Electronics/Documentation

C/C++ Compilation Process

Last updated Dec 27, 2025Electronics

The Compilation Process

In short, “compilation” refers to the process of taking your human readable source code and translating it into something executable by the computer. This means converting your code into machine code, also known as assembly. Somewhat confusingly, compilation is not actually a single thing completed by a particular program on your computer. Instead, its a multi-stage process involving several different intermediate file formats and multiple different programs. Fortunately though, the process is not super complex at its base level, and each individual stage of the processing is relatively simple.

Additionally, many people will shorten the group of programs that convert your code to an executable format as simply the “compiler”, when in reality the compiler itself is just one part of the process. So when you see the word “compiler”, most cases this actually refers to a suite of different programs.

The Scenario

To give some context to the rest of this article, we will establish a mock project we are compiling. Our file structure looks as follows:

|
| - src/
    | - file1.c
    | - file2.c
    | - file3.c
|
| - include/
    | - extras/
        | - header1.h
    | - header2.h
    | - header3.h
|
| - lib/
    | - libcoolstuff.a
    | - libawesomestuff.a

Here we have 3 folders: one containing our source (src), one containing headers (include), and the last containing libraries (lib). We will discuss libraries later.

Keep in mind this file structure when reading about each stage of the compilation process.

Overview

Each individual step of the compilation process is important, but it is helpful to have an overview of how it actually goes first.

There are generally 2 options for compiling your code. You can batch every source file into a single invocation of the compiler, and it will chug through all of your code at once and spit out an executable with a single command. However, this is largely not how it is done in practice, since doing it this way introduces various inefficiencies, and makes the compilation process take longer.

The more typical way is to invoke the compiler individually on each source file, and then use the linker to combine them all into a single executable. This has a few advantages:

  • You can change one source file and not need to recompile everything. If every file is processed individually and then later combined, if I change just one source, I only need to recompile that source and recombine everything, instead of having to recompile every source file in my project
  • Compiling all the source files can be parallelized. All sources in your project can be seperately compiled on their own CPU core, speeding up the process tremendously

You will see this latter type of process reflected in the compilation steps below. The first two steps are executed on a single source file, or translation unit, and the final step is executed on the whole combination of the intermediates.

Preprocessing

The first stage of processing, ironically, is called preprocessing. This stage is the very first pass over your code, during which the preprocessor will evaluate all preprocessor directives. Everything from #include to #define will be fully expanded, and any code sections excluded from compilation with #ifdef/#ifndef will be removed. The output of this stage is actually perfectly readable C/C++. Nothing has been translated into assembly yet, and so it is almost exactly the same, except include files have been fully pasted in, and all macros have been evaluated.

The result is a C/C++ file with no further preprocessor directives. This is also called a preprocessed translation unit.

This is also a good time to mention how the compiler locates header files that get included. The list of directories (or folders) that the compiler searches for a header matching the name you provided is called the include search path. When you install a compiler, it will by default search certain directories on your computer where things like system headers will be located. This includes things like stdio and stdlib, or any other headers provided by the standard library. These headers are usually included with angle brackets (i.e. #include <stdio.h>) instead of quotes (#include "stdio.h"), which signifies to the compiler that these headers should be located with the system headers, so it should look there first.

In contrast, quoted header names are headers that may be user defined, or added by a library imported by the project. Here, the order in which the compiler traverses the search paths is slightly different. In the end, if the compiler must it will search the same set of directories for both the angle brackets and quoted case, just the order will be slightly different. The search path can be modified and have new directoriess added via arguments to the compiler itself (see command line syntax).

One other thing to note is that the argument given to the #include directive is relative to the set of search paths. This means that headers within folders not directly in the include path can still be added. For example, if we are compiling the scenario from above and we have the folder include/ in the search path, then we can include both header2.h and header3.h by just their names. We can still include header1.h even though extras/ is not explicitly in the search path by specifying the extras folder: #include "extras/header1.h". When the compiler is searching for header1.h, it will know to first look for a folder named extras somewhere in the include path, then a file named header1.h within that folder.

Now that all the preprocessing directives have been handled, we can move on to translation.

Translation

Translation is the step of the compilation process that actually converts the human readable text into assembly language. In this case, assembly is referring to machine code, or the raw binary that makes up an executable (knowing how to read assembly is not necessary for most programming, so we won’t discuss it much here). This translated source code is outputted into what are called object files, usually ending in *.o.

This step is where translation units matter the most. This is where things like static linkage start to matter, as translation units are created from the preprocessed translation units in this step. Also during the translation step, compiler optimizations are applied.

Compiler optimizations refer to a wide range of different code optimization techniques you can enable in the compile settings to offer speed or size bonuses. This is also when constexpr things may be evaluated (see here). Most compiler optimizations are things like removing unused variables and functions, removing code that is unreachable, evaluating if inlining is worthwhile, and many other tricks to make your code slightly faster or slightly smaller.

One crucial thing to note about translation is that it does not evaluate function calls and extern variables. What this means is that while the resulting object file contains assembly, it also contains a lot of placeholders for functions that have not yet been discovered, and need to be resolved during the next and final step.

At this point, we have now used all of the source and header files to create something. From here on out, we will be working with the intermediate object files, and we have no more use for the original sources.

Linking

Linking is the final step in the compilation process. This is the step when all the various object files are combined together, and the finished program is fully complete.

Recall that we can place the definition for a function in one source file, but still use the function from another source so long as we have the appropriate forward declaration for it available to our program. The linking step is how the compiler (or actually the linker) associates the function call in one source/object file with the definition in another. Additionally, a similar process is completed for variables declared as extern.

Libraries

This is now the time to talk about how libraries function. Libraries are pre-compiled pieces of code that implement hard or repetitive stuff for you, so that you don’t have to code them yourself. There are 2 main types of libraries, but for now we will only worry about static libraries, since dynamic/shared (.dll on windows and .so on linux) libraries are a little more complicated.

Static libraries will typically end with the extension .a, as can be seen in the libs folder in our example project. Static libraries are actually very similar to the object files we produced via the compilation steps above: in fact, they are actually just a bunch of object files shmushed together into one. Its the same as if someone performed all the compilation steps on the library source files for us, and sent us the resulting objects for us to link into our code. Of course, we would need the appropriate header files as well.

After resolving all the various function calls from both our own object files and those from static libraries, we finally have a finished executable!

Linker weirdness

The linker is a very different program that deals primarily with what are called symbols. Symbols are things like variable or function names, and have some kind of memory address associated with them (a variable is located at a particular address which is associated with the symbol for it, a function’s code is located at some particular memory address which has an associated symbol with the function name, etc.). In the object files, things like variables and functions are still in their symbolic form, instead of actual memory addresses that can be used by the computer. At the end of the day, this is what the linker is really doing: cataloging all the available symbols, and replacing references to symbols with the appropriate memory address.

However, symbols don’t always have to match the exact name you type in to your code; they can be anything, as long as the compiler is self-consistent with what symbols it uses.

The most obvious example of this is C++ name mangling. In C, function names must be unique: no two functions may have the same name, even if they take different parameters. However, C++ introduces function overloading, in which two functions can have the same name, so long as their parameter lists are different. When called, the compiler can choose the correct function based on the parameters.

This is implemented in a way that may seem somewhat hacky: in C++, the symbols for functions contain extra characters to encode the parameter list, that way the function symbols are still distinguished, and so two functions can have the same name.

Lets look at an example. Take a look a this function:

double hello(int i, double d) {
    printf("Hello! %d", i);
    return 3 + d;
}

After compiling with a C compiler (GCC 13.2.0 with MinGW), we can extract the symbol name for the function: hello. We can see here the symbol name matches exactly the name we gave it!

However, when compiling as C++ (G++ 13.2.0 with MinGW), the symbol is instead _Z5helloid. You may be tempted to say “oh, the function takes an int and a double as parameters, so thats why it ends in id!”. While this might be correct, the exact way the symbol is named is completely compiler dependent, so this is purely coincidental.

Either way, we can see the symbol is no longer as simple as just our function name. This fact that symbols don’t need to match names becomes more relevant when talking about linkage errors.

However, sometimes we still want to force C symbol names for certain things. For example, if we are calling C code from C++, we want the C++ compiler to know that, and to not mangle the function name like it would normally. We can mark this with the line extern "C". You may see this a lot in C libraries, since this is how you can indicate to a C++ compiler that it is calling a C function, so it should use C symbol names.

If you want another example: try decoding this function’s name and parameters: _Z12goodbyeworldNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIiSaIiEE [1]

Linkage Errors

There is a whole seperate class of errors that can arise during the linking stage, which can’t be caught by the compiler. This mostly amounts to missing or duplicated symbols (things like functions, variables, etc.) that make it to where the linker cannot complete the program. Unfortunately, these errors can be very difficult to pinpoint and fix, due to the nature of preprocessing, and the fact that your IDE cannot highlight these errors, since they are not syntactic. Additionally, due to name mangling (see above), it can sometimes be hard to identify exactly what is causing the problem.

The first step to solving a linkage error is to determine what the problematic symbol is. This may involve some guesswork, especially when deciphering name mangling, but it is usually enough to know just the name of the function/variable. The mangling bits can typically be safely ignored.

From there, the symbol will either be missing (unresolved external symbol), or there will be duplicates.

Unresolved errors are much more straightforward to solve: you are using some function or variable that is not acutally defined anywhere. For functions, make sure you have a definition somewhere in your source code or an object file. For variables (only variables declared extern can have linkage errors), make sure it actually exists non-extern somewhere in a source file.

Duplicate symbols are unfortunately much harder to solve. This can come about because of preprocessing introducing duplicates somewhere you don’t expect, or perhaps you really did just make a silly mistake of defining a function twice. For the prior case, you unfortunately can’t trust your IDE, as it will not display the full story, and can get confused since when it is completing syntax highlighting, it will process files individually and will not account for definitions in other files.

Linkage Types

There are a few different ways a function or variable can be given visibility to the rest of the program. These types are called linkage types or storage classes. The two classes that are most important for linking are extern and static.

Extern Variables

Extern variables (or functions) are any symbols that you declare to exist, but don’t yet have the definition for yet. Variables declared extern are very similar to forward declarations for a function: they are actually the exact same kind of thing!

The two forward declarations are identical:

int hello(double d);
extern int hello(double d);

For function forward declarations, the extern keyword is optional, hence why it is not usually included. For variables, declaring it extern effectively just introduces a forward declaration for it in the same way as with a function. In order for a program with extern variables to link correctly, the actual definition for a variable must be present somewhere in a source code.

Static Variables/Functions

Often times, you may want a symbol to be restricted in scope to just the translation unit you are using it in. This is useful if, for example, you have a helper function you don’t necessarily want to expose to the rest of the program. By declaring a variable of function as static, it will be made unavailable outside of the translation unit it is defined in. Providing a forward declaration for such a symbol in another translation unit will result in a link error, as the linker cannot find a symbol with that name (since it is restricted to the translation unit).

In C++, another way to declare translation-unit local symbols is with the anonymous namespace:

namespace { ... }

This is functionally equivalent to the static keyword for most intents and purposes.

Making changes

Once we’ve compiled and linked our code, that most likely isn’t the end for it. Chances are, a bug will be discovered somewhere and we’ll need to go back in and fix something in one of our source files. It would be a huge pain to go back in and recompile our entire program, especially if it is a few hundred sources, rather than just 3. Luckily, we can do that.

Since we compiled each source into its own independent object file, we can just recompile that one source file and re-link the program (unfortunately linking needs to be completely re-done every time).

If there’s one thing to take away from this whole article, its this concept. Source files are processed individually, and only combined at the very end.


  1. Realistically, there is no way to decode the parameters for this function. It is simply too hard, and there are too many things going on in the STL for you to reasonably be expected to figure it out from the symbol name alone. However, the symbol name itself should be decypherable fairly easily. If you must know, the function was named goodbyeworld, and it took an std::string and std::vector<int>. ↩︎