Compilation command -
gcc -o hello hello.c
It looks like just one simple step from outside but internally there are several steps that compiler takes. Following are the detailed steps of compilation
Let's see each step in detail, for making it more fun by taking small C program and also see how this C file transforms in each step.
Sample C Program [sample.c]
#include <stdio.h> #define MAX 125 int main() { int i = 1023; if(i > MAX) { printf("i is greater than %d\n", MAX); } return 0; }
Pre-process
First step in a Compilation is called as pre-processing. It is done by a part of compiler called pre-processor. pre-processor does the following:- Includes the header file.
- Expands the macro.
cpp sample.c > sample.i
cpp stands for c pre-processor. Here sample.c is input to pre-processor and sample.i is the output of pre-processor. Let's see how our C source file is converted to pre-processed output. 12 lines in sample.c in my system produces sample.i which is 860 lines long. I won't explain all 860 lines, nor will I paste all of it here. Only relevant output of sample.i file is pasted here . Note: <CUT> - means the part in between is stripped.
# 1 "sample.c" # 1 "<built-in>" # 1 "<command-line>" # 1 "sample.c" # 1 "/usr/include/stdio.h" 1 3 4 <CUT1> typedef unsigned char __u_char; typedef unsigned short int __u_short; <CUT2> enum __codecvt_result { __codecvt_ok, __codecvt_partial, __codecvt_error, __codecvt_noconv }; <CUT3> extern int __underflow (_IO_FILE *); extern int __uflow (_IO_FILE *); <CUT4> # 2 "sample.c" 2 int main() { int i = 1023; if(i > 125) { printf("i is greater than %d", 125); } return 0; }
Question: How come a 12 line C file generated 860 line of translation unit ?
Answer: it is very simple, most part comes from #include of stdio.h.
So stdio.h has
- typedefs - e.g see <CUT1>
- Enum - e.g see <CUT2>
- function declaration - e.g see <CUT3>
- structure, union definition (no example)
Compiler
Compiler takes the pre-processed file and generates assembly file with extension ".s". Output Assembly file depends on the CPU architecture for which it is being compiled, in our case it x86_64(x86 is Intel architecture and 64 signifies 64- bit CPU). Compilation is done using following commandgcc -S sample.i
.file "sample.c" .section .rodata .LC0: .string "i is greater than %d\n" .text .globl main .type main, @function main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $16, %rsp movl $1023, -4(%rbp) cmpl $125, -4(%rbp) jle .L2 movl $.LC0, %eax movl $125, %esi movq %rax, %rdi movl $0, %eax call printf .L2: movl $0, %eax leave .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size main, .-main .ident "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-4)" .section .note.GNU-stack,"",@progbits
I am not (or want to be) assembly expert but there are few observation here:
- .file "sample.c"
- This tells that source file name whose assembly this file is
- .section .rodata
- This is assembly notation of start of read only (ro) data.
- Only entry here is
- .string "i is greater than %d\n"
- Constant strings are placed in rodata section.
- .text
- This tells begin of text section (where actual code starts)
- main:
- See how function main has just become label here.
- Note the compiler has optimized local variable i and removed it.
- It will be using processor register/stack for this directly
- call printf
- This is reference to function call of printf.
- Stack
- Initial few assembly code in main is setting up the stack of main.
Assembler
Assembler converts the assembly file into Object file (Machine code). This can be done by following command -as -o sample.o sample.s
Till now to examine a file we could directly open in our favorite text editor. But Object files are ELF or COFF format which our editor will not understand. We use a command line tool called as readelf to understand our object file.
readelf -h sample.o ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: REL (Relocatable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 0 (bytes into file) Start of section headers: 344 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 0 (bytes) Number of program headers: 0 Size of section headers: 64 (bytes) Number of section headers: 13 Section header string table index: 10
- This prints the Header of the ELF file.
- Header gives information of about ELF file.
- We can see are using ELF64 representation. (This is for 64 bit machine)
- Data is represented in 2's complement and little endian.
- Architecture is X86-64 (64 bit machine)
- Number of Section header - 13. which will contain different section of the assembly code
readelf -S sample.o There are 13 section headers, starting at offset 0x158: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [ 0] NULL 0000000000000000 00000000 0000000000000000 0000000000000000 0 0 0 [ 1] .text PROGBITS 0000000000000000 00000040 0000000000000033 0000000000000000 AX 0 0 4 [ 2] .rela.text RELA 0000000000000000 000005b8 0000000000000030 0000000000000018 11 1 8 [ 3] .data PROGBITS 0000000000000000 00000074 0000000000000000 0000000000000000 WA 0 0 4 [ 4] .bss NOBITS 0000000000000000 00000074 0000000000000000 0000000000000000 WA 0 0 4 [ 5] .rodata PROGBITS 0000000000000000 00000074 0000000000000016 0000000000000000 A 0 0 1 [ 6] .comment PROGBITS 0000000000000000 0000008a 000000000000002d 0000000000000001 MS 0 0 1 [ 7] .note.GNU-stack PROGBITS 0000000000000000 000000b7 0000000000000000 0000000000000000 0 0 1 [ 8] .eh_frame PROGBITS 0000000000000000 000000b8 0000000000000038 0000000000000000 A 0 0 8 [ 9] .rela.eh_frame RELA 0000000000000000 000005e8 0000000000000018 0000000000000018 11 8 8 [10] .shstrtab STRTAB 0000000000000000 000000f0 0000000000000061 0000000000000000 0 0 1 [11] .symtab SYMTAB 0000000000000000 00000498 0000000000000108 0000000000000018 12 9 8 [12] .strtab STRTAB 0000000000000000 000005a0 0000000000000016 0000000000000000 0 0 1 Key to Flags: W (write), A (alloc), X (execute), M (merge), S (strings) I (info), L (link order), G (group), x (unknown) O (extra OS processing required) o (OS specific), p (processor specific)
We will not go-into details of all the sections but here are some interesting one
- .text - is where our code is there.
- .data - Global variables - initialized
- .bss - Global variables - uninitialized.
- .rodata - Read only constant, const string.
- .symtab - will have symbol table information.
$ readelf -s sample.o Symbol table '.symtab' contains 11 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 FILE LOCAL DEFAULT ABS sample.c 2: 0000000000000000 0 SECTION LOCAL DEFAULT 1 3: 0000000000000000 0 SECTION LOCAL DEFAULT 3 4: 0000000000000000 0 SECTION LOCAL DEFAULT 4 5: 0000000000000000 0 SECTION LOCAL DEFAULT 5 6: 0000000000000000 0 SECTION LOCAL DEFAULT 7 7: 0000000000000000 0 SECTION LOCAL DEFAULT 8 8: 0000000000000000 0 SECTION LOCAL DEFAULT 6 9: 0000000000000000 51 FUNC GLOBAL DEFAULT 1 main 10: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND printf
- sample.c is type FILE
- main is global FUNC
- printf is UND - undefined (This is job of linker to fill in which will come later)
- Also note that Value is all empty which is again needs to be filled by linker.
Linker
Linker takes the object file and combines it with any library that needs to be linked and produces executable. Command for linking is little complicated and is as follows -
ld -o sample -dynamic-linker /lib64/ld-linux-x86-64.so.2 \ /usr/lib64/crt1.o /usr/lib64/crti.o \ sample.o /usr/lib64/crtn.o \ /usr/lib64/libc.so
- -o stands for output file which is just sample in our case.
- -dynamic-linker we are telling linker to dynamically link the libraries and object.
- ld-linux-x86-64.so.2, crt1.o, crti.o crtn.o - are standard libraries that always needs to linked.
- sample.o is our object file.
- libc.so is a shared object (so) which has information about printf. (which will be linked dynamically).
Output sample also is in ELF format if we see only the symbol table of it -
readelf -S sample Symbol table '.dynsym' contains 4 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 FUNC GLOBAL DEFAULT UND printf@GLIBC_2.2.5 (2) 2: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__ 3: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __libc_start_main@GLIBC_2.2.5 (2) Symbol table '.symtab' contains 42 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 00000000004001c8 0 SECTION LOCAL DEFAULT 1 2: 00000000004001e4 0 SECTION LOCAL DEFAULT 2 3: 0000000000400208 0 SECTION LOCAL DEFAULT 3 4: 0000000000400230 0 SECTION LOCAL DEFAULT 4 5: 0000000000400290 0 SECTION LOCAL DEFAULT 5 6: 00000000004002d0 0 SECTION LOCAL DEFAULT 6 7: 00000000004002d8 0 SECTION LOCAL DEFAULT 7 8: 00000000004002f8 0 SECTION LOCAL DEFAULT 8 9: 0000000000400310 0 SECTION LOCAL DEFAULT 9 10: 0000000000400340 0 SECTION LOCAL DEFAULT 10 11: 0000000000400350 0 SECTION LOCAL DEFAULT 11 12: 0000000000400380 0 SECTION LOCAL DEFAULT 12 13: 000000000040049c 0 SECTION LOCAL DEFAULT 13 14: 00000000004004a8 0 SECTION LOCAL DEFAULT 14 15: 00000000004004c8 0 SECTION LOCAL DEFAULT 15 16: 0000000000600540 0 SECTION LOCAL DEFAULT 16 17: 00000000006006d0 0 SECTION LOCAL DEFAULT 17 18: 00000000006006d8 0 SECTION LOCAL DEFAULT 18 19: 0000000000600700 0 SECTION LOCAL DEFAULT 19 20: 0000000000000000 0 SECTION LOCAL DEFAULT 20 21: 00000000004003ac 0 FUNC LOCAL DEFAULT 12 call_gmon_start 22: 0000000000000000 0 FILE LOCAL DEFAULT ABS sample.c 23: 00000000006006d8 0 OBJECT LOCAL DEFAULT 18 _GLOBAL_OFFSET_TABLE_ 24: 0000000000600540 0 NOTYPE LOCAL DEFAULT 16 __init_array_end 25: 0000000000600540 0 NOTYPE LOCAL DEFAULT 16 __init_array_start 26: 0000000000600540 0 OBJECT LOCAL DEFAULT 16 _DYNAMIC 27: 0000000000600700 0 NOTYPE WEAK DEFAULT 19 data_start 28: 0000000000000000 0 FUNC GLOBAL DEFAULT UND printf@@GLIBC_2.2.5 29: 0000000000400400 2 FUNC GLOBAL DEFAULT 12 __libc_csu_fini 30: 0000000000400380 0 FUNC GLOBAL DEFAULT 12 _start 31: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__ 32: 000000000040049c 0 FUNC GLOBAL DEFAULT 13 _fini 33: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __libc_start_main@@GLIBC_ 34: 00000000004004a8 4 OBJECT GLOBAL DEFAULT 14 _IO_stdin_used 35: 0000000000600700 0 NOTYPE GLOBAL DEFAULT 19 __data_start 36: 0000000000400410 137 FUNC GLOBAL DEFAULT 12 __libc_csu_init 37: 0000000000600704 0 NOTYPE GLOBAL DEFAULT ABS __bss_start 38: 0000000000600708 0 NOTYPE GLOBAL DEFAULT ABS _end 39: 0000000000600704 0 NOTYPE GLOBAL DEFAULT ABS _edata 40: 00000000004003c4 51 FUNC GLOBAL DEFAULT 12 main 41: 0000000000400340 0 FUNC GLOBAL DEFAULT 10 _init
- there are two symbol table dynamic symbol table and normal symbol table.
- main has got address (value) assigned to it.
- printf reference is still UND since loader will allocate value to it when it loads dynamically the code. (Loader will be covered later)
- A lot of internal symbols which wont make any sense w.r.to our program.
Links
Next Article - C Programming #35: PreprocessorPrevious Article - C Programming #33: Global Variable
All Article - C Programming
No comments :
Post a Comment