Jun 30, 2014

C Programming #34: Journey from source code to executable

Following article will explain how code is converted to binary, seeing each step in more detail. In this [C Programming #03: First Program - Hello C] article I had explained how to compile and run the C code.
Compilation command -


gcc -o hello hello.c

It looks like just one simple step from outside but internally there are several steps that compiler takes. Following are the detailed steps of compilation



Let's see each step in detail, for making it more fun by taking small C program and also see how this C file transforms in each step.

Sample C Program [sample.c]


#include <stdio.h>
#define MAX 125
int main()
{
   int i = 1023;

   if(i > MAX) {
      printf("i is greater than %d\n", MAX);
   }
   return 0;
}

Pre-process

First step in a Compilation is called as pre-processing. It is done by a part of compiler called pre-processor. pre-processor does the following:
  1. Includes the header file.
  2. Expands the macro.
Pre-processing can be done with following command -


cpp sample.c > sample.i

cpp stands for c pre-processor. Here sample.c is input to pre-processor and sample.i is the output of pre-processor. Let's see how our C source file is converted to pre-processed output. 12 lines in sample.c in my system produces sample.i which is 860 lines long. I won't explain all 860 lines, nor will I paste all of it here. Only relevant output of sample.i file is pasted here . Note: <CUT> - means the part in between is stripped.


# 1 "sample.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "sample.c"
# 1 "/usr/include/stdio.h" 1 3 4
<CUT1>
typedef unsigned char __u_char;
typedef unsigned short int __u_short;
<CUT2>
enum __codecvt_result
{
  __codecvt_ok,
  __codecvt_partial,
  __codecvt_error,
  __codecvt_noconv
};
<CUT3>
extern int __underflow (_IO_FILE *);
extern int __uflow (_IO_FILE *);
<CUT4>
# 2 "sample.c" 2

int main()
{
   int i = 1023;

   if(i > 125) {
      printf("i is greater than %d", 125);
   }

   return 0;
}

Question: How come a 12 line C file generated 860 line of translation unit ?
Answer: it is very simple, most part comes from #include of stdio.h.

So stdio.h has
  1. typedefs - e.g see <CUT1>
  2. Enum - e.g see <CUT2>
  3. function declaration - e.g see <CUT3> 
  4. structure, union definition (no example)
At last finally we have our code starting with "# 2 sample.c 2". Note this is some internal representation of CPP and we should not worry. But if you want to find what they are, you can understand them in this documentation. Note one more thing MAX is now replaced by 125 in all the places. So now we know pre-processing lets move to next step

Compiler

Compiler takes the pre-processed file and generates assembly file with extension ".s". Output Assembly file depends on the CPU architecture for which it is being compiled, in our case it x86_64(x86 is Intel architecture and 64 signifies 64- bit CPU). Compilation is done using following command

gcc -S sample.i
-S option tells compiler only to do compilation and stop after generating the assembly file. In my case it has generated sample.s which 35 lines long, which is as follows


   .file "sample.c"
   .section .rodata
.LC0:
   .string  "i is greater than %d\n"
   .text
.globl main
   .type main, @function
main:
.LFB0:
   .cfi_startproc
   pushq %rbp
   .cfi_def_cfa_offset 16
   .cfi_offset 6, -16
   movq  %rsp, %rbp
   .cfi_def_cfa_register 6
   subq  $16, %rsp
   movl  $1023, -4(%rbp)
   cmpl  $125, -4(%rbp)
   jle   .L2
   movl  $.LC0, %eax
   movl  $125, %esi
   movq  %rax, %rdi
   movl  $0, %eax
   call  printf
.L2:
   movl  $0, %eax
   leave
   .cfi_def_cfa 7, 8
   ret
   .cfi_endproc
.LFE0:
   .size main, .-main
   .ident   "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-4)"
   .section .note.GNU-stack,"",@progbits

I am not (or want to be) assembly expert but there are few observation here:
  • .file "sample.c"
    • This tells that source file name whose assembly this file is
  • .section .rodata
    • This is assembly notation of start of read only (ro) data.
    • Only entry here is
      • .string  "i is greater than %d\n"
    • Constant strings are placed in rodata section.
  •  .text
    • This tells begin of text section (where actual code starts)
  • main:
    • See how function main has just become label here.
  • Note the compiler has optimized local variable i and removed it. 
    • It will be using processor register/stack for this directly
  • call  printf
    • This is reference to function call of printf.
  • Stack 
    • Initial few assembly code in main is setting up the stack of main.
This is a last place where we could still correlate so much to C code.

Assembler

Assembler converts the assembly file into Object file (Machine code). This can be done by following command -


as -o sample.o sample.s
-o stands for output file which in our case is object file with name sample.o.

Till now to examine a file we could directly open in our favorite text editor. But Object files are ELF or COFF format which our editor will not understand. We use a command line tool called as readelf to understand our object file.


readelf -h sample.o 
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              REL (Relocatable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          0 (bytes into file)
  Start of section headers:          344 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           0 (bytes)
  Number of program headers:         0
  Size of section headers:           64 (bytes)
  Number of section headers:         13
  Section header string table index: 10
  1. This prints the Header of the ELF file.
  2. Header gives information of about ELF file.
  3. We can see are using ELF64 representation. (This is for 64 bit machine)
  4. Data is represented in 2's complement and little endian.
  5. Architecture is X86-64 (64 bit machine)
  6. Number of Section header - 13. which will contain different section of the assembly code
Above we read only the header of ELF file. Now lets read more about section header


readelf -S sample.o
There are 13 section headers, starting at offset 0x158:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         0000000000000000  00000040
       0000000000000033  0000000000000000  AX       0     0     4
  [ 2] .rela.text        RELA             0000000000000000  000005b8
       0000000000000030  0000000000000018          11     1     8
  [ 3] .data             PROGBITS         0000000000000000  00000074
       0000000000000000  0000000000000000  WA       0     0     4
  [ 4] .bss              NOBITS           0000000000000000  00000074
       0000000000000000  0000000000000000  WA       0     0     4
  [ 5] .rodata           PROGBITS         0000000000000000  00000074
       0000000000000016  0000000000000000   A       0     0     1
  [ 6] .comment          PROGBITS         0000000000000000  0000008a
       000000000000002d  0000000000000001  MS       0     0     1
  [ 7] .note.GNU-stack   PROGBITS         0000000000000000  000000b7
       0000000000000000  0000000000000000           0     0     1
  [ 8] .eh_frame         PROGBITS         0000000000000000  000000b8
       0000000000000038  0000000000000000   A       0     0     8
  [ 9] .rela.eh_frame    RELA             0000000000000000  000005e8
       0000000000000018  0000000000000018          11     8     8
  [10] .shstrtab         STRTAB           0000000000000000  000000f0
       0000000000000061  0000000000000000           0     0     1
  [11] .symtab           SYMTAB           0000000000000000  00000498
       0000000000000108  0000000000000018          12     9     8
  [12] .strtab           STRTAB           0000000000000000  000005a0
       0000000000000016  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)

We will not go-into details of all the sections but here are some interesting one
  1. .text - is where our code is there.
  2. .data - Global variables - initialized
  3. .bss - Global variables - uninitialized.
  4. .rodata - Read only constant, const string.
  5. .symtab - will have symbol table information.
Now lets examine more into symbol table using following command-


$ readelf -s sample.o

Symbol table '.symtab' contains 11 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS sample.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
     5: 0000000000000000     0 SECTION LOCAL  DEFAULT    5 
     6: 0000000000000000     0 SECTION LOCAL  DEFAULT    7 
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    8 
     8: 0000000000000000     0 SECTION LOCAL  DEFAULT    6 
     9: 0000000000000000    51 FUNC    GLOBAL DEFAULT    1 main
    10: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND printf

  1. sample.c is type FILE
  2. main is global FUNC
  3. printf is UND - undefined (This is job of linker to fill in which will come later)
  4. Also note that Value is all empty which is again needs to be filled by linker.

Linker


Linker takes the object file and combines it with any library that needs to be linked and produces executable. Command for linking is little complicated and is as follows -


ld -o sample -dynamic-linker /lib64/ld-linux-x86-64.so.2 \
/usr/lib64/crt1.o /usr/lib64/crti.o \
sample.o /usr/lib64/crtn.o \
/usr/lib64/libc.so 

  1. -o stands for output file which is just sample in our case.
  2. -dynamic-linker we are telling linker to dynamically link the libraries and object.
  3. ld-linux-x86-64.so.2, crt1.o, crti.o crtn.o - are standard libraries that always needs to linked.
  4. sample.o is our object file.
  5. libc.so is a shared object (so) which has information about printf. (which will be linked dynamically).

Output sample also is in ELF format if we see only the symbol table of it -


readelf -S sample
Symbol table '.dynsym' contains 4 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND printf@GLIBC_2.2.5 (2)
     2: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@GLIBC_2.2.5 (2)

Symbol table '.symtab' contains 42 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 00000000004001c8     0 SECTION LOCAL  DEFAULT    1 
     2: 00000000004001e4     0 SECTION LOCAL  DEFAULT    2 
     3: 0000000000400208     0 SECTION LOCAL  DEFAULT    3 
     4: 0000000000400230     0 SECTION LOCAL  DEFAULT    4 
     5: 0000000000400290     0 SECTION LOCAL  DEFAULT    5 
     6: 00000000004002d0     0 SECTION LOCAL  DEFAULT    6 
     7: 00000000004002d8     0 SECTION LOCAL  DEFAULT    7 
     8: 00000000004002f8     0 SECTION LOCAL  DEFAULT    8 
     9: 0000000000400310     0 SECTION LOCAL  DEFAULT    9 
    10: 0000000000400340     0 SECTION LOCAL  DEFAULT   10 
    11: 0000000000400350     0 SECTION LOCAL  DEFAULT   11 
    12: 0000000000400380     0 SECTION LOCAL  DEFAULT   12 
    13: 000000000040049c     0 SECTION LOCAL  DEFAULT   13 
    14: 00000000004004a8     0 SECTION LOCAL  DEFAULT   14 
    15: 00000000004004c8     0 SECTION LOCAL  DEFAULT   15 
    16: 0000000000600540     0 SECTION LOCAL  DEFAULT   16 
    17: 00000000006006d0     0 SECTION LOCAL  DEFAULT   17 
    18: 00000000006006d8     0 SECTION LOCAL  DEFAULT   18 
    19: 0000000000600700     0 SECTION LOCAL  DEFAULT   19 
    20: 0000000000000000     0 SECTION LOCAL  DEFAULT   20 
    21: 00000000004003ac     0 FUNC    LOCAL  DEFAULT   12 call_gmon_start
    22: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS sample.c
    23: 00000000006006d8     0 OBJECT  LOCAL  DEFAULT   18 _GLOBAL_OFFSET_TABLE_
    24: 0000000000600540     0 NOTYPE  LOCAL  DEFAULT   16 __init_array_end
    25: 0000000000600540     0 NOTYPE  LOCAL  DEFAULT   16 __init_array_start
    26: 0000000000600540     0 OBJECT  LOCAL  DEFAULT   16 _DYNAMIC
    27: 0000000000600700     0 NOTYPE  WEAK   DEFAULT   19 data_start
    28: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND printf@@GLIBC_2.2.5
    29: 0000000000400400     2 FUNC    GLOBAL DEFAULT   12 __libc_csu_fini
    30: 0000000000400380     0 FUNC    GLOBAL DEFAULT   12 _start
    31: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
    32: 000000000040049c     0 FUNC    GLOBAL DEFAULT   13 _fini
    33: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@@GLIBC_
    34: 00000000004004a8     4 OBJECT  GLOBAL DEFAULT   14 _IO_stdin_used
    35: 0000000000600700     0 NOTYPE  GLOBAL DEFAULT   19 __data_start
    36: 0000000000400410   137 FUNC    GLOBAL DEFAULT   12 __libc_csu_init
    37: 0000000000600704     0 NOTYPE  GLOBAL DEFAULT  ABS __bss_start
    38: 0000000000600708     0 NOTYPE  GLOBAL DEFAULT  ABS _end
    39: 0000000000600704     0 NOTYPE  GLOBAL DEFAULT  ABS _edata
    40: 00000000004003c4    51 FUNC    GLOBAL DEFAULT   12 main
    41: 0000000000400340     0 FUNC    GLOBAL DEFAULT   10 _init
  • there are two symbol table dynamic symbol table and normal symbol table.
  • main has got address (value) assigned to it.
  • printf reference is still UND since loader will allocate value to it when it loads dynamically the code. (Loader will be covered later)
  • A lot of internal symbols which wont make any sense w.r.to our program.
Sample can be directly executed now as ./sample.

Links

Next Article - C Programming #35: Preprocessor
Previous Article - C Programming #33: Global Variable
All Article - C Programming

No comments :

Post a Comment