NVCC编译流程

NVCC 编译流程

最近在复习学习 GPU 的相关知识，之前对 NVCC 的编译流程一直一知半解，这次稍微深入研究一下。

基本概念

上图来自官网，“没有什么计算机问题是不能通过增加一个抽象层解决的”，和 CPU 上的编译过程一样，从抽象源代码到机器码的过程中包含了多层抽象，所以在了解 NVCC 编译流程之前，我们首先要对编译过程中的各种抽象层次有个基本的了解：

类型	后缀	描述
CUDA 源代码	.cu	包含 host 和 device 源代码
NVVM IR	.ll/.bc	本质上是 LLVM IR，一个 NVVM IR 程序一定是合法的 LLVM IR 程序，反之不成立
虚拟架构汇编指令	.ptx	PTX(Parallel Thread Execution) 代码，类比 Java 字节码，机器无关，可以满足后向兼容和 JIT，对应一个虚拟架构 compute_xy
物理架构汇编指令	.cubin	SASS(Source and Assembly)代码，真正运行的机器码，对应一个物理架构 sm_xy
fat binary	.fatbin	包含多个 PTX 文件和 cubin 文件

图中其它一些后缀的文件类型是和 CPU 编译过程中相同的

Learn by Examples

环境

Debian 11
A100
Driver 535.113.01
CUDA 11.2
GCC 10.2.1

架构

查看 NVCC 支持的虚拟架构和物理架构
- nvcc --list-gpu-code --list-gpu-arch

> nvcc --list-gpu-code --list-gpu-arch
arch=compute_35,code=sm_35
arch=compute_37,code=sm_37
arch=compute_50,code=sm_50
arch=compute_52,code=sm_52
arch=compute_53,code=sm_53
arch=compute_60,code=sm_60
arch=compute_61,code=sm_61
arch=compute_62,code=sm_62
arch=compute_70,code=sm_70
arch=compute_72,code=sm_72
arch=compute_75,code=sm_75
arch=compute_80,code=sm_80
arch=compute_86,code=sm_86

初学 CUDA 的时候，一般就是去官网查设备的计算能力，然后无脑设置参数，虽然一般也没错，但是对它们的含义不甚了了，下面是一些不同参数的对比实验，源代码来自code-samples

nvcc shared-memory.cu -o shared-memory-base && 
nvcc shared-memory.cu -o shared-memory-arch_compute_70 --gpu-architecture=compute_70 &&
nvcc shared-memory.cu -o shared-memory-arch_compute_80 --gpu-architecture=compute_80 &&
nvcc shared-memory.cu -o shared-memory-arch_sm_80 --gpu-architecture=sm_80 &&
nvcc shared-memory.cu -o shared-memory-arch_compute_80-code_compute_80 --gpu-architecture=compute_80 --gpu-code=compute_80 &&
nvcc shared-memory.cu -o shared-memory-arch_compute_80-code_compute_80_sm_80 --gpu-architecture=compute_80 --gpu-code=compute_80,sm_80 &&
nvcc shared-memory.cu -o shared-memory-arch_compute_80-code_compute_80_sm_80_sm_86 --gpu-architecture=compute_80 --gpu-code=compute_80,sm_80,sm_86 &&
nvcc shared-memory.cu -o shared-memory-arch_compute_80-code_compute_80_sm_86 --gpu-architecture=compute_80 --gpu-code=compute_80,sm_86 &&
nvcc shared-memory.cu -o shared-memory-arch_compute_80-code_sm_80 --gpu-architecture=compute_80 --gpu-code=sm_80 &&
nvcc shared-memory.cu -o shared-memory-arch_compute_80-code_sm_80_sm_86 --gpu-architecture=compute_80 --gpu-code=sm_80,sm_86 &&
nvcc shared-memory.cu -o shared-memory-arch_compute_80-code_sm_86 --gpu-architecture=compute_80 --gpu-code=sm_86

来自代码大小的疑惑

完成上述各种参数的编译后，我们来查看一下最后的可执行文件
- ls -la shared-memory-*
- 仅仅是可执行文件大小就有很大不同，主要分成 1500000+ 和 700000+ 两类
- 这难免让人产生疑惑，为什么有些参数看起来似乎对程序的生成没有影响，而有些有影响，这些花里胡哨的参数是怎么影响可执行文件生成和最后的文件大小的？

> ls -ls shared-memory-*

-rwxr-xr-x 1 zty zty 1520216 Nov 24 20:12 shared-memory-arch_compute_70
-rwxr-xr-x 1 zty zty 1520216 Nov 24 20:12 shared-memory-arch_compute_80
-rwxr-xr-x 1 zty zty 1520216 Nov 24 20:12 shared-memory-arch_compute_80-code_compute_80
-rwxr-xr-x 1 zty zty  700480 Nov 24 20:12 shared-memory-arch_compute_80-code_compute_80_sm_80
-rwxr-xr-x 1 zty zty  704576 Nov 24 20:12 shared-memory-arch_compute_80-code_compute_80_sm_80_sm_86
-rwxr-xr-x 1 zty zty  700480 Nov 24 20:12 shared-memory-arch_compute_80-code_compute_80_sm_86
-rwxr-xr-x 1 zty zty  700480 Nov 24 20:12 shared-memory-arch_compute_80-code_sm_80
-rwxr-xr-x 1 zty zty  704576 Nov 24 20:12 shared-memory-arch_compute_80-code_sm_80_sm_86
-rwxr-xr-x 1 zty zty  700480 Nov 24 20:12 shared-memory-arch_compute_80-code_sm_86
-rwxr-xr-x 1 zty zty  700480 Nov 24 20:12 shared-memory-arch_sm_80
-rwxr-xr-x 1 zty zty  700480 Nov 24 20:12 shared-memory-base

分析：Section

我们进一步查看可执行文件的 section 信息
- size -A -d shared-memory-*
- 输出比较长，这里放两个典型的，可以看出文件大小的差异主要是因为 section __nv_relfatbin的不同

> size -A -d shared-memory-*
...

shared-memory-arch_compute_80  :
section                 size      addr
.interp                   28       736
.note.gnu.build-id        36       764
.note.ABI-tag             32       800
.gnu.hash                 36       832
.dynsym                 3792       872
.dynstr                 1909      4664
.gnu.version             316      6574
.gnu.version_r           240      6896
.rela.dyn              21552      7136
.rela.plt               3648     28688
.init                     23     32768
.plt                    2448     32800
.plt.got                   8     35248
.text                 411566     35264
.fini                      9    446832
.rodata                48076    450560
__nv_module_id           148    498656
.nv_fatbin               648    498808
__nv_relfatbin        785960    499456
.eh_frame_hdr          12340   1285416
.eh_frame              82032   1297760
.tbss                   4096   1384448
.init_array               40   1384448
.fini_array                8   1384488
.data.rel.ro           15456   1384512
.dynamic                 544   1399968
.got                      40   1400512
.got.plt                1240   1400832
.data                     88   1402080
.nvFatBinSegment          72   1402168
.bss                    3744   1402240
.comment                  83         0
Total                1400258

...

shared-memory-arch_compute_80-code_sm_80  :
section                size     addr
.interp                  28      736
.note.gnu.build-id       36      764
.note.ABI-tag            32      800
.gnu.hash                36      832
.dynsym                3792      872
.dynstr                1909     4664
.gnu.version            316     6574
.gnu.version_r          240     6896
.rela.dyn             15936     7136
.rela.plt              3648    23072
.init                    23    28672
.plt                   2448    28704
.plt.got                  8    31152
.text                389646    31168
.fini                     9   420816
.rodata               42700   421888
.nv_fatbin             5752   464592
__nv_module_id           15   470344
.eh_frame_hdr         12044   470360
.eh_frame             79800   482408
.tbss                  4096   569344
.init_array              32   569344
.fini_array               8   569376
.data.rel.ro          12672   569408
.dynamic                544   582080
.got                     40   582624
.got.plt               1240   585728
.data                    88   586976
.nvFatBinSegment         48   587064
.bss                   3424   587136
.comment                 83        0
Total                580693

...

分析：反汇编

看名字 __nv_relfatbin，不难想到，这可能是和之前提到的.fatbin有关的信息，而 Fatbin 和 PTX/SASS 代码有关，可能类似代码段.text。在 CPU 上我们如果要进行反汇编，可以使用objdump，类似地，在 GPU 上，我们有cuobjdump来进一步剖析。
- 事实上，也可以使用objdump进行反汇编，但是只反汇编了 CPU 代码相关段，其中没有 Fatbin 的信息
首先看看反汇编到 PTX 和 SASS 的结果
- cuobjdump shared-memory-base -ptx/cuobjdump shared-memory-base -sass/cuobjdump shared-memory-arch_compute_80-code_sm_80 -ptx
- 好消息是，我们看到了 Fatbin 的内部，似乎主要由Fatbin ptx code和Fatbin elf code两部分组成
- 坏消息是，对于不同的 ELF 文件，我们不是总能反汇编得到 PTX，而且Fatbin ptx code和Fatbin elf code的数目也很不同，这让我们更加疑惑

> cuobjdump shared-memory-base -ptx

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_52
code version = [7,2]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

.version 7.2
.target sm_52
.address_size 64

.extern .shared .align 4 .b8 s[];

.visible .entry _Z13staticReversePii(
.param .u64 _Z13staticReversePii_param_0,
.param .u32 _Z13staticReversePii_param_1
)
{
.reg .b32 %r<12>;
.reg .b64 %rd<5>;

        .shared .align 4 .b8 _ZZ13staticReversePiiE1s[256];

ld.param.u64 %rd1, [_Z13staticReversePii_param_0];
ld.param.u32 %r1, [_Z13staticReversePii_param_1];
cvta.to.global.u64 %rd2, %rd1;
mov.u32 %r2, %tid.x;
not.b32 %r3, %r2;
add.s32 %r4, %r3, %r1;
mul.wide.s32 %rd3, %r2, 4;
add.s64 %rd4, %rd2, %rd3;
ld.global.u32 %r5, [%rd4];
shl.b32 %r6, %r2, 2;
mov.u32 %r7, _ZZ13staticReversePiiE1s;
add.s32 %r8, %r7, %r6;
st.shared.u32 [%r8], %r5;
bar.sync 0;
shl.b32 %r9, %r4, 2;
add.s32 %r10, %r7, %r9;
ld.shared.u32 %r11, [%r10];
st.global.u32 [%rd4], %r11;
ret;

}

.visible .entry _Z14dynamicReversePii(
.param .u64 _Z14dynamicReversePii_param_0,
.param .u32 _Z14dynamicReversePii_param_1
)
{
.reg .b32 %r<12>;
.reg .b64 %rd<5>;


ld.param.u64 %rd1, [_Z14dynamicReversePii_param_0];
ld.param.u32 %r1, [_Z14dynamicReversePii_param_1];
cvta.to.global.u64 %rd2, %rd1;
mov.u32 %r2, %tid.x;
not.b32 %r3, %r2;
add.s32 %r4, %r3, %r1;
mul.wide.s32 %rd3, %r2, 4;
add.s64 %rd4, %rd2, %rd3;
ld.global.u32 %r5, [%rd4];
shl.b32 %r6, %r2, 2;
mov.u32 %r7, s;
add.s32 %r8, %r7, %r6;
st.shared.u32 [%r8], %r5;
bar.sync 0;
shl.b32 %r9, %r4, 2;
add.s32 %r10, %r7, %r9;
ld.shared.u32 %r11, [%r10];
st.global.u32 [%rd4], %r11;
ret;

}

> cuobjdump shared-memory-base -sass


Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_52

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_52
                Function : _Z14dynamicReversePii
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                    /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;           /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;               /* 0xf0c8000002170000 */
        /*0018*/                   SHL R5, R0.reuse, 0x2 ;          /* 0x3848000000270005 */
                                                                    /* 0x001fc800fec007f5 */
        /*0028*/                   SHR R3, R0, 0x1e ;               /* 0x3829000001e70003 */
        /*0030*/                   IADD R2.CC, R5, c[0x0][0x140] ;  /* 0x4c10800005070502 */
        /*0038*/                   IADD.X R3, R3, c[0x0][0x144] ;   /* 0x4c10080005170303 */
                                                                    /* 0x001fd800fcc007b1 */
        /*0048*/                   LDG.E R4, [R2] ;                 /* 0xeed4200000070204 */
        /*0050*/                   LOP.PASS_B R0, RZ, ~R0 ;         /* 0x5c4707000007ff00 */
        /*0058*/                   IADD R0, R0, c[0x0][0x148] ;     /* 0x4c10000005270000 */
                                                                    /* 0x001ffc20fe2007f1 */
        /*0068*/                   SHL R0, R0, 0x2 ;                /* 0x3848000000270000 */
        /*0070*/                   STS [R5], R4 ;                   /* 0xef5c000000070504 */
        /*0078*/                   BAR.SYNC 0x0 ;                   /* 0xf0a81b8000070000 */
                                                                    /* 0x003fc400e3e007f5 */
        /*0088*/                   MEMBAR.CTA ;                     /* 0xef98000000070000 */
        /*0090*/                   LDS.U.32 R0, [R0] ;              /* 0xef4c100000070000 */
        /*0098*/                   STG.E [R2], R0 ;                 /* 0xeedc200000070200 */
                                                                    /* 0x001ffc00fda007ef */
        /*00a8*/                   NOP ;                            /* 0x50b0000000070f00 */
        /*00b0*/                   NOP ;                            /* 0x50b0000000070f00 */
        /*00b8*/                   EXIT ;                           /* 0xe30000000007000f */
                                                                    /* 0x001f8000fc0007ff */
        /*00c8*/                   BRA 0xc0 ;                       /* 0xe2400fffff07000f */
        /*00d0*/                   NOP;                             /* 0x50b0000000070f00 */
        /*00d8*/                   NOP;                             /* 0x50b0000000070f00 */
                                                                    /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                             /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                             /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                             /* 0x50b0000000070f00 */
                ..........


                Function : _Z13staticReversePii
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                    /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;           /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;               /* 0xf0c8000002170000 */
        /*0018*/                   SHL R5, R0.reuse, 0x2 ;          /* 0x3848000000270005 */
                                                                    /* 0x001fc800fec007f5 */
        /*0028*/                   SHR R3, R0, 0x1e ;               /* 0x3829000001e70003 */
        /*0030*/                   IADD R2.CC, R5, c[0x0][0x140] ;  /* 0x4c10800005070502 */
        /*0038*/                   IADD.X R3, R3, c[0x0][0x144] ;   /* 0x4c10080005170303 */
                                                                    /* 0x001fd800fcc007b1 */
        /*0048*/                   LDG.E R4, [R2] ;                 /* 0xeed4200000070204 */
        /*0050*/                   LOP.PASS_B R0, RZ, ~R0 ;         /* 0x5c4707000007ff00 */
        /*0058*/                   IADD R0, R0, c[0x0][0x148] ;     /* 0x4c10000005270000 */
                                                                    /* 0x001ffc20fe2007f1 */
        /*0068*/                   SHL R0, R0, 0x2 ;                /* 0x3848000000270000 */
        /*0070*/                   STS [R5], R4 ;                   /* 0xef5c000000070504 */
        /*0078*/                   BAR.SYNC 0x0 ;                   /* 0xf0a81b8000070000 */
                                                                    /* 0x003fc400e3e007f5 */
        /*0088*/                   MEMBAR.CTA ;                     /* 0xef98000000070000 */
        /*0090*/                   LDS.U.32 R0, [R0] ;              /* 0xef4c100000070000 */
        /*0098*/                   STG.E [R2], R0 ;                 /* 0xeedc200000070200 */
                                                                    /* 0x001ffc00fda007ef */
        /*00a8*/                   NOP ;                            /* 0x50b0000000070f00 */
        /*00b0*/                   NOP ;                            /* 0x50b0000000070f00 */
        /*00b8*/                   EXIT ;                           /* 0xe30000000007000f */
                                                                    /* 0x001f8000fc0007ff */
        /*00c8*/                   BRA 0xc0 ;                       /* 0xe2400fffff07000f */
        /*00d0*/                   NOP;                             /* 0x50b0000000070f00 */
        /*00d8*/                   NOP;                             /* 0x50b0000000070f00 */
                                                                    /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                             /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                             /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                             /* 0x50b0000000070f00 */
                ..........



Fatbin ptx code:
================
arch = sm_52
code version = [7,2]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

> cuobjdump shared-memory-arch_compute_80-code_sm_80 -ptx

Fatbin elf code:
================
arch = sm_80
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Fatbin elf code:
================
arch = sm_80
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

> for i in `ls shared-memory-*`; do echo -------$i------- && cuobjdump $i -all -lptx -lelf; done

-------shared-memory-arch_compute_70-------
PTX file    1: shared-memory-arch_compute_70.1.sm_70.ptx
ELF file    1: shared-memory-arch_compute_70.1.sm_35.cubin
ELF file    2: shared-memory-arch_compute_70.2.sm_37.cubin
ELF file    3: shared-memory-arch_compute_70.3.sm_50.cubin
ELF file    4: shared-memory-arch_compute_70.4.sm_52.cubin
ELF file    5: shared-memory-arch_compute_70.5.sm_60.cubin
ELF file    6: shared-memory-arch_compute_70.6.sm_61.cubin
ELF file    7: shared-memory-arch_compute_70.7.sm_70.cubin
ELF file    8: shared-memory-arch_compute_70.8.sm_75.cubin
ELF file    9: shared-memory-arch_compute_70.9.sm_80.cubin
PTX file    2: shared-memory-arch_compute_70.2.sm_86.ptx
ELF file   10: shared-memory-arch_compute_70.10.sm_86.cubin
-------shared-memory-arch_compute_80-------
PTX file    1: shared-memory-arch_compute_80.1.sm_80.ptx
ELF file    1: shared-memory-arch_compute_80.1.sm_35.cubin
ELF file    2: shared-memory-arch_compute_80.2.sm_37.cubin
ELF file    3: shared-memory-arch_compute_80.3.sm_50.cubin
ELF file    4: shared-memory-arch_compute_80.4.sm_52.cubin
ELF file    5: shared-memory-arch_compute_80.5.sm_60.cubin
ELF file    6: shared-memory-arch_compute_80.6.sm_61.cubin
ELF file    7: shared-memory-arch_compute_80.7.sm_70.cubin
ELF file    8: shared-memory-arch_compute_80.8.sm_75.cubin
ELF file    9: shared-memory-arch_compute_80.9.sm_80.cubin
PTX file    2: shared-memory-arch_compute_80.2.sm_86.ptx
ELF file   10: shared-memory-arch_compute_80.10.sm_86.cubin
-------shared-memory-arch_compute_80-code_compute_80-------
PTX file    1: shared-memory-arch_compute_80-code_compute_80.1.sm_80.ptx
ELF file    1: shared-memory-arch_compute_80-code_compute_80.1.sm_35.cubin
ELF file    2: shared-memory-arch_compute_80-code_compute_80.2.sm_37.cubin
ELF file    3: shared-memory-arch_compute_80-code_compute_80.3.sm_50.cubin
ELF file    4: shared-memory-arch_compute_80-code_compute_80.4.sm_52.cubin
ELF file    5: shared-memory-arch_compute_80-code_compute_80.5.sm_60.cubin
ELF file    6: shared-memory-arch_compute_80-code_compute_80.6.sm_61.cubin
ELF file    7: shared-memory-arch_compute_80-code_compute_80.7.sm_70.cubin
ELF file    8: shared-memory-arch_compute_80-code_compute_80.8.sm_75.cubin
ELF file    9: shared-memory-arch_compute_80-code_compute_80.9.sm_80.cubin
PTX file    2: shared-memory-arch_compute_80-code_compute_80.2.sm_86.ptx
ELF file   10: shared-memory-arch_compute_80-code_compute_80.10.sm_86.cubin
-------shared-memory-arch_compute_80-code_compute_80_sm_80-------
ELF file    1: shared-memory-arch_compute_80-code_compute_80_sm_80.1.sm_80.cubin
ELF file    2: shared-memory-arch_compute_80-code_compute_80_sm_80.2.sm_80.cubin
PTX file    1: shared-memory-arch_compute_80-code_compute_80_sm_80.1.sm_80.ptx
-------shared-memory-arch_compute_80-code_compute_80_sm_80_sm_86-------
ELF file    1: shared-memory-arch_compute_80-code_compute_80_sm_80_sm_86.1.sm_86.cubin
ELF file    2: shared-memory-arch_compute_80-code_compute_80_sm_80_sm_86.2.sm_80.cubin
ELF file    3: shared-memory-arch_compute_80-code_compute_80_sm_80_sm_86.3.sm_86.cubin
ELF file    4: shared-memory-arch_compute_80-code_compute_80_sm_80_sm_86.4.sm_80.cubin
PTX file    1: shared-memory-arch_compute_80-code_compute_80_sm_80_sm_86.1.sm_80.ptx
-------shared-memory-arch_compute_80-code_compute_80_sm_86-------
ELF file    1: shared-memory-arch_compute_80-code_compute_80_sm_86.1.sm_86.cubin
ELF file    2: shared-memory-arch_compute_80-code_compute_80_sm_86.2.sm_86.cubin
PTX file    1: shared-memory-arch_compute_80-code_compute_80_sm_86.1.sm_80.ptx
-------shared-memory-arch_compute_80-code_sm_80-------
ELF file    1: shared-memory-arch_compute_80-code_sm_80.1.sm_80.cubin
ELF file    2: shared-memory-arch_compute_80-code_sm_80.2.sm_80.cubin
cuobjdump info    : No PTX file found to extract from 'shared-memory-arch_compute_80-code_sm_80'. You may try with -all option.
-------shared-memory-arch_compute_80-code_sm_80_sm_86-------
ELF file    1: shared-memory-arch_compute_80-code_sm_80_sm_86.1.sm_86.cubin
ELF file    2: shared-memory-arch_compute_80-code_sm_80_sm_86.2.sm_80.cubin
ELF file    3: shared-memory-arch_compute_80-code_sm_80_sm_86.3.sm_86.cubin
ELF file    4: shared-memory-arch_compute_80-code_sm_80_sm_86.4.sm_80.cubin
cuobjdump info    : No PTX file found to extract from 'shared-memory-arch_compute_80-code_sm_80_sm_86'. You may try with -all option.
-------shared-memory-arch_compute_80-code_sm_86-------
ELF file    1: shared-memory-arch_compute_80-code_sm_86.1.sm_86.cubin
ELF file    2: shared-memory-arch_compute_80-code_sm_86.2.sm_86.cubin
cuobjdump info    : No PTX file found to extract from 'shared-memory-arch_compute_80-code_sm_86'. You may try with -all option.
-------shared-memory-arch_sm_80-------
ELF file    1: shared-memory-arch_sm_80.1.sm_80.cubin
ELF file    2: shared-memory-arch_sm_80.2.sm_80.cubin
PTX file    1: shared-memory-arch_sm_80.1.sm_80.ptx
-------shared-memory-base-------
ELF file    1: shared-memory-base.1.sm_52.cubin
ELF file    2: shared-memory-base.2.sm_52.cubin
PTX file    1: shared-memory-base.1.sm_52.ptx

对 Fatbin 的进一步分析

我们进一步查看 Fatbin
- for i in ls ``shared-memory-*``; do echo -------$i------- && cuobjdump $i -all -lptx -lelf; done
- 看到这样的结果，我们心中的疑惑减少了一大半：文件大小大的 ELF 程序，它的 Fatbin中包含了更多的PTX file和ELF file，针对不同的编译选项，NVCC 生成了不同的Fatbin以满足虚拟架构(--gpu-architecture/-arch)和物理架构(`--gpu-code/-code)的要求
回到编译选项本身的含义
- --gpu-architecture/-arch的意思是指定虚拟架构
- --gpu-code/-code的意思是指定物理架构，需要兼容虚拟架构（-code不允许在没有-arch的情况下使用）；可以指定多个；允许指定虚拟架构，同样需要兼容-arch，对于-code指定虚拟架构，官方有如下说明（感觉上是指定生成 PTX）：
- In this case the stage 2 translation will be omitted for such virtual architecture, and the stage 1 PTX result will be embedded instead. At application launch, and in case the driver does not find a better alternative, the stage 2 compilation will be invoked by the driver with the PTX as input.
简写
- -arch允许在-code未被使用的时候指定物理架构，只是一个简写，实际上会选取一个最接近的虚拟架构，即-arch=sm_80 等价于 -arch=compute_80 -code=compute_80,sm_80
- -code允许不指定，-arch=compute_80等价于-arch=compute_80 -code=compute_80
- 如果两个选项都不指定，默认为-arch=sm_52/-arch=compute_52 -code=compute_52,sm_52
- 以上规则解释了一些结果为什么完全一样
需要把-arch=compute_xy，-code=compute_xy和--ptx区分开来
详见官方文档

新的疑惑

我们现在对Fatbin有了进一步的认识，即Fatbin中可以包含多种PTX file和ELF file，很难不产生新的疑惑：具体什么时候条件下生成 PTX，什么条件下生成 cubin 呢？它们生成的兼容版本又如何确定？Fatbin 是怎么被装载然后运行的呢？运行时什么时候使用 cubin，什么时候使用 PTX 进行 JIT 呢？
对于前两个问题
- 反汇编时有一个细节需要注意，-all会让cuobjdump反汇编relocatable fatbin，缺少这个参数shared-memory-arch_compute_70/shared-memory-arch_compute_80/shared-memory-arch_compute_80-code_compute_80的Fatbin就只能看到一个PTX file，而其它文件没有影响，有以下几个猜测
  - 没有该参数反汇编的是 section .nv_fatbin，有该参数则包括 section __nv_relfatbin
  - 在没有指定物理架构时，NVCC 会把满足条件可能的 cubin 放到 section __nv_relfatbin，而 section .nv_fatbin中只有 PTX
  - 其它情况下，则把合适的 cubin 和 PTX 放到 section .nv_fatbin
- 感觉不用太纠结细节，这更多是 NVCC 的设置，能写清楚的时候尽可能写清楚虚拟架构和物理架构，或者仅使用-arch=sm_xy，出现问题了再去查
对于后两个问题
- 从官方文档的字里行间很容易得到一个自然的猜测，运行时有合适的 cubin 就用 cubin ，没有就用 PTX JIT
至于这些猜测对不对，relocatable fatbin又是怎么回事，这就留待后续，尤其是链接和装载部分的学习了~~当然也可能没有开源~~

全流程分析

上述过程我们更清楚了 GPU 编译的结果，现在结合官图回头再用下列命令分析编译的全流程，这里选了上述例子中的 3 个典型例子进行对比，分步骤进行解释，并进行了部分简化（删去环境变量信息，路径信息和rm，主要展示了例 1 的结果）

> nvcc shared-memory.cu -o shared-memory-arch_compute_80 -dryrun --gpu-architecture=compute_80 # 例1
> nvcc shared-memory.cu -o shared-memory-arch_compute_80-code_sm_80 -dryrun --gpu-architecture=compute_80 --gpu-code=sm_80 # 例2
> nvcc shared-memory.cu -o shared-memory-arch_compute_80-code_compute_80_sm_80 -dryrun --gpu-architecture=compute_80 --gpu-code=compute_80,sm_80 # 例3

第一步，gcc 将 shared-memory.cu 代码预处理，生成 .cpp1.ii 文件

gcc -D__CUDA_ARCH__=800 -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  "-I/.../cuda-11.2.2-tzm4z2qcmxczmsu7bpowrsca4tv5g3ch/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=2 -D__CUDACC_VER_BUILD__=152 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=2 -include "cuda_runtime.h" -m64 "shared-memory.cu" -o "/tmp/tmpxft_00206b3f_00000000-9_shared-memory.cpp1.ii"

第二步，cicc 将 .cpp1.ii 文件编译为 .ptx 文件
- 此时还生成了 .cudafe1.stub.c 文件，其文件内容中存在#include "tmpxft_00206b3f_00000000-3_shared-memory.fatbin.c"，这和编译参数相符，此时就已经确定 .fatbin.c 文件被 include
- CICC 是一个基于 LLVM 的编译器，CICC 的中端和后端合起来就是 libNVVM library，CICC 后端已经开源并成为了 LLVM 的一部分

cicc --c++14 --gnu_version=100201 --orig_src_file_name "shared-memory.cu" --allow_managed   -arch compute_80 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00206b3f_00000000-3_shared-memory.fatbin.c" -tused --gen_module_id_file --module_id_file_name "/tmp/tmpxft_00206b3f_00000000-4_shared-memory.module_id" --gen_c_file_name "/tmp/tmpxft_00206b3f_00000000-6_shared-memory.cudafe1.c" --stub_file_name "/tmp/tmpxft_00206b3f_00000000-6_shared-memory.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00206b3f_00000000-6_shared-memory.cudafe1.gpu"  "/tmp/tmpxft_00206b3f_00000000-9_shared-memory.cpp1.ii" -o "/tmp/tmpxft_00206b3f_00000000-6_shared-memory.ptx"

第三步，ptxas 将 .ptx 文件编译为 .cubin 文件；例 1 不存在此步骤，例 2 和例 3 存在
- 此时走完实线绿框部分，对于每一个虚拟架构，需要反复进行这个过程

ptxas -arch=sm_80 -m64  "/tmp/tmpxft_00206bdd_00000000-6_shared-memory.ptx"  -o "/tmp/tmpxft_00206bdd_00000000-10_shared-memory.cubin"

第四步，fatbinary 将 .ptx/.cubin 文件合并生成为 .fatbin.c 文件，此时 device 部分的编译结束

fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=ptx,sm=80,file=/tmp/tmpxft_00206b3f_00000000-6_shared-memory.ptx" --embedded-fatbin="/tmp/tmpxft_00206b3f_00000000-3_shared-memory.fatbin.c"

第五步，gcc 再次将 shared-memory.cu 代码预处理，生成 .cpp4.ii 文件

gcc -E -x c++ -D__CUDACC__ -D__NVCC__  "-I/.../cuda-11.2.2-tzm4z2qcmxczmsu7bpowrsca4tv5g3ch/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=2 -D__CUDACC_VER_BUILD__=152 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=2 -include "cuda_runtime.h" -m64 "shared-memory.cu" -o "/tmp/tmpxft_00206b3f_00000000-5_shared-memory.cpp4.ii"

第六步，cudafe++ 将 .cpp4.ii 文件中 host 端的代码分离，得到 .cudafe1.cpp 文件
1. 其中包含 .cudafe1.stub.c 文件
2. 而 .cudafe1.stub.c 文件 include 了 .fatbin.c 文件，device 端代码就这样被嵌入进 .cudafe1.cpp 文件

cudafe++ --c++14 --gnu_version=100201 --orig_src_file_name "shared-memory.cu" --allow_managed  --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_00206b3f_00000000-6_shared-memory.cudafe1.cpp" --stub_file_name "tmpxft_00206b3f_00000000-6_shared-memory.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_00206b3f_00000000-4_shared-memory.module_id" "/tmp/tmpxft_00206b3f_00000000-5_shared-memory.cpp4.ii"

第七步，gcc 将 .cudafe1.cpp 文件编译为 .o 文件
- 此时走完虚线绿框部分，所有编译部分结束，对于每一个 .cu 文件，需要反复进行这个过程

gcc -D__CUDA_ARCH__=800 -c -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS "-I/.../cuda-11.2.2-tzm4z2qcmxczmsu7bpowrsca4tv5g3ch/bin/../targets/x86_64-linux/include"   -m64 "/tmp/tmpxft_00206b3f_00000000-6_shared-memory.cudafe1.cpp" -o "/tmp/tmpxft_00206b3f_00000000-10_shared-memory.o"

第八，九步，这两步没看懂，没看到中间产物，姑且按图中理解，这两步是进行 device 端链接，nvlink （注意它和NVLink 的不同）将多个 .o 文件链接为一个 cubin，再由 fatbinary 生成为 dlink.fatbin.c 文件

nvlink --arch=compute_80 --register-link-binaries="/tmp/tmpxft_00206b3f_00000000-7_shared-memory-arch_compute_80_dlink.reg.c"  -m64   "-L/.../cuda-11.2.2-tzm4z2qcmxczmsu7bpowrsca4tv5g3ch/bin/../targets/x86_64-linux/lib/stubs" "-L/.../cuda-11.2.2-tzm4z2qcmxczmsu7bpowrsca4tv5g3ch/bin/../targets/x86_64-linux/lib" -cpu-arch=X86_64 "/tmp/tmpxft_00206b3f_00000000-10_shared-memory.o"  -lcudadevrt
fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -link --embedded-fatbin="/tmp/tmpxft_00206b3f_00000000-8_shared-memory-arch_compute_80_dlink.fatbin.c"

第十步，仍然搞不懂，尤其是 dlink.reg.c 文件和 link.stub 的作用，总之最后生成了 _dlink.o 文件

gcc -c -x c++ -DFATBINFILE="\"/tmp/tmpxft_00206b3f_00000000-8_shared-memory-arch_compute_80_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"/tmp/tmpxft_00206b3f_00000000-7_shared-memory-arch_compute_80_dlink.reg.c\"" -I. -D__NV_EXTRA_INITIALIZATION= -D__NV_EXTRA_FINALIZATION= -D__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__  "-I/.../cuda-11.2.2-tzm4z2qcmxczmsu7bpowrsca4tv5g3ch/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=2 -D__CUDACC_VER_BUILD__=152 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=2 -m64 "/.../cuda-11.2.2-tzm4z2qcmxczmsu7bpowrsca4tv5g3ch/bin/crt/link.stub" -o "/tmp/tmpxft_00206b3f_00000000-12_shared-memory-arch_compute_80_dlink.o"

第十一步，最后的 host 链接，生成可执行文件

g++ -m64 -Wl,--start-group "/tmp/tmpxft_00206b3f_00000000-12_shared-memory-arch_compute_80_dlink.o" "/tmp/tmpxft_00206b3f_00000000-10_shared-memory.o"   "-L/.../cuda-11.2.2-tzm4z2qcmxczmsu7bpowrsca4tv5g3ch/bin/../targets/x86_64-linux/lib/stubs" "-L/.../cuda-11.2.2-tzm4z2qcmxczmsu7bpowrsca4tv5g3ch/bin/../targets/x86_64-linux/lib"  -lcudadevrt  -lcudart_static  -lrt -lpthread  -ldl  -Wl,--end-group -o "shared-memory-arch_compute_80"

我们想到 .o 文件同样是 ELF 文件，那么也可以类似反汇编，结果显示这三个例子下的编译中间产物 .o 文件都只包含一个PTX/ELF file，这和最后的可执行文件差别很大，看来链接阶段还进行了不少工作
链接部分更细节的分析可能需要一些理论知识，可以考虑多个 .cu 和多个架构情况下的编译链接过程