跳转至

llvm pdf pgo

build clang & compiler_rt on e1v

Bash
cmake $HOME/llvm/llvm -DLLVM_TARGETS_TO_BUILD=PowerPC  -DCMAKE_INSTALL_PREFIX=$HOME/llvm/install -DLLVM_ENABLE_ASSERTIONS=On -DCMAKE_BUILD_TYPE=RELEASE -DLLVM_ENABLE_PROJECTS="clang;compiler-rt"  -DCMAKE_C_COMPILER=/opt/at12.0/bin/gcc -DCMAKE_CXX_COMPILER=/opt/at12.0/bin/g++ ../llvm -DLLVM_BINUTILS_INCDIR=/gsa/tlbgsa/projects/x/xlcdl/shkzhang/p9_software/binutils/binutils/include

How to use llvm pdf:

Bash
https://source.android.google.cn/devices/tech/perf/pgo

llvm-profdata tool:

Bash
http://llvm.org/docs/CommandGuide/llvm-profdata.html

clang pgo official document:

Bash
https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization

How the spec use the pgo

link don't need fprofile, but when build, all file need fprofile.

Bash
default=peak:
PASS1_OPTIMIZE = -fprofile-generate
PASS1_LDFLAGS  = -pie %{LINKER} -Wl,-q  -Wl,-rpath=%{BASE_DIR}/lib
fdo_post1      =  %{BASE_DIR}/bin/llvm-profdata merge default*.profraw -output default.profdata
PASS2_OPTIMIZE = -fprofile-use %{LTO_OPTION}
PASS2_LDFLAGS  = -pie %{LINKER} -Wl,-q  -Wl,-rpath=%{BASE_DIR}/lib

1. Clang PGO

Bash
https://www.jianshu.com/p/bd2fe89e2025
https://source.android.com/devices/tech/perf/pgo

1.1 两种配置文件

Clang 可以使用两类配置文件来执行配置文件引导的优化:

Bash
- 基于插桩的配置文件(基于AST和基于LLVM IR两种,由clang/llvm自带)
从插桩的目标程序生成的。这些配置文件很详细,且会产生很高的运行时开销。
- 基于采样的配置文件(使用外部工具生成profile文件,如perf)
通常通过对硬件计数器进行采样生成。此类配置文件产生的运行时开销较低,并且无需对二进制文件进行任何插桩或修改即可收集。详细程度不如基于插桩的配置文件。

基于插桩的两种配置文件

所有配置文件都应该从执行应用的典型行为的代表性工作负载生成。 Clang 同时 - 支持基于 AST(clang FE)的配置文件 (-fprofile-instr-generate, -fprofile-instr-use) - 基于 LLVM IR (-fprofile-generate, -fprofile-use) 配置文件。

基于采样的配置文件:

必须使用外部profile工具,如gnu perf产生配置文件,然后转化成llvm可以识别的格式,再用-fprofile-sample-use=pathname命令对其进行使用。

不同

Bash
https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
Differences Between Sampling and Instrumentation
Although both techniques are used for similar purposes, there are important differences between the two:

  • Profile data generated with one cannot be used by the other, and there is no conversion tool that can convert one to the other. So, a profile generated via -fprofile-instr-generate must be used with -fprofile-instr-use. Similarly, sampling profiles generated by external profilers must be converted and used with -fprofile-sample-use.
  • 插桩配置文件可用于code分析和优化
  • Sampling profiles can only be used for optimization. They cannot be used for code coverage analysis. Although it would be technically possible to use sampling profiles for code coverage, sample-based profiles are too coarse-grained for code coverage purposes; it would yield poor results. 采样配置文件通常只用于优化
  • 采样配置文件必须由外部工具产生,并对其进行格式转化。

2. 基于插桩的配置文件 Multi Files & Single Files

2.1 Get the prof data file.

need libclang-rt, and link must use the -fprofile-instr-generate/-fprofle-generate

Bash
TARGET = main
OBJ=cal.o ioput.o main.o
CFLAG= -O2
PDF= -fprofile-generate
#the dataraw file will in the dir default-dir.dataraw
#PDF= -fprofile-generate=default-dir.dataraw
#CFLAG= -fprofile-generate
LFLAG=
CC=clang
main:$(OBJ)
    $(CC) $(OBJ) -o $(TARGET) $(LFLAG) $(PDF)
%.o:%.c
    $(CC) -c $(CFLAG) $(PDF) $<
clean:
    rm $(TARGET) *.o

Run the program

Bash
./main < input.txt
LLVM_PROFILE_FILE="code.profraw" ./main < input.txt
If you don't use LLVM_PROFILE_FILE, then you will get default_711218735311555632_0.profraw or code.profraw.

2.2 Merge & Show data file.

llvm-profdata merge [options] [filename…]

Bash
llvm-profdata merge foo.profraw bar.profdata baz.profraw -output= merge.profdata
llvm-profdata merge *.profraw -output=merged.profdata
Even you have one one profdata file, you must use the merge command, or you will get below error:
Bash
error: Could not read profile default_2461475265565158006_0.profraw: Invalid
      instrumentation profile data (bad magic)
1 error generated.

llvm-profdata show [options] [filename], you'd better use show after use merge.
Bash
llvm-profdata show -help
llvm-profdata show --help-list-hidden
llvm-profdata show -all-functions defult*.profraw

                  - Sample profile

可统计出热点函数的BB调用次数
 --text                                            - Show instr profile data in text dump 
 llvm-profdata show default.profdata --text -all-functions -o text.log

 筛选最热的BB
 --topn=<uint>                                     - Show the list of functions with the largest internal counts
 llvm-profdata show default.profdata  -all-functions --topn=10 -o topn.log

2.3 Use the prof file:

default is the file after we use merge.

Bash
TARGET = main
OBJ=cal.o ioput.o main.o
CFLAG= -O2
PDF= -fprofile-use=merge.profraw
#CFLAG= -fprofile-generate
LFLAG=
CC=clang
main:$(OBJ)
    $(CC) $(OBJ) -o $(TARGET) $(LFLAG) $(PDF)
%.o:%.c
#   $(CC) -c $(CFLAG) $(PDF) $<
    $(CC) -c $(CFLAG) $<
clean:
    rm $(TARGET) *.o

2.4 single file example

  • build
    Bash
    clang -O3 bintree.c -fprofile-generate -o bintree
    
  • run
    Bash
    ./bintree < input.txt
    
  • merge(transfter the format which llvm expects)
    Bash
    llvm-profdata merge -output=merge.profraw default_2461475265565158006_0.profraw
    
  • use
    Bash
    clang -O3 bintree.c -fprofile-use=merge.profraw -mllvm -debug
    

Bash
Clang同时支持:
- 基于AST的(-fprofile-instr-generate)
- 基于LLVM IR的(-fprofile-generate).



# Options
./clang/lib/CodeGen/BackendUtil.cpp
-fdebug-info-for-profiling

3 基于插桩的配置文件Example

Bash
Has some error in these blog.
https://cmdlinelinux.blogspot.com/2018/04/profiling-c-code-with-clang-using.html

3.1 test code

test.c

C
#include <stdio.h>
#include <stdlib.h>
#define CTR 10

int
main()
{
    int i, j, k;
    for(i=0; i < CTR; ++i) {
        printf("3: %d", i);
    }
    for(i=0; i < CTR*10; ++i) {
        printf("3: %d", i);
    }
    for(i=0; i < CTR*100; ++i) {
        printf("3: %d", i);
    }
    //  exit(0);
    return 0;
}

3.2 build

Notice that If you want use -fconverage-mapping, you must use fprofile-instr-generate, you can not use fprofile-generate.

Bash
clang -O3 -fprofile-instr-generate  test.c -fcoverage-mapping -o test

3.3 Run

Bash
./test
You will get default.profraw

3.4 merge

Bash
llvm-profdata merge -output=merge.profraw default.profraw

3.5 View the result

  • Use llvm-profdata show:
    Bash
    llvm-profdata show -all-functions -counts -ic-targets merge.profraw
    

Output will look like:

Bash
  1 Counters:
  2   main:
  3     Hash: 0x0000000000004104
  4     Counters: 4
  5     Function count: 1
  6     Indirect Call Site Count: 0
  7     Block counts: [10, 100, 1000]
  8     Indirect Target Results:
  9 Functions shown: 1
 10 Total functions: 1
 11 Maximum function count: 1
 12 Maximum internal block count: 1000

  • Use llvm-conv show:
    Bash
    llvm-cov show a.out -instr-profile=merge.profraw
    
    Output will look like: So you easy to find the hot spot.
    Bash
    Count|line#| source code
           |    1|#include <stdio.h>
           |    2|#include <stdlib.h>
      1.11k|    3|#define CTR 10
           |    4|
           |    5|int
           |    6|main()
          1|    7|{
          1|    8|    int i, j, k;
         11|    9|    for(i=0; i < CTR; ++i) {
         10|   10|        printf("3: %d", i);
         10|   11|    }
        101|   12|    for(i=0; i < CTR*10; ++i) {
        100|   13|        printf("3: %d", i);
        100|   14|    }
      1.00k|   15|    for(i=0; i < CTR*100; ++i) {
      1.00k|   16|        printf("3: %d", i);
      1.00k|   17|    }
          1|   18|    //  exit(0);
          1|   19|    return 0;
    
          1|   20|}
    

Notice that the postion of count and line may be changed.

4 Context Sensitive PGO(CSPGO)

Bash
https://reviews.llvm.org/D54175
https://reviews.llvm.org/rL354930

4.1 PGO & CSPGO

当前PGO配置文件计数对上下文不敏感.对于所有调用站点,内联函数的分支概率保持不变,并且它们可能与实际的分支概率非常不同。这些次优配置文件可以极大地影响一些下游优化,特别是对于机器基本块布局优化。

在这个补丁中,我们建议使用post-inline PGO插装/use pass,我们称之为上下文敏感的PGO (CSPGO)。对于想要获得最好性能的用户,他们可以在常规PGO的基础上进行第二轮PGO乐器/使用。它们将有两组配置文件计数。第一个pass配置文件将主要用于内联、间接调用提升和CGSCC简化pass优化。第二个pass配置文件用于后内联优化和代码生成优化。

  • 当前PGO是在inline之前做的。
  • CSPGO是在post-inline后做的。-fprofile-use=pass1.profdata -fcs-profile-generate

4.2 How to use CSPGO

Regular PGO instrumentation and generate pass1 profile.

Bash
clang -O2 -fprofile-generate source.c -o gen
./gen
llvm-profdata merge default.*profraw -o pass1.profdata
CSPGO instrumentation.
Bash
clang -O2 -fprofile-use=pass1.profdata -fcs-profile-generate -o gen2
./gen2
Merge two sets of profiles
Bash
llvm-profdata merge default.*profraw pass1.profdata -o profile.profdata
Use the combined profile. Pass manager will invoke two PGO use passes.
Text Only
clang -O2 -fprofile-use=profile.profdata -o use
第一次使用-fprofile-use=pass1.profdata -fcs-profile-generate,第二次使用-fprofile-use=pass1.profdata

5 基于采样的配置文件使用

Sampling profilers are used to collect runtime information, such as hardware counters, while your application executes. They are typically very efficient and do not incur a large runtime overhead. The sample data collected by the profiler can be used during compilation to determine what the most executed areas of the code are.

Using the data from a sample profiler requires some changes in the way a program is built. Before the compiler can use profiling information, the code needs to execute under the profiler. The following is the usual build cycle when using sample profilers for optimization:

  • Build the code with source line table information. You can use all the usual build flags that you always build your application with. The only requirement is that you add -gline-tables-only or -g to the command line. This is important for the profiler to be able to map instructions back to source line locations.

    Bash
    $ clang++ -O2 -gline-tables-only code.cc -o code
    

  • Run the executable under a sampling profiler. The specific profiler you use does not really matter, as long as its output can be converted into the format that the LLVM optimizer understands. Currently, there exists a conversion tool for the Linux Perf profiler (https://perf.wiki.kernel.org/), so these examples assume that you are using Linux Perf to profile your code.

    Bash
    $ perf record -b ./code
    
    Note the use of the -b flag. This tells Perf to use the Last Branch Record (LBR) to record call chains. While this is not strictly required, it provides better call information, which improves the accuracy of the profile data.

  • Convert the collected profile data to LLVM’s sample profile format. This is currently supported via the AutoFDO converter create_llvm_prof. It is available at https://github.com/google/autofdo. Once built and installed, you can convert the perf.data file to LLVM using the command:

    Bash
    $ create_llvm_prof --binary=./code --out=code.prof
    
    This will read perf.data and the binary file ./code and emit the profile data in code.prof. Note that if you ran perf without the -b flag, you need to use --use_lbr=false when calling create_llvm_prof.

  • Build the code again using the collected profile. This step feeds the profile back to the optimizers. This should result in a binary that executes faster than the original one. Note that you are not required to build the code with the exact same arguments that you used in the first step. The only requirement is that you build the code with -gline-tables-only and -fprofile-sample-use.

    Bash
        $ clang++ -O2 -gline-tables-only -fprofile-sample-use=code.prof code.cc -o code
    

6 Code

Bash
lib/Transforms/Instrumentation/PGOInstrumentation.cpp

7 Some options

1 -print-machine-bfi

Print the machine block frequency info.

Bash
llc ctrloop-shortLoops.ll -print-machine-bfi

block-frequency-info: testTripCount2NonSmallLoop
 - BB0[entry]: float = 1.0, int = 8
 - BB1[for.body]: float = 32.0, int = 255
 - BB2[if.then]: float = 20.0, int = 159
 - BB3[if.end]: float = 32.0, int = 255
 - BB4[for.end]: float = 1.0, int = 8

2 -print-bfi

Print the block frequency info.

Bash
llc ctrloop-shortLoops.ll -print-bfi

block-frequency-info: testTripCount2NonSmallLoop
 - entry: float = 1.0, int = 8
 - for.body: float = 32.0, int = 255
 - if.then: float = 20.0, int = 159
 - if.end: float = 32.0, int = 255

8 ./lib/Analysis/InlineCost.cpp

If you use the fprofile-use option, you can find the Hot callee in the debug info file.

Hot callee

9 An example

cat foo.c

C++
struct parm {
  int *arr;
  int m;
  int n;
};
void foo(struct parm *arg) {
  struct parm localArg = *arg;
  int m = localArg.m;
  int *s = localArg.arr;
  int n = localArg.n;
  do{
    int k = n;
    do{
      s[++k] = k++;
      s[k++] = k;
      s[k++] = k;
      s[k] = k;
      s[--k] = k--;
      s[k--] = k;
      s[--k] = k;
    }while(k--);
  } while(m--);

  s[n]=0;
}

cat main.c

C++
struct parm {
  int *arr;
  int m;
  int n;
};
void foo(struct parm*);
int main() {
  int a[5000];
  struct parm arg = {a, 2000000000, 5};
  foo(&arg);
  return 0;
}

pgo.ksh

Bash
set -x
# profile-generate
rm t t.* t_* *.o *.s *.profraw *.profdata
clang -c main.c -O -fprofile-generate
clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-generate
clang -o t main.o foo.s -fprofile-generate
objdump -dr t > t.dis
time -p ./t

# merge
llvm-profdata merge *.profraw -output=merge.profdata

# profile-use
clang -c main.c -O -fprofile-use=merge.profdata
clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-use=merge.profdata
clang -o t_pgo main.o foo.s -fprofile-use=merge.profdata
objdump -dr t_pgo > t_pgo.dis
time -p ./t_pgo

Another Case

C++
$ cat def.c 
int test(int a, int b) {
  return a <= b ? a : -a;
}

$ cat use.c 
int test(int, int);
int main(int argc, const char **argv) {
  int Ret = 0;
  for (int i = 0; i < 1 << 28; i += 1) {
    Ret += test(i, i - 1);
  }

  return Ret;
}

$ clang -O3 def.c use.c -fprofile-generate
$ ./a.out
$ llvm-profdata merge default_*.profraw -o default.profdata
$ clang -O3 def.c -fprofile-use -emit-llvm -S
$ grep '\!30' def.ll 
  %cond = select i1 %cmp, i32 %sub, i32 %a, !prof !30
!30 = !{!"branch_weights", i32 1073741818, i32 6}