llvm pdf pgo
build clang & compiler_rt on e1v
cmake $HOME/llvm/llvm -DLLVM_TARGETS_TO_BUILD=PowerPC -DCMAKE_INSTALL_PREFIX=$HOME/llvm/install -DLLVM_ENABLE_ASSERTIONS=On -DCMAKE_BUILD_TYPE=RELEASE -DLLVM_ENABLE_PROJECTS="clang;compiler-rt" -DCMAKE_C_COMPILER=/opt/at12.0/bin/gcc -DCMAKE_CXX_COMPILER=/opt/at12.0/bin/g++ ../llvm -DLLVM_BINUTILS_INCDIR=/gsa/tlbgsa/projects/x/xlcdl/shkzhang/p9_software/binutils/binutils/include
How to use llvm pdf:
llvm-profdata tool:
clang pgo official document:
How the spec use the pgo¶
link don't need fprofile, but when build, all file need fprofile.
default=peak:
PASS1_OPTIMIZE = -fprofile-generate
PASS1_LDFLAGS = -pie %{LINKER} -Wl,-q -Wl,-rpath=%{BASE_DIR}/lib
fdo_post1 = %{BASE_DIR}/bin/llvm-profdata merge default*.profraw -output default.profdata
PASS2_OPTIMIZE = -fprofile-use %{LTO_OPTION}
PASS2_LDFLAGS = -pie %{LINKER} -Wl,-q -Wl,-rpath=%{BASE_DIR}/lib
1. Clang PGO¶
1.1 两种配置文件¶
Clang 可以使用两类配置文件来执行配置文件引导的优化:
- 基于插桩的配置文件(基于AST和基于LLVM IR两种,由clang/llvm自带)
从插桩的目标程序生成的。这些配置文件很详细,且会产生很高的运行时开销。
- 基于采样的配置文件(使用外部工具生成profile文件,如perf)
通常通过对硬件计数器进行采样生成。此类配置文件产生的运行时开销较低,并且无需对二进制文件进行任何插桩或修改即可收集。详细程度不如基于插桩的配置文件。
基于插桩的两种配置文件¶
所有配置文件都应该从执行应用的典型行为的代表性工作负载生成。 Clang 同时 - 支持基于 AST(clang FE)的配置文件 (-fprofile-instr-generate, -fprofile-instr-use) - 基于 LLVM IR (-fprofile-generate, -fprofile-use) 配置文件。
基于采样的配置文件:¶
必须使用外部profile工具,如gnu perf产生配置文件,然后转化成llvm可以识别的格式,再用-fprofile-sample-use=pathname命令对其进行使用。
不同¶
https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
Differences Between Sampling and Instrumentation
- Profile data generated with one cannot be used by the other, and there is no conversion tool that can convert one to the other. So, a profile generated via
-fprofile-instr-generatemust be used with-fprofile-instr-use. Similarly, sampling profiles generated by external profilers must be converted and used with-fprofile-sample-use. - 插桩配置文件可用于code分析和优化
- Sampling profiles can only be used for optimization. They cannot be used for code coverage analysis. Although it would be technically possible to use sampling profiles for code coverage, sample-based profiles are too coarse-grained for code coverage purposes; it would yield poor results. 采样配置文件通常只用于优化
- 采样配置文件必须由外部工具产生,并对其进行格式转化。
2. 基于插桩的配置文件 Multi Files & Single Files¶
2.1 Get the prof data file.¶
need libclang-rt, and link must use the -fprofile-instr-generate/-fprofle-generate
TARGET = main
OBJ=cal.o ioput.o main.o
CFLAG= -O2
PDF= -fprofile-generate
#the dataraw file will in the dir default-dir.dataraw
#PDF= -fprofile-generate=default-dir.dataraw
#CFLAG= -fprofile-generate
LFLAG=
CC=clang
main:$(OBJ)
$(CC) $(OBJ) -o $(TARGET) $(LFLAG) $(PDF)
%.o:%.c
$(CC) -c $(CFLAG) $(PDF) $<
clean:
rm $(TARGET) *.o
Run the program
If you don't useLLVM_PROFILE_FILE, then you will get default_711218735311555632_0.profraw or code.profraw. 2.2 Merge & Show data file.¶
llvm-profdata merge [options] [filename…]¶
llvm-profdata merge foo.profraw bar.profdata baz.profraw -output= merge.profdata
llvm-profdata merge *.profraw -output=merged.profdata
error: Could not read profile default_2461475265565158006_0.profraw: Invalid
instrumentation profile data (bad magic)
1 error generated.
llvm-profdata show [options] [filename], you'd better use show after use merge.¶
llvm-profdata show -help
llvm-profdata show --help-list-hidden
llvm-profdata show -all-functions defult*.profraw
- Sample profile
可统计出热点函数的BB调用次数
--text - Show instr profile data in text dump
llvm-profdata show default.profdata --text -all-functions -o text.log
筛选最热的BB
--topn=<uint> - Show the list of functions with the largest internal counts
llvm-profdata show default.profdata -all-functions --topn=10 -o topn.log
2.3 Use the prof file:¶
default is the file after we use merge.
TARGET = main
OBJ=cal.o ioput.o main.o
CFLAG= -O2
PDF= -fprofile-use=merge.profraw
#CFLAG= -fprofile-generate
LFLAG=
CC=clang
main:$(OBJ)
$(CC) $(OBJ) -o $(TARGET) $(LFLAG) $(PDF)
%.o:%.c
# $(CC) -c $(CFLAG) $(PDF) $<
$(CC) -c $(CFLAG) $<
clean:
rm $(TARGET) *.o
2.4 single file example¶
- build
- run
- merge(transfter the format which llvm expects)
- use
Clang同时支持:
- 基于AST的(-fprofile-instr-generate)
- 基于LLVM IR的(-fprofile-generate).
# Options
./clang/lib/CodeGen/BackendUtil.cpp
3 基于插桩的配置文件Example¶
Has some error in these blog.
https://cmdlinelinux.blogspot.com/2018/04/profiling-c-code-with-clang-using.html
3.1 test code¶
test.c
#include <stdio.h>
#include <stdlib.h>
#define CTR 10
int
main()
{
int i, j, k;
for(i=0; i < CTR; ++i) {
printf("3: %d", i);
}
for(i=0; i < CTR*10; ++i) {
printf("3: %d", i);
}
for(i=0; i < CTR*100; ++i) {
printf("3: %d", i);
}
// exit(0);
return 0;
}
3.2 build¶
Notice that If you want use -fconverage-mapping, you must use fprofile-instr-generate, you can not use fprofile-generate.
3.3 Run¶
You will get default.profraw3.4 merge¶
3.5 View the result¶
- Use
llvm-profdata show:
Output will look like:
1 Counters:
2 main:
3 Hash: 0x0000000000004104
4 Counters: 4
5 Function count: 1
6 Indirect Call Site Count: 0
7 Block counts: [10, 100, 1000]
8 Indirect Target Results:
9 Functions shown: 1
10 Total functions: 1
11 Maximum function count: 1
12 Maximum internal block count: 1000
- Use
llvm-conv show: Output will look like: So you easy to find the hot spot.BashCount|line#| source code | 1|#include <stdio.h> | 2|#include <stdlib.h> 1.11k| 3|#define CTR 10 | 4| | 5|int | 6|main() 1| 7|{ 1| 8| int i, j, k; 11| 9| for(i=0; i < CTR; ++i) { 10| 10| printf("3: %d", i); 10| 11| } 101| 12| for(i=0; i < CTR*10; ++i) { 100| 13| printf("3: %d", i); 100| 14| } 1.00k| 15| for(i=0; i < CTR*100; ++i) { 1.00k| 16| printf("3: %d", i); 1.00k| 17| } 1| 18| // exit(0); 1| 19| return 0; 1| 20|}
Notice that the postion of count and line may be changed.
4 Context Sensitive PGO(CSPGO)¶
4.1 PGO & CSPGO¶
当前PGO配置文件计数对上下文不敏感.对于所有调用站点,内联函数的分支概率保持不变,并且它们可能与实际的分支概率非常不同。这些次优配置文件可以极大地影响一些下游优化,特别是对于机器基本块布局优化。
在这个补丁中,我们建议使用post-inline PGO插装/use pass,我们称之为上下文敏感的PGO (CSPGO)。对于想要获得最好性能的用户,他们可以在常规PGO的基础上进行第二轮PGO乐器/使用。它们将有两组配置文件计数。第一个pass配置文件将主要用于内联、间接调用提升和CGSCC简化pass优化。第二个pass配置文件用于后内联优化和代码生成优化。
- 当前PGO是在inline之前做的。
- CSPGO是在post-inline后做的。
-fprofile-use=pass1.profdata -fcs-profile-generate
4.2 How to use CSPGO¶
Regular PGO instrumentation and generate pass1 profile.
clang -O2 -fprofile-generate source.c -o gen
./gen
llvm-profdata merge default.*profraw -o pass1.profdata
-fprofile-use=pass1.profdata -fcs-profile-generate,第二次使用-fprofile-use=pass1.profdata。 5 基于采样的配置文件使用¶
Sampling profilers are used to collect runtime information, such as hardware counters, while your application executes. They are typically very efficient and do not incur a large runtime overhead. The sample data collected by the profiler can be used during compilation to determine what the most executed areas of the code are.
Using the data from a sample profiler requires some changes in the way a program is built. Before the compiler can use profiling information, the code needs to execute under the profiler. The following is the usual build cycle when using sample profilers for optimization:
-
Build the code with source line table information. You can use all the usual build flags that you always build your application with. The only requirement is that you add -gline-tables-only or -g to the command line. This is important for the profiler to be able to map instructions back to source line locations.
-
Run the executable under a sampling profiler. The specific profiler you use does not really matter, as long as its output can be converted into the format that the LLVM optimizer understands. Currently, there exists a conversion tool for the Linux Perf profiler (https://perf.wiki.kernel.org/), so these examples assume that you are using Linux Perf to profile your code.
Note the use of the -b flag. This tells Perf to use the Last Branch Record (LBR) to record call chains. While this is not strictly required, it provides better call information, which improves the accuracy of the profile data. -
Convert the collected profile data to LLVM’s sample profile format. This is currently supported via the AutoFDO converter create_llvm_prof. It is available at https://github.com/google/autofdo. Once built and installed, you can convert the perf.data file to LLVM using the command:
This will read perf.data and the binary file ./code and emit the profile data in code.prof. Note that if you ran perf without the -b flag, you need to use--use_lbr=false when calling create_llvm_prof. -
Build the code again using the collected profile. This step feeds the profile back to the optimizers. This should result in a binary that executes faster than the original one. Note that you are not required to build the code with the exact same arguments that you used in the first step. The only requirement is that you build the code with -gline-tables-only and -fprofile-sample-use.
6 Code¶
7 Some options¶
1 -print-machine-bfi¶
Print the machine block frequency info.
llc ctrloop-shortLoops.ll -print-machine-bfi
block-frequency-info: testTripCount2NonSmallLoop
- BB0[entry]: float = 1.0, int = 8
- BB1[for.body]: float = 32.0, int = 255
- BB2[if.then]: float = 20.0, int = 159
- BB3[if.end]: float = 32.0, int = 255
- BB4[for.end]: float = 1.0, int = 8
2 -print-bfi¶
Print the block frequency info.
llc ctrloop-shortLoops.ll -print-bfi
block-frequency-info: testTripCount2NonSmallLoop
- entry: float = 1.0, int = 8
- for.body: float = 32.0, int = 255
- if.then: float = 20.0, int = 159
- if.end: float = 32.0, int = 255
8 ./lib/Analysis/InlineCost.cpp¶
If you use the fprofile-use option, you can find the Hot callee in the debug info file.
Hot callee
9 An example¶
cat foo.c
struct parm {
int *arr;
int m;
int n;
};
void foo(struct parm *arg) {
struct parm localArg = *arg;
int m = localArg.m;
int *s = localArg.arr;
int n = localArg.n;
do{
int k = n;
do{
s[++k] = k++;
s[k++] = k;
s[k++] = k;
s[k] = k;
s[--k] = k--;
s[k--] = k;
s[--k] = k;
}while(k--);
} while(m--);
s[n]=0;
}
cat main.c
struct parm {
int *arr;
int m;
int n;
};
void foo(struct parm*);
int main() {
int a[5000];
struct parm arg = {a, 2000000000, 5};
foo(&arg);
return 0;
}
pgo.ksh
set -x
# profile-generate
rm t t.* t_* *.o *.s *.profraw *.profdata
clang -c main.c -O -fprofile-generate
clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-generate
clang -o t main.o foo.s -fprofile-generate
objdump -dr t > t.dis
time -p ./t
# merge
llvm-profdata merge *.profraw -output=merge.profdata
# profile-use
clang -c main.c -O -fprofile-use=merge.profdata
clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-use=merge.profdata
clang -o t_pgo main.o foo.s -fprofile-use=merge.profdata
objdump -dr t_pgo > t_pgo.dis
time -p ./t_pgo
Another Case¶
$ cat def.c
int test(int a, int b) {
return a <= b ? a : -a;
}
$ cat use.c
int test(int, int);
int main(int argc, const char **argv) {
int Ret = 0;
for (int i = 0; i < 1 << 28; i += 1) {
Ret += test(i, i - 1);
}
return Ret;
}
$ clang -O3 def.c use.c -fprofile-generate
$ ./a.out
$ llvm-profdata merge default_*.profraw -o default.profdata
$ clang -O3 def.c -fprofile-use -emit-llvm -S
$ grep '\!30' def.ll
%cond = select i1 %cmp, i32 %sub, i32 %a, !prof !30
!30 = !{!"branch_weights", i32 1073741818, i32 6}