从源码编译安装
知乎地址:
https://zhuanlan.zhihu.com/p/572757265?
源码安装地址¶
有所支持的显卡的机器上安装tensorflow和CUDA。
这里我使用的源代码是r2.10,不同的源代码需要使用的python,cmake,bazel的版本可能不同。
commit 359c3cdfc5fabac82b3c70b3b6de2b0a8c16874f (HEAD -> r2.10, tag: v2.10.0, origin/r2.10)
Merge: 203b3338195 724308fadba
Author: Mihai Maruseac <mihaimaruseac@google.com>
Date: Fri Sep 2 15:59:55 2022 -0700
Merge pull request #57609 from tensorflow/vinila21-patch-6
Update estimator and keras version in TF 2.10 branch for 2.10.0.
https://cloud-atlas.readthedocs.io/zh_CN/latest/machine_learning/tensorflow/build_tensorflow.html
我使用的安装环境是8核,8G内存,swap 16G内存。
1.编译¶
先把packaging装上,最后装包时要用。
1.1 configure¶
1.2 Build(看最后我用的命令行)¶
仅支持 CPU 的情况下,调用下面的命令建立一个 TenslorFlow pip 包:
支持 GPU 的情况下,调用下面的命令建立一个 TensorFlow pip 包:
1.3 生成whl安装文件¶
出现以下信息表示build成功
INFO: Found applicable config definition build:dynamic_kernels in file /home/ken/workspace/tensorflow/tensorflow/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 5.072s, Critical Path: 4.12s
INFO: 3 processes: 1 internal, 2 local.
INFO: Build completed successfully, 3 total actions
bazel build 命令执行一个叫 build_pip_package 的脚本,该脚本在第5行所有的位置。执行这个脚本将会在 /tmp/tensorflow_pkg 目录下构建一个 .whl 文件。
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg // 可以改成~/tensorflow_pkg
# 以下是我执行该命令后的输出
~/workspace/tensorflow/tensorflow/bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/org_tensorflow ~/workspace/tensorflow/tensorflow
~/workspace/tensorflow/tensorflow
/tmp/tmp.4uUFP30gwO/tensorflow/include ~/workspace/tensorflow/tensorflow
~/workspace/tensorflow/tensorflow
2022年 10月 12日 星期三 18:12:49 CST : === Building wheel
/usr/lib/python3.7/distutils/dist.py:274: UserWarning: Unknown distribution option: 'long_description_content_type'
warnings.warn(msg)
warning: no files found matching 'README'
warning: no files found matching '*.pyd' under directory '*'
warning: no files found matching '*.pyi' under directory '*'
warning: no files found matching '*.pd' under directory '*'
warning: no files found matching '*.dylib' under directory '*'
warning: no files found matching '*.dll' under directory '*'
warning: no files found matching '*.lib' under directory '*'
warning: no files found matching '*.csv' under directory '*'
warning: no files found matching '*.h' under directory 'tensorflow/include/tensorflow'
warning: no files found matching '*.proto' under directory 'tensorflow/include/tensorflow'
warning: no files found matching '*' under directory 'tensorflow/include/third_party'
2022年 10月 12日 星期三 18:13:25 CST : === Output wheel file is in: /home/ken/tensorflow_pkg
1.4 安装whl文件(这里最好是使用virtualenv进行安装)¶
pip3 --user install /tmp/tensorflow_pkg/tensorflow-1.8.0-py3-none-any.whl
pip2 --user install /tmp/tensorflow_pkg/tensorflow-1.8.0-py2-none-any.whl
如果编译时用的是python3,名字中会有py3,使用pip3命令。
如果编译时用的是python2,名字中会有py2,使用pip命令。
1.5 验证是否安装成功¶
按照以下说明验证 TensorFlow 是否安装成功:
启动终端。
在 python 交互式 shell 中输入以下代码:
TF1.x hello world:
import tensorflow as tf
msg = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(msg))
TF2.x hello world:
如果系统输出以下内容,你就可以开始编写 TensorFlow 程序了:
2.问题解决¶
https://blog.csdn.net/surtol/article/details/97638399
这里是以r2.10为例。
2.1 bazel版本错误¶
需要安装指定版本的bazel,根据报错信息选择指定版本,r2.10需要使用bazel-5.1.1。
https://github.com/bazelbuild/bazel/releases?page=3
下载指定版本的 bazel-5.1.1-installer-linux-x86_64.sh
即可安装完成
安装bazel
2.2 依赖无法下载¶
这是在安装tensorflow中最大的问题。很多依赖都是从github或google下载的,国内有gfw导致无法下载而失败。
依赖的解决有三种方式,前两种方式是首选,我用前两种方式解决的所有依赖问题,第三种方式没试过
https://stackoverflow.com/questions/42918662/bazel-error-building-tensorflow-with-local-llvm-repository
1 --over_repository方式¶
先将所需有依赖根据报错的链接用wget下下来,下载指定的报错的commit版本,一般会有google链接和github链接。如果没办法下载的话,通过下载源码然后使用git reset commit --hard 的方式。
下载好之后,解压到某个位置,然后用--over_repository进行指定。如tf_runtime
--override_repository=tf_runtime=/home/ken/workspace/tf_dep/runtime-6ca793b5d862ef6c50f242d77a811f06cce9b60a
2. 自己搭建http服务器¶
-
下载所需的依赖打包文件。放到某个位置。
-
搭建http服务器
这里,执行该命令的目录为home目录,可以在网页上输入http://localhost:8000进行测试。
最好直接在自己的$HOME目录执行该操作。这里我将依赖文件放于/home/ken/workspace/tf_dep/0538e5431afdb1fa05bdcedf70ee502ccfcd112a.tar.gz
- 将链接加到对应的
workspace.bzl文件中
用以下命令清空bazel build cache
bazel clean --expunge
~/workspace/tensorflow/tensorflow$ git diff third_party/llvm/workspace.bzl
diff --git a/third_party/llvm/workspace.bzl b/third_party/llvm/workspace.bzl
index c9760c3ab38..75308a26ec4 100644
--- a/third_party/llvm/workspace.bzl
+++ b/third_party/llvm/workspace.bzl
@@ -13,6 +13,7 @@ def repo(name):
strip_prefix = "llvm-project-{commit}".format(commit = LLVM_COMMIT),
urls = [
"https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/{commit}.tar.gz".format(commit = LLVM_COMMIT),
+ "http://localhost:8000/workspace/tf_dep/0538e5431afdb1fa05bdcedf70ee502ccfcd112a.tar.gz",
"https://github.com/llvm/llvm-project/archive/{commit}.tar.gz".format(commit = LLVM_COMMIT),
],
build_file = "//third_party/llvm:llvm.BUILD",
这样,会自动下载该依赖。
3. 使用native.local_repository¶
该方法还没研究清楚具体怎么用,没用过,不保证是对的
Changed the temp_workaround_http_archive rule in //tensorflow/workspace.bzl to:
native.local_repository (
name = "llvm",
path = "/git/llvm/",
)
In /git/llvm I added the file WORKSPACE containing:
workspace( name = "llvm" )
2.3 mlir依赖报错¶
DIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --distinct_host_configuration=false --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /home/ken/workspace/tensorflow/tensorflow/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /home/ken/.cache/bazel/_bazel_ken/7958fc4ffdc3fda57f47c66f5611f525/external/llvm-project/mlir/BUILD.bazel:6592:10: Compiling mlir/tools/mlir-tblgen/DirectiveCommonGen.cpp failed: undeclared inclusion(s) in rule '@llvm-project//mlir:mlir-tblgen':
this rule is missing dependency declarations for the following files included by 'mlir/tools/mlir-tblgen/DirectiveCommonGen.cpp':
'bazel-out/k8-opt-exec-50AE0418/bin/external/llvm-project/llvm/Demangle.cppmap'
'bazel-out/k8-opt-exec-50AE0418/bin/external/llvm_terminfo/terminfo.cppmap'
'bazel-out/k8-opt-exec-50AE0418/bin/external/llvm_zlib/zlib.cppmap'
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 136.286s, Critical Path: 29.42s
INFO: 487 processes: 199 internal, 288 local.
FAILED: Build did NOT complete successfully
此种情况可能是使用了clang/clang++进行编译,换成gcc/g++后,该问题得到解决。
2.4 gcc: internal compiler error: Killed (program cc1plus)¶
Configuration: 3ae23968b857c69df50fb9f45895d8d26e5e5626ba667099462d25a3ebb9874f
# Execution platform: @local_execution_config_platform//:platform
gcc: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
可能是资源不够用。
使用free -m或者htop查看资源使用情况
如果是swap分区不够用,如果是ubuntu的话,用下面的方法对swap分区进行扩容。
https://blog.csdn.net/AlexWang30/article/details/90341172
2.5 SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA¶
2022-10-12 17:29:57.576173: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
这不是个错误 ,只是告诉你你可以加AVX等相关的选项去提高性能。
2.6 packaing not found¶
from packaging import version as packaging_version # pylint: disable=g-bad-import-order
ModuleNotFoundError: No module named 'packaging'
Target //tensorflow/tools/pip_package:build_pip_package failed to build
ERROR: /home/ken/workspace/tensorflow/tensorflow/tensorflow/lite/python/BUILD:68:10 Middleman _middlemen/tensorflow_Slite_Spython_Stflite_Uconvert-runfiles failed: (Exit 1): bash failed: error executing command
3 编译命令¶
下面是我用的编译命令,--override_repository是提前下好解决的文件。这里只针对r2.10版本,其它版本所需要的依赖commit可能是不同的。
使用4GB内存,4个核来编译。htop来看有多少核,我这里有8G内存,8个核,只用了一半ram,另外设置了16G的swapp空间。
--local_ram_resources=4096 --local_cpu_resources=6, 不设置,则所有资源可用,可能会异常中止.
如果使用clang来编译,需要使用,建议不要用clang
CC=clang bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
--verbose_failures出错的话,会显示详细的出错信息
CC=gcc CXX=g++ bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package --override_repository=tf_runtime=/home/ken/workspace/tf_dep/runtime-6ca793b5d862ef6c50f242d77a811f06cce9b60a --override_repository=upb=/home/ken/workspace/tf_dep/upb-9effcbcb27f0a665f9f345030188c0b291e32482 --override_repository=io_bazel_rules_go=/home/ken/workspace/tf_dep/rules_go-0.18.5 --override_repository=envoy_api=/home/ken/workspace/tf_dep/data-plane-api-c83ed7ea9eb5fb3b93d1ad52b59750f1958b8bde --override_repository=rules_java=/home/ken/workspace/tf_dep/rules_java-981f06c3d2bd10225e85209904090eb7b5fb26bd --override_repository=XNNPACK=/home/ken/workspace/tf_dep/XNNPACK-6b409ac0a3090ebe74d0cdfb494c4cd91485ad39 --override_repository=rules_cc=/home/ken/workspace/tf_dep/rules_cc-081771d4a0e9d7d3aa0eed2ef389fa4700dfb23e --local_ram_resources=4096 --local_cpu_resources=6 --verbose_failures