Tensorflow 调试方法

Timeline

可以直观的显示一个Step内Op粒度执行之间, 包括使用的GPU stream等信息, 为性能优化提供参考

在Chrome中键入 chrome://tracing/ ,即可通过Load按钮加载time.json, 使之图形化显示

所谓的Op粒度, 就是说Timeline只表示DAG图计算的过程中,每个Op的起止时间和使用的Stream(Tensorflow使用StreamExecutor管理Stream),而Op粒度的分析有时是有误导性的,比如使用Horovod进行分布式训练时,由于多个AllreduceOp并无相关性,也分属不同的Stream,看起来并行度很高,其实在其内部实现中,一方面Horovod会做融合,另一方面只有一个后台线程真正的去做网络通信,最后,网络IO的能力上限取决于网卡,所以实际的IO路径根本不是Timeline上看到的那种并行,而是在Horovod后台,最多到网卡就已经串行了。

环境变量

> 这部分代码变化较频繁, 此处以1.14rc0为例.

TF_CPP_MIN_LOG_LEVEL:TF runtime的日志level, 分为 INFO, WARN, ERROR, FATAL 4级, 使用0, 1, 2, 3 进行控制, 缺省配置为INFO
TF_CPP_MIN_VLOG_LEVEL:TF runtime中用于调试的日志, 在 TF_CPP_MIN_LOG_LEVEL = 0 时可用, 日志级别按照0, 1, 2 … 依次递增, 在2以上日志就已经非常多
TF_DUMP_GRAPH_PREFIX:, 通过设置这个环境变量, 比如TF_DUMP_GRAPH_PREFIX=”/tmp/tf_dump_graph/”, 就可以在optimization优化的前后将整张图dump到/tmp/tf_dump_graph/下相应的pbtxt文件, 如果是tfop的图, 将其转换为pb格式就可以通过tensorboard查看, 不过XLA这种有很多不在tfop内的op的图, 转换成pb也无法被tensorboard识别.
TF_XLA_FLAGS: 是XLA的JIT 调试FLAG, 位于tensorflow/compiler/jit/flags.cc, 使用的方式: export TF_XLA_FLAGS=”–tf_xla_enable_lazy_compilation –tf_xla_print_cluster_outputs”

itermdefault valuedescription
tf_xla_enable_lazy_compilationtrue
tf_xla_print_cluster_outputsfalseIf true then insert Print nodes to print out values produced by XLA clusters
tf_xla_compile_on_demandfalseSwitch a device into ‘on-demand’ mode, where instead of autoclustering ops are compiled one by one just-in-time.
tf_xla_always_defer_compilationfalse
tf_introduce_floating_point_jitter_to_tensorsXhe Tensors to add the jitter to.The tensors are named in the TensorId format of <node name>:<output idx>.

XLA_FLAGS: 是XLA HLO层的调试FLAG, 位于tensorflow/compiler/xla/debug_optitions_flags.cc

itermdefault valuedescription
xla_cpu_enable_fast_mathtrueEnable unsafe fast-math optimizations in the CPU compiler;this may produce faster code at the expense of some accuracy.
xla_cpu_fast_math_honor_nansXWhen xla_cpu_enable_fast_math is true then this controls whether weallow operations to produce NaNs.Ignored whenxla_cpu_enable_fast_math is false
xla_cpu_fast_math_honor_infsXWhen xla_cpu_enable_fast_math is true then this controls whether weallow operations to produce infinites.Ignored whenxla_cpu_enable_fast_math is false
xla_gpu_enable_fast_min_maxtrueEnable fast floating point min/max lowering that does not propagateNaNs.
xla_llvm_enable_alias_scope_metadatatrueIn LLVM-based backends, enable the emission of !alias.scope metadata in the generated IR.
xla_llvm_enable_noalias_metadata”, “In LLVM-based backends, enable the emission of !noalias metadata in the generated IR.
xla_llvm_enable_noalias_metadatatrueIn LLVM-based backends, enable the emission of !noalias metadata in the generated IR.
xla_llvm_enable_invariant_load_metadatatrueIn LLVM-based backends, enable the emission of !invariant.load metadata in the generated IR.
xla_llvm_disable_expensive_passes”, “In LLVM-based backends, disable a custom set of expensive optimization passes.
xla_llvm_disable_expensive_passesfalseIn LLVM-based backends, disable a custom set of expensive optimization passes.
xla_backend_optimization_level3Numerical optimization level for the XLA compiler backend.
xla_disable_hlo_passesXComma-separated list of hlo passes to be disabled. These names must exactly match the passes’ names; no whitespace around commas.
xla_disable_all_hlo_passesfalseDisables all HLO passes.Notes that some passes are necessary for correctness and the invariants that must be satisfied by ‘fully optimized’ HLO are different for different devices and may change over time.The only ‘guarantee’, such as it is, is that if you compile XLA and dump the optimized HLO for some graph, you should be able to run it again on the same device with the same build of XLA.
xla_embed_ir_in_executable”, “Embed the compiler IR as a string in the executable.
xla_embed_ir_in_executableXEmbed the compiler IR as a string in the executable.
xla_eliminate_hlo_implicit_broadcasttrueEliminate implicit broadcasts when lowering user computations to HLO instructions; use explicit broadcast instead.
xla_cpu_multi_thread_eigentrueWhen generating calls to Eigen in the CPU backend, use multi-threaded Eigen mode.
xla_gpu_cuda_data_dir“./cuda_sdk_lib”If non-empty, speficies a local directory containing ptxas and nvvm libdevice files; otherwise we use those from runfile directories.
xla_gpu_ftzXIf true, flush-to-zero semantics are enabled in the code generated for GPUs.
xla_gpu_disable_multi_streamingtrueIf true, multi-streaming in the GPU backend is disabled.
xla_gpu_max_kernel_unroll_factor4Specify the maximum kernel unroll factor for the GPU backend.
xla_test_all_output_layoutsXLet ClientLibraryTestBase::ComputeAndCompare* test all permutations of output layouts. For example, with a 3D shape, all permutations of the set {0, 1, 2} are tried.
xla_test_all_input_layoutsXLet ClientLibraryTestBase::ComputeAndCompare* test all permutations of *input* layouts. For example, for 2 input arguments with 2D shape and 4D shape, the computation will run 2! * 4! times for every possible layouts
xla_hlo_profileXInstrument the computation to collect per-HLO cycle counts
xla_backend_extra_optionsXExtra options to pass to a backend; comma-separated list of ‘key=val’ strings (=val may be omitted); no whitespace around commas.
xla_reduce_precisionXDirections for adding reduce-precision operations. Format is ‘LOCATION=E,M:OPS;NAMES’ where LOCATION is the class of locations in which to insert the operations (e.g., ‘OP_OUTPUTS’), E and M are the exponent and matissa bit counts respectively, and OPS and NAMES are comma-separated (no spaces) lists of the operation types and names to which to attach the reduce-precision operations.The NAMES string and its preceding ‘;’ may be omitted.This option may be repeated to define multiple sets of added reduce-precision operations.
xla_gpu_use_cudnn_batchnormfalseAllows the GPU backend to implement batchnorm HLOs using cudnn, rather than expanding them to a soup of HLOs.
xla_cpu_use_mkl_dnntrueGenerate calls to MKL-DNN in the CPU backend.
xla_gpu_crash_on_verification_failuresXCrashes the program on extra verification failures, e.g. cuDNN cross checking failures
xla_gpu_disable_autotuneXDisable GEMM and Convolution auto-tuning.
xla_force_host_platform_device_count1Force the host platform to pretend that there are these many host \”devices\”. All of these host devices are backed by the samethreadpool.Setting this to anything other than 1 can increase overhead from context switching but we let the user override this behavior to help run tests on the host that run models in parallel across multiple devices.
xla_gpu_disable_ptxas_optimizationsXIn XLA:GPU run ptxas in -O0 (default is -O3).
xla_dump_toXDirectory into which debugging data is written.If not specified but another dumping flag is passed, data will be written to stdout.To explicitly write to stdout, set this to \”-\”.The values \”sponge\” and \”test_undeclared_outputs_dir\” have a special meaning: They cause us to dump into the directory specified by the environment variable TEST_UNDECLARED_OUTPUTS_DIR.
xla_dump_hlo_as_textXDumps HLO modules as text before and after optimizations.Results are written to the –xla_dump_to dir, or, if no dir is specified, to stdout.
xla_dump_hlo_as_protoXDumps HLO modules as HloProtos to the directory specified by –xla_dump_to.
xla_dump_hlo_as_dotXDumps HLO modules rendered as dot files to the directory specified by –xla_dump_to.
xla_dump_hlo_as_htmlXDumps HLO modules rendered as HTML files to the directory specified by –xla_dump_to.
xla_dump_hlo_as_urlXTries to dump HLO modules rendered as URLs to stdout (and also to the directory specified by –xla_dump_to). This is not implemented by default; you need to add a plugin which calls RegisterGraphToURLRenderer().
xla_dump_hlo_snapshotsXEvery time an HLO module is run, dumps an HloSnapshot to the directory specified by –xla_dump_to.
xla_dump_hlo_module_reXLimits dumping only to modules which match this regular expression.Default is to dump all modules.
xla_dump_hlo_pass_reXIf specified, dumps HLO before and after optimization passes which match this regular expression, in addition to dumping at the very beginning and end of compilation.
xla_hlo_graph_addressesXWhen rendering graphs (–xla_dump_hlo_as_{dot,html,url}), displays the address in memory of each HloInstruction object.
xla_hlo_graph_sharding_colorXAssign colors based on sharding assignments when generating the HLO graphs.
xla_allow_excess_precisiontrueAllow xla to increase the output precision of an instruction.

在调试XLA的过程中, 下面是基本的配置, 主要是在/tmp/tf_dump_graph/下生成Optimization和/tmp/foo下生成的HLO的dump, 可以根据需求添加更多选项, 以及, 在源码中修改ToString()方法, 输出你想要dump的其他信息

输出XLA Supported op

或XLA_CPU_JIT。 在源码中, XLA_GPU_JIT 是封装在宏 DEVICE_GPU_XLA_JIT 中的, 即 DEVICE_XLA_GPU 设备旗下的OpKernel使用的device_type. 具体参考”Tensorflow XlaOpKernel机制详解

使用GDB

Tensorflow编译结果只有一个.so, 添加符号表会大幅增加链接时间

操作步骤:
1. 修改编译命令行, 保留符号

Tensorflow 使用bazel 作为编译框架, bazel本身提供了非常丰富的编译选项, 但通常, 我们能修改的只有最外层workspace的选项, 这就意味着所用的选项不能与内部编译目标冲突. 为了添加debug符号, 我们需要添加”–copt=-g –strip=never”, 这里的”-g”是GCC的选项, 在Bazel下, 等价于”–compilation_mode dbg”或”-c dbg”, 最终, 选项如下:

2. 使用python3编译
使用python3的原因主要是因为在开启debug会导致.so急剧增大, 超过python2制作wheel的int溢出:

这个bug在python3中得到了解决, 当然, 可以去hack到python包来解决问题, 只是个人觉得比较ugly, 所以选择了改我的代码, 拥抱python3
使用python3的方式很容易, 重新configure, 或直接修改.tf_configure.bazelrc + tools/python_bin_path.sh. centos下, 需要安装python3, python3-pip, python36-devel等几个组件
编译好后, 可以通过”readelf -S /usr/local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so” 来确认debug信息, 可以看到多了很多代码段, 而且, so也大了很多:148MB -> 6.1GB

3. 通过attach调试python
Tensorflow将.so封装到Python package的方式被调用, 因此, 适用于简单程序的gdb python + set args + break + run这种方式就行不通, 因为python真正run之前还没有load TF包(tf.so), 所以也就无法捕获.so中的符号. 正确的做法是在python XXX.py执行起来之后, 即已经load了tf.so, 再通过gdb + attach的方式调试之. 实际操作中, 可以在python中加类似下面这句, 待attach(会load很多符号, 所以很慢)完成之后, 继续执行python代码.

4. 设断点
gdb attach之后, 即可以设置断点, 得益于编译时的-g选项, 函数符号都已经在.so中, 典型的, 是在runtime的Session入口设置断点:

5. 设置源码目录

run到断点的时候, 就可以通过backstrace打印调用栈, 但若要通过list的当时查看源码, 还需要在gdb中添加源码目录用于检索上下文:

之后就可以愉快调试了

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: