Tensorflow OptimizationPassRegistry机制详解

整个流水线大概是:Init Graph –> Grappler进行全图优化 –> 根据Device将Graph拆成 Partition –> OptimizationPassRegistry优化各个Partition –> 图执行. 同其他注册机制一样, OptimizationPassRegistry也使用的registry, registertion以及registerar等概念.

源码中, 使用REGISTER_OPTIMIZATION()注册一个优化器, 具体实现如下

//core/common_runtime/optimization_registry.h
#define REGISTER_OPTIMIZATION(grouping, phase, optimization) \
  REGISTER_OPTIMIZATION_UNIQ_HELPER(__COUNTER__, grouping, phase, optimization)

#define REGISTER_OPTIMIZATION_UNIQ_HELPER(ctr, grouping, phase, optimization) \
  REGISTER_OPTIMIZATION_UNIQ(ctr, grouping, phase, optimization)

#define REGISTER_OPTIMIZATION_UNIQ(ctr, grouping, phase, optimization)         \
  static ::tensorflow::optimization_registration::OptimizationPassRegistration \
      register_optimization_##ctr(                                             \
          grouping, phase,                                                     \
          ::std::unique_ptr<::tensorflow::GraphOptimizationPass>(              \
              new optimization()),                                             \
          #optimization)

-12-new了一个我们注册的optimization对象并用unique_ptr指向它, 这个unique_ptr就是registry管理的对象, 通过它间接管理相应的optimization. 注册的本质是返回一个静态的, 类型为'OptimazationPassRegistration'的, 名为'register_optimization_##ctr'的对象, 这里使用了C++预编译宏'__COUNTER__'生成唯一变量名

下面是一个`OptimazationPassRegistration`对象的构造过程, 可以看出, 就是将我们构造的'register_optimization_##ctr注册到全局的'global_optimization_registry'.

Continue reading

Tensorflow 图计算引擎概述

tensorflow的用户可以使用多种语言来构造的自己的图, 但各种语言的API最终都会经由C API 进入tensorflow 运行时. 可以说, 对于运行时代码, 其上边界就是C API. 比如, 通过python描述的一张网络, 就是通过类似下面的几个python-C接口进入运行时的.

#9  0x00007f9de118daa4 in PyEval_EvalFrameEx ()
#10 0x00007f9de118f0bd in PyEval_EvalCodeEx ()

tensorflow整体上可以看做一个”图语言的编译器”, 和所有编译器的优化以及翻译的功能类似, Graph在运行时中的处理过程, 可以分为 图构造->图优化->图执行 几个阶段. 其中, 图优化随同图构造一同被执行.

全图构造及其优化

Session初次构造时, 应用层代码中定义的数据流图转换成GraphDef格式, 经由C API传入DirectSession.Extend(), 参考调用栈如下

PyEval_EvalCodeEx()
  PyEval_EvalFrameEx()
    _wrap_ExtendSession()
      tensorflow::ExtendSession()
        tensorflow::ExtendSessionGraphHelper()
          tensorflow::SessionRef::Extend()
            tensorflow::DirectSession::ExtendLocked()
              tensorflow::DirectSession::MaybeInitializeExecutionState(out_already_initialized/already_initialized)
                if out_already_initialized:
                  return
                  flib_def_.reset(new FunctionLibraryDefinition())
                tensorflow::GraphExecutionState::MakeForBaseGraph()
                    std::unique_ptr<GraphExecutionState> ret(new GraphExecutionState(graph_def, options));
                    if (!ret->session_options_->config.graph_options().place_pruned_graph()):
                      tensorflow::GraphExecutionState::InitBaseGraph()
                        OptimizationPassRegistry::Global()->RunGrouping(PRE_PLACEMENT)
                        Placer placer()
                        placer.Run()
                        OptimizationPassRegistry::Global()->RunGrouping(POST_PLACEMENT)
                out_already_initialized = false
              if already_initialized:
                flib_def_->AddLibrary(graph.library())
                std::unique_ptr<GraphExecutionState> state
                execution_state_->Extend(graph, &state))
                execution_state_.swap(state)
    _wrap_TF_SessionRun_wrapper()
      tensorflow::TF_SessionRun_wrapper()
        tensorflow::TF_SessionRun_wrapper_helper()
          TF_SessionRun()
            TF_Run_Helper()
              tensorflow::SessionRef::Run()
                tensorflow::DirectSession::Run()  
                  DirectSession::GetOrCreateExecutors(executors_and_keys)
                    CreateExecutors()
                      std::unique_ptr<ExecutorsAndKeys> ek(new ExecutorsAndKeys); 
                      std::unordered_map<string, std::unique_ptr<Graph>> graphs;
                      CreateGraphs(&graphs)
                        if (options_.config.graph_options().place_pruned_graph()):
                          MakeForPrunedGraph()
                            ret->InitBaseGraph()
                              if (session_options_ && session_options_->config.graph_options().place_pruned_graph()):
                                PruneGraph()
                                  if (options.use_function_convention):
                                    feed_rewrites.emplace_back(new subgraph::ArgFeedRewrite())
                                    fetch_rewrites.emplace_back(new subgraph::RetvalFetchRewrite())
                                    ValidateFeedAndFetchDevices()
                                  else:
                                    feed_rewrites.emplace_back(new subgraph::RecvFeedRewrite())
                                    fetch_rewrites.emplace_back(new subgraph::SendFetchRewrite())
                                  subgraph::RewriteGraphForExecution(graph, feed_rewrites, fetch_rewrites)
                              OptimizationPassRegistry::Global()->RunGrouping(PRE_PLACEMENT)
                              Placer placer()
                              placer.run()
                              OptimizationPassRegistry::Global()->RunGrouping(POST_PLACEMENT)
                            graph_ = new_graph.release();
                            ret->BuildGraph()
                              OptimizeGraph()
                                if (session_options_ == nullptr || !session_options_->config.graph_options().place_pruned_graph()) {
                                  PruneGraph()
                              std::unique_ptr<ClientGraph> dense_copy(new ClientGraph)
                        else:
                          execution_state->BuildGraph();
Continue reading

BERT 模型参数量估计

根据BERT论文, 其12层transformer结构有110M参数, 24层更是高达340M, 虽然google公开了这两个网络的预训练模型, 用户只需在后面加一层Full-connected 但如果自己去做BERT预训练, 到底要耗费多少显存呢?

tensorflow中使用tf.variable()在模型中生成训练参数, 同时, tf.dense()内部也含有weight和bias两类参数. 比如,

tf.get_variable(shape=[vocab_size, embedding_size],initializer=…))

生成了shape是[vocab_size, embedding_size]的variable, 即 vocab_size * embedding_size 个参数,

to_tensor_2d = tf.layers.dense(from_tensor_2d, kernel_initializer=...)) 

内含 from_tensor_2d.shape[-1] * to_tensor_2d.shape[-1] 个weight 以及 to_tensor_2d.shape[-1] 个bias. 共计 (from_tensor_2d.shape[-1] + 1) * to_tensor_2d.shape[-1] 个参数, 这就是一个FC-Layer的参数量

有了上面的例子, BERT的参数量就不难分析, 整个BERT网络可以分为4个模块: embedding layer + transformer layers + pooled layer+ classifier layer

embedding-layer

Continue reading