Tensorflow XLA Service 详解 II

本文主要介绍在XLA service阶段针对HloInstruction做的一些显存优化, 对于训练框架来说, 显存优化的工作至关重要, 主要是由于现阶段GPU+CUDA远没有CPU+Linux组合强大, 后者有完善的建立在虚拟内存基础上的内存管理机制, 内存的高效使用由linux kernel来负责, 即便物理内存不足, 还可以使用swap, 内存压缩等技术确保内存的高效供应, 而在GPU+CUDA里, 这方面的工作很大程度让渡给了程序员自己来搞定, GPU程序接触到的就是物理显存, 如果程序的显存申请超过显存容量, 整个程序就会直接coredump, 此外, 显存本身就集成在GPU板卡上, 无法像内存一样扩展, 而GPU本身造价昂贵, 最后, 在深度学习训练中, 大力出奇迹的现状下, 显存的消耗明显超过的摩尔定律, 这也加剧了显存供求关系的矛盾, 正式由于训练框架做了大量的优化, 才能让模型跑起来.

XLA Service的显存优化设计思想与tensorflow整体一样遵循”静态图”的设计: 先整体优化, 再落地实施. 其中, xla/service/buffer_assignment.cc 是整个显存优化的核心, 在1.14版本中, xla/service/支持两种后端: cpu和gpu, 纷纷针对两种backend有进一步的优化算法, 本文主要针对GPU的优化逻辑进行分析

核心文件

内存优化公共:
xla/service/buffer_assignment 内存优化核心文件
xla/service/buffer_liveness.cc  内存片生命周期分析
GPU相关:
xla/service/gpu/buffer_allocations.cc BufferAllocation的组合
xla/service/gpu/gpu_hlo_scheudle.cc   Hlo的处理顺序, 与显存的优化策略息息相关, 简单地说, 按照BFS并行执行的HloInstruction消耗的显存肯定大于所有的HloInstruction都顺序执行. 

核心抽象

HloSchedule

XxxSchedule是TF的代码风格, 类似的有更底层用于Thunk调度的ThunkSchedule, 以及Service提供的HloSchedule. HloSchedule中最重要的就是封装了HloOrdering.

HloSchedule

XLAService内存优化的本质是处理LoigicalBuffer和BufferAllocation之间的关系, 原则是使用尽可能少的BufferAllocation去承载尽可能多的LogicalBuffer, 而如何使用的更少, 就涉及到了对Hlo图的分析, 就涉及到了Ordering的问题, 使用不同策略生成Ordering, 直接影响两个LogicalBuffer之间的约束关系, 最简单的, 在图遍历中, 使用DFS和BFS的2种方式遍历会导致图上节点的内存依赖关系大有不同.

HloOrdering是描述HloInstruction加载序列的基类, 派生类有PredecessorHloOrdering, DependencyHloOrdering 和 SequentialHloOrdering, 其中, DependencyHloOrdering基于依赖关系, 所以可以并行, 性能更高, 但耗更多的内存, 而SequentialHloOrdering完全串行, 性能相对低, 但可以节约更多内存, 而 PredecessorHloOrdering 是个虚类, 需要子类进一步填充predecessors_, 这也是GPU后端使用的方式.不同的Ordering会影响内存的依赖关系, 进一步影响Launch到GPU后Kernel的执行序列. 

// An HLO ordering based on data dependencies in the HLO graph. In this partial
// order, instruction A executes before instruction B only if there is a path
// from A to B in the HLO graph. For example, given the following graph:
/*
          param
         /     \
      negate   exp
          \    /
           add
*/
// DependencyHloOrdering gives the following executes-before relations:
//   param executes before negate, exp, and add
//   negate executes before add
//   exp executes before add
//   add executes before nothing
// negate and exp are not ordered because the dependencies allow either to
// execute before the other (or in parallel). DependencyHloOrdering ordering
// allows maximum parallelism and enables any execution order which satisfies
// data dependencies. This requires pessimistic assumptions about buffer live
// ranges and can result in more memory used than more constrained orderings.
class DependencyHloOrdering : public PredecessorHloOrdering {

// An HLO ordering based on a total order of instructions in each computation.
// The computation total order is a sequencing of all of its instructions in
// the computation (eg, {inst0, inst1, inst2,...}) as in single-threaded
// execution. For example, given the following HLO graph:
/*
          param
         /     \
      negate   exp
          \    /
           add
*/
// and the following sequence:
//
//  {param, negate, exp, add}
//
// SequentialHloOrdering gives the following executes-before relations:
//   param executes before negate, exp, and add
//   negate executes before exp and add
//   exp executes before add
//   add executes before nothing
// This is more constrained than DependencyHloOrdering in this example because
// negate and exp are ordered (negate before exp). This enables param to share
// the same buffer as exp (param buffer is dead after exp). Generally, this
// ordering enables more buffer sharing (reduced memory usage) because buffer
// interference is reduced relative to DependencyHloOrdering.
class SequentialHloOrdering : public HloOrdering {  

GpuHloSchedule

Build()      构造GpuHloSchedule单例, 同时会根据Stream的数量采用BFS或Sequence的LaunchOrder,, 这个LaunchOrder会用于构造GpuHloOrdering, 以及thunk_launch_order_
ThunkLaunchOrder() 拿到上述thunk_launch_order_
ConsumeHloOrdering() 拿到上述GpuHloOrdering

GpuHloOrdering

class GpuHloOrdering : public PredecessorHloOrdering : public HloOrdering

在使用Multi-stream的情况下, 一方面stream no的分配算法会遍历图给每个HloInstruction分配不同的stream no, 另一方面, GpuHloOrdering也会据此选择BFS算法来生成Ordering存入thunk_launch_order以及HloOrdering.predecessors_ , 后续的显存优化会据此优化显存, 确保Multi-Stream在实际执行的过程中不会产生显存问题.  

BufferAssigner

BufferAssigner 用于构造BufferAssignment对象, 可以理解为BufferAssignment的wrapper

BufferAssignment

内存管理入口类. 所有关于内存优化的内容均在其Run()中实现, 在nvptx_compiler.cc:RunBackend()调用

BufferLiveness

描述一块内存的生命周期, “computes liveness of the output buffers of HLOs and their interference”, 为显存优化的决策提供支持. BufferLiveness 的构造过程经历3层分析, 从内而外, 即从前到后:
LogicalBufferAnalysis: 构造LogicalBuffer, 每个Instruction都可以有多个LogicalBuffer, 通过Accept机制遍历所有的Instruction
TuplePointsToAnalysis: 构造PointToSet, 需要之前的LogccalBuffer,  最终也是通过Accept机制遍历所有的Instruction二者其实的都是Hlo的DFS遍历handler

class TuplePointsToAnalysis : public DfsHloVisitorWithDefault
class LogicalBufferAnalysis : public DfsHloVisitorWithDefault 
NVPTXCompiler::RunBackend()
  BufferAssigner::Run()
    assigner::CreateAssignment()
      liveness = BufferLiveness::Run(module, std::move(hlo_ordering)  //class BufferLiveness
        liveness = new BufferLiveness()
        liveness->Analyze()
          points_to_analysis_ = TuplePointsToAnalysis::Run()
            logical_buffer_analysis = LogicalBufferAnalysis::Run()
              analysis = new LogicalBufferAnalysis()
              analysis.Analyze()
              return analysis
            analysis = new TuplePointsToAnalysis(logical_buffer_analysis)
            analysis.Analyze()
            return analysis
          maybe_live_out_buffers_ = points_to_analysis_->GetPointsToSet(root).CreateFlattenedSet()
        return liveness
      assignment(new BufferAssignment(module, std::move(liveness)

-2- 规划内存的总入口, 注意入参
-15- liveout的部分只发生在root instruction(TODO)

LogicalBuffer

“虚拟内存”, LogicalBuffer 对内存需求的抽象, 内存优化的实质就是调整LogicalBuffer和BufferAllocation的关系, 将来会被HloValue替换掉. 同时, 整个内存优化过程都使用color的概念, 相同的color表示可以融合, 这点在BufferAllocation和BufferSet中都有用到, 显然, 处于一个BufferAllocation之下的LogicalBuffer都有相同的color, 要不也不会在一起, 相应的BufferAllocation实例也使用想用的color进行标记
什么样的两个LogicalBuffer会被分配给同一个BufferAllocation?
1. kWhile, kCall, kConditional 的LogicalBuffer
2. `cannot_merge_buffer_sets()`: !(两个buffer的id不同(不是一个buffer) && 会发生干涉)

void LogicalBufferAnalysis::NewLogicalBuffer(HloInstruction* instruction,  const ShapeIndex& index) {
  CHECK_EQ(logical_buffers_.size(), next_buffer_id_);
  logical_buffers_.emplace_back( absl::make_unique<LogicalBuffer>(instruction, index, next_buffer_id_));
    //LogicalBuffer
    : BufferValue(instruction, index, id),
      id_(id)
      const Shape& shape = ShapeUtil::GetSubshape(instruction->shape(), index);
    index_(index) {}
  output_buffers_[std::make_pair(instruction, index)] = logical_buffers_.back().get();
  ++next_buffer_id_;
}

-1- LogicalBuffer只存Shape(确切的是SubShape) id 和inst指针
-3- 所有的LogicalBuffers对象都会丢到, LogicalBufferAnalysis::logical_buffers中
-7- 所以, 一个LogicalBuffer只对应一个Subshape的内存, 不是整个Instruction

BufferAllocation

BufferAllocation  物理内存的抽象, 一个 BufferAllocation 实例最终对应一块连续物理内存, 通常, 一块物理内存会被多个不干涉的LogicalBuffer公用.一个BufferAllocation实例可以分为:

Status BufferAssignment::ComputeSummaryStats() {
  for (auto& allocation : Allocations()) {
    if (allocation.is_entry_computation_parameter()) {
      stats_.parameter_allocation_count++;
      stats_.parameter_allocation_bytes += allocation.size();
    }
    if (allocation.is_constant()) {
      stats_.constant_allocation_count++;
      stats_.constant_allocation_bytes += allocation.size();
    }
    if (allocation.maybe_live_out()) {
      stats_.maybe_live_out_allocation_count++;
      stats_.maybe_live_out_allocation_bytes += allocation.size();
    }
    if (allocation.IsPreallocatedTempBuffer()) {
      stats_.preallocated_temp_allocation_count++;
      stats_.preallocated_temp_allocation_bytes += allocation.size();
    }
    stats_.total_allocation_count++;
    stats_.total_allocation_bytes += allocation.size();

BufferAllocation::Slice

BufferAllocation里的一片, 关联在该BufferAllocation的每一个LogicalBuffer都会有对应的Slice

GetSlice 根据 buffer,  重新构造一个 Slice 对象

BufferAllocation::Slice BufferAllocation::GetSlice(
    const LogicalBuffer& buffer) const {
  const OffsetSize os = FindOrDie(assigned_buffers_, &buffer);
  return Slice(this, os.offset, os.size);
}

ShapeIndex

一个 ShapeIndex 对应的是一个 LogicalBufferList, 真正使用的时候是用的 ShapeIndex 来索引到buffer

StatusOr<BufferAllocation::Slice> BufferAssignment::GetUniqueSlice
  for (const LogicalBuffer* buffer : GetPointsToSet(instruction).element(index))
    //elements()
    return tree_.element(index).buffers

-4- index是ShapeIndex, 描述一个Shape的索引, 不是Intuction. buffers 是 BufferList,  此处可以看到, 直接用index来索引得到的BufferList, 

   IrEmitterUnnested::BuildInitializerThunk()
       return {absl::make_unique<MemzeroThunk>(a, nullptr)};

-2- 使用 ShapeIndex 索引 直接->Allocation::Slice->Thunk, 

BufferAllocations

GPU后端对一组BufferAllocation的抽象

 std::vector<se::DeviceMemoryBase> buffers_

-1- 物理内存对象, se::DeviceMemoryBase是StreamExecutor对一块线性内存资源的抽象,  提供了最基础offset+size的计算方法

StatusOr<std::unique_ptr<BufferAllocations>> BufferAllocations::Builder::Build()
  for BufferAllocation::Index i in num_buffers:  
    BufferAllocation allocation = buffer_assignment->GetAllocation():
    if address  already registered:
      buffer_allocations->SetBuffer(i, *address);
    if (allocation.maybe_live_out() || allocation.IsPreallocatedTempBuffer()):
      se::DeviceMemoryBase buffer_address;
      buffer = memory_allocator->Allocate(device_ordinal, buffer_size)
      buffer_address = buffer.Release();
      buffer_allocations->SetBuffer(i, buffer_address);

-2- 每个 buffer 都有一个 BufferAllocation
-8- 真正分配新的buffer

优化过程

   2 NVPTXCompiler::RunBackend()
   3   hlo_schedule = GpuHloSchedule::Build(*module, *stream_assignment, pointer_size_))
   4     std::unique_ptr<GpuHloSchedule> schedule(new GpuHloSchedule)
   5   //this analysis figures out which temp buffers are required to run the computation
   6   BufferAssigner::Run(hlo_schedule->ConsumeHloOrdering(), BufferSizeBytesFunction(), BufferL
   7     //BufferSizeBytesFunction()
   8     shape_size = ShapeSizeBytesFunction()
   9       ShapeUtil::ByteSizeOf(shape, pointer_size);
  10         ByteSizeOfTupleIndexTable(shape, pointer_size);
  11           return pointer_size * shape.tuple_shapes_size();
  12     //Run()
  13     BufferAssigner assigner
  14     assigner.CreateAssignment(HloModule, hlo_ordering, buffer_size)
  15       liveness = BufferLiveness::Run(module, std::move(hlo_ordering)
  16         liveness = new BufferLiveness()
  17           liveness->Analyze()
  18             points_to_analysis_ = TuplePointsToAnalysis::Run() 
  19               logical_buffer_analysis = LogicalBufferAnalysis::Run(module)
  20                 std::unique_ptr<LogicalBufferAnalysis> analysis(new LogicalBufferAnalysis()
  21                 analysis->Analyze();
  22                   NewLogicalBuffer()
  23               analysis(new TuplePointsToAnalysis(module, logical_buffer_analysis.ConsumeValu
  24               analysis->Analyze()
  25                 for computation in module_->MakeNoFusionComputations():
  26                   computation->Accept(this)
  27                   for instruction in computation->instructions():
  28                     if instruction is not kFusion:
  29                       continue
  30                     GatherFusionInstructions(instruction, fusion_instructions)
  31                 for instruction in fusion_instructions:
  32                   instruction->fused_expression_root()->Accept(this)
  33               return analysis
  34             for computation in module->computation():
  35               if computation->IsFusionComputation():
  36                 continue
  37               for instruction in computation->instructions():
  38                 for alias_buffer in points_to_analysis_->GetPointsToSet().CreateFlattenedSet
  39                   if aliased_buffer->instruction() != instruction:
  40                     alias_buffers_.insert(aliased_buffer)
  41               if computation == module_->entry_computation():
  42                 maybe_live_out_buffers_ = points_to_analysis_->getPointsToSet().CreateFlatte
  43         return liveness
  44       assignment(new BufferAssignment(module, std::move(liveness)
  45
  46       flat_hash_set<const LogicalBuffer*> colocated_buffers;
  47       BuildColocatedBufferSets(&colocated_buffer_sets)
  48         points_to_analysis = buffer_liveness.points_to_analysis()
  49         module->input_output_alias_config().ForEachAlias([]{AddSetToColocatedBufferSets()})
  50         for computation in module->MakeComputationPostOrder():
  51           for instruction in computation->MakeInstructionPostOrder():
  52             if instruction->opcode()== HloOpCode::kWhile:
  53               AddBufferToColocatedSet()
  54               AddSetToColocatedBufferSets()
  55             else if == HloOpCode::kCall:
  56               //...
  57             else if == HloOpcode::kConditional:
  58               //...
  59         // Given a list of colocated buffer sets (each colocated buffer set represents
  60         // the logical buffers that would be assigned to the same physical buffer),
  61         // try to merge the sets if the buffers can be shared. Returns the merged set.
  62         new_colocated_buffer_sets = MergeColocatedBufferSets()
  63           // Given the interference map of a graph (the list of interfering node indices
  64           // for each node), perform graph coloring such that interfering nodes are
  65           // assigned to different colors. Returns the assigned color of the nodes, where
  66           // the colors are represented as integer values [0, color_count).
  67           std::vector<int64> ColorInterferenceGraph()
  68           // Assign a color to each colocation set in colocated_buffer_sets, such that
  69           // the sets that can be merged are assigned with the same color.
  70           auto assigned_colors = ColorInterferenceGraph(interference_map);
  71             for node in nodes: //nodes: 0 ~ node_count-1
  72               for neighbor in interference_map[node]:
  73                 color = assigned_colors[neighbor]
  74                 if color != kColorUnassigned:
  75                   available_colors[color] = false
  76                 ...
  77           std::vector<ColocatedBufferSet> new_colocated_buffer_sets(num_sets)
  78           for i in colocated_buffer_sets.size():
  79             buffer_sets = colocated_buffer_sets[i]
  80             new_colocated_buffer_sets[assigned_colors[i]].insert(buffer_set.begin(), buffer_
  81           return new_colocated_buffer_sets;
  82         swap(colocated_buffer_sets, new_colocated_buffer_sets)
  83       colorer_(assignment->liveness())
  84       GatherComputationsByAllocationType(module, &thread_local_computations, &global_computa
  85       // For each buffer set in 'colocated_buffer_sets', assigns all buffers in the
  86       // same set to the same buffer allocation in 'assignment'.
  87       AssignColocatedBufferSets(colocated_buffer_sets, &colocated_buffers, &colocated_alloca
  88         for colocated_buffer_set in colocated_buffer_sets:
  89           BufferAllocation* allocation = nullptr
  90           for LogicalBuffer buffer in colocated_buffer_set:
  91             instruction-> buffer->instruction()
  92             computation = instruction->parent()
  93           for LogicalBuffer buffer in colocated_buffer_set:
  94             buffer_size = assignment->buffer_size_(*buffer);
  95             if allocation == nullptr:
  96               allocation = assignment->NewAllocation(*buffer, buffer_size)
  97               colocated_allocations->insert(allocation->index())
  98             else:
  99               assignment->AddAssignment(allocation, *buffer, 0, buffer_size)
 100                 allocation->AddAssignment(buffer, offset, size);
 101                   assigned_buffers_.emplace(&buffer, offset_size);
 102             colocated_buffers->insert(buffer)
 103       GatherComputationsByAllocationType(module, &thread_local_computations, &global_computa
 104       // First assign buffers for global computatations. Temporary buffers for
 105       // sequential computations are collected in 'buffers_to_assign_sequentially'.
 106       flat_hash_map<const HloComputation*, flat_hash_set<const LogicalBuffer*>>buffers_to_as
 107       for computation in global_computations:
 108         // Assigns buffers to the instructions in the given computation. "assignment"
 109         // is modified to reflect the new buffer assignments. If is_thread_local is
 110         // true, then all assigned buffers have the is_thread_local flag set to
 111         // true.
 112         AssignBuffersForComputation(colocated_buffers, buffers_to_assign_sequentially, / *is
 113           std::vector<const LogicalBuffer*> sorted_buffers;
 114           for (auto* instruction : computation->instructions()):
 115             for buffer in assignment->points_to_analysis().GetBuffersDefinedByInstruction(in
 116               sorted_buffers.push_back()
 117           for instruction : computation->MakeInstructionPostOrder():
 118             post_order_position.emplace(instruction, position);
 119           // Every sequential computation must get an entry in the
 120           // buffers_to_assign_sequentially map, even if we end up with an empty set
 121           // of buffers. This ensures we can correctly determine whether to run
 122           // whole-module heap simulation.
 123           buffers_to_assign_sequentially->emplace(computation, flat_hash_set<const LogicalBu
 124           std::vector<BufferAllocation::Index> new_allocation_indices;
 125           // A sorted multimap from size to indices of colocated allocations.
 126           std::multimap<int64, BufferAllocation::Index> colocated_allocation_size_to_indices
 127           for index in colocated_allocations:
 128             for buffer_offset_size in assignment->GetAllocations(index).assign_buffers():
 129               if IsTuple():
 130                 consider_reuse = false
 131                 break;
 132             if considering_reusing:
 133               sorted_colocated_indices.push(index)
 134           while !sorted_colocated_indices.empty():
 135             auto index = sorted_colocated_indices.top()
 136             sorted_colocated_indices.pop()
 137             // Returns the size of the allocation. Necessarily this must be at least as
 138             // large as any LogicalBuffer assigned to this allocation.
 139             colocated_allocation_size_to_indices.emplace(assigne...size(), index)
 140           for buffer in sorted_buffers:
 141             if colocated_buffers.contains(buffer)
 142               continue
 143             instruction = buffer->instruction()
 144             // Function which returns the buffer size for a given logical buffer (shape)
 145             buffer_size = assignment->buffer_size_(*buffer)
 146             if instruction->opcode() == HloOpcode::kConstant:
 147               BufferAllocation* allocation = assignment->NewAllocation(*buffer)
 148             ...
 149             if ... ==
 150             if ... ==
 151             if ... ==
 152             if buffer->...:
 153               for operand in instruction->operands():
 154                 for operand_slice in assignment->GetAllSlices():
 155                   BufferAllocation allocation = assign.GetMutableAllocation(operand_slice.in
 156                   if(MaybeAssignBuffer(allocation, buffer, assignment)):
 157                     //bool BufferAssigner::MaybeAssignBuffer()
 158                       ...
 159                     assigned_operand = true
 160                 if assigned_operand
 161                   break
 162             if reuse_colocated_checker_ ... && !assignment->HasAllocation(*buffer):
 163               it = colocated_allocation_size_to_indices.lower_bound(buffer_size)
 164               while(it != colocated_allocation_size_....)
 165                 allocation = assignment->GetMutableAllocation(it->second)
 166                 if (MaybeAssignBuffer(allocation)):
 167                   colocated_allocation_size_to_indices.erase(it)
 168                   break
 169                 ++it;
 170             if(!assignment->HasAllocation(*buffer)):
 171               // Find the smallest buffer which can be reused iterating from end of
 172               // new_allocation_indices (smallest) to beginning (largest).
 173               for allocation_index = new_allocation_indices.size() - 1; allocation_index >=
 174                 BufferAllocation* allocation = assignment->GetMutableAllocation(new_allocati
 175                 if MaybeAssignBuffer(allocation, *buffer, assignment):
 176                   break
 177             if !assignment->HasAllocation(buffer) && has_sequential_order && !liveness.Maybe
 178               // There is a sequential instruction ordering, so we delay assignment of
 179               // temp buffers until after the loop. We do this right before we decide to
 180               // create a new allocation, to ensure we've exhausted all the buffer
 181               // re-use cases above.
 182               //
 183               // Entry parameters and thread local buffers were already handled earlier
 184               // in this loop iteration.  See BufferAllocation::IsPreallocatedTempBuffer
 185               // for the definition of temp buffers.
 186               (*buffers_to_assign_sequentially)[computation].insert(buffer);
 187               continue
 188             if !assignment->HasAlocation()
 189               allocation = assigment->NewAllocation(buffer, buffer_size)
 190               new_allocation_indices.push_back(allocation->index())
 191       // Assigns 'buffers_to_assign_sequentially' using heap simulation, assuming
 192       // the HLO instructions will be executed in the sequential order given by
 193       // assignment->liveness().hlo_ordering().SequentialOrder. If
 194       // 'run_whole_module_heap_simulation' is true, the heap simulation will be run
 195       // assuming all global computations are sequentially ordered.
 196       // Assign buffers with sequential ordering, if any. If all global computations
 197       // are sequential, we can run heap simuation on the whole module, which
 198       // reduces memory usage.
 199       AssignBuffersWithSequentialOrdering()
 200         HloSchedule schedule(&assignment->module());
 201         hlo_ordering = assignment->liveness().hlo_ordering();
 202         auto get_heap_algorithm = ...
 203         if (run_whole_module_heap_simulation):
 204           for (const auto& pair : buffers_to_assign_sequentially):
 205             HloInstructionSequence* instruction_sequence = hlo_ordering.SequentialOrder(*com
 206             schedule.set_sequence()
 207             all_buffers_to_assign.insert()
 208             color_map = SplitBuffersByColor()
 209           for (auto& single_colored_set : color_map) : 
 210             result = HeapSimulator::Run()
 211             AssignBuffersFromHeapSimulator()
 212         else:
 213           for (const auto& pair : buffers_to_assign_sequentially):
 214             instruction_sequence = hlo_ordering.SequentialOrder(*computation)
 215             color_map = SplitBuffersByColor(buffers_to_assign)
 216             for (auto& single_colored_set : color_map):
 217               buffer_value_set = ToBufferValueFlatSet(single_colored_set.second)
 218               options.buffers_to_assign = &buffer_value_set;
 219               result = HeapSimulator::Run(get_heap_algorithm(alignment),
 220                           assignment->points_to_analysis(),
 221                           assignment->buffer_size_)
 222                 HeapSimulator heap(std::move(algorithm), size_fn, options, &schedule)
 223                 const HloComputation* entry_computation = module.entry_computation();
 224                 HloInstructionSequence& instruction_sequence = schedule.sequence(entry_compu
 225                 heap.RunComputation(entry_computation, instruction_sequence, points_to_analy
 226                   for instruction in instruction_sequence.instructions():
 227                     points_to = points_to_analysis.GetPointsToSet(instruction);
 228                     for user in instruction.users():
 229                       ...
 230                   for instruction in instruction_sequence.instructions():
 231                     buffers_defined_by_instruction = points_analysis.GetBufferDefinedByInstr
 232                     for buffer in buffers_defined_by_instruction:
 233                       ...
 234                     for oprand_buffer in used_buffers[instruction]:
 235                       ...
 236                     for buffer in buffers_defined_by_instruction:
 237                       if !shared: -->
 238                         alloc_size_by_instruction += size_fn_(*buffer);
 239                         Alloc(buffer, instruction);
 240                 heap->finish()
 241                   Result result = algorithm_->Finish();
 242               AssignBuffersFromHeapSimulator(result)
 243                 BufferAllocation* allocation = assignement->NewEmptyAllocation(result.heap_s
 244                 for (const auto& buffer_chunk : result.chunk_map):
 245                   assignment->AddAssignment(allocation, buffer, chunk.offset, chunk.size);
 246                     allocation->AddAssignment(bufferm, offset, size)
 247                 allocation->peak_buffers_ = ComputePeakMemoryLogicalBuffers(*allocation, res
 248       for computation in thread_local_computation:
 249         if computation IsFusionComputation():
 250           continue
 251         AssignBuffersForComputation(/*is_thread_local=*/true)
 252       for buffer in assignment_.liveness().maybe_live_out_buffers()
 253         if assignement->HasAllocation():
 254           alloc = assignment->GetMutableAssignedAllocation()
 255           alloc -> set_maybe_live_out(true)
 256       assignment->CombineTempAllocations();
 257         std::partition(allocations_.begin(), allocations_.end(),
 258                      [](const BufferAllocation& allocation) {
 259                        return !allocation.IsPreallocatedTempBuffer();
 260                      });
 261         allocations_.erase(first_temp_it, )
 262         for combined:combined_allocation_map:
 263           allocations_.push_back()
 264         for index in allocations_.size():
 265           allocation->set_index(index):
 266             for buffer_offset_size in allocation->assigned_buffers_:
 267               LogicalBuffer buffer = buffer_offset_size.first
 268               allocation_index_for_buffer_[buffer] = index
 269   ir_emitter_context();
 270   IrEmitterUnnested ir_emitter
 271   entry_computation->Accept(&ir_emitter)
 272   ptx = CompileToPtx()
 273   const std::vector<uint8> cubin = CompilePtxOrGetCachedResult()
 274   ir_emitter.ConsumeThunkSequence(), std::move(stream_assignment), hlo_schedule->ThunkLaunch
 275   auto* gpu_executable = new GpuExecutable()
 276   gpu_executable->set_ir_module_string(ir_module_string_before_opt); 

-2- 这个backend其实不是llvm的backend 而是HLO的backend?
-6- //buffer_assignment.cc , 都是对元数据的管理, 没有真的分配, 包含了所有的内存管理代码, 显存优化的入口, 这里ConsumerHloOrdering()返回的是hlo_ordering
-8- 就是下文的buffer_size
-11- 对于一个tuple描述的shape, size的计算
-15- 遍历 module//buffer_liveness.cc  –> DFS analysis, 构造一个liveness, 需要针对HloModule的TuplePointsToAnalysis::Run()->LogicalBufferAnalysis::Run()等一系列分析, 
-19- 遍历 HloModule, 为每个HloInstruction构造相应的Logicalbuffer, 存入LogicalBufferAnalysis, 并返回其指针, `class LogicalBufferAnalysis : public DfsHloVisitorWithDefault`, 说明也是用来遍历的handler
-21- 遍历节点, 执行各种handler, 建立logicalbuffer和instruction及其subshape的关系, 并存储在logicalbuffer中
-22- 构造LogicalBuffer对象
-26- 所有非fusion的buffer构造完毕
-27- 这个循环收集所有的fusion_inst
-30- gather 该computation下的所有的fusion instruction
-32- 给fusion的inst分配logical buffer
-44- 构造BufferAssignment
-52- 当前只有这3种op可以colocated
-53- 将这个instruction对应的LogicalBuffer放到BufferSet
-54- 将BufferSet放到BufferSets
-60- colocate意味着最终分配的是同一块物理内存
-62- 使用邻接矩阵的方式表示一张图, 用int64标识一个节点. 遍历每一个node, 首先遍历其每一个neighbor, 这个neighbor不是相邻节点, 而是通过 cannot_merge_buffer_set 设置的, 比如id()不同&&… 他们用过的color就标记为不可用, 取一个neighbor都没用过的color, 最小未用, 给这个node, 最终的效果是, 每一个node和他的neighbors节点的颜色都不同, 但可能和其他节点相同, 相邻的不能共用buffer, 猜测应该是不能共用输入及输出, 前面的输出还是可以作为后面的输入的. 最终, 就是在这里解决内存冲突: 不相邻的标记为相同颜色, 说明后续可能可以合并. MergeColocatedBufferSets 遍历所有的BufferSet, 将能合并的set合并了,  合并之前, 每个BufferSet都和Instruction/node关联, 合并之后, BufferSets中的一个Set里的buffer就不是和instruction强相关了, 可能是几个合在一起的, 不过, 不存在把一个BufferSet拆成几个的情况, 两个bufferset, 要么能合并, 要么不能合并, 不会把一个BufferSet内的LogicalBuffer拆分. 
-70- interference_map表示vector的vector,  表征的是colocated_buffer_sets中两个set的关系,  而从这里看, 每个set里存储的一组logicalbuffer其实都是一个node使用, 这一点其实也可以从colocated_buffer_sets的构造过程可以看出: 3种op每种op都会先添加一组上logical_buffer到colocated_buffer_set, 再把set放到sets中
-72- 将 interference_map中表征的有冲突的node用不同的color表示
-77- num_sets是最大的color值, 即能将nodes划分为互斥最小集合的数量
-80- 将不冲突的, 即color相同的node都放一起
-87- AssignColocatedBufferSets, 代码里的AssignXXX都是构造/分配BufferAllocation的意思, 比如这个 AssignColocatedBufferSets, 就是根据BufferSets分配BufferAllocation, 遍历colocated_buffer_sets 里的每一个Set里的每一个LogicalBuffer, 将一个Set的LogicalBuffer都放在一个BufferAllocation中, 一个Allocation里的LogicalBuffer不一定一样大, 将二者关联的时候会传入buffer_size的, 此时, offset都还是0 
-92- Module->computation->instruction的嵌套关系
-94- 根据assigner构造时传入的函数对象计算,  这里是遍历所有的buffersize, 将相同的合并到一个allocation中
-95- 为 buffer 分配allocation
-96- **每个allocation内的buffer或buffer_size是固定的?**
-99- 将这个buffer 赋给已有BufferAllocation对象, 这里 offset是0, size就是buffer_size, 是第一次存储这两个值
-106- 第一个顺序执行的点
-107- 至此, 就完成了初始的内存分配代码, 接下来 AssignBufferForComputation(), 就是将 Computation都关联到Allocation, 首先是 global_computations, 遍历所有的Instruction->points_to_analysis->GetBuffersDefinedByInstruction(), 将所有的Buffer按照从大到小排序, 之所以从大到小排序, 应该和 BufferAllocation的Size 要比里面的所有 Buffer的size 都大的原因, 这样就能充分利用 allocation, 毕竟越少越好, 前面的用大块 buffer 搞了大的, 后面就可以乘凉, NewAllocation 本身就会将自己链入Assignment, 总之, 这里会通过一系列判断, 将大部分buffer存入尽可能少的Allocation中. global_computations 进行Assign的过程中, 有一些buffer被加入了sequential的范围,  这部分buffer在AssignBuffersWithSequtialOrdering, 同样, 这里Assign也是找Allocation的意思, 这里会用到堆模拟器, 会计算很多offset, size之类的, 将buffer加入到Heap中, `NewEmptyAllocation(result.heap_size,` 就能看出, 此外, 由于根据hlo_ordering, stream_no> 1时不会所有的都是顺序的, 所以, 会存在根据hlo_ordering最终有多个chunk的情况, 更具体的, 一个computation对应多个LogicalBuffer, 并进行一个simulate, 得到一组chunk, 和一个allocation. 
-112- 有关于顺序执行的代码
-116- 将传入的computation下的每一个instruction的每一个LogicalBuffer都入队
-118- 保存传入的computation下每一个instruction及其位置
-130- 根据注释, :// Output tuple table may be allocated at run-time, so make sure we don’t // overwrite them.
-140- 处理在sorted_buffer中, 但不在 colocated_buffers 中的, 前文的 colocated_buffers 只有3种HloOpCode可以用
-155- 获取每一个已有的Allocation
-161- 说明该buffer已经找到了 allocation
-162- 说明之前没有找到能用的
-163- 找到最接近的大于的key
-172- 从小到大(即从后向前) 找合适的allocation存储buffer
-186- 不是liveout的buffer, 即temp的buffer, 先存着, 没有相应的Allocation, 这里存的是comutation和buffer们的映射关系
-189- 实在不行了, 分配个新的Allocation
-193- 计算一个buffer的 live out 或 interfere 等
-199- buffer 顺序执行, 重点关注
-202- 这里用的LazyBestFitHeap, ChoostBestHeapAlgorithm
-208- 默认构造一个buffer只有0, 通过default_colorer(), 还有其他逻辑导致可能没有0吗?-213- out/trace.log:2019-08-02 10:44:04.437956: I tensorflow/compiler/xla/service/buffer_assignment.cc: 1262] Running per-computation heap simulation
-215- 这里的是LogicalBuffer->BufferValue里的color, 不是前文用来合并BufferSet的color, 是通过buffer_liveness.h 中的 DefaultColorer赋值的. 这部分buffers都有color, 但没有被colocate合并过的, 也没有被合并单奥已有allocation中的temp buffer
-216-  single_colored_set里的buffer是可以合并的.
-237- 分析时候可以共享内存
-243- 用整个优化过的堆大小来定义新的allocation, 一个allocation里是可以存放很多不同offset和size的buffer的. allocation有自己的size, 全文来看, 同样的color分配一个allocation. 那直接取最大的不就好了。 所以最终, Assignment里存储的tmp buffer是最小数量的Allocation和他的size, 用color区分 + 之前3种已经处理好的Op,  + 其他
-245- 会存入allocation
-256- 把多个allocation里的buffer存到同一个combined_allocation里
-257- 划分的原则
-270- XLA HLO Visitor机制
-271- 使用上述代码构造的buffer来构造Thunk




Related:
Tensorflow XLA Service Buffer优化详解
Tensorflow XLA Service 详解 II
Tensorflow XLA Service 详解 I
Tensorflow XLA Client | HloModuleProto 详解
Tensorflow XlaOpKernel | tf2xla 机制详解
Tensorflow JIT 技术详解
Tensorflow JIT/XLA UML
Tensorflow OpKernel机制详解
Tensorflow Op机制详解
Tensorflow Optimization机制详解
Tensorflow 图计算引擎概述

One comment on “Tensorflow XLA Service 详解 II

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from sketch2sky

Subscribe now to keep reading and get access to the full archive.

Continue reading