本文主要介绍在XLA service阶段针对HloInstruction做的一些显存优化, 对于训练框架来说, 显存优化的工作至关重要, 主要是由于现阶段GPU+CUDA远没有CPU+Linux组合强大, 后者有完善的建立在虚拟内存基础上的内存管理机制, 内存的高效使用由linux kernel来负责, 即便物理内存不足, 还可以使用swap, 内存压缩等技术确保内存的高效供应, 而在GPU+CUDA里, 这方面的工作很大程度让渡给了程序员自己来搞定, GPU程序接触到的就是物理显存, 如果程序的显存申请超过显存容量, 整个程序就会直接coredump, 此外, 显存本身就集成在GPU板卡上, 无法像内存一样扩展, 而GPU本身造价昂贵, 最后, 在深度学习训练中, 大力出奇迹的现状下, 显存的消耗明显超过的摩尔定律, 这也加剧了显存供求关系的矛盾, 正式由于训练框架做了大量的优化, 才能让模型跑起来.
XLA Service的显存优化设计思想与tensorflow整体一样遵循”静态图”的设计: 先整体优化, 再落地实施. 其中, xla/service/buffer_assignment.cc 是整个显存优化的核心, 在1.14版本中, xla/service/支持两种后端: cpu和gpu, 纷纷针对两种backend有进一步的优化算法, 本文主要针对GPU的优化逻辑进行分析
核心文件
内存优化公共:
xla/service/buffer_assignment 内存优化核心文件
xla/service/buffer_liveness.cc 内存片生命周期分析
GPU相关:
xla/service/gpu/buffer_allocations.cc BufferAllocation的组合
xla/service/gpu/gpu_hlo_scheudle.cc Hlo的处理顺序, 与显存的优化策略息息相关, 简单地说, 按照BFS并行执行的HloInstruction消耗的显存肯定大于所有的HloInstruction都顺序执行.
核心抽象
HloSchedule
XxxSchedule是TF的代码风格, 类似的有更底层用于Thunk调度的ThunkSchedule, 以及Service提供的HloSchedule. HloSchedule中最重要的就是封装了HloOrdering.
HloSchedule
XLAService内存优化的本质是处理LoigicalBuffer和BufferAllocation之间的关系, 原则是使用尽可能少的BufferAllocation去承载尽可能多的LogicalBuffer, 而如何使用的更少, 就涉及到了对Hlo图的分析, 就涉及到了Ordering的问题, 使用不同策略生成Ordering, 直接影响两个LogicalBuffer之间的约束关系, 最简单的, 在图遍历中, 使用DFS和BFS的2种方式遍历会导致图上节点的内存依赖关系大有不同.
HloOrdering是描述HloInstruction加载序列的基类, 派生类有PredecessorHloOrdering, DependencyHloOrdering 和 SequentialHloOrdering, 其中, DependencyHloOrdering基于依赖关系, 所以可以并行, 性能更高, 但耗更多的内存, 而SequentialHloOrdering完全串行, 性能相对低, 但可以节约更多内存, 而 PredecessorHloOrdering 是个虚类, 需要子类进一步填充predecessors_, 这也是GPU后端使用的方式.不同的Ordering会影响内存的依赖关系, 进一步影响Launch到GPU后Kernel的执行序列.
// An HLO ordering based on data dependencies in the HLO graph. In this partial // order, instruction A executes before instruction B only if there is a path // from A to B in the HLO graph. For example, given the following graph: /* param / \ negate exp \ / add */ // DependencyHloOrdering gives the following executes-before relations: // param executes before negate, exp, and add // negate executes before add // exp executes before add // add executes before nothing // negate and exp are not ordered because the dependencies allow either to // execute before the other (or in parallel). DependencyHloOrdering ordering // allows maximum parallelism and enables any execution order which satisfies // data dependencies. This requires pessimistic assumptions about buffer live // ranges and can result in more memory used than more constrained orderings. class DependencyHloOrdering : public PredecessorHloOrdering { // An HLO ordering based on a total order of instructions in each computation. // The computation total order is a sequencing of all of its instructions in // the computation (eg, {inst0, inst1, inst2,...}) as in single-threaded // execution. For example, given the following HLO graph: /* param / \ negate exp \ / add */ // and the following sequence: // // {param, negate, exp, add} // // SequentialHloOrdering gives the following executes-before relations: // param executes before negate, exp, and add // negate executes before exp and add // exp executes before add // add executes before nothing // This is more constrained than DependencyHloOrdering in this example because // negate and exp are ordered (negate before exp). This enables param to share // the same buffer as exp (param buffer is dead after exp). Generally, this // ordering enables more buffer sharing (reduced memory usage) because buffer // interference is reduced relative to DependencyHloOrdering. class SequentialHloOrdering : public HloOrdering {
GpuHloSchedule
Build() | 构造GpuHloSchedule单例, 同时会根据Stream的数量采用BFS或Sequence的LaunchOrder,, 这个LaunchOrder会用于构造GpuHloOrdering, 以及thunk_launch_order_ |
ThunkLaunchOrder() | 拿到上述thunk_launch_order_ |
ConsumeHloOrdering() | 拿到上述GpuHloOrdering |
GpuHloOrdering
class GpuHloOrdering : public PredecessorHloOrdering : public HloOrdering
在使用Multi-stream的情况下, 一方面stream no的分配算法会遍历图给每个HloInstruction分配不同的stream no, 另一方面, GpuHloOrdering也会据此选择BFS算法来生成Ordering存入thunk_launch_order以及HloOrdering.predecessors_ , 后续的显存优化会据此优化显存, 确保Multi-Stream在实际执行的过程中不会产生显存问题.
BufferAssigner
BufferAssigner 用于构造BufferAssignment对象, 可以理解为BufferAssignment的wrapper
BufferAssignment
内存管理入口类. 所有关于内存优化的内容均在其Run()中实现, 在nvptx_compiler.cc:RunBackend()调用
BufferLiveness
描述一块内存的生命周期, “computes liveness of the output buffers of HLOs and their interference”, 为显存优化的决策提供支持. BufferLiveness 的构造过程经历3层分析, 从内而外, 即从前到后:
LogicalBufferAnalysis: 构造LogicalBuffer, 每个Instruction都可以有多个LogicalBuffer, 通过Accept机制遍历所有的Instruction
TuplePointsToAnalysis: 构造PointToSet, 需要之前的LogccalBuffer, 最终也是通过Accept机制遍历所有的Instruction二者其实的都是Hlo的DFS遍历handler
class TuplePointsToAnalysis : public DfsHloVisitorWithDefault class LogicalBufferAnalysis : public DfsHloVisitorWithDefault
NVPTXCompiler::RunBackend() BufferAssigner::Run() assigner::CreateAssignment() liveness = BufferLiveness::Run(module, std::move(hlo_ordering) //class BufferLiveness liveness = new BufferLiveness() liveness->Analyze() points_to_analysis_ = TuplePointsToAnalysis::Run() logical_buffer_analysis = LogicalBufferAnalysis::Run() analysis = new LogicalBufferAnalysis() analysis.Analyze() return analysis analysis = new TuplePointsToAnalysis(logical_buffer_analysis) analysis.Analyze() return analysis maybe_live_out_buffers_ = points_to_analysis_->GetPointsToSet(root).CreateFlattenedSet() return liveness assignment(new BufferAssignment(module, std::move(liveness)
-2- 规划内存的总入口, 注意入参
-15- liveout的部分只发生在root instruction(TODO)
LogicalBuffer
“虚拟内存”, LogicalBuffer 对内存需求的抽象, 内存优化的实质就是调整LogicalBuffer和BufferAllocation的关系, 将来会被HloValue替换掉. 同时, 整个内存优化过程都使用color的概念, 相同的color表示可以融合, 这点在BufferAllocation和BufferSet中都有用到, 显然, 处于一个BufferAllocation之下的LogicalBuffer都有相同的color, 要不也不会在一起, 相应的BufferAllocation实例也使用想用的color进行标记
什么样的两个LogicalBuffer会被分配给同一个BufferAllocation?
1. kWhile, kCall, kConditional 的LogicalBuffer
2. `cannot_merge_buffer_sets()`: !(两个buffer的id不同(不是一个buffer) && 会发生干涉)
void LogicalBufferAnalysis::NewLogicalBuffer(HloInstruction* instruction, const ShapeIndex& index) { CHECK_EQ(logical_buffers_.size(), next_buffer_id_); logical_buffers_.emplace_back( absl::make_unique<LogicalBuffer>(instruction, index, next_buffer_id_)); //LogicalBuffer : BufferValue(instruction, index, id), id_(id) const Shape& shape = ShapeUtil::GetSubshape(instruction->shape(), index); index_(index) {} output_buffers_[std::make_pair(instruction, index)] = logical_buffers_.back().get(); ++next_buffer_id_; }
-1- LogicalBuffer只存Shape(确切的是SubShape) id 和inst指针
-3- 所有的LogicalBuffers对象都会丢到, LogicalBufferAnalysis::logical_buffers中
-7- 所以, 一个LogicalBuffer只对应一个Subshape的内存, 不是整个Instruction
BufferAllocation
BufferAllocation 物理内存的抽象, 一个 BufferAllocation 实例最终对应一块连续物理内存, 通常, 一块物理内存会被多个不干涉的LogicalBuffer公用.一个BufferAllocation实例可以分为:
Status BufferAssignment::ComputeSummaryStats() { for (auto& allocation : Allocations()) { if (allocation.is_entry_computation_parameter()) { stats_.parameter_allocation_count++; stats_.parameter_allocation_bytes += allocation.size(); } if (allocation.is_constant()) { stats_.constant_allocation_count++; stats_.constant_allocation_bytes += allocation.size(); } if (allocation.maybe_live_out()) { stats_.maybe_live_out_allocation_count++; stats_.maybe_live_out_allocation_bytes += allocation.size(); } if (allocation.IsPreallocatedTempBuffer()) { stats_.preallocated_temp_allocation_count++; stats_.preallocated_temp_allocation_bytes += allocation.size(); } stats_.total_allocation_count++; stats_.total_allocation_bytes += allocation.size();
BufferAllocation::Slice
BufferAllocation里的一片, 关联在该BufferAllocation的每一个LogicalBuffer都会有对应的Slice
GetSlice 根据 buffer, 重新构造一个 Slice 对象
BufferAllocation::Slice BufferAllocation::GetSlice( const LogicalBuffer& buffer) const { const OffsetSize os = FindOrDie(assigned_buffers_, &buffer); return Slice(this, os.offset, os.size); }
ShapeIndex
一个 ShapeIndex 对应的是一个 LogicalBufferList, 真正使用的时候是用的 ShapeIndex 来索引到buffer
StatusOr<BufferAllocation::Slice> BufferAssignment::GetUniqueSlice for (const LogicalBuffer* buffer : GetPointsToSet(instruction).element(index)) //elements() return tree_.element(index).buffers
-4- index是ShapeIndex, 描述一个Shape的索引, 不是Intuction. buffers 是 BufferList, 此处可以看到, 直接用index来索引得到的BufferList,
IrEmitterUnnested::BuildInitializerThunk() return {absl::make_unique<MemzeroThunk>(a, nullptr)};
-2- 使用 ShapeIndex 索引 直接->Allocation::Slice->Thunk,
BufferAllocations
GPU后端对一组BufferAllocation的抽象
std::vector<se::DeviceMemoryBase> buffers_
-1- 物理内存对象, se::DeviceMemoryBase是StreamExecutor对一块线性内存资源的抽象, 提供了最基础offset+size的计算方法
StatusOr<std::unique_ptr<BufferAllocations>> BufferAllocations::Builder::Build() for BufferAllocation::Index i in num_buffers: BufferAllocation allocation = buffer_assignment->GetAllocation(): if address already registered: buffer_allocations->SetBuffer(i, *address); if (allocation.maybe_live_out() || allocation.IsPreallocatedTempBuffer()): se::DeviceMemoryBase buffer_address; buffer = memory_allocator->Allocate(device_ordinal, buffer_size) buffer_address = buffer.Release(); buffer_allocations->SetBuffer(i, buffer_address);
-2- 每个 buffer 都有一个 BufferAllocation
-8- 真正分配新的buffer
优化过程
2 NVPTXCompiler::RunBackend() 3 hlo_schedule = GpuHloSchedule::Build(*module, *stream_assignment, pointer_size_)) 4 std::unique_ptr<GpuHloSchedule> schedule(new GpuHloSchedule) 5 //this analysis figures out which temp buffers are required to run the computation 6 BufferAssigner::Run(hlo_schedule->ConsumeHloOrdering(), BufferSizeBytesFunction(), BufferL 7 //BufferSizeBytesFunction() 8 shape_size = ShapeSizeBytesFunction() 9 ShapeUtil::ByteSizeOf(shape, pointer_size); 10 ByteSizeOfTupleIndexTable(shape, pointer_size); 11 return pointer_size * shape.tuple_shapes_size(); 12 //Run() 13 BufferAssigner assigner 14 assigner.CreateAssignment(HloModule, hlo_ordering, buffer_size) 15 liveness = BufferLiveness::Run(module, std::move(hlo_ordering) 16 liveness = new BufferLiveness() 17 liveness->Analyze() 18 points_to_analysis_ = TuplePointsToAnalysis::Run() 19 logical_buffer_analysis = LogicalBufferAnalysis::Run(module) 20 std::unique_ptr<LogicalBufferAnalysis> analysis(new LogicalBufferAnalysis() 21 analysis->Analyze(); 22 NewLogicalBuffer() 23 analysis(new TuplePointsToAnalysis(module, logical_buffer_analysis.ConsumeValu 24 analysis->Analyze() 25 for computation in module_->MakeNoFusionComputations(): 26 computation->Accept(this) 27 for instruction in computation->instructions(): 28 if instruction is not kFusion: 29 continue 30 GatherFusionInstructions(instruction, fusion_instructions) 31 for instruction in fusion_instructions: 32 instruction->fused_expression_root()->Accept(this) 33 return analysis 34 for computation in module->computation(): 35 if computation->IsFusionComputation(): 36 continue 37 for instruction in computation->instructions(): 38 for alias_buffer in points_to_analysis_->GetPointsToSet().CreateFlattenedSet 39 if aliased_buffer->instruction() != instruction: 40 alias_buffers_.insert(aliased_buffer) 41 if computation == module_->entry_computation(): 42 maybe_live_out_buffers_ = points_to_analysis_->getPointsToSet().CreateFlatte 43 return liveness 44 assignment(new BufferAssignment(module, std::move(liveness) 45 46 flat_hash_set<const LogicalBuffer*> colocated_buffers; 47 BuildColocatedBufferSets(&colocated_buffer_sets) 48 points_to_analysis = buffer_liveness.points_to_analysis() 49 module->input_output_alias_config().ForEachAlias([]{AddSetToColocatedBufferSets()}) 50 for computation in module->MakeComputationPostOrder(): 51 for instruction in computation->MakeInstructionPostOrder(): 52 if instruction->opcode()== HloOpCode::kWhile: 53 AddBufferToColocatedSet() 54 AddSetToColocatedBufferSets() 55 else if == HloOpCode::kCall: 56 //... 57 else if == HloOpcode::kConditional: 58 //... 59 // Given a list of colocated buffer sets (each colocated buffer set represents 60 // the logical buffers that would be assigned to the same physical buffer), 61 // try to merge the sets if the buffers can be shared. Returns the merged set. 62 new_colocated_buffer_sets = MergeColocatedBufferSets() 63 // Given the interference map of a graph (the list of interfering node indices 64 // for each node), perform graph coloring such that interfering nodes are 65 // assigned to different colors. Returns the assigned color of the nodes, where 66 // the colors are represented as integer values [0, color_count). 67 std::vector<int64> ColorInterferenceGraph() 68 // Assign a color to each colocation set in colocated_buffer_sets, such that 69 // the sets that can be merged are assigned with the same color. 70 auto assigned_colors = ColorInterferenceGraph(interference_map); 71 for node in nodes: //nodes: 0 ~ node_count-1 72 for neighbor in interference_map[node]: 73 color = assigned_colors[neighbor] 74 if color != kColorUnassigned: 75 available_colors[color] = false 76 ... 77 std::vector<ColocatedBufferSet> new_colocated_buffer_sets(num_sets) 78 for i in colocated_buffer_sets.size(): 79 buffer_sets = colocated_buffer_sets[i] 80 new_colocated_buffer_sets[assigned_colors[i]].insert(buffer_set.begin(), buffer_ 81 return new_colocated_buffer_sets; 82 swap(colocated_buffer_sets, new_colocated_buffer_sets) 83 colorer_(assignment->liveness()) 84 GatherComputationsByAllocationType(module, &thread_local_computations, &global_computa 85 // For each buffer set in 'colocated_buffer_sets', assigns all buffers in the 86 // same set to the same buffer allocation in 'assignment'. 87 AssignColocatedBufferSets(colocated_buffer_sets, &colocated_buffers, &colocated_alloca 88 for colocated_buffer_set in colocated_buffer_sets: 89 BufferAllocation* allocation = nullptr 90 for LogicalBuffer buffer in colocated_buffer_set: 91 instruction-> buffer->instruction() 92 computation = instruction->parent() 93 for LogicalBuffer buffer in colocated_buffer_set: 94 buffer_size = assignment->buffer_size_(*buffer); 95 if allocation == nullptr: 96 allocation = assignment->NewAllocation(*buffer, buffer_size) 97 colocated_allocations->insert(allocation->index()) 98 else: 99 assignment->AddAssignment(allocation, *buffer, 0, buffer_size) 100 allocation->AddAssignment(buffer, offset, size); 101 assigned_buffers_.emplace(&buffer, offset_size); 102 colocated_buffers->insert(buffer) 103 GatherComputationsByAllocationType(module, &thread_local_computations, &global_computa 104 // First assign buffers for global computatations. Temporary buffers for 105 // sequential computations are collected in 'buffers_to_assign_sequentially'. 106 flat_hash_map<const HloComputation*, flat_hash_set<const LogicalBuffer*>>buffers_to_as 107 for computation in global_computations: 108 // Assigns buffers to the instructions in the given computation. "assignment" 109 // is modified to reflect the new buffer assignments. If is_thread_local is 110 // true, then all assigned buffers have the is_thread_local flag set to 111 // true. 112 AssignBuffersForComputation(colocated_buffers, buffers_to_assign_sequentially, / *is 113 std::vector<const LogicalBuffer*> sorted_buffers; 114 for (auto* instruction : computation->instructions()): 115 for buffer in assignment->points_to_analysis().GetBuffersDefinedByInstruction(in 116 sorted_buffers.push_back() 117 for instruction : computation->MakeInstructionPostOrder(): 118 post_order_position.emplace(instruction, position); 119 // Every sequential computation must get an entry in the 120 // buffers_to_assign_sequentially map, even if we end up with an empty set 121 // of buffers. This ensures we can correctly determine whether to run 122 // whole-module heap simulation. 123 buffers_to_assign_sequentially->emplace(computation, flat_hash_set<const LogicalBu 124 std::vector<BufferAllocation::Index> new_allocation_indices; 125 // A sorted multimap from size to indices of colocated allocations. 126 std::multimap<int64, BufferAllocation::Index> colocated_allocation_size_to_indices 127 for index in colocated_allocations: 128 for buffer_offset_size in assignment->GetAllocations(index).assign_buffers(): 129 if IsTuple(): 130 consider_reuse = false 131 break; 132 if considering_reusing: 133 sorted_colocated_indices.push(index) 134 while !sorted_colocated_indices.empty(): 135 auto index = sorted_colocated_indices.top() 136 sorted_colocated_indices.pop() 137 // Returns the size of the allocation. Necessarily this must be at least as 138 // large as any LogicalBuffer assigned to this allocation. 139 colocated_allocation_size_to_indices.emplace(assigne...size(), index) 140 for buffer in sorted_buffers: 141 if colocated_buffers.contains(buffer) 142 continue 143 instruction = buffer->instruction() 144 // Function which returns the buffer size for a given logical buffer (shape) 145 buffer_size = assignment->buffer_size_(*buffer) 146 if instruction->opcode() == HloOpcode::kConstant: 147 BufferAllocation* allocation = assignment->NewAllocation(*buffer) 148 ... 149 if ... == 150 if ... == 151 if ... == 152 if buffer->...: 153 for operand in instruction->operands(): 154 for operand_slice in assignment->GetAllSlices(): 155 BufferAllocation allocation = assign.GetMutableAllocation(operand_slice.in 156 if(MaybeAssignBuffer(allocation, buffer, assignment)): 157 //bool BufferAssigner::MaybeAssignBuffer() 158 ... 159 assigned_operand = true 160 if assigned_operand 161 break 162 if reuse_colocated_checker_ ... && !assignment->HasAllocation(*buffer): 163 it = colocated_allocation_size_to_indices.lower_bound(buffer_size) 164 while(it != colocated_allocation_size_....) 165 allocation = assignment->GetMutableAllocation(it->second) 166 if (MaybeAssignBuffer(allocation)): 167 colocated_allocation_size_to_indices.erase(it) 168 break 169 ++it; 170 if(!assignment->HasAllocation(*buffer)): 171 // Find the smallest buffer which can be reused iterating from end of 172 // new_allocation_indices (smallest) to beginning (largest). 173 for allocation_index = new_allocation_indices.size() - 1; allocation_index >= 174 BufferAllocation* allocation = assignment->GetMutableAllocation(new_allocati 175 if MaybeAssignBuffer(allocation, *buffer, assignment): 176 break 177 if !assignment->HasAllocation(buffer) && has_sequential_order && !liveness.Maybe 178 // There is a sequential instruction ordering, so we delay assignment of 179 // temp buffers until after the loop. We do this right before we decide to 180 // create a new allocation, to ensure we've exhausted all the buffer 181 // re-use cases above. 182 // 183 // Entry parameters and thread local buffers were already handled earlier 184 // in this loop iteration. See BufferAllocation::IsPreallocatedTempBuffer 185 // for the definition of temp buffers. 186 (*buffers_to_assign_sequentially)[computation].insert(buffer); 187 continue 188 if !assignment->HasAlocation() 189 allocation = assigment->NewAllocation(buffer, buffer_size) 190 new_allocation_indices.push_back(allocation->index()) 191 // Assigns 'buffers_to_assign_sequentially' using heap simulation, assuming 192 // the HLO instructions will be executed in the sequential order given by 193 // assignment->liveness().hlo_ordering().SequentialOrder. If 194 // 'run_whole_module_heap_simulation' is true, the heap simulation will be run 195 // assuming all global computations are sequentially ordered. 196 // Assign buffers with sequential ordering, if any. If all global computations 197 // are sequential, we can run heap simuation on the whole module, which 198 // reduces memory usage. 199 AssignBuffersWithSequentialOrdering() 200 HloSchedule schedule(&assignment->module()); 201 hlo_ordering = assignment->liveness().hlo_ordering(); 202 auto get_heap_algorithm = ... 203 if (run_whole_module_heap_simulation): 204 for (const auto& pair : buffers_to_assign_sequentially): 205 HloInstructionSequence* instruction_sequence = hlo_ordering.SequentialOrder(*com 206 schedule.set_sequence() 207 all_buffers_to_assign.insert() 208 color_map = SplitBuffersByColor() 209 for (auto& single_colored_set : color_map) : 210 result = HeapSimulator::Run() 211 AssignBuffersFromHeapSimulator() 212 else: 213 for (const auto& pair : buffers_to_assign_sequentially): 214 instruction_sequence = hlo_ordering.SequentialOrder(*computation) 215 color_map = SplitBuffersByColor(buffers_to_assign) 216 for (auto& single_colored_set : color_map): 217 buffer_value_set = ToBufferValueFlatSet(single_colored_set.second) 218 options.buffers_to_assign = &buffer_value_set; 219 result = HeapSimulator::Run(get_heap_algorithm(alignment), 220 assignment->points_to_analysis(), 221 assignment->buffer_size_) 222 HeapSimulator heap(std::move(algorithm), size_fn, options, &schedule) 223 const HloComputation* entry_computation = module.entry_computation(); 224 HloInstructionSequence& instruction_sequence = schedule.sequence(entry_compu 225 heap.RunComputation(entry_computation, instruction_sequence, points_to_analy 226 for instruction in instruction_sequence.instructions(): 227 points_to = points_to_analysis.GetPointsToSet(instruction); 228 for user in instruction.users(): 229 ... 230 for instruction in instruction_sequence.instructions(): 231 buffers_defined_by_instruction = points_analysis.GetBufferDefinedByInstr 232 for buffer in buffers_defined_by_instruction: 233 ... 234 for oprand_buffer in used_buffers[instruction]: 235 ... 236 for buffer in buffers_defined_by_instruction: 237 if !shared: --> 238 alloc_size_by_instruction += size_fn_(*buffer); 239 Alloc(buffer, instruction); 240 heap->finish() 241 Result result = algorithm_->Finish(); 242 AssignBuffersFromHeapSimulator(result) 243 BufferAllocation* allocation = assignement->NewEmptyAllocation(result.heap_s 244 for (const auto& buffer_chunk : result.chunk_map): 245 assignment->AddAssignment(allocation, buffer, chunk.offset, chunk.size); 246 allocation->AddAssignment(bufferm, offset, size) 247 allocation->peak_buffers_ = ComputePeakMemoryLogicalBuffers(*allocation, res 248 for computation in thread_local_computation: 249 if computation IsFusionComputation(): 250 continue 251 AssignBuffersForComputation(/*is_thread_local=*/true) 252 for buffer in assignment_.liveness().maybe_live_out_buffers() 253 if assignement->HasAllocation(): 254 alloc = assignment->GetMutableAssignedAllocation() 255 alloc -> set_maybe_live_out(true) 256 assignment->CombineTempAllocations(); 257 std::partition(allocations_.begin(), allocations_.end(), 258 [](const BufferAllocation& allocation) { 259 return !allocation.IsPreallocatedTempBuffer(); 260 }); 261 allocations_.erase(first_temp_it, ) 262 for combined:combined_allocation_map: 263 allocations_.push_back() 264 for index in allocations_.size(): 265 allocation->set_index(index): 266 for buffer_offset_size in allocation->assigned_buffers_: 267 LogicalBuffer buffer = buffer_offset_size.first 268 allocation_index_for_buffer_[buffer] = index 269 ir_emitter_context(); 270 IrEmitterUnnested ir_emitter 271 entry_computation->Accept(&ir_emitter) 272 ptx = CompileToPtx() 273 const std::vector<uint8> cubin = CompilePtxOrGetCachedResult() 274 ir_emitter.ConsumeThunkSequence(), std::move(stream_assignment), hlo_schedule->ThunkLaunch 275 auto* gpu_executable = new GpuExecutable() 276 gpu_executable->set_ir_module_string(ir_module_string_before_opt);
-2- 这个backend其实不是llvm的backend 而是HLO的backend?
-6- //buffer_assignment.cc , 都是对元数据的管理, 没有真的分配, 包含了所有的内存管理代码, 显存优化的入口, 这里ConsumerHloOrdering()返回的是hlo_ordering
-8- 就是下文的buffer_size
-11- 对于一个tuple描述的shape, size的计算
-15- 遍历 module//buffer_liveness.cc –> DFS analysis, 构造一个liveness, 需要针对HloModule的TuplePointsToAnalysis::Run()->LogicalBufferAnalysis::Run()等一系列分析,
-19- 遍历 HloModule, 为每个HloInstruction构造相应的Logicalbuffer, 存入LogicalBufferAnalysis, 并返回其指针, `class LogicalBufferAnalysis : public DfsHloVisitorWithDefault`, 说明也是用来遍历的handler
-21- 遍历节点, 执行各种handler, 建立logicalbuffer和instruction及其subshape的关系, 并存储在logicalbuffer中
-22- 构造LogicalBuffer对象
-26- 所有非fusion的buffer构造完毕
-27- 这个循环收集所有的fusion_inst
-30- gather 该computation下的所有的fusion instruction
-32- 给fusion的inst分配logical buffer
-44- 构造BufferAssignment
-52- 当前只有这3种op可以colocated
-53- 将这个instruction对应的LogicalBuffer放到BufferSet
-54- 将BufferSet放到BufferSets
-60- colocate意味着最终分配的是同一块物理内存
-62- 使用邻接矩阵的方式表示一张图, 用int64标识一个节点. 遍历每一个node, 首先遍历其每一个neighbor, 这个neighbor不是相邻节点, 而是通过 cannot_merge_buffer_set 设置的, 比如id()不同&&… 他们用过的color就标记为不可用, 取一个neighbor都没用过的color, 最小未用, 给这个node, 最终的效果是, 每一个node和他的neighbors节点的颜色都不同, 但可能和其他节点相同, 相邻的不能共用buffer, 猜测应该是不能共用输入及输出, 前面的输出还是可以作为后面的输入的. 最终, 就是在这里解决内存冲突: 不相邻的标记为相同颜色, 说明后续可能可以合并. MergeColocatedBufferSets 遍历所有的BufferSet, 将能合并的set合并了, 合并之前, 每个BufferSet都和Instruction/node关联, 合并之后, BufferSets中的一个Set里的buffer就不是和instruction强相关了, 可能是几个合在一起的, 不过, 不存在把一个BufferSet拆成几个的情况, 两个bufferset, 要么能合并, 要么不能合并, 不会把一个BufferSet内的LogicalBuffer拆分.
-70- interference_map表示vector的vector, 表征的是colocated_buffer_sets中两个set的关系, 而从这里看, 每个set里存储的一组logicalbuffer其实都是一个node使用, 这一点其实也可以从colocated_buffer_sets的构造过程可以看出: 3种op每种op都会先添加一组上logical_buffer到colocated_buffer_set, 再把set放到sets中
-72- 将 interference_map中表征的有冲突的node用不同的color表示
-77- num_sets是最大的color值, 即能将nodes划分为互斥最小集合的数量
-80- 将不冲突的, 即color相同的node都放一起
-87- AssignColocatedBufferSets, 代码里的AssignXXX都是构造/分配BufferAllocation的意思, 比如这个 AssignColocatedBufferSets, 就是根据BufferSets分配BufferAllocation, 遍历colocated_buffer_sets 里的每一个Set里的每一个LogicalBuffer, 将一个Set的LogicalBuffer都放在一个BufferAllocation中, 一个Allocation里的LogicalBuffer不一定一样大, 将二者关联的时候会传入buffer_size的, 此时, offset都还是0
-92- Module->computation->instruction的嵌套关系
-94- 根据assigner构造时传入的函数对象计算, 这里是遍历所有的buffersize, 将相同的合并到一个allocation中
-95- 为 buffer 分配allocation
-96- **每个allocation内的buffer或buffer_size是固定的?**
-99- 将这个buffer 赋给已有BufferAllocation对象, 这里 offset是0, size就是buffer_size, 是第一次存储这两个值
-106- 第一个顺序执行的点
-107- 至此, 就完成了初始的内存分配代码, 接下来 AssignBufferForComputation(), 就是将 Computation都关联到Allocation, 首先是 global_computations, 遍历所有的Instruction->points_to_analysis->GetBuffersDefinedByInstruction(), 将所有的Buffer按照从大到小排序, 之所以从大到小排序, 应该和 BufferAllocation的Size 要比里面的所有 Buffer的size 都大的原因, 这样就能充分利用 allocation, 毕竟越少越好, 前面的用大块 buffer 搞了大的, 后面就可以乘凉, NewAllocation 本身就会将自己链入Assignment, 总之, 这里会通过一系列判断, 将大部分buffer存入尽可能少的Allocation中. global_computations 进行Assign的过程中, 有一些buffer被加入了sequential的范围, 这部分buffer在AssignBuffersWithSequtialOrdering, 同样, 这里Assign也是找Allocation的意思, 这里会用到堆模拟器, 会计算很多offset, size之类的, 将buffer加入到Heap中, `NewEmptyAllocation(result.heap_size,` 就能看出, 此外, 由于根据hlo_ordering, stream_no> 1时不会所有的都是顺序的, 所以, 会存在根据hlo_ordering最终有多个chunk的情况, 更具体的, 一个computation对应多个LogicalBuffer, 并进行一个simulate, 得到一组chunk, 和一个allocation.
-112- 有关于顺序执行的代码
-116- 将传入的computation下的每一个instruction的每一个LogicalBuffer都入队
-118- 保存传入的computation下每一个instruction及其位置
-130- 根据注释, :// Output tuple table may be allocated at run-time, so make sure we don’t // overwrite them.
-140- 处理在sorted_buffer中, 但不在 colocated_buffers 中的, 前文的 colocated_buffers 只有3种HloOpCode可以用
-155- 获取每一个已有的Allocation
-161- 说明该buffer已经找到了 allocation
-162- 说明之前没有找到能用的
-163- 找到最接近的大于的key
-172- 从小到大(即从后向前) 找合适的allocation存储buffer
-186- 不是liveout的buffer, 即temp的buffer, 先存着, 没有相应的Allocation, 这里存的是comutation和buffer们的映射关系
-189- 实在不行了, 分配个新的Allocation
-193- 计算一个buffer的 live out 或 interfere 等
-199- buffer 顺序执行, 重点关注
-202- 这里用的LazyBestFitHeap, ChoostBestHeapAlgorithm
-208- 默认构造一个buffer只有0, 通过default_colorer(), 还有其他逻辑导致可能没有0吗?-213- out/trace.log:2019-08-02 10:44:04.437956: I tensorflow/compiler/xla/service/buffer_assignment.cc: 1262] Running per-computation heap simulation
-215- 这里的是LogicalBuffer->BufferValue里的color, 不是前文用来合并BufferSet的color, 是通过buffer_liveness.h 中的 DefaultColorer赋值的. 这部分buffers都有color, 但没有被colocate合并过的, 也没有被合并单奥已有allocation中的temp buffer
-216- single_colored_set里的buffer是可以合并的.
-237- 分析时候可以共享内存
-243- 用整个优化过的堆大小来定义新的allocation, 一个allocation里是可以存放很多不同offset和size的buffer的. allocation有自己的size, 全文来看, 同样的color分配一个allocation. 那直接取最大的不就好了。 所以最终, Assignment里存储的tmp buffer是最小数量的Allocation和他的size, 用color区分 + 之前3种已经处理好的Op, + 其他
-245- 会存入allocation
-256- 把多个allocation里的buffer存到同一个combined_allocation里
-257- 划分的原则
-270- XLA HLO Visitor机制
-271- 使用上述代码构造的buffer来构造Thunk
Related:
Tensorflow XLA Service Buffer优化详解
Tensorflow XLA Service 详解 II
Tensorflow XLA Service 详解 I
Tensorflow XLA Client | HloModuleProto 详解
Tensorflow XlaOpKernel | tf2xla 机制详解
Tensorflow JIT 技术详解
Tensorflow JIT/XLA UML
Tensorflow OpKernel机制详解
Tensorflow Op机制详解
Tensorflow Optimization机制详解
Tensorflow 图计算引擎概述
One comment on “Tensorflow XLA Service 详解 II”