Tensorflow XlaOpKernel | tf2xla 机制详解

compiler/aot/	以AOT的方式将tf2xla/接入TF引擎
compiler/jit/	以JIT的方式将tf2xla/接入TF引擎，核心是9个优化器和3个tfop，其中XlaCompileOp调用tf2xla的“编译”入口完成功能封装，XlaRunOp调用xla/client完成“运行”功能。
compiler/tf2xla/	对上提供xla_compiler.cc:XlaCompiler::CompileFunction()供jit:compile_fn()使用将cluster转化为XlaComputation。核心是利用xla/client提供的接口，实现对XlaOpKernel的“Symbolic Execution”功能。每个XlaOpKernel子类均做的以下工作: 从XlaOpKernelContext中取出XlaExpression或XlaOp, 调用xla/client/xla_buidler.h提供的方法完成计算, 将计算结果的XlaOp存入XlaKernelContext.
compiler/xla/client/	对上提供xla_builder.cc:Builder等供CompileFunction()使用，将Graph由Op表达转化为HloModuleProto:HloComputationProto:HloInstructionProto表达并保存在XlaComputation中。对上提供local_client.cc:LocalClient::Compile()，作为编译入口供jit：BuildExecutable()使用，将已经得到的XlaComputation交给service并进一步编译为二进制。对上提供local_client.cc:LocalExecutable::Run()，作为运行入口供jit/kernels/xla_ops.cc:XlaRunOp使用，通过Key找到相应的二进制交给service层处理
compiler/xla/service/	对上提供local_service.cc:LocalService::BuildExecutable()供LocalClient::Compile()使用实现真正的编译，承接XlaComputation封装的HloProto, 将其转化为HloModule:HloComputation:HloInstruction表达, 对其进行优化之后, 使用LLVM后端将其编译为相应Executable后端的二进制代码对上提供executable.cc:Executable::ExecuteOnStream()供LocalExecutable::Run()使用实现真正的执行二进制。

从Kernel的视角, XLA并不会新增Op, 而是针对已有的Op, 新增了基于XLA的另一个版本的Kernel: XlaOpKerne。在TF引擎中, OpKernel在软件栈上已是底层, 即最终的计算过程都要在OpKernel中实现. 但在XLA中, XlaOpKernel只是编译的入口, 大量的实际工作都交给了更下层的XLA引擎去完成.XLA相关的代码在tensorflow/compiler中.

tf2xla/负责XlaOpKernel的构造, 注册. 虽然XLA与TF引擎不在一层, 但二者面临的问题有很多有相似之处, 比如都需要对Kernel和Device保持易扩展性, 都需要维持前驱/后继Kernel的数据流和控制流关系. 基于类似的种种原因, XLA内部实现的注册XlaOpKernel的接口与TF引擎中注册OpKernel的风格十分相似, 同时, 其内部实现又有本质的不同, 而这些”不同”, 正是我们需要关注的.

要理解XlaOpKernel与OpKernel的不同, 关键在于了解”Symbolic Execution“.
先来看TF引擎, 它的OpKernel::Compute()方法要: OpKernelContext.Input()取输入数据 ==> 计算 ==> OpKernelContext.SetOutput()存输出数据, 计算结果继续通过OpKernelContext流入后继Opkernel, 其中流动的是真正的训练数据, 暂且将这个过程称之为”Execution”.
对比之下, XLA中的”Symbolic Execution”中的”Symbolic”即是说, XlaOpKernel的设计目的不在于去处理训练数据, 而在于去生成能够正确的处理数据的代码. 整个JIT类似于python解释器，先将程序编译为二进制，再运行二进制，XlaOpKernel执行的”Symbolic Execution”就类似其中的”编译为二进制”的过程. 具体地, 在XlaOpKernel::Compile()中: XlaOpKernelContext.Input()以XlaOp形式取输入 ==> 调用xla/client/xla_buidler.h提供的方法实现Op该有的功能, 实际上是生成一组能处理数据的HloInstruction ==> XlaOpKernelContext.SetOutput()存储XlaOp形式的结果, 计算结果继续通过XlaOpKernelContext流入后继XlaOpkernel, 其中流动的都是以XlaOp表征的对训练数据的处理方法.

//compiler/xla/client/xla_builder.h
// This represents an instruction that has been enqueued using the XlaBuilder.
// This is used to pass to subsequent computations that depends upon the
// instruction as an operand.
class XlaOp {

至于真正处理数据的时机, 就要交给XLA引擎, 它来负责后续的”编译”和”执行”, 具体地, 在JIT中, XlaCompileOp会在所有的XlaOpKernel::Compile()执行完毕之后, 继续调用xla/service中相应的方法将这些所有生成的HloInstruction编译生成二进制并进一步交给XlaRunOp来执行.

XlaOpKernel定义

"XlaOpKernel(compiler/tf2xla/xla_op_kernel.h)"继承自"OpKernel"(当前XlaOpKernel不支持AsynOpKernel类似的异步机制), 通过这种继承, XlaOpKernel与OpKernel注册框架有天然的相容性, 同时, 又针对XLA的设计要求作了以下处理, 来实现XLA需要的Symbolic Execution: 一个XlaOpKernel子类不再实现以OpKernelContext为参数的Compute()方法, 而要实现以XlaOpKernelContext为参数的Compile()方法.

//compiler/tf2xla/xla_op_kernel.h
class XlaOpKernel : public OpKernel {
  // Subclasses should implement Compile(), much as standard OpKernels implement
  // Compute().
  virtual void Compile(XlaOpKernelContext* context) = 0;
  void Compute(OpKernelContext* context) final;
};

//compiler/tf2xla/xla_op_kernel.cc
void XlaOpKernel::Compute(OpKernelContext* context) {
  XlaOpKernelContext xla_context(context);
  Compile(&xla_context);
}

同时, 依前文所述, XlaOpKernel的运行上下文, 输入输出与OpKernel有很大不同, 的这里, XLA重新封装了一个上下文类: "XlaOpkernelContext", 要注意到, 生成处理数据的XlaOp本身就和数据的形状, 类型等有关系, 即数据的元数据, 这些信息又存储在"OpKernelContext"中, 所以使用了”关联“OpKernelContext的方式来构造XlaOpKernelContext

//compiler/tf2xla/xla_op_kernel.h
// The context passed to the Compile() method of XlaOpKernel. An
// XlaOpKernelContext is a variant of the standard OpKernel class, tailored for
// implementing operators that perform symbolic execution as part of the XLA
// compiler. The key difference is that XlaOpKernelContext produces and consumes
// data as XLA computations, rather than as standard Tensors.
//
// Under the hood, symbolic execution communicates using special Tensors that
// wrap XlaExpression objects, however this is an implementation detail that
// this class hides. The *only* correct way to allocate a Tensor during
// compilation is using the XlaOpKernelContext methods, since they ensure there
// is a valid XlaExpression backing the tensor. No Op should ever call
// allocate_output or allocate_temp directly on the underlying OpKernelContext.
class XlaOpKernelContext {
 public:
  explicit XlaOpKernelContext(OpKernelContext* context);

  XlaContext* xla_context() const;
  // Returns input `index` as a XlaOp. Unlike
  // OpKernelContext::Input returns a symbolic value rather than a concrete
  // Tensor.
  xla::XlaOp Input(int index);

XLA已经实现的OpKernel在 tensorflow/compiler/tf2xla/kernels/中, 实现一个新的XlaOpKernel, 子类需要实现"Compile()"方法, 并通过"REGISTER_XLA_OP"注册到系统中, 举个例子:

//compiler/tf2xla/kernels/relu_op.cc
class ReluOp : public XlaOpKernel {
 public:
  explicit ReluOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {}
  void Compile(XlaOpKernelContext* ctx) override {
    xla::XlaBuilder* builder = ctx->builder();
    auto zero = XlaHelpers::Zero(builder, input_type(0));
    ctx->SetOutput(0, xla::Max(zero, ctx->Input(0)));
  }
};
REGISTER_XLA_OP(Name("Relu"), ReluOp);

XlaOpKernel 注册

和TF引擎中的OpKernel机制类似, XLA内部也使用了regsitry->registrar->registration->create_fn()的结构管理旗下的XlaOpKernel.

//compiler/tf2xla/xla_op_registry.h
#define REGISTER_XLA_OP(NAME, OP) \
  REGISTER_XLA_OP_UNIQ_HELPER(__COUNTER__, NAME, OP)
class XlaOpRegistrationBuilder {
 public:
  // Starts an operator registration chain.
  static XlaOpRegistrationBuilder Name(absl::string_view name);
  ...
}
class XlaOpRegistrar {
 public:
  XlaOpRegistrar(std::unique_ptr<XlaOpRegistry::OpRegistration> registration);
};
#define REGISTER_XLA_OP_UNIQ_HELPER(COUNTER, BUILDER, OP) \
  REGISTER_XLA_OP_UNIQ(COUNTER, BUILDER, OP)
#define REGISTER_XLA_OP_UNIQ(CTR, BUILDER, OP)                                 \
  static ::tensorflow::XlaOpRegistrar xla_op_registrar__body__##CTR##__object( \
      ::tensorflow::XlaOpRegistrationBuilder::BUILDER.Build(                   \
          [](::tensorflow::OpKernelConstruction* context)                      \
              -> ::tensorflow::OpKernel* { return new OP(context); }));  

//compiler/tf2xla/xla_op_registry.cc
XlaOpRegistry& XlaOpRegistry::Instance() {
  static XlaOpRegistry* r = new XlaOpRegistry;
  return *r;
}
std::unique_ptr<XlaOpRegistry::OpRegistration> XlaOpRegistrationBuilder::Build(
    XlaOpRegistry::Factory factory) {
  registration_->factory = factory;
  return std::move(registration_);
}
XlaOpRegistrar::XlaOpRegistrar(
    std::unique_ptr<XlaOpRegistry::OpRegistration> registration) {
  XlaOpRegistry& registry = XlaOpRegistry::Instance();
  mutex_lock lock(registry.mutex_);
  auto& existing_ops = registry.ops_[registration->name];   //  std::unordered_map<string, std::vector<std::unique_ptr<OpRegistration>>> ops_
  for (auto& existing : existing_ops) {
    if (!XlaOpRegistry::IsCompatible(*existing, *registration)) {
      LOG(FATAL)
          << "XLA op registration " << registration->name
          << " is incompatible with existing registration of the same name.";
    }
  }
  existing_ops.emplace_back(std::move(registration));  //将registration注册到registry
}

如此, 就添加了新的XlaOpKernel(实际上是XlaOpKernelRegistry::OpRegistration)注册到了XlaOpRegistry中, 但事情还远没有结束, 和TF引擎一样, XLA同样需要检索大量的XlaOpKernel, 但XLA无意重新实现一遍TF引擎已经实现过的Regsitration的管理机制, 而为了复用TF的实现, 除了继承OpKernel, 还需要在保留create_fn()的基础上, 将XlaOpKernelRegitry::OpRegistration转换为KernelRegistration, 如此就可以使用TF引擎的OpKernel管理机制. 具体的, 在JIT中, 由Optimization: MarkForCompilationPass中完成这种转换:

MarkForCompilationPassImpl::Run() //compiler/jit/mark_for_compilation_pass.cc
  RegisterCompilationKernels()    //compiler/tf2xla/xla_op_registry.cc
    for ops in registry.ops_:
      std::vector<std::unique_ptr<OpRegistration>>& op_registrations = ops.second
      for op_registration in op_registrations:
        for backends in registry::backends_:
          std::unique_ptr<KernelDef> kdef(new KernelDef);
          kdef->set_op(op_registration->name);
          kdef->set_device_type(backend.first);
          for type_attr in type_attrs:
            ...
          registry.kernel_registrars_.emplace_back(new kernel_factory::OpKernelRegistrar(new KernelDef(*kdef), "XlaJitOp",op_registration->factory));

-9- 将device_type信息存储在KernelDef, XlaOpKernel就是靠device_type的不同才能与"GlobalKernelRegistry()"中OpKernel区分开, 在当前版本(1.14)中, XLA共注册3个backend, 所以此处的kdef取"DEVICE_GPU_XLA_JIT"等下述3个值之一, 关于"::tensorflow::XlaBackendRegistrar "我另文详述, 这里只需了解这些Backend的类型最终将作为XlaOpKernel的kdef.device_type_.

//compiler/tf2xla/xla_op_registry.h
REGISTER_XLA_BACKEND(DEVICE_CPU_XLA_JIT, kCpuAllTypes, CpuOpFilter);
REGISTER_XLA_BACKEND(DEVICE_INTERPRETER_XLA_JIT, kExecAllTypes, OpFilter);
REGISTER_XLA_BACKEND(DEVICE_GPU_XLA_JIT, kGpuAllTypes, GpuOpFilter);

-12- 根据“Tensorflow OpKernel机制详解“一文, "kernel_factory::OpKernelRegistrar()"会调用"InitInternal()"将KEY和作为VALUE的"KernelRegistration"注册到"GlobalKernelRegistry()", 这里, factory即是REGISTER_XLA_OP`时create_fn(): 一个用于new 一个XlaOpKernel实例的方法. 至此, 我们注册了的XlaOpKernel就进入到了了GlobalKernelRegistry(), 这些XlaOpKernel可以通过TF引擎的通用接口“FindKernelRegistration()”来构造并获取. 此时, 在TF引擎中, 一个Op就有个多个Kernel: OpKernel + 多个适配了不同device的XlaOpKernel, Kernel之间彼此通过device_type进行区分.

在1.14版本中，已经有集成的debug工具查看当前运行时已经注册的OpKernel和XlaOpKernel, 可以看到，即便不考虑OpKernel支持更多硬件设备，仅从Op种类上看，也只有229/1062个Op支持XLA，这方面还有很多工作要做。相应的Op明细我列在了文末，感兴趣的同学可以查看。

XlaOpKernel 替换 OpKernel

JIT将所有的XlaOpKernel注册到TF引擎, 那真正运行的时候如何找到相应的XlaOpKernel呢?这就涉及到了刚才一直在强调的问题: 注册到TF引擎时, 每一个OpKernelRegistrar都使用了JIT内部的设备类型. JIT在系统初始化的时候即注册了3个DeviceFactory:

//compiler/jit/xla_interpreter_device.cc
REGISTER_LOCAL_DEVICE_FACTORY(DEVICE_XLA_INTERPRETER, XlaInterpreterDeviceFactory, 40);
//compiler/jit/xla_cpu_device.cc
REGISTER_LOCAL_DEVICE_FACTORY(DEVICE_XLA_CPU, XlaCpuDeviceFactory);
//compiler/jit/xla_gpu_device.cc
REGISTER_LOCAL_DEVICE_FACTORY(DEVICE_XLA_GPU, XlaGpuDeviceFactory);

//compiler/jit/xla_device.cc
XlaGpuDeviceFactory::CreateDevices(devices)
  for i in gpu_ids:
    options.compilation_device_name = DEVICE_GPU_XLA_JIT
    device = absl::make_unique<XlaDevice>(session_options, options)
      //XlaDevice::XlaDevice():
      xla_metadata_(DeviceType(options.compilation_device_name)
        device_type_(device_type)
      jit_device_name_(options.compilation_device_name),
    devices->push_back(std::move(device))

根据Tensorflow的设计, 所有注册的DeviceFactory最终都会被Tensorflow执行引擎在初始化阶段调用"CreateDevices()"以工厂模式”生产”相应的device实例, JIT注册的这三个也不例外, 以JIT内的GPU为例, 可以看到, 每一个device.device_type_都是"DEVICE_GPU_XLA_JIT".在OpKernel执行阶段, XlaCompileOp执行编译的时候结合上述的DEVICE_GPU_XLA_JIT等属性, 从GlobalKernelRegistry()中检索所需的Kernel;

tensorflow::XlaCompileOp::Compute()
  tensorflow::CompileToLocalExecutable()
      BuildCompilationCache(XlaPlatformInfo& platform_info, {
      *cache = new XlaCompilationCache(platform_info.xla_device_metadata()->jit_device_type());
        //XlaCompilationCache()
        device_type_(std::move(device_type))
        //compiler/jit/xla_device.cc XlaDevice::Metadata::jit_device_type()
        return device_type_;
    XlaCompiler::Options options
    options.device_type = cache->device_type();
    //XlaCompilationCache::Compile
    return  cache->Compile(options)
      XlaCompilationCache::CompileImpl(options
        XlaCompiler compiler(options);
        //XlaCompiler::XlaCompiler(XlaCompiler::Options options):
          options_(options),
          device_(new XlaCompilationDevice(SessionOptions(), options_.device_type))
            LocalDevice(options, Device::BuildDeviceAttributes(absl::StrCat("/device:",type.type()...
          device_mgr_(absl::WrapUnique(device_))
          local_pflr_.reset(new ProcessFunctionLibraryRuntime(&device_mgr_
            for d in device_mgr->ListDevices():
              flr_map_[d] = NewFunctionLibraryRuntime()
                return std::unique_ptr<FunctionLibraryRuntime>(new FunctionLibraryRuntimeImpl(device_mgr, env, device
          pflr_.reset(new ProcessFunctionLibraryRuntime(&device_mgr_,)
          local_flib_runtime_ = local_pflr_->GetFLR(device_->name())
          flib_runtime_ = pflr_->GetFLR(device_->name());
        tensorflow::XlaCompiler::CompileFunction()
          tensorflow::XlaCompiler::CompileGraph()
            xla::XlaBuilder builder(name);
            XlaContext* context = new XlaContext(this, &builder)
            ExecuteGraph(context, std::move(graph), device_, flib_runtime_)
              device->resource_manager()->Create(
              GraphCompiler graph_compiler(device, graph.get(), flib, step_container.get());
              //tensorflow::GraphCompiler::Compile()
              graph_compiler.Compile()
                for (Node* n : topo_sorted_nodes):
                  //core/common_runtime/function.cc tensorflow::FunctionLibraryRuntimeImpl::CreateKernel()
                  flib_->CreateKernel(n->def(), &op_kernel_raw);
                    tensorflow::CreateNonCachedKernel(device_)
                      device_type = DeviceType(device->attributes().device_type());
                      //core/framework/op_kernel.cc
                      tensorflow::CreateOpKernel(device_type)
                        Status s = OpRegistry::Global()->LookUpOpDef(node_def.op(),&op_def);
                        FindKernelRegistration(device_type)
                          FindKernelRegistration(device_type)
                            string key = Key(node_op, device_type, label);
                            KernelRegistry* typed_registry = GlobalKernelRegistryTyped();
                            auto regs = typed_registry->registry.equal_range(key)
                            for iter in regs:
                              KernelAttrsMatch(iter->second.def, node_attrs, &match)
                              *reg = &iter->second
                        // Everything needed for OpKernel construction.
                        OpKernelConstruction context(...)
                        //tensorflow::XlaOpKernel::XlaOpKernel()
                        *kernel = registration->factory->Create(&context);
                  std::unique_ptr<OpKernel> op_kernel(op_kernel_raw);
                  OpKernelContext op_context()
                  *flib_->GetFunctionLibraryDefinition(), *n)
                  device_->Compute(CHECK_NOTNULL(params.op_kernel), &op_context)
                    op_kernel->Compute(context)
            BuildComputation()
        entry->compilation_status = compile_fn(
        entry->compilation_status = BuildExecutable(
        *out_compilation_result = &entry->compilation_result;
        *out_executable = entry->executable.get()

-1- JIT编译执行入口, 初始化结束后, Tensorflow执行引擎执行XlaCompileOp进而进入”编译”阶段
-4-8- 将device中获取到的DEVICE_GPU_XLA_JIT 存入XlaCompilationCache
-10- 用XlaCompilationCache中的device_type构造XlaCompiler::Options, 进一步用于构造XlaCompiler
-18- 构造XlaCompiler::device_对象, 可以看到, 构造DeviceAttr使用的device_type_就是刚才传入的DEVICE_GPU_XLA_JIT
-19,21,23- 将存有DEVICE_GPU_XLA_JIT的Device laCompiler::device_存入XlaCompiler::flib_runtime_
-33- 构造GraphCompiler, 传入的flib即XlaCompiler::flib_runtime_, 存入GraphCompiler::flib_
-39- 此处使用的device追根溯源, 就是系统初始化CreateDevices()时构造的device, 即 DEVICE_GPU_XLA_JIT
-18,40- 遥相呼应
-46- 根据DEVICE_GPU_XLA_JIT构造检索用的key
-54,52- 执行create_fn(), 在XLA中即是构造XlaOpKernel.
-59- 执行OpKernel的入口
-61- 这里就是XlaOpKernel::Compute(), 如前文所述, 实质就是执行Compile()

另外, 关于使用device区别不同Kernel的讨论, 可以看到, TF引擎为OpKernel只提供了三种设备:"DEVICE_CPU", "DEVICE_GPU"和"DEVICE_SYCL"(core/framework/types.h), 可见, JIT通过构造虚拟设备的方式将两类Kernel巧妙的融合在一个数据结构中, 设计可谓巧妙. 类似的很多Linux内核程序也是借助了设备驱动接口注册虚拟设备实现和用户态的交互, 优秀的代码设计大体相似, 但垃圾的代码却各有各的垃圾法.

//core/common_runtime/executor.cc
for iter in graphs:
  Device* device;
  TF_RETURN_IF_ERROR(device_mgr_->LookupDevice(partition_name, &device));

XlaOpKernel 调试

和KernelRegistry一样, XlaOpRegistry也没有提供很多调试接口, 目前的版本只有这一个, 毕竟, 既然是开发人员, 就不能奢求太多

//tf2xla/xla_registry.h
  // Returns all operations for which there are XLA kernels on any device.
  static std::vector<string> GetAllRegisteredOps();

XlaOpKernel VS OpKernel

NR	Op supported by XLA	Op not supported by XLA
1	_ArrayToList	__MklDummyConv2DBackpropFilterWithBias	DatasetToGraph	NonDeterministicInts	SparseReshape
2	_ListToArray	__MklDummyConv2DWithBias	DatasetToSingleElement	NonMaxSuppression	SparseSegmentMean
3	AddN	__MklDummyPadWithConv2D	DebugGradientIdentity	NonMaxSuppressionV2	SparseSegmentMeanGrad
4	AdjustContrastv2	__MklDummyPadWithFusedConv2D	DebugGradientRefIdentity	NonMaxSuppressionV3	SparseSegmentMeanWithNumSegments
5	AdjustHue	_FusedConv2D	DebugIdentity	NonMaxSuppressionV4	SparseSegmentSqrtN
6	AdjustSaturation	_FusedMatMul	DebugNanCount	NonMaxSuppressionWithOverlaps	SparseSegmentSqrtNGrad
7	All	_HostCast	DebugNumericSummary	NotEqual	SparseSegmentSqrtNWithNumSegments
8	Any	_HostRecv	DecodeAndCropJpeg	NthElement	SparseSegmentSum
9	ApproximateEqual	_HostSend	DecodeBase64	OneShotIterator	SparseSegmentSumWithNumSegments
10	ArgMax	_If	DecodeBmp	OptimizeDataset	SparseSlice
11	ArgMin	_MklAddN	DecodeCompressed	OptionalFromValue	SparseSliceGrad
12	Assert	_MklAvgPool	DecodeCSV	OptionalGetValue	SparseSoftmax
13	AssignVariableOp	_MklAvgPool3D	DecodeGif	OptionalHasValue	SparseSparseMaximum
14	AvgPool	_MklAvgPool3DGrad	DecodeJpeg	OptionalNone	SparseSparseMinimum
15	AvgPool3D	_MklAvgPoolGrad	DecodeJSONExample	OrderedMapClear	SparseSplit
16	BatchMatMul	_MklConcat	DecodePaddedRaw	OrderedMapIncompleteSize	SparseTensorDenseAdd
17	BatchMatMulV2	_MklConcatV2	DecodePng	OrderedMapPeek	SparseTensorDenseMatMul
18	BatchToSpace	_MklConv2D	DecodeProtoV2	OrderedMapSize	SparseTensorSliceDataset
19	BatchToSpaceND	_MklConv2DBackpropFilter	DecodeRaw	OrderedMapStage	SparseToSparseSetOperation
20	BiasAdd	_MklConv2DBackpropFilterWithBias	DecodeWav	OrderedMapUnstage	Square
21	BiasAddGrad	_MklConv2DBackpropInput	DeepCopy	OrderedMapUnstageNoKey	SquaredDifference
22	BiasAddV1	_MklConv2DWithBias	DeleteIterator	PaddedBatchDataset	Stack
23	BroadcastArgs	_MklConv2DWithBiasBackpropBias	DeleteSessionTensor	PaddedBatchDatasetV2	StackClose
24	BroadcastGradientArgs	_MklConv3D	DenseToDenseSetOperation	PaddingFIFOQueue	StackPop
25	BroadcastTo	_MklConv3DBackpropFilterV2	DenseToSparseSetOperation	PaddingFIFOQueueV2	StackPush
26	Bucketize	_MklConv3DBackpropInputV2	Dequantize	ParallelConcat	StackV2
27	Case	_MklDepthwiseConv2dNative	DeserializeIterator	ParallelDynamicStitch	Stage
28	Cast	_MklDepthwiseConv2dNativeBackpropFilter	DeserializeManySparse	ParallelInterleaveDatasetV2	StageClear
29	CheckNumerics	_MklDepthwiseConv2dNativeBackpropInput	DeserializeSparse	ParallelMapDataset	StagePeek
30	ClipByValue	_MklDequantize	DestroyResourceOp	ParameterizedTruncatedNormal	StageSize
31	Concat	_MklElu	DestroyTemporaryVariable	ParseExample	StatefulRandomBinomial
32	ConcatOffset	_MklEluGrad	Dilation2D	ParseSequenceExample	StatefulStandardNormal
33	ConcatV2	_MklFusedBatchNorm	Dilation2DBackpropFilter	ParseSingleExample	StaticRegexFullMatch
34	ConjugateTranspose	_MklFusedBatchNormGrad	Dilation2DBackpropInput	ParseSingleSequenceExample	StaticRegexReplace
35	Const	_MklFusedBatchNormGradV2	Div	ParseTensor	StatsAggregatorHandleV2
36	ControlTrigger	_MklFusedBatchNormV2	DrawBoundingBoxes	Placeholder	StatsAggregatorSetSummaryWriter
37	Conv2D	_MklFusedConv2D	DrawBoundingBoxesV2	PlaceholderV2	StridedSliceAssign
38	Conv3D	_MklIdentity	DynamicPartition	PopulationCount	StringFormat
39	Cross	_MklInputConversion	EditDistance	PrefetchDataset	StringJoin
40	Cumprod	_MklLeakyRelu	EncodeBase64	Print	StringLength
41	Cumsum	_MklLeakyReluGrad	EncodeJpeg	PrintV2	StringLower
42	DepthToSpace	_MklLRN	EncodeJpegVariableQuality	PriorityQueue	StringSplit
43	DepthwiseConv2dNative	_MklLRNGrad	EncodePng	PriorityQueueV2	StringSplitV2
44	DepthwiseConv2dNativeBackpropFilter	_MklMaxPool	EncodeProto	QuantizeAndDequantize	StringStrip
45	DepthwiseConv2dNativeBackpropInput	_MklMaxPool3D	EncodeWav	QuantizedAdd	StringToHashBucket
46	Diag	_MklMaxPool3DGrad	EnsureShape	QuantizedAvgPool	StringToHashBucketFast
47	DiagPart	_MklMaxPoolGrad	Enter	QuantizedBatchNormWithGlobalNormalization	StringToHashBucketStrong
48	DynamicStitch	_MklPadWithConv2D	Equal	QuantizedBiasAdd	StringToNumber
49	Elu	_MklPadWithFusedConv2D	EuclideanNorm	QuantizedConcat	StringUpper
50	EluGrad	_MklQuantizedAvgPool	Exit	QuantizedConcatV2	Sub
51	Empty	_MklQuantizedConcatV2	ExperimentalAssertNextDataset	QuantizedConv2D	Substr
52	EmptyTensorList	_MklQuantizedConv2D	ExperimentalAutoShardDataset	QuantizedConv2DAndRelu	SummaryWriter
53	ExpandDims	_MklQuantizedConv2DAndRelu	ExperimentalBytesProducedStatsDataset	QuantizedConv2DAndReluAndRequantize	Switch
54	ExtractImagePatches	_MklQuantizedConv2DAndReluAndRequantize	ExperimentalChooseFastestDataset	QuantizedConv2DAndRequantize	T
55	FakeParam	_MklQuantizedConv2DAndRequantize	ExperimentalCSVDataset	QuantizedConv2DPerChannel	TakeDataset
56	FakeQuantWithMinMaxArgs	_MklQuantizedConv2DPerChannel	ExperimentalDatasetCardinality	QuantizedConv2DWithBias	TakeManySparseFromTensorsMap
57	FakeQuantWithMinMaxArgsGradient	_MklQuantizedConv2DWithBias	ExperimentalDatasetToTFRecord	QuantizedConv2DWithBiasAndRelu	TemporaryVariable
58	FakeQuantWithMinMaxVars	_MklQuantizedConv2DWithBiasAndRelu	ExperimentalDenseToSparseBatchDataset	QuantizedConv2DWithBiasAndReluAndRequantize	TensorArray
59	FakeQuantWithMinMaxVarsGradient	_MklQuantizedConv2DWithBiasAndReluAndRequantize	ExperimentalDirectedInterleaveDataset	QuantizedConv2DWithBiasAndRequantize	TensorArrayClose
60	FFT	_MklQuantizedConv2DWithBiasAndRequantize	ExperimentalGroupByReducerDataset	QuantizedConv2DWithBiasSignedSumAndReluAndRequantize	TensorArrayCloseV2
61	FFT2D	_MklQuantizedConv2DWithBiasSignedSumAndReluAndRequantize	ExperimentalGroupByWindowDataset	QuantizedConv2DWithBiasSumAndRelu	TensorArrayConcat
62	FFT3D	_MklQuantizedConv2DWithBiasSumAndRelu	ExperimentalIgnoreErrorsDataset	QuantizedConv2DWithBiasSumAndReluAndRequantize	TensorArrayConcatV2
63	Fill	_MklQuantizedConv2DWithBiasSumAndReluAndRequantize	ExperimentalIteratorGetDevice	QuantizedDepthwiseConv2D	TensorArrayGather
64	FusedBatchNorm	_MklQuantizedDepthwiseConv2D	ExperimentalLatencyStatsDataset	QuantizedDepthwiseConv2DWithBias	TensorArrayGatherV2
65	FusedBatchNormGrad	_MklQuantizedDepthwiseConv2DWithBias	ExperimentalLMDBDataset	QuantizedDepthwiseConv2DWithBiasAndRelu	TensorArrayGrad
66	FusedBatchNormGradV2	_MklQuantizedDepthwiseConv2DWithBiasAndRelu	ExperimentalMapAndBatchDataset	QuantizedDepthwiseConv2DWithBiasAndReluAndRequantize	TensorArrayGradV2
67	FusedBatchNormGradV3	_MklQuantizedDepthwiseConv2DWithBiasAndReluAndRequantize	ExperimentalMapDataset	QuantizedInstanceNorm	TensorArrayGradWithShape
68	FusedBatchNormV2	_MklQuantizedMaxPool	ExperimentalMatchingFilesDataset	QuantizedMatMul	TensorArrayPack
69	FusedBatchNormV3	_MklQuantizeV2	ExperimentalMaxIntraOpParallelismDataset	QuantizedMaxPool	TensorArrayRead
70	Gather	_MklRelu	ExperimentalNonSerializableDataset	QuantizedMul	TensorArrayReadV2
71	GatherNd	_MklRelu6	ExperimentalParallelInterleaveDataset	QuantizeDownAndShrinkRange	TensorArrayScatter
72	GatherV2	_MklRelu6Grad	ExperimentalParseExampleDataset	QuantizedRelu	TensorArrayScatterV2
73	HSVToRGB	_MklReluGrad	ExperimentalPrivateThreadPoolDataset	QuantizedRelu6	TensorArraySize
74	IdentityN	_MklReshape	ExperimentalRandomDataset	QuantizedReshape	TensorArraySizeV2
75	If	_MklSlice	ExperimentalRebatchDataset	QuantizedResizeBilinear	TensorArraySplit
76	IFFT	_MklSoftmax	ExperimentalScanDataset	QuantizeV2	TensorArraySplitV2
77	IFFT2D	_MklTanh	ExperimentalSetStatsAggregatorDataset	QueueClose	TensorArrayUnpack
78	IFFT3D	_MklTanhGrad	ExperimentalSleepDataset	QueueCloseV2	TensorArrayV2
79	InTopKV2	_MklToTf	ExperimentalSlidingWindowDataset	QueueDequeue	TensorArrayWrite
80	InvertPermutation	_NcclBroadcastRecv	ExperimentalSqlDataset	QueueDequeueMany	TensorArrayWriteV2
81	IRFFT	_NcclBroadcastSend	ExperimentalStatsAggregatorHandle	QueueDequeueManyV2	TensorDataset
82	IRFFT2D	_NcclReduceRecv	ExperimentalStatsAggregatorSummary	QueueDequeueUpTo	TensorForestCreateTreeVariable
83	IRFFT3D	_NcclReduceSend	ExperimentalTakeWhileDataset	QueueDequeueUpToV2	TensorForestTreeDeserialize
84	L2Loss	_ParallelConcatStart	ExperimentalThreadPoolDataset	QueueDequeueV2	TensorForestTreeIsInitializedOp
85	LeakyRelu	_ParallelConcatUpdate	ExperimentalThreadPoolHandle	QueueEnqueue	TensorForestTreePredict
86	LeakyReluGrad	_ReadVariablesOp	ExperimentalUnbatchDataset	QueueEnqueueMany	TensorForestTreeSerialize
87	LinSpace	_Recv	ExperimentalUniqueDataset	QueueEnqueueManyV2	TensorForestTreeSize
88	ListDiff	_ScopedAllocator	ExtractGlimpse	QueueEnqueueV2	TensorListConcat
89	LogSoftmax	_ScopedAllocatorConcat	ExtractJpegShape	QueueIsClosed	TensorListConcatLists
90	LRN	_ScopedAllocatorSplit	ExtractVolumePatches	QueueIsClosedV2	TensorListConcatV2
91	LRNGrad	_Send	Fact	QueueSize	TensorListFromTensor
92	MatMul	_UnaryOpsComposition	FakeQuantWithMinMaxVarsPerChannel	QueueSizeV2	TensorListGather
93	MatrixBandPart	_VarHandlesOp	FakeQuantWithMinMaxVarsPerChannelGradient	RaggedGather	TensorListPushBackBatch
94	MatrixDiag	_While	FakeQueue	RaggedRange	TensorListResize
95	MatrixDiagPart	; group	FIFOQueue	RaggedTensorFromVariant	TensorListScatter
96	MatrixSetDiag	; idx:	FIFOQueueV2	RaggedTensorToSparse	TensorListScatterIntoExistingList
97	Max	Abort	FilterByLastComponentDataset	RaggedTensorToVariant	TensorListScatterV2
98	MaxPool	Abs	FilterDataset	RandomCrop	TensorListSplit
99	MaxPool3D	AccumulatorApplyGradient	Fingerprint	RandomGamma	TensorScatterAdd
100	MaxPool3DGrad	AccumulatorNumAccumulated	FixedLengthRecordDataset	RandomPoisson	TensorScatterSub
101	MaxPool3DGradGrad	AccumulatorSetGlobalStep	FixedLengthRecordDatasetV2	RandomPoissonV2	TensorScatterUpdate
102	MaxPoolGrad	AccumulatorTakeGradient	FixedLengthRecordReader	RandomShuffleQueue	TensorSliceDataset
103	MaxPoolGradGrad	Add	FixedLengthRecordReaderV2	RandomShuffleQueueV2	TensorStridedSliceUpdate
104	MaxPoolGradGradV2	AddManySparseToTensorsMap	FixedUnigramCandidateSampler	RangeDataset	TensorSummary
105	MaxPoolGradV2	AddSparseToTensorsMap	FlatMapDataset	ReaderNumRecordsProduced	TensorSummaryV2
106	MaxPoolV2	AddV2	FloorDiv	ReaderNumRecordsProducedV2	TextLineDataset
107	Mean	AdjustContrast	FloorMod	ReaderNumWorkUnitsCompleted	TextLineReader
108	Min	AllCandidateSampler	FlushSummaryWriter	ReaderNumWorkUnitsCompletedV2	TextLineReaderV2
109	MirrorPad	Angle	For	ReaderRead	TFRecordDataset
110	Multinomial	AnonymousIterator	FractionalAvgPool	ReaderReadUpTo	TFRecordReader
111	NoOp	AnonymousIteratorV2	FractionalAvgPoolGrad	ReaderReadUpToV2	TFRecordReaderV2
112	OneHot	ApplyAdadelta	FractionalMaxPool	ReaderReadV2	ThreadUnsafeUnigramCandidateSampler
113	OnesLike	ApplyAdagrad	FractionalMaxPoolGrad	ReaderReset	TileGrad
114	Pack	ApplyAdagradDA	FusedPadConv2D	ReaderResetV2	Timestamp
115	Pad	ApplyAdam	FusedResizeAndPadConv2D	ReaderRestoreState	TopK
116	PadV2	ApplyAdaMax	GenerateVocabRemapping	ReaderRestoreStateV2	TruncateMod
117	PartitionedCall	ApplyAddSign	GeneratorDataset	ReaderSerializeState	TryRpc
118	PlaceholderWithDefault	ApplyCenteredRMSProp	GetSessionHandle	ReaderSerializeStateV2	Unbatch
119	PreventGradient	ApplyFtrl	GetSessionHandleV2	ReadFile	UnbatchGrad
120	Prod	ApplyFtrlV2	GetSessionTensor	Real	UnicodeDecode
121	Qr	ApplyGradientDescent	Greater	RecordInput	UnicodeDecodeWithOffsets
122	QuantizeAndDequantizeV2	ApplyMomentum	GreaterEqual	ReduceDataset	UnicodeEncode
123	QuantizeAndDequantizeV3	ApplyPowerSign	GuaranteeConst	ReduceJoin	UnicodeScript
124	RandomShuffle	ApplyProximalAdagrad	HashTable	RefEnter	UnicodeTranscode
125	RandomStandardNormal	ApplyProximalGradientDescent	HashTableV2	RefExit	UniformCandidateSampler
126	RandomUniform	ApplyRMSProp	HistogramFixedWidth	RefIdentity	Unique
127	RandomUniformInt	Assign	HistogramSummary	RefMerge	UniqueV2
128	Range	AssignAdd	HostConst	RefNextIteration	UniqueWithCounts
129	Rank	AssignAddVariableOp	Identity	RefSelect	UniqueWithCountsV2
130	ReadVariableOp	AssignSub	IdentityReader	RefSwitch	UnravelIndex
131	Relu	AssignSubVariableOp	IdentityReaderV2	RegexFullMatch	Unstage
132	Relu6	AsString	Imag	RegexReplace	UpperBound
133	Relu6Grad	AudioSpectrogram	ImageSummary	RemoteCall	VarHandleOp
134	ReluGrad	AudioSummary	ImmutableConst	RemoteFusedGraphExecute	Variable
135	Reshape	AudioSummaryV2	ImportEvent	RepeatDataset	VariableV2
136	ResizeBilinear	AvgPool3DGrad	InitializeTable	RequantizationRange	Where
137	ResizeBilinearGrad	AvgPoolGrad	InitializeTableFromTextFile	RequantizationRangePerChannel	WholeFileReader
138	ResizeNearestNeighbor	Barrier	InitializeTableFromTextFileV2	Requantize	WholeFileReaderV2
139	ResourceApplyAdadelta	BarrierClose	InitializeTableV2	RequantizePerChannel	WindowDataset
140	ResourceApplyAdagrad	BarrierIncompleteSize	InplaceAdd	ResizeArea	WriteAudioSummary
141	ResourceApplyAdagradDA	BarrierInsertMany	InplaceSub	ResizeBicubic	WriteFile
142	ResourceApplyAdam	BarrierReadySize	InplaceUpdate	ResizeBicubicGrad	WriteGraphSummary
143	ResourceApplyAdaMax	BarrierTakeMany	InterleaveDataset	ResizeNearestNeighborGrad	WriteHistogramSummary
144	ResourceApplyAddSign	Batch	InTopK	ResourceApplyAdamWithAmsgrad	WriteImageSummary
145	ResourceApplyFtrl	BatchDataset	IsBoostedTreesEnsembleInitialized	ResourceApplyCenteredRMSProp	WriteRawProtoSummary
146	ResourceApplyFtrlV2	BatchDatasetV2	IsBoostedTreesQuantileStreamResourceInitialized	ResourceApplyGradientDescent	WriteScalarSummary
147	ResourceApplyMomentum	BatchFFT	IsVariableInitialized	ResourceApplyKerasMomentum	WriteSummary
148	ResourceApplyPowerSign	BatchFFT2D	Iterator	ResourceApplyProximalAdagrad
149	ResourceApplyProximalGradientDescent	BatchFFT3D	IteratorFromStringHandle	ResourceCountUpTo
150	ResourceApplyRMSProp	BatchFunction	IteratorFromStringHandleV2	ResourceGatherNd
151	ResourceGather	BatchIFFT	IteratorGetNext	ResourceSparseApplyAdadelta
152	ResourceScatterUpdate	BatchIFFT2D	IteratorGetNextAsOptional	ResourceSparseApplyAdagrad
153	ResourceStridedSliceAssign	BatchIFFT3D	IteratorGetNextSync	ResourceSparseApplyAdagradDA
154	Reverse	BatchMatrixBandPart	IteratorToStringHandle	ResourceSparseApplyCenteredRMSProp
155	ReverseSequence	BatchMatrixDiag	IteratorV2	ResourceSparseApplyFtrl
156	ReverseV2	BatchMatrixDiagPart	KMC2ChainInitialization	ResourceSparseApplyFtrlV2
157	RFFT	BatchMatrixSetDiag	KmeansPlusPlusInitialization	ResourceSparseApplyKerasMomentum
158	RFFT2D	BatchNormWithGlobalNormalization	LearnedUnigramCandidateSampler	ResourceSparseApplyMomentum
159	RFFT3D	BatchNormWithGlobalNormalizationGrad	LeftShift	ResourceSparseApplyProximalAdagrad
160	RGBToHSV	Betainc	Less	ResourceSparseApplyProximalGradientDescent
161	Select	Bincount	LessEqual	ResourceSparseApplyRMSProp
162	SelectV2	BitwiseAnd	LMDBReader	Restore
163	SelfAdjointEigV2	BitwiseOr	LoadAndRemapMatrix	RestoreSlice
164	Selu	BitwiseXor	LogicalAnd	RestoreV2
165	SeluGrad	BoostedTreesAggregateStats	LogicalNot	RightShift
166	Shape	BoostedTreesBucketize	LogicalOr	RngSkip
167	ShapeN	BoostedTreesCalculateBestFeatureSplit	LogUniformCandidateSampler	Roll
168	Size	BoostedTreesCalculateBestGainsPerFeature	LookupTableExport	Rpc
169	Slice	BoostedTreesCenterBias	LookupTableExportV2	SampleDistortedBoundingBox
170	Snapshot	BoostedTreesCreateEnsemble	LookupTableFind	SampleDistortedBoundingBoxV2
171	Softmax	BoostedTreesCreateQuantileStreamResource	LookupTableFindV2	SamplingDataset
172	SoftmaxCrossEntropyWithLogits	BoostedTreesDeserializeEnsemble	LookupTableImport	Save
173	SpaceToBatch	BoostedTreesExampleDebugOutputs	LookupTableImportV2	SaveSlices
174	SpaceToBatchND	BoostedTreesGetEnsembleStates	LookupTableInsert	SaveV2
175	SpaceToDepth	BoostedTreesMakeQuantileSummaries	LookupTableInsertV2	ScalarSummary
176	SparseMatMul	BoostedTreesMakeStatsSummary	LookupTableRemoveV2	ScaleAndTranslate
177	SparseSoftmaxCrossEntropyWithLogits	BoostedTreesPredict	LookupTableSize	ScaleAndTranslateGrad
178	SparseToDense	BoostedTreesQuantileStreamResourceAddSummaries	LookupTableSizeV2	SdcaFprint
179	Split	BoostedTreesQuantileStreamResourceDeserialize	LoopCond	SdcaOptimizer
180	SplitV	BoostedTreesQuantileStreamResourceFlush	LowerBound	SdcaOptimizerV2
181	Squeeze	BoostedTreesQuantileStreamResourceGetBucketBoundaries	Lu	SdcaShrinkL1
182	StackCloseV2	BoostedTreesSerializeEnsemble	MakeIterator	SegmentSum
183	StackPopV2	BoostedTreesTrainingPredict	MapClear	SerializeIterator
184	StackPushV2	BoostedTreesUpdateEnsemble	MapDataset	SerializeManySparse
185	StatefulPartitionedCall	CacheDataset	MapDefun	SerializeSparse
186	StatefulStandardNormalV2	ChooseFastestBranchDataset	MapIncompleteSize	SerializeTensor
187	StatefulTruncatedNormal	CloseSummaryWriter	MapPeek	SetSize
188	StatefulUniform	CollectiveBcastRecv	MapSize	ShardDataset
189	StatefulUniformFullInt	CollectiveBcastSend	MapStage	ShardedFilename
190	StatefulUniformInt	CollectiveGather	MapUnstage	ShardedFilespec
191	StatelessIf	CollectiveReduce	MapUnstageNoKey	ShuffleAndRepeatDataset
192	StatelessMultinomial	CombinedNonMaxSuppression	MatchingFiles	ShuffleDataset
193	StatelessRandomNormal	CompareAndBitpack	Maximum	Sign
194	StatelessRandomUniform	Complex	MaxPoolGradGradWithArgmax	Sinh
195	StatelessRandomUniformInt	ComputeAccidentalHits	MaxPoolGradWithArgmax	SkipDataset
196	StatelessTruncatedNormal	ConcatenateDataset	MaxPoolWithArgmax	Skipgram
197	StatelessWhile	ConditionalAccumulator	MemmappedTensorAllocator	SnapshotDataset
198	StopGradient	Conj	Merge	Softplus
199	StridedSlice	ConsumeMutexLock	MergeSummary	SoftplusGrad
200	StridedSliceGrad	Conv2DBackpropFilter	MergeV2Checkpoints	Softsign
201	Sum	Conv2DBackpropInput	Mfcc	SoftsignGrad
202	TensorArrayCloseV3	Conv3DBackpropFilter	Minimum	SparseAccumulatorApplyGradient
203	TensorArrayConcatV3	Conv3DBackpropFilterV2	MirrorPadGrad	SparseAccumulatorTakeGradient
204	TensorArrayGatherV3	Conv3DBackpropInput	Mod	SparseAdd
205	TensorArrayGradV3	Conv3DBackpropInputV2	ModelDataset	SparseAddGrad
206	TensorArrayReadV3	Copy	Mul	SparseApplyAdadelta
207	TensorArrayScatterV3	CopyHost	MultiDeviceIterator	SparseApplyAdagrad
208	TensorArraySizeV3	Cosh	MultiDeviceIteratorFromStringHandle	SparseApplyAdagradDA
209	TensorArraySplitV3	CountUpTo	MultiDeviceIteratorGetNextFromShard	SparseApplyCenteredRMSProp
210	TensorArrayV3	CreateSummaryDbWriter	MultiDeviceIteratorInit	SparseApplyFtrl
211	TensorArrayWriteV3	CreateSummaryFileWriter	MultiDeviceIteratorToStringHandle	SparseApplyFtrlV2
212	TensorListElementShape	CropAndResize	MutableDenseHashTable	SparseApplyMomentum
213	TensorListGetItem	CropAndResizeGradBoxes	MutableDenseHashTableV2	SparseApplyProximalAdagrad
214	TensorListLength	CropAndResizeGradImage	MutableHashTable	SparseApplyProximalGradientDescent
215	TensorListPopBack	CTCBeamSearchDecoder	MutableHashTableOfTensors	SparseApplyRMSProp
216	TensorListPushBack	CTCGreedyDecoder	MutableHashTableOfTensorsV2	SparseConcat
217	TensorListReserve	CTCLoss	MutableHashTableV2	SparseConditionalAccumulator
218	TensorListSetItem	CudnnRNN	MutexLock	SparseCross
219	TensorListStack	CudnnRNNBackprop	MutexV2	SparseDenseCwiseAdd
220	Tile	CudnnRNNBackpropV2	Name:	SparseDenseCwiseDiv
221	TopKV2	CudnnRNNBackpropV3	NcclAllReduce	SparseDenseCwiseMul
222	Transpose	CudnnRNNCanonicalToParams	NcclBroadcast	SparseFillEmptyRows
223	TruncatedNormal	CudnnRNNParamsSize	NcclReduce	SparseFillEmptyRowsGrad
224	Unpack	CudnnRNNParamsToCanonical	NearestNeighbors	SparseReduceMax
225	VariableShape	CudnnRNNV2	Neg	SparseReduceMaxSparse
226	VarIsInitializedOp	CudnnRNNV3	NegTrain	SparseReduceSum
227	While	DataFormatDimMap	NextAfter	SparseReduceSumSparse
228	ZerosLike	DataFormatVecPermute	NextIteration	SparseReorder

XlaOpKernel定义

XlaOpKernel 注册

XlaOpKernel 替换 OpKernel

XlaOpKernel 调试

XlaOpKernel VS OpKernel

共享此文章：

Leave a ReplyCancel reply

Discover more from sketch2sky