TVM学习（九）codegen中的内存申请

2024-03-15 14:02:05

在BuildRelay函数中，完成了基于IR的硬件无关优化之后，接下来是在新function基础上进行codegen以及schedule的处理。我们详细考察以下代码：

    // Generate code for the updated function.
    graph_codegen_ = std::unique_ptr<GraphCodegen>(new GraphCodegen());
    graph_codegen_->Init(nullptr, targets_);
graph_codegen_->Codegen(func);

进行codegen是通过一个封装了GraphRuntimeCodegenModule的类GraphCodegen来实现的。GraphCodegen会调用GraphRuntimeCodegenModule中的函数。对于init操作就是传入function和targets建立一个GraphRuntimeCodegen的类。真正的codegen实现是在这个类中。函数实现代码如下：

LoweredOutput Codegen(relay::Function func) {
    auto pf = GetPackedFunc("relay.backend.GraphPlanMemory");
    storage_device_map_ = (*pf)(func);
    // First we convert all the parameters into input nodes.
    for (auto param : func->params) {
      auto node_ptr = GraphInputNode::make_node_ptr(param->name_hint(), GraphAttrs());
      var_map_[param.get()] = AddNode(node_ptr, param);
    }
    heads_ = VisitExpr(func->body);
    std::ostringstream os;
    dmlc::JSONWriter writer(&os);
    GetJSON(&writer);
    LoweredOutput ret;
    ret.graph_json = os.str();
    ret.params = params_;

    for (auto& kv : lowered_funcs_) {
      if (ret.lowered_funcs.count(kv.first) == 0) {
        ret.lowered_funcs.Set(kv.first, IRModule());
      }
      auto& mod = ret.lowered_funcs[kv.first];
      mod->Update(kv.second);
      ret.lowered_funcs.Set(kv.first, mod);
    }
    ret.external_mods = compile_engine_->LowerExternalFunctions();
    return ret;
  }

内存申请
内存申请优化在src/relay/backend/http://graph_plan_memory.cc中。如下：

Map<Expr, Array<IntegerArray> > GraphPlanMemory(const Function& func) {
  return StorageAllocator().Plan(func);
}

主要有两个类和内存申请有关：StorageAllocator和StorageAllocaInit。StorageAllocaInit主要是创建封装内存申请信息的TokenMap，收集不同算子所在的设备信息。主要函数是GetINitTokenMap，在这个函数中先是调用CollectDeviceInfo来遍历func的节点，获得每个节点运行的设备属性。获得每个算子节点的设备属性主要是通过copy算子来进行推断。我们知道当前后两个算子所在设备不一样的时候，需要实现两个设备的数据交换，这个交换是通过copy算子来实现的。因此copy之后连接的算子就是和这个copy算子具有相同的设备。算法上采用从copy算子向前和向后遍历的方式来推断非copy节点的设备信息。比如官网给出的例子：

/*
 * \brief Return device allocation map based on the post order traversed graph.
 * For the following program:
 * .. code-block:: python
 *     x = relay.var("x")
 *     y = relay.var("y")
 *     add = relay.add(x, y)
 *     sqrt = relay.sqrt(add)
 *     log = relay.log(add)
 *     subtract = relay.subtract(sqrt, log)
 *     exp = relay.exp(subtract)
 *
 * Suppose we have annotated add, sqrt, and log with device 1, 2, and 3,
 * respectively. The fallback/default device is 4. After Rewriting the
 * program, we can have the following graph, where each copy op has both
 * source and destination device type denoting which device the data should be
 * copied from and to.
 *
 *         x     y
 *          \   /
 *          add/1
 *          /   \
 *       copy1  copy2
 *         |     |
 *      sqrt/2 log/3
 *         |     |
 *       copy3 copy4
 *          \   /
 *        subtract
 *            |
 *           exp
 *
 * To Get the device mapping of each expression, we need to propagate the
 * device information from the copy ops. This can be done in two passes.
 *  -Pass 1: Propagating the source device type to ops in a bottom-up way to the
 *           ancestors until encountering another copy op. For example, this way
 *           provides add, x, and y device types from the copy operator, `copy1`.
 *  -Pass 2: Propagating the destination device type of "the last" copy op to the
 *           remain nodes. For instance, this offers `subtract` and `exp` the
 *           same device type as `copy3`.
 */

具体代码实现是首先进行post深度遍历，建立节点的map，在或者map中会记录该节点是否存在相连的copy节点，为之后通过copy来推断节点设备信息使用。BottomUpPropagation用来从copy节点由底向上遍历，获得节点设备信息。FillPropagation得到copy之后节点的设备信息。

完成device信息收集后返回node_device_map_，用于创建TokenMap的时候使用。之后StorageAllocaInit开始创建TokenMap，TokenMap中包含了ttype，device_type的信息。接下来需要给每个节点进行内存分配。TVM中通过复用内存来优化内存申请。

void CreateToken(const ExprNode* op, bool can_realloc) final {
    CHECK(!token_map_.count(op));
    auto it = prototype_.find(op);
    CHECK(it != prototype_.end());
    std::vector<StorageToken*> tokens;
    for (StorageToken* tok : it->second) {
      if (can_realloc) {
        tokens.push_back(Request(tok));
      } else {
        // Allocate a new token,
        StorageToken* allocated_tok = Alloc(tok, GetMemorySize(tok));
        allocated_tok->device_type = tok->device_type;
        // ensure it never get de-allocated.
        allocated_tok->ref_counter += 1;
        tokens.push_back(allocated_tok);
      }
    }
    token_map_[op] = tokens;
  }

内存分配分为两种情况，一种是can_realloc为真，即可以复用内存，另外一种是不可以复用内存的。如果内存可以复用，调用Request函数来重新计算tok大小。并将其放入tokens列表中。还有一个free_列表记录了可用的内存块，在申请新的内存的时候会从free_中查找有没有可用的符合大小的内存块。符合条件的内存块大小位于区间[size/match_range, size*match_range]之间，首先寻找大于需要申请空间的tok，如果没有再寻找不低于size/match_range的tok，如果都没有就创建新的storageTok。为什么要使用一个match_range来扩大tok搜索范围呢？我猜测可能是为了能增加toke的复用率，减少tok的数量。

这些不同内存块用storage_id来进行标记。如果free_中找不到符合大小的内存块，就重新申请一个storageToke。

如果不可复用，就需要创建新的tok了。Tok分配完成返回tok的映射表，这个映射表包含了算子和tok的一一对应关系。

码农公寓

相关文章