环境搭建 由于在官方文档中 SparseTIR 环境不能使用 python 3.8: 并且 SparTA 的开源版本根本没做完,想要复现只能用他给定的复现版本 因此这里不能使用最初的 tvm 的 docker 这里我们利用之前的 docker image 重新构建一个 docker 进行环境搭建,当然,镜像与 tvm 中使用的一致。 Docker 构建 由于这两个项目对于 python 版本和包依赖会出错,因此需要分开构建 docker Dockerfile.sparsetir 如下: 1FROM nvidia/cuda:11.7.0-cudnn8-devel-ubuntu20.042# Install tools and dependencies.3RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list4RUN apt-get -y update && apt -y upgrade5ARG DEBIAN_FRONTEND=noninteractive6ENV TZ=Asia/Shanghai7RUN apt-get install -y \8 vim \9 git \10 wget \11 libgoogle-glog-dev12RUN apt install -y \13 gcc \14 libtinfo-dev \15 zlib1g-dev \16 build-essential \17 libedit-dev \18 libxml2-dev \19 libssl-dev \20 unzip \21 pip \22 libsndfile123 24# Setup to install the latest version of cmake.25RUN apt-get install -y software-properties-common && \26 apt-get update && \27 wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \28 apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \29 apt-get update && apt-get install -y cmake30 31 32# Set the working directory.33 34WORKDIR /root35 36RUN wget https://repo.anaconda.com/archive/Anaconda3-2023.03-1-Linux-x86_64.sh && \37 bash Anaconda3-2023.03-1-Linux-x86_64.sh -b -p /root/anaconda38RUN eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda create -n tir python=3.9 -y && \39 conda activate tir && pip3 install torch torchvision torchaudio40 41# install llvm1242RUN wget https://apt.llvm.org/llvm.sh && \43 chmod +x llvm.sh && \44 ./llvm.sh 1245 46# add github host47RUN echo '20.205.243.166 github.com\n\48 199.59.148.96 github.global.ssl.fastly.Net\n' >> /etc/hosts49 50# install sparseTIR51RUN git clone --recursive https://github.com/uwsampl/SparseTIR.git sparsetir && cd sparsetir/cmake && \52 echo set\(USE_LLVM ON\) >> config.cmake && \53 echo set\(HIDE_PRIVATE_SYMBOLS ON\) >> config.cmake && \54 echo set\(USE_CUDA ON\) >> config.cmake && \55 echo set\(USE_CUBLAS ON\) >> config.cmake && \56 echo set\(USE_CUDNN ON\) >> config.cmake && \57 echo set\(USE_RELAY_DEBUG ON\) >> config.cmake && \58 cd .. && \59 mkdir -p build && \60 cd build && \61 cp ../cmake/config.cmake . && \62 cmake .. && \63 make -j$(nproc) && \64 pip install decorator && \65 cd .. && \66 export SPARSETIR_PATH=$(pwd) && \67 export PYTHONPATH=${SPARSETIR_PATH}/python:${PYTHONPATH} && \68 eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate tir && \69 cd python && python setup.py install && cd .. 输入命令: Terminal window1docker build . -f Dockerfile.sparsetir -t tir 进行镜像构建,完成后输入: Terminal window1docker system prune2docker run -p 8088:22 --restart=on-failure --runtime=nvidia -it tir /bin/bash 进入容器 而 Dockerfile.sparta 如下: 11.0-cudnn8-devel-ubuntu18.041FROM nvidia/cuda:11.1-cudnn8-devel-ubuntu18.042# Install tools and dependencies.3ARG DEBIAN_FRONTEND=noninteractive4ENV TZ=Asia/Shanghai5RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list6RUN apt-get -y update && apt -y upgrade7RUN apt-get install -y \8 emacs \9 git \10 wget \11 libgoogle-glog-dev \12 libsndfile113 14# Setup to install the latest version of cmake.15RUN apt-get install -y software-properties-common && \16 apt-get update && \17 wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \18 apt-add-repository 'deb https://apt.kitware.com/ubuntu/ bionic main' && \19 apt-get update && apt-get install -y cmake20# Set the working directory.21WORKDIR /root22 23#install sputnik24RUN git clone --recurse-submodules https://github.com/zheng-ningxin/sputnik.git && \25 cd sputnik && mkdir build && cd build && \26 cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_TEST=ON -DBUILD_BENCHMARK=ON -DCUDA_ARCHS="70;75" && \27 make -j && cp sputnik/libspmm.so /usr/local/lib/ && cp -r /root/sputnik/third_party/abseil-cpp/absl /usr/local/include/28 29# install nnfusion30RUN git clone https://github.com/zheng-ningxin/nnfusion.git && cd nnfusion && git checkout hubert_antares && \31 ./maint/script/install_dependency.sh && mkdir build && cd build && cmake .. && make -j32 33# install anaconda34RUN wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh && \35 bash Anaconda3-2021.11-Linux-x86_64.sh -b -p /root/anaconda && \36 eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda create -n artifact python=3.8 -y && \37 conda activate artifact && pip install torch==1.7.0 torchvision==0.8.038 39# install nni40RUN git clone https://github.com/zheng-ningxin/nni.git && cd nni && git checkout artifact && \41 eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate artifact && pip install -U -r dependencies/setup.txt && \42 pip install -r dependencies/develop.txt && python setup.py develop && pip install tensorboard transformers==3.5.0 onnxruntime graphviz onnx soundfile datasets==2.0.0 ply matplotlib numpy librosa43 44# install antares45RUN eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate artifact && \46 pip install antares==0.3.12.147 48# install tensorrt49RUN eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate artifact && \50 pip install pycuda==2020.1 && python3 -m pip install --upgrade setuptools pip && \51 python3 -m pip install nvidia-pyindex && python3 -m pip install --upgrade nvidia-tensorrt==8.4.0.6 && \52 pip install six53 54# install tvm55RUN wget https://github.com/llvm/llvm-project/releases/download/llvmorg-13.0.0/clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz && \56 tar -xvf clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz57RUN eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate artifact && \58 git clone --recursive https://github.com/linbinskn/tvm.git tvm && cd tvm && git checkout cuda_old && \59 apt-get update && apt-get install -y python3 python3-dev python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev && \60 cd build && cmake .. && make -j4 && \61 pip install decorator62 63# install taco64RUN export PATH=/usr/local/cuda/bin:$PATH && export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH && export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH && git clone https://github.com/QuanluZhang/taco.git && cd taco && git checkout artifact && \65 mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DCUDA=ON .. && \66 make -j867 68# install azcopy69RUN wget https://aka.ms/downloadazcopy-v10-linux && tar xzvf downloadazcopy-v10-linux && cp azcopy_linux_amd64_10.14.1/azcopy /usr/local/bin70 71# configure the bashrc72RUN echo 'export NNFUSION_HOME=/root/nnfusion \n\73 export TACO_HOME=/root/taco \n\74 export PATH=$PATH:$TACO_HOME/build/bin \n\75 export PYTHONPATH=/root/tvm/python:$PYTHONPATH \n\76 export PATH=$NNFUSION_HOME/build/src/tools/nnfusion:$PATH \n\77 export CUDA_HOME=/usr/local/cuda \n\78 source ~/anaconda/etc/profile.d/conda.sh \n\79 ' >> /root/.bashrc 输入命令 Terminal window1docker build . -f Dockerfile.sparta -t sparta 进行构建 完成后输入命令 Terminal window1docker system prune2docker run -p 8087:22 --restart=on-failure --runtime=nvidia -it sparta /bin/bash Docker 内环境的搭建 进入容器内部后,输入 nvidia-smi 进行查看: 出现如下输出即视为成功。 输入命令 Terminal window1mkdir workspace && cd workspace 后,开始安装两个框架。 SparTA 的安装 输入命令 Terminal window1git clone https://github.com/microsoft/nni && cd nni && git checkout sparta_artifact2conda activate spar3python setup.py develop4cd script && bash init_env.sh 在下载 checkpoint 的时候会发生错误,错误原因应该是 nni 把下载的 url 移动了,但是文档没更新 SparseTIR 的安装 如果在 dockerfile 构建失败的话,把最后的 RUN 删除后,进行下面的步骤 编辑 build.sh 文件如下: Terminal window1git clone --recursive https://github.com/uwsampl/SparseTIR.git sparsetir && cd sparsetir/cmake && \2echo set\(USE_LLVM ON\) >> config.cmake && \3echo set\(HIDE_PRIVATE_SYMBOLS ON\) >> config.cmake && \4echo set\(USE_CUDA ON\) >> config.cmake && \5echo set\(USE_CUBLAS ON\) >> config.cmake && \6echo set\(USE_CUDNN ON\) >> config.cmake && \7echo set\(USE_RELAY_DEBUG ON\) >> config.cmake && \8cd .. && \9mkdir -p build && \10cd build && \11cp ../cmake/config.cmake . && \12cmake .. && \13make -j$(nproc) && \14pip install decorator && \15cd .. && \16export SPARSETIR_PATH=$(pwd) && \17export PYTHONPATH=${SPARSETIR_PATH}/python:${PYTHONPATH} && \18eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate tir && \19cd python && python setup.py install && cd .. 然后运行: Terminal window1bash build.sh 如果在 make 时会报错: 纠错方式为,将报错的部分全部改为: 1alloca->getAlign().value() 即可 出现这个问题是因为 llvm 的版本太高了,可能安装了 llvm-15 或者 llvm-16 ,可以安装低版本的llvm 来解决(注意不能低于 llvm-10) 随后输入如下命令进行 python 包的安装: Terminal window1cd python2python setup.py install3cd .. 注意,在复现时还需要很多依赖包,需要自己去添加 Important 其中 dgl 的包版本需要 <=1.0,否则会出错,其他包似乎没有版本要求,最新版均可 安装 dgl 的命令如下: Terminal window1conda install -c dglteam/label/cu117 dgl SparTA 实验复现 复现环境: 环境版本OSUbuntu-20.04cudacudnn-11.7.0python3.9.16PyTorch2.0.1trochvision0.15.2gpuNVIDIA GeForce RTX 3090 × 2 以图为例,从 Figure 8 开始复现: 初始结果: 输入命令: Terminal window1cd script2bash init_checkpoint.sh3cd figure8 && bash run.sh 注意,可能报错:ModuleNotFoundError: No module named 'tqdm' 安装一下即可 Important 还需要安装 nni : Terminal window1git clone https://github.com/zheng-ningxin/nni.git && cd nni && git checkout artifact2pip install -U -r dependencies/setup.txt3pip install -r dependencies/develop.txt4python setup.py develop5pip install tensorboard transformers==3.5.0 onnxruntime graphviz onnx soundfile datasets==2.0.0 ply matplotlib numpy librosa 如果遇到 ERROR: Could not find a version that satisfies the requirement sentencepiece==0.1.91 (from transformers) (from versions: 0.0.0, 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.0.9, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.83, 0.1.86, 0.1.91, 0.1.92, 0.1.94, 0.1.95, 0.1.96, 0.1.97, 0.1.98, 0.1.99) 错误,需要升级 pip: Terminal window1pip install --upgrade pip2pip install transformers SparseTIR 实验复现 环境版本OSUbuntu 20.04cudacudnn8-devel-11.7.0python3.9.16anaconda23.3.1pytorch2.0.1torchvision0.15.2gpu2×RTX3090 SpMM AE代码如下: 1import dgl2import tvm3import tvm.testing4import tvm.tir as tir5import scipy.sparse as sp6import argparse7import numpy as np8import torch as th9from tvm.script import tir as T10from tvm.sparse import (11 FormatRewriteRule,12 lower_sparse_buffer,13 lower_sparse_iter,14 column_part_hyb,15 format_decompose,16)17import tvm.sparse18from utils import get_dataset, ell19 20 21col_part_config = {22 "arxiv": 1,23 "proteins": 8,24 "pubmed": 1,25 "citeseer": 1,26 "cora": 1,27 "ppi": 16,28 "reddit": 8,29 "products": 16,30}31 32bucketing_config = {33 "arxiv": [1, 2, 4, 8, 16, 32],34 "proteins": [1, 2, 4, 8, 16, 32, 64, 128, 256],35 "pubmed": [1, 2, 4, 8, 16, 32],36 "citeseer": [1, 2, 4],37 "cora": [1, 2, 4],38 "ppi": [1, 2, 4, 8, 16, 32],39 "products": [1, 2, 4, 8, 16, 32],40 "reddit": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512],41}42 43@T.prim_func44def csrmm(45 a: T.handle,46 b: T.handle,47 c: T.handle,48 indptr: T.handle,49 indices: T.handle,50 m: T.int32,51 n: T.int32,52 num_tiles: T.int32,53 nnz: T.int32,54 cwm: T.int32,55) -> None:56 T.func_attr({"global_symbol": "main", "tir.noalias": True, "sparse_tir_level": 2})57 I = T.dense_fixed(m)58 J = T.sparse_variable(I, (n, nnz), (indptr, indices), "int32")59 J_detach = T.dense_fixed(n)60 K1 = T.dense_fixed(num_tiles)61 K2 = T.dense_fixed(cwm)62 K3 = T.dense_fixed(32)63 A = T.match_sparse_buffer(a, (I, J), "float32")64 B = T.match_sparse_buffer(b, (J_detach, K1, K2, K3), "float32")65 C = T.match_sparse_buffer(c, (I, K1, K2, K3), "float32")66 with T.sp_iter([I, J, K1, K2, K3], "SRSSS", "csrmm") as [i, j, k1, k2, k3]:67 with T.init():68 C[i, k1, k2, k3] = T.float32(0)69 # if hyb enable C[i, k1, k2, k3] = 0.070 C[i, k1, k2, k3] = C[i, k1, k2, k3] + A[i, j] * B[j, k1, k2, k3]71 72 73class TorchOpTimer(object):74 def __enter__(self):75 self.start_event = th.cuda.Event(enable_timing=True)76 self.end_event = th.cuda.Event(enable_timing=True)77 self.start_event.record()78 return self79 80 def __exit__(self, type, value, traceback):81 self.end_event.record()82 th.cuda.synchronize() # Wait for the events to be recorded!83 self.time = self.start_event.elapsed_time(self.end_event) / 1e384 85 86def csr2ell_inv_index_map(o, i, j):87 return i, j88 89 90def csr2ell_index_map(i, j):91 return 0, i, j92 93 94cached_bucketing_format = None95 96 97def bench_nodecomposition(98 g,99 x,100 y_golden,101 feat_size=128,102 cwm=2,103):104 indptr, indices, _ = g.adj_sparse("csc")105 m = g.num_dst_nodes()106 n = g.num_src_nodes()107 nnz = g.num_edges()108 if feat_size < 64:109 cwm = 1110 mod = tvm.IRModule.from_expr(csrmm)111 # specialize112 params = mod["main"].params113 param_map = {114 params[5]: m, # m115 params[6]: n, # n116 params[7]: feat_size // cwm // 32, # num_tiles,117 params[8]: nnz, # nnz118 params[9]: cwm, # cwm119 }120 121 mod["main"] = mod["main"].specialize(param_map)122 123 # schedule124 mod = tvm.sparse.lower_sparse_iter(mod)125 sch = tvm.tir.Schedule(mod)126 outer_blk = sch.get_block("csrmm0")127 inner_blk = sch.get_block("csrmm1")128 (i,) = sch.get_loops(outer_blk)129 j, foo, foi, fi = sch.get_loops(inner_blk)130 sch.reorder(foo, fi, j, foi)131 sch.bind(fi, "threadIdx.x")132 sch.bind(foo, "blockIdx.y")133 sch.unroll(foi)134 io, ii = sch.split(i, [None, 8])135 sch.bind(io, "blockIdx.x")136 sch.bind(ii, "threadIdx.y")137 init_blk = sch.decompose_reduction(inner_blk, fi)138 ax0, ax1 = sch.get_loops(init_blk)[-2:]139 sch.bind(ax0, "threadIdx.x")140 mod = tvm.sparse.lower_sparse_buffer(sch.mod)141 f = tvm.build(mod["main"], target="cuda")142 # prepare nd array143 indptr_nd = tvm.nd.array(indptr.numpy().astype("int32"), device=tvm.cuda(0))144 b_nd = tvm.nd.array(145 x.numpy().reshape(-1).astype("float32"),146 device=tvm.cuda(0),147 )148 indices_nd = tvm.nd.array(indices.numpy().astype("int32"), device=tvm.cuda(0))149 c_nd = tvm.nd.array(np.zeros((n * feat_size,)).astype("float32"), device=tvm.cuda(0))150 a_nd = tvm.nd.array(np.ones((nnz,)).astype("float32"), device=tvm.cuda(0))151 args = [a_nd, b_nd, c_nd, indptr_nd, indices_nd]152 f(*args)153 tvm.testing.assert_allclose(c_nd.numpy().reshape(-1, feat_size), y_golden.numpy(), rtol=1e-4)154 evaluator = f.time_evaluator(f.entry_name, tvm.cuda(0), number=100)155 return evaluator(*args).mean * 1000156 157 158def bench_decomposition(159 g,160 x,161 y_golden,162 feat_size=128,163 bucket_sizes=[],164 coersening_factor=2,165 num_col_parts=1,166 use_implicit_unroll=False,167):168 num_buckets = len(bucket_sizes)169 coersening_factor = min(coersening_factor, feat_size // 32)170 indptr, indices, _ = g.adj_sparse("csc")171 m = g.num_dst_nodes()172 n = g.num_src_nodes()173 nnz = g.num_edges()174 global cached_bucketing_format175 indptr_nd = tvm.nd.array(indptr.numpy(), device=tvm.cpu())176 indices_nd = tvm.nd.array(indices.numpy(), device=tvm.cpu())177 cached_bucketing_format = column_part_hyb(178 m, n, indptr_nd, indices_nd, num_col_parts, bucket_sizes179 )180 row_indices, col_indices, mask = cached_bucketing_format181 182 # rewrite csrmm183 nnz_cols_symbol = ell.params[-1]184 rewrites = []185 for part_id in range(num_col_parts):186 for bucket_id, bucket_size in enumerate(bucket_sizes):187 rewrites.append(188 FormatRewriteRule(189 str(part_id) + "_" + str(bucket_id),190 ell.specialize({nnz_cols_symbol: bucket_size}),191 ["A"],192 ["I", "J"],193 ["O", "I", "J"],194 {"I": ["O", "I"], "J": ["J"]},195 csr2ell_index_map,196 csr2ell_inv_index_map,197 )198 )199 mod = tvm.IRModule.from_expr(csrmm)200 mod = format_decompose(mod, rewrites)201 mod = tvm.tir.transform.RemovePreprocess()(mod)202 203 # specialize204 params = mod["main"].params205 param_map = {206 params[5]: m, # m207 params[6]: n, # n208 params[7]: feat_size // coersening_factor // 32, # num_tiles,209 params[8]: nnz, # nnz210 params[9]: coersening_factor, # coersening_factor211 }212 for part_id in range(num_col_parts):213 for bucket_id in range(num_buckets):214 param_map[params[10 + 7 * (part_id * num_buckets + bucket_id) + 4]] = m215 param_map[params[10 + 7 * (part_id * num_buckets + bucket_id) + 5]] = n216 param_map[params[10 + 7 * (part_id * num_buckets + bucket_id) + 6]] = row_indices[217 part_id218 ][bucket_id].shape[0]219 220 mod["main"] = mod["main"].specialize(param_map).with_attr("horizontal_fuse", True)221 222 # schedule223 sch = tvm.tir.Schedule(mod)224 for sp_iter_name in [225 "csrmm_{}_{}".format(i, j) for j in range(num_buckets) for i in range(num_col_parts)226 ]:227 sp_iteration = sch.get_sparse_iteration(sp_iter_name)228 o, i, j, k1, k2, k3 = sch.get_sp_iters(sp_iteration)229 sch.sparse_fuse(sp_iteration, [o, i])230 231 mod = sch.mod232 mod = tvm.sparse.lower_sparse_iter(mod)233 sch = tvm.tir.Schedule(mod)234 for part_id in range(num_col_parts):235 for bucket_id, bucket_size in enumerate(bucket_sizes):236 is_atomic = num_col_parts > 1 or bucket_id + 1 == num_buckets237 blk = sch.get_block("csrmm_{}_{}0".format(part_id, bucket_id))238 i, j, foo, foi, fi = sch.get_loops(blk)239 sch.reorder(foo, fi, j, foi)240 if is_atomic:241 sch.annotate(blk, "atomic", True)242 write_blk = sch.reverse_cache_write(blk, 0, "local")243 sch.reverse_compute_at(write_blk, fi, True)244 # sch.unroll(sch.get_loops(write_blk)[-2])245 sch.bind(fi, "threadIdx.x")246 sch.bind(foo, "blockIdx.y")247 sch.unroll(foi)248 if use_implicit_unroll:249 sch.annotate(foi, "pragma_unroll_explicit", 0)250 sch.unroll(j)251 if use_implicit_unroll:252 sch.annotate(j, "pragma_unroll_explicit", 0)253 io, ioi, ii = sch.split(i, [None, bucket_sizes[-1] // bucket_size, 8])254 sch.bind(io, "blockIdx.x")255 sch.bind(ii, "threadIdx.y")256 init_blk = sch.decompose_reduction(blk, fi)257 ax0, ax1 = sch.get_loops(init_blk)[-2:]258 sch.bind(ax0, "threadIdx.x")259 sch.unroll(ax1)260 if use_implicit_unroll:261 sch.annotate(ax1, "pragma_unroll_explicit", 0)262 263 mod = tvm.sparse.lower_sparse_buffer(sch.mod)264 mod = tvm.tir.transform.RemoveUnusedArgs()(mod)265 f = tvm.build(mod, target="cuda")266 267 # prepare nd array268 b_nd = tvm.nd.array(269 x.numpy().reshape(-1).astype("float32"),270 device=tvm.cuda(0),271 )272 c_nd = tvm.nd.array(np.zeros((n * feat_size,)).astype("float32"), device=tvm.cuda(0))273 # prepare args274 args = [b_nd, c_nd]275 276 for part_id in range(num_col_parts):277 for bucket_id, _ in enumerate(bucket_sizes):278 weight = tvm.nd.array(279 mask[part_id][bucket_id].numpy().reshape(-1).astype("float32"), device=tvm.cuda(0)280 )281 rows = tvm.nd.array(282 row_indices[part_id][bucket_id].numpy().astype("int32"), device=tvm.cuda(0)283 )284 cols = tvm.nd.array(285 col_indices[part_id][bucket_id].numpy().reshape(-1).astype("int32"),286 device=tvm.cuda(0),287 )288 args += [weight, rows, cols]289 290 # test accuracy291 f(*args)292 tvm.testing.assert_allclose(c_nd.numpy().reshape(-1, feat_size), y_golden.numpy(), rtol=1e-4)293 294 # evaluate time295 evaluator = f.time_evaluator(f.entry_name, tvm.cuda(0), number=100)296 return evaluator(*args).mean * 1000297 298def spmm_hyb(dataset="arxiv"):299 time_list = []300 parser = argparse.ArgumentParser("hybrid format spmm in sparse-tir")301 parser.add_argument("--dataset", "-d", type=str, default=dataset, help="dataset name")302 parser.add_argument("--implicit-unroll", "-i", action="store_true", help="use implicit unroll")303 args = parser.parse_args()304 name = args.dataset305 g = get_dataset(name)306 307 for feat_size in [32, 64, 128, 256, 512]:308 x = th.rand((g.num_src_nodes(), feat_size))309 y_golden = dgl.ops.copy_u_sum(g, x)310 exec_time = bench_decomposition(311 g,312 x,313 y_golden,314 feat_size=feat_size,315 bucket_sizes=bucketing_config[name],316 coersening_factor=2,317 num_col_parts=col_part_config[name],318 use_implicit_unroll=args.implicit_unroll,319 )320 time_list.append((feat_size, exec_time))321 322 return time_list323 324def spmm_nohyb(dataset="arxiv"):325 time_list = []326 parser = argparse.ArgumentParser("hybrid format spmm in sparse-tir")327 parser.add_argument("--dataset", "-d", type=str, default=dataset, help="dataset name")328 args = parser.parse_args()329 name = args.dataset330 g = get_dataset(name)331 332 for feat_size in [32, 64, 128, 256, 512]:333 x = th.rand((g.num_src_nodes(), feat_size))334 y_golden = dgl.ops.copy_u_sum(g, x)335 exec_time = bench_nodecomposition(336 g,337 x,338 y_golden,339 feat_size=feat_size,340 cwm=2,341 )342 time_list.append((feat_size, exec_time))343 344 return time_list345 346 347if __name__ == "__main__":348 dataset_name = ["arxiv", "proteins", "pubmed", "citeseer", "cora", "ppi", "reddit"]349 time_log = {}350 speedup = lambda x, y: x / y351 for dataset in dataset_name:352 nodecomposition_time = spmm_nohyb(dataset)353 decomposition_time = spmm_hyb(dataset)354 time_log[dataset] = [decomposition_time[i][1] / nodecomposition_time[i][1] for i in range(5)]355 356 print(time_log) 关于 taco 的安装 Terminal window1git clone https://github.com/tensor-compiler/taco.git2cd taco3mkdir -p build4cd build5cmake -DCMAKE_BUILD_TYPE=Release -DPYTHON=ON -DCUDA=ON ..6make -j87export PYTHONPATH=/root/taco/build/lib:$PYTHONPATH8export PATH=/usr/local/cuda/bin:$PATH9export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH10export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH 注意这里的 python 版本要与 conda 中的一致,例如 conda 中的 python 版本为 3.9.16 ,那么本机中也要安装 3.9.16 的 python 安装完成后,运行: Terminal window1python build/python_bindings/unit_tests.py 查看是否通过测试