环境搭建

由于在官方文档中 SparseTIR 环境不能使用 python 3.8:

并且 SparTA 的开源版本根本没做完，想要复现只能用他给定的复现版本

因此这里不能使用最初的 tvm 的 docker

这里我们利用之前的 docker image 重新构建一个 docker 进行环境搭建，当然，镜像与 tvm 中使用的一致。

`Docker` 构建

由于这两个项目对于 python 版本和包依赖会出错，因此需要分开构建 docker

Dockerfile.sparsetir 如下：

1
FROM nvidia/cuda:11.7.0-cudnn8-devel-ubuntu20.04
2
# Install tools and dependencies.
3
RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
4
RUN apt-get -y update && apt -y  upgrade
5
ARG DEBIAN_FRONTEND=noninteractive
6
ENV TZ=Asia/Shanghai
7
RUN apt-get install -y \
8
    vim \
9
    git \
10
    wget \
11
    libgoogle-glog-dev
12
RUN apt install -y \
13
    gcc \
14
    libtinfo-dev \
15
    zlib1g-dev \
16
    build-essential \
17
    libedit-dev \
18
    libxml2-dev \
19
    libssl-dev \
20
    unzip \
21
    pip \
22
    libsndfile1
23

24
# Setup to install the latest version of cmake.
25
RUN apt-get install -y software-properties-common && \
26
    apt-get update && \
27
    wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
28
    apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \
29
    apt-get update && apt-get install -y cmake
30

31

32
# Set the working directory.
33

34
WORKDIR /root
35

36
RUN wget https://repo.anaconda.com/archive/Anaconda3-2023.03-1-Linux-x86_64.sh && \
37
    bash Anaconda3-2023.03-1-Linux-x86_64.sh -b -p /root/anaconda
38
RUN eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda create -n tir python=3.9 -y && \
39
    conda activate tir && pip3 install torch torchvision torchaudio
40

41
# install llvm12
42
RUN wget https://apt.llvm.org/llvm.sh && \
43
    chmod +x llvm.sh && \
44
    ./llvm.sh 12
45

46
# add github host
47
RUN echo '20.205.243.166  github.com\n\
48
    199.59.148.96 github.global.ssl.fastly.Net\n' >> /etc/hosts
49

50
# install sparseTIR
51
RUN git clone --recursive https://github.com/uwsampl/SparseTIR.git sparsetir && cd sparsetir/cmake && \
52
    echo set\(USE_LLVM ON\) >> config.cmake && \
53
    echo set\(HIDE_PRIVATE_SYMBOLS ON\) >> config.cmake && \
54
    echo set\(USE_CUDA ON\) >> config.cmake && \
55
    echo set\(USE_CUBLAS ON\) >> config.cmake && \
56
    echo set\(USE_CUDNN ON\) >> config.cmake && \
57
    echo set\(USE_RELAY_DEBUG ON\) >> config.cmake && \
58
    cd .. && \
59
    mkdir -p build && \
60
    cd build && \
61
    cp ../cmake/config.cmake . && \
62
    cmake .. && \
63
    make -j$(nproc) && \
64
    pip install decorator && \
65
    cd .. && \
66
    export SPARSETIR_PATH=$(pwd) && \
67
    export PYTHONPATH=${SPARSETIR_PATH}/python:${PYTHONPATH} && \
68
    eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate tir && \
69
    cd python && python setup.py install  && cd ..

输入命令:

1
docker build . -f Dockerfile.sparsetir  -t tir

进行镜像构建，完成后输入：

1
docker system prune
2
docker run  -p 8088:22 --restart=on-failure --runtime=nvidia  -it tir /bin/bash

进入容器

而 Dockerfile.sparta 如下：

1
FROM nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
2
# Install tools and dependencies.
3
ARG DEBIAN_FRONTEND=noninteractive
4
ENV TZ=Asia/Shanghai
5
RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
6
RUN apt-get -y update && apt -y  upgrade
7
RUN apt-get install -y \
8
    emacs \
9
    git \
10
    wget \
11
    libgoogle-glog-dev \
12
    libsndfile1
13

14
# Setup to install the latest version of cmake.
15
RUN apt-get install -y software-properties-common && \
16
    apt-get update && \
17
    wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
18
    apt-add-repository 'deb https://apt.kitware.com/ubuntu/ bionic main' && \
19
    apt-get update && apt-get install -y cmake
20
# Set the working directory.
21
WORKDIR /root
22

23
#install sputnik
24
RUN git clone --recurse-submodules https://github.com/zheng-ningxin/sputnik.git && \
25
    cd sputnik && mkdir build && cd build && \
26
    cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_TEST=ON -DBUILD_BENCHMARK=ON -DCUDA_ARCHS="70;75" && \
27
    make -j && cp sputnik/libspmm.so /usr/local/lib/ && cp -r /root/sputnik/third_party/abseil-cpp/absl /usr/local/include/
28

29
# install nnfusion
30
RUN git clone https://github.com/zheng-ningxin/nnfusion.git && cd nnfusion && git checkout hubert_antares && \
31
    ./maint/script/install_dependency.sh && mkdir build && cd build && cmake .. && make -j
32

33
# install anaconda
34
RUN wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh && \
35
    bash Anaconda3-2021.11-Linux-x86_64.sh -b -p /root/anaconda && \
36
    eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda create -n artifact python=3.8 -y && \
37
    conda activate artifact && pip install torch==1.7.0 torchvision==0.8.0
38

39
# install nni
40
RUN git clone https://github.com/zheng-ningxin/nni.git && cd nni && git checkout artifact && \
41
    eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate artifact && pip install -U -r dependencies/setup.txt && \
42
    pip install -r dependencies/develop.txt && python setup.py develop && pip install tensorboard transformers==3.5.0 onnxruntime graphviz onnx soundfile datasets==2.0.0 ply matplotlib numpy librosa
43

44
# install antares
45
RUN eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate artifact && \
46
    pip install antares==0.3.12.1
47

48
# install tensorrt
49
RUN  eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate artifact && \
50
    pip install pycuda==2020.1 && python3 -m pip install --upgrade setuptools pip && \
51
    python3 -m pip install nvidia-pyindex && python3 -m pip install --upgrade nvidia-tensorrt==8.4.0.6 && \
52
    pip install six
53

54
# install tvm
55
RUN wget https://github.com/llvm/llvm-project/releases/download/llvmorg-13.0.0/clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz && \
56
    tar -xvf clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz
57
RUN eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate artifact && \
58
    git clone --recursive https://github.com/linbinskn/tvm.git tvm && cd tvm && git checkout cuda_old && \
59
    apt-get update && apt-get install -y python3 python3-dev python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev && \
60
    cd build && cmake .. && make -j4 &&  \
61
    pip install decorator
62

63
# install taco
64
RUN export PATH=/usr/local/cuda/bin:$PATH && export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH && export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH && git clone https://github.com/QuanluZhang/taco.git && cd taco && git checkout artifact && \
65
    mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DCUDA=ON .. && \
66
    make -j8
67

68
# install azcopy
69
RUN wget https://aka.ms/downloadazcopy-v10-linux && tar xzvf downloadazcopy-v10-linux && cp azcopy_linux_amd64_10.14.1/azcopy /usr/local/bin
70

71
# configure the bashrc
72
RUN echo 'export NNFUSION_HOME=/root/nnfusion \n\
73
    export TACO_HOME=/root/taco \n\
74
    export PATH=$PATH:$TACO_HOME/build/bin \n\
75
    export PYTHONPATH=/root/tvm/python:$PYTHONPATH \n\
76
    export PATH=$NNFUSION_HOME/build/src/tools/nnfusion:$PATH \n\
77
    export CUDA_HOME=/usr/local/cuda \n\
78
    source ~/anaconda/etc/profile.d/conda.sh \n\
79
    ' >> /root/.bashrc

输入命令

1
docker build . -f Dockerfile.sparta -t sparta

进行构建

完成后输入命令

1
docker system prune
2
docker run  -p 8087:22 --restart=on-failure --runtime=nvidia  -it sparta /bin/bash

`Docker` 内环境的搭建

进入容器内部后，输入 nvidia-smi 进行查看：

出现如下输出即视为成功。

输入命令

1
mkdir workspace && cd workspace

后，开始安装两个框架。

`SparTA` 的安装

输入命令

1
git clone https://github.com/microsoft/nni && cd nni && git checkout sparta_artifact
2
conda activate spar
3
python setup.py develop
4
cd script && bash init_env.sh

在下载 checkpoint 的时候会发生错误，错误原因应该是 nni 把下载的 url 移动了，但是文档没更新

`SparseTIR` 的安装

如果在 dockerfile 构建失败的话，把最后的 RUN 删除后，进行下面的步骤

编辑 build.sh 文件如下：

1
git clone --recursive https://github.com/uwsampl/SparseTIR.git sparsetir && cd sparsetir/cmake && \
2
echo set\(USE_LLVM ON\) >> config.cmake && \
3
echo set\(HIDE_PRIVATE_SYMBOLS ON\) >> config.cmake && \
4
echo set\(USE_CUDA ON\) >> config.cmake && \
5
echo set\(USE_CUBLAS ON\) >> config.cmake && \
6
echo set\(USE_CUDNN ON\) >> config.cmake && \
7
echo set\(USE_RELAY_DEBUG ON\) >> config.cmake && \
8
cd .. && \
9
mkdir -p build && \
10
cd build && \
11
cp ../cmake/config.cmake . && \
12
cmake .. && \
13
make -j$(nproc) && \
14
pip install decorator && \
15
cd .. && \
16
export SPARSETIR_PATH=$(pwd) && \
17
export PYTHONPATH=${SPARSETIR_PATH}/python:${PYTHONPATH} && \
18
eval "$(/root/anaconda/bin/conda shell.bash hook)" && conda activate tir && \
19
cd python && python setup.py install  && cd ..

然后运行：

1
bash build.sh

如果在 make 时会报错：

纠错方式为，将报错的部分全部改为：

1
alloca->getAlign().value()

即可

出现这个问题是因为 llvm 的版本太高了，可能安装了 llvm-15 或者 llvm-16 ，可以安装低版本的llvm 来解决（注意不能低于 llvm-10）

随后输入如下命令进行 python 包的安装：

1
cd python
2
python setup.py install
3
cd ..

注意，在复现时还需要很多依赖包，需要自己去添加

Important
其中 dgl 的包版本需要 <=1.0，否则会出错，其他包似乎没有版本要求，最新版均可

安装 dgl 的命令如下：
Terminal window
1
conda install -c dglteam/label/cu117 dgl

`SparTA` 实验复现

复现环境：

环境	版本
OS	Ubuntu-20.04
cuda	cudnn-11.7.0
python	3.9.16
PyTorch	2.0.1
trochvision	0.15.2
gpu	NVIDIA GeForce RTX 3090 × 2

以图为例，从 Figure 8 开始复现：

初始结果：

输入命令：

1
cd script
2
bash init_checkpoint.sh
3
cd figure8 && bash run.sh

注意，可能报错：ModuleNotFoundError: No module named 'tqdm' 安装一下即可

Important
还需要安装 nni ：
Terminal window
1
git clone https://github.com/zheng-ningxin/nni.git && cd nni && git checkout artifact
2
pip install -U -r dependencies/setup.txt
3
pip install -r dependencies/develop.txt
4
python setup.py develop
5
pip install tensorboard transformers==3.5.0 onnxruntime graphviz onnx soundfile datasets==2.0.0 ply matplotlib numpy librosa
如果遇到 ERROR: Could not find a version that satisfies the requirement sentencepiece==0.1.91 (from transformers) (from versions: 0.0.0, 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.0.9, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.83, 0.1.86, 0.1.91, 0.1.92, 0.1.94, 0.1.95, 0.1.96, 0.1.97, 0.1.98, 0.1.99) 错误，需要升级 pip：
Terminal window
1
pip install --upgrade pip
2
pip install transformers

`SparseTIR` 实验复现

环境	版本
OS	Ubuntu 20.04
cuda	cudnn8-devel-11.7.0
python	3.9.16
anaconda	23.3.1
pytorch	2.0.1
torchvision	0.15.2
gpu	2×RTX3090

`SpMM`

AE代码如下：

1
import dgl
2
import tvm
3
import tvm.testing
4
import tvm.tir as tir
5
import scipy.sparse as sp
6
import argparse
7
import numpy as np
8
import torch as th
9
from tvm.script import tir as T
10
from tvm.sparse import (
11
    FormatRewriteRule,
12
    lower_sparse_buffer,
13
    lower_sparse_iter,
14
    column_part_hyb,
15
    format_decompose,
16
)
17
import tvm.sparse
18
from utils import get_dataset, ell
19

20

21
col_part_config = {
22
    "arxiv": 1,
23
    "proteins": 8,
24
    "pubmed": 1,
25
    "citeseer": 1,
26
    "cora": 1,
27
    "ppi": 16,
28
    "reddit": 8,
29
    "products": 16,
30
}
31

32
bucketing_config = {
33
    "arxiv": [1, 2, 4, 8, 16, 32],
34
    "proteins": [1, 2, 4, 8, 16, 32, 64, 128, 256],
35
    "pubmed": [1, 2, 4, 8, 16, 32],
36
    "citeseer": [1, 2, 4],
37
    "cora": [1, 2, 4],
38
    "ppi": [1, 2, 4, 8, 16, 32],
39
    "products": [1, 2, 4, 8, 16, 32],
40
    "reddit": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
41
}
42

43
@T.prim_func
44
def csrmm(
45
    a: T.handle,
46
    b: T.handle,
47
    c: T.handle,
48
    indptr: T.handle,
49
    indices: T.handle,
50
    m: T.int32,
51
    n: T.int32,
52
    num_tiles: T.int32,
53
    nnz: T.int32,
54
    cwm: T.int32,
55
) -> None:
56
    T.func_attr({"global_symbol": "main", "tir.noalias": True, "sparse_tir_level": 2})
57
    I = T.dense_fixed(m)
58
    J = T.sparse_variable(I, (n, nnz), (indptr, indices), "int32")
59
    J_detach = T.dense_fixed(n)
60
    K1 = T.dense_fixed(num_tiles)
61
    K2 = T.dense_fixed(cwm)
62
    K3 = T.dense_fixed(32)
63
    A = T.match_sparse_buffer(a, (I, J), "float32")
64
    B = T.match_sparse_buffer(b, (J_detach, K1, K2, K3), "float32")
65
    C = T.match_sparse_buffer(c, (I, K1, K2, K3), "float32")
66
    with T.sp_iter([I, J, K1, K2, K3], "SRSSS", "csrmm") as [i, j, k1, k2, k3]:
67
        with T.init():
68
            C[i, k1, k2, k3] = T.float32(0)
69
            # if hyb enable C[i, k1, k2, k3] = 0.0
70
        C[i, k1, k2, k3] = C[i, k1, k2, k3] + A[i, j] * B[j, k1, k2, k3]
71

72

73
class TorchOpTimer(object):
74
    def __enter__(self):
75
        self.start_event = th.cuda.Event(enable_timing=True)
76
        self.end_event = th.cuda.Event(enable_timing=True)
77
        self.start_event.record()
78
        return self
79

80
    def __exit__(self, type, value, traceback):
81
        self.end_event.record()
82
        th.cuda.synchronize()  # Wait for the events to be recorded!
83
        self.time = self.start_event.elapsed_time(self.end_event) / 1e3
84

85

86
def csr2ell_inv_index_map(o, i, j):
87
    return i, j
88

89

90
def csr2ell_index_map(i, j):
91
    return 0, i, j
92

93

94
cached_bucketing_format = None
95

96

97
def bench_nodecomposition(
98
    g,
99
    x,
100
    y_golden,
101
    feat_size=128,
102
    cwm=2,
103
):
104
    indptr, indices, _ = g.adj_sparse("csc")
105
    m = g.num_dst_nodes()
106
    n = g.num_src_nodes()
107
    nnz = g.num_edges()
108
    if feat_size < 64:
109
        cwm = 1
110
    mod = tvm.IRModule.from_expr(csrmm)
111
    # specialize
112
    params = mod["main"].params
113
    param_map = {
114
        params[5]: m,  # m
115
        params[6]: n,  # n
116
        params[7]: feat_size // cwm // 32,  # num_tiles,
117
        params[8]: nnz,  # nnz
118
        params[9]: cwm,  # cwm
119
    }
120

121
    mod["main"] = mod["main"].specialize(param_map)
122

123
    # schedule
124
    mod = tvm.sparse.lower_sparse_iter(mod)
125
    sch = tvm.tir.Schedule(mod)
126
    outer_blk = sch.get_block("csrmm0")
127
    inner_blk = sch.get_block("csrmm1")
128
    (i,) = sch.get_loops(outer_blk)
129
    j, foo, foi, fi = sch.get_loops(inner_blk)
130
    sch.reorder(foo, fi, j, foi)
131
    sch.bind(fi, "threadIdx.x")
132
    sch.bind(foo, "blockIdx.y")
133
    sch.unroll(foi)
134
    io, ii = sch.split(i, [None, 8])
135
    sch.bind(io, "blockIdx.x")
136
    sch.bind(ii, "threadIdx.y")
137
    init_blk = sch.decompose_reduction(inner_blk, fi)
138
    ax0, ax1 = sch.get_loops(init_blk)[-2:]
139
    sch.bind(ax0, "threadIdx.x")
140
    mod = tvm.sparse.lower_sparse_buffer(sch.mod)
141
    f = tvm.build(mod["main"], target="cuda")
142
    # prepare nd array
143
    indptr_nd = tvm.nd.array(indptr.numpy().astype("int32"), device=tvm.cuda(0))
144
    b_nd = tvm.nd.array(
145
        x.numpy().reshape(-1).astype("float32"),
146
        device=tvm.cuda(0),
147
    )
148
    indices_nd = tvm.nd.array(indices.numpy().astype("int32"), device=tvm.cuda(0))
149
    c_nd = tvm.nd.array(np.zeros((n * feat_size,)).astype("float32"), device=tvm.cuda(0))
150
    a_nd = tvm.nd.array(np.ones((nnz,)).astype("float32"), device=tvm.cuda(0))
151
    args = [a_nd, b_nd, c_nd, indptr_nd, indices_nd]
152
    f(*args)
153
    tvm.testing.assert_allclose(c_nd.numpy().reshape(-1, feat_size), y_golden.numpy(), rtol=1e-4)
154
    evaluator = f.time_evaluator(f.entry_name, tvm.cuda(0), number=100)
155
    return evaluator(*args).mean * 1000
156

157

158
def bench_decomposition(
159
    g,
160
    x,
161
    y_golden,
162
    feat_size=128,
163
    bucket_sizes=[],
164
    coersening_factor=2,
165
    num_col_parts=1,
166
    use_implicit_unroll=False,
167
):
168
    num_buckets = len(bucket_sizes)
169
    coersening_factor = min(coersening_factor, feat_size // 32)
170
    indptr, indices, _ = g.adj_sparse("csc")
171
    m = g.num_dst_nodes()
172
    n = g.num_src_nodes()
173
    nnz = g.num_edges()
174
    global cached_bucketing_format
175
    indptr_nd = tvm.nd.array(indptr.numpy(), device=tvm.cpu())
176
    indices_nd = tvm.nd.array(indices.numpy(), device=tvm.cpu())
177
    cached_bucketing_format = column_part_hyb(
178
        m, n, indptr_nd, indices_nd, num_col_parts, bucket_sizes
179
    )
180
    row_indices, col_indices, mask = cached_bucketing_format
181

182
    # rewrite csrmm
183
    nnz_cols_symbol = ell.params[-1]
184
    rewrites = []
185
    for part_id in range(num_col_parts):
186
        for bucket_id, bucket_size in enumerate(bucket_sizes):
187
            rewrites.append(
188
                FormatRewriteRule(
189
                    str(part_id) + "_" + str(bucket_id),
190
                    ell.specialize({nnz_cols_symbol: bucket_size}),
191
                    ["A"],
192
                    ["I", "J"],
193
                    ["O", "I", "J"],
194
                    {"I": ["O", "I"], "J": ["J"]},
195
                    csr2ell_index_map,
196
                    csr2ell_inv_index_map,
197
                )
198
            )
199
    mod = tvm.IRModule.from_expr(csrmm)
200
    mod = format_decompose(mod, rewrites)
201
    mod = tvm.tir.transform.RemovePreprocess()(mod)
202

203
    # specialize
204
    params = mod["main"].params
205
    param_map = {
206
        params[5]: m,  # m
207
        params[6]: n,  # n
208
        params[7]: feat_size // coersening_factor // 32,  # num_tiles,
209
        params[8]: nnz,  # nnz
210
        params[9]: coersening_factor,  # coersening_factor
211
    }
212
    for part_id in range(num_col_parts):
213
        for bucket_id in range(num_buckets):
214
            param_map[params[10 + 7 * (part_id * num_buckets + bucket_id) + 4]] = m
215
            param_map[params[10 + 7 * (part_id * num_buckets + bucket_id) + 5]] = n
216
            param_map[params[10 + 7 * (part_id * num_buckets + bucket_id) + 6]] = row_indices[
217
                part_id
218
            ][bucket_id].shape[0]
219

220
    mod["main"] = mod["main"].specialize(param_map).with_attr("horizontal_fuse", True)
221

222
    # schedule
223
    sch = tvm.tir.Schedule(mod)
224
    for sp_iter_name in [
225
        "csrmm_{}_{}".format(i, j) for j in range(num_buckets) for i in range(num_col_parts)
226
    ]:
227
        sp_iteration = sch.get_sparse_iteration(sp_iter_name)
228
        o, i, j, k1, k2, k3 = sch.get_sp_iters(sp_iteration)
229
        sch.sparse_fuse(sp_iteration, [o, i])
230

231
    mod = sch.mod
232
    mod = tvm.sparse.lower_sparse_iter(mod)
233
    sch = tvm.tir.Schedule(mod)
234
    for part_id in range(num_col_parts):
235
        for bucket_id, bucket_size in enumerate(bucket_sizes):
236
            is_atomic = num_col_parts > 1 or bucket_id + 1 == num_buckets
237
            blk = sch.get_block("csrmm_{}_{}0".format(part_id, bucket_id))
238
            i, j, foo, foi, fi = sch.get_loops(blk)
239
            sch.reorder(foo, fi, j, foi)
240
            if is_atomic:
241
                sch.annotate(blk, "atomic", True)
242
                write_blk = sch.reverse_cache_write(blk, 0, "local")
243
                sch.reverse_compute_at(write_blk, fi, True)
244
                # sch.unroll(sch.get_loops(write_blk)[-2])
245
            sch.bind(fi, "threadIdx.x")
246
            sch.bind(foo, "blockIdx.y")
247
            sch.unroll(foi)
248
            if use_implicit_unroll:
249
                sch.annotate(foi, "pragma_unroll_explicit", 0)
250
            sch.unroll(j)
251
            if use_implicit_unroll:
252
                sch.annotate(j, "pragma_unroll_explicit", 0)
253
            io, ioi, ii = sch.split(i, [None, bucket_sizes[-1] // bucket_size, 8])
254
            sch.bind(io, "blockIdx.x")
255
            sch.bind(ii, "threadIdx.y")
256
            init_blk = sch.decompose_reduction(blk, fi)
257
            ax0, ax1 = sch.get_loops(init_blk)[-2:]
258
            sch.bind(ax0, "threadIdx.x")
259
            sch.unroll(ax1)
260
            if use_implicit_unroll:
261
                sch.annotate(ax1, "pragma_unroll_explicit", 0)
262

263
    mod = tvm.sparse.lower_sparse_buffer(sch.mod)
264
    mod = tvm.tir.transform.RemoveUnusedArgs()(mod)
265
    f = tvm.build(mod, target="cuda")
266

267
    # prepare nd array
268
    b_nd = tvm.nd.array(
269
        x.numpy().reshape(-1).astype("float32"),
270
        device=tvm.cuda(0),
271
    )
272
    c_nd = tvm.nd.array(np.zeros((n * feat_size,)).astype("float32"), device=tvm.cuda(0))
273
    # prepare args
274
    args = [b_nd, c_nd]
275

276
    for part_id in range(num_col_parts):
277
        for bucket_id, _ in enumerate(bucket_sizes):
278
            weight = tvm.nd.array(
279
                mask[part_id][bucket_id].numpy().reshape(-1).astype("float32"), device=tvm.cuda(0)
280
            )
281
            rows = tvm.nd.array(
282
                row_indices[part_id][bucket_id].numpy().astype("int32"), device=tvm.cuda(0)
283
            )
284
            cols = tvm.nd.array(
285
                col_indices[part_id][bucket_id].numpy().reshape(-1).astype("int32"),
286
                device=tvm.cuda(0),
287
            )
288
            args += [weight, rows, cols]
289

290
    # test accuracy
291
    f(*args)
292
    tvm.testing.assert_allclose(c_nd.numpy().reshape(-1, feat_size), y_golden.numpy(), rtol=1e-4)
293

294
    # evaluate time
295
    evaluator = f.time_evaluator(f.entry_name, tvm.cuda(0), number=100)
296
    return evaluator(*args).mean * 1000
297

298
def spmm_hyb(dataset="arxiv"):
299
    time_list = []
300
    parser = argparse.ArgumentParser("hybrid format spmm in sparse-tir")
301
    parser.add_argument("--dataset", "-d", type=str, default=dataset, help="dataset name")
302
    parser.add_argument("--implicit-unroll", "-i", action="store_true", help="use implicit unroll")
303
    args = parser.parse_args()
304
    name = args.dataset
305
    g = get_dataset(name)
306

307
    for feat_size in [32, 64, 128, 256, 512]:
308
        x = th.rand((g.num_src_nodes(), feat_size))
309
        y_golden = dgl.ops.copy_u_sum(g, x)
310
        exec_time = bench_decomposition(
311
            g,
312
            x,
313
            y_golden,
314
            feat_size=feat_size,
315
            bucket_sizes=bucketing_config[name],
316
            coersening_factor=2,
317
            num_col_parts=col_part_config[name],
318
            use_implicit_unroll=args.implicit_unroll,
319
        )
320
        time_list.append((feat_size, exec_time))
321

322
    return time_list
323

324
def spmm_nohyb(dataset="arxiv"):
325
    time_list = []
326
    parser = argparse.ArgumentParser("hybrid format spmm in sparse-tir")
327
    parser.add_argument("--dataset", "-d", type=str, default=dataset, help="dataset name")
328
    args = parser.parse_args()
329
    name = args.dataset
330
    g = get_dataset(name)
331

332
    for feat_size in [32, 64, 128, 256, 512]:
333
        x = th.rand((g.num_src_nodes(), feat_size))
334
        y_golden = dgl.ops.copy_u_sum(g, x)
335
        exec_time = bench_nodecomposition(
336
            g,
337
            x,
338
            y_golden,
339
            feat_size=feat_size,
340
            cwm=2,
341
        )
342
        time_list.append((feat_size, exec_time))
343

344
    return time_list
345

346

347
if __name__ == "__main__":
348
    dataset_name = ["arxiv", "proteins", "pubmed", "citeseer", "cora", "ppi", "reddit"]
349
    time_log = {}
350
    speedup = lambda x, y: x / y
351
    for dataset in dataset_name:
352
        nodecomposition_time = spmm_nohyb(dataset)
353
        decomposition_time = spmm_hyb(dataset)
354
        time_log[dataset] = [decomposition_time[i][1] / nodecomposition_time[i][1] for i in range(5)]
355

356
    print(time_log)

关于 taco 的安装

1
git clone https://github.com/tensor-compiler/taco.git
2
cd taco
3
mkdir -p build
4
cd build
5
cmake -DCMAKE_BUILD_TYPE=Release -DPYTHON=ON -DCUDA=ON ..
6
make -j8
7
export PYTHONPATH=/root/taco/build/lib:$PYTHONPATH
8
export PATH=/usr/local/cuda/bin:$PATH
9
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
10
export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

注意这里的 python 版本要与 conda 中的一致，例如 conda 中的 python 版本为 3.9.16 ，那么本机中也要安装 3.9.16 的 python

安装完成后，运行：

1
python build/python_bindings/unit_tests.py

查看是否通过测试

また夏を追う

最近的笔记

TAOCP 4B & SAT Handbook 阅读

RoundingSAT 阅读笔记其二

基数约束编码中文字顺序的重要性

探索

SparseTIR & SparTA 环境搭建与实现复现

环境搭建

`Docker` 构建

`Docker` 内环境的搭建

`SparTA` 的安装

`SparseTIR` 的安装

`SparTA` 实验复现

`SparseTIR` 实验复现

`SpMM`

🕸️ 关系图谱

目录

反向链接

また夏を追う

最近的笔记

TAOCP 4B & SAT Handbook 阅读

RoundingSAT 阅读笔记其二

基数约束编码中文字顺序的重要性

探索

SparseTIR & SparTA 环境搭建与实现复现

环境搭建

Docker 构建

Docker 内环境的搭建

SparTA 的安装

SparseTIR 的安装

SparTA 实验复现

SparseTIR 实验复现

SpMM

🕸️ 关系图谱

目录

反向链接

`Docker` 构建

`Docker` 内环境的搭建

`SparTA` 的安装

`SparseTIR` 的安装

`SparTA` 实验复现

`SparseTIR` 实验复现

`SpMM`