实验准备

这里有几份资料可以学习 cache 的工作原理：

对于实验来说，请仔细阅读 Write up 文档，如果需要，请安装下载 Valgrind(安装方式请自行 STFW)

我更换了我的实验环境，从 WSL2 迁移到了真实的 Linux 平台上，但对于这份实验来说并没有什么影响

Part A

要求

实现一个命令行工具 csim，用于模拟 cache 的工作原理，其官方版本的功能如下：

其中，参数 s E b 的含义如图：

写命令行工具时，我们可以通过 #include <unistd.h> 提供的 getopt() 函数来解析命令行参数，这在前文的 slide 中有提到

Data Structure

首先，我们通过上图来设计 cache 的数据结构，显然这是一个矩阵，每一个条目视为一个 struct ，其中包括 valid bit, tag, block，当然，我们还需要实现 LRU 算法，因此这里我们还需要一个时间戳来记录出现的时间。

于是，可以设计数据结构如下：

1
struct cache_line {
2
  u_int16_t valid;
3
  u_int32_t tag;
4
  u_int32_t *block;
5
  u_int32_t timestamp;
6
} **cache;

注意到这里的 block 我用了数组，实际上甚至可以不需要用这个成员，因为模拟器并不会存储任何数据，只是模拟 hit, missing, evict 而已。

Control Flow

我们可以设计如下控制流：

1
sequenceDiagram
2
  main()->>parsing_cmd(): "Parsing Command Line"
3
  activate parsing_cmd()
4
  parsing_cmd()-->>main(): "Some Result"
5
  deactivate parsing_cmd()
6
  main()->> eval(): "start to simulate"
7
  activate eval()
8
  loop UNTIL EOF
9
    eval() ->> get_page(): "get page from disk or cache"
10
    get_page() -x eval(): "finish"
11
  end
12
  eval()-->>main(): "finish"
13
  deactivate eval()

分治后，就可以开始解决各个子问题了。

parsing_cmd

实际上我们可以在 main 中实现这部分逻辑：

1
struct cache_line {
2
  u_int16_t valid;
3
  u_int32_t tag;
4
  u_int32_t *block;
5
  u_int32_t timestamp;
6
} **cache;
7

8
int S, s, E, b, verbose;
9

10
u_int32_t hits, misses, evictions;
11

12
char *trace_file;
13

14
int main(int argc, char *argv[]) {
15
  int opt;
16
  while (-1 != (opt = getopt(argc, argv, "hvs:E:b:t:"))) {
17
    switch (opt) {
18
    case 's':
19
      s = atoi(optarg);
20
      S = (1 << s);
21
      break;
22
    case 'E':
23
      E = atoi(optarg);
24
      break;
25
    case 'b':
26
      b = atoi(optarg);
27
      break;
28
    case 't':
29
      trace_file = optarg;
30
      break;
31
    case 'h':
32
      fprintf(stdout, "Usage: ./%s [-hv] -s <s> -E <E> -b <b> -y <tracefile>\n \
33
              -h: Optional help flag that print usage info\n \
34
              -v: Optional verbose flag that displays trace info\n \
35
              -s <s>: Number of set index bits (S = 2^s is the number of sets)\n \
36
              -E <E>: Associativity (number of lines per set)\n \
37
              -b <b>: Number of block bits (B = 2^b is the block size)\n \
38
              -t <trace_file>: Name of the valgrind trace to replay\n",
39
              argv[0]);
40
      exit(0);
41
    case 'v':
42
      verbose = 1;
43
      break;
44
    default:
45
      fprintf(stderr, "Usage: ./%s [-hv] -s <s> -E <E> -b <b> -y <tracefile>\n",
46
              argv[0]);
47
      exit(-1);
48
    }
49
  }
50

51
  // Allocate cache
52
  cache = (struct cache_line **)malloc(sizeof(struct cache_line *) * S);
53

54
  eval();
55

56
  // Free cache
57
  for (int i = 0; i < S; i++) {
58
    free(cache[i]);
59
  }
60
  free(cache);
61

62
  printSummary(hits, misses, evictions);
63

64
  return 0;
65
}

如果不会使用 getopt，可以看 slide 或者 man 3 getopt 看手册学

大概的骨架并不需要太多解释，我们的重点应该放在 eval 函数中。

eval

由于我们不需要处理除 S, L, M 外的其他字符开头的行（注意你可能会读到 \r\n 等字符），并且对每次 S L M 的操作是类似的，都是 get_page()，于是，代码是显然的：

1
void eval() {
2
  FILE *file = fopen(trace_file, "r");
3
  char identifier;
4
  u_int64_t address;
5
  u_int32_t size;
6

7
  while (fscanf(file, "%c %lx,%d", &identifier, &address, &size) != EOF) {
8
    switch (identifier) {
9
    case 'I':
10
    case ' ':
11
    case '\n':
12
    case '\r':
13
      break;
14
    case 'L':
15
      get_page(identifier, address, size);
16
      break;
17
    case 'S':
18
      get_page(identifier, address, size);
19
      break;
20
    case 'M':
21
      get_page(identifier, address, size);
22
      get_page(identifier, address, size);
23
      break;
24
    default:
25
      fprintf(stderr, "Unrecognized identifier: %c\n", identifier);
26
      exit(-1);
27
    }
28
  }
29
}

注意，这里的 M 代表修改，所以我们需要先 L 一次，然后 S 一次，这样就需要更新（或者是 get）两次 page

而对于 get_page，逻辑也是显然的：

通过地址获取 set_index, tag, offset 三个值
查询是否为 cold miss，即 cache[set_index] == NULL（cache 中一条数据都没有）
如果是，则直接从 disk 中调入页面，在这里，我们只是把 cache[set_index] 中的一行的 tag 设置为该地址的 tag，并将其设置为有效
如果不是，那么我们查看是否已经被调入到 cache 中来了
如果是，则已经找到，将时间戳设置为最新后，更新所有条目的时间戳，然后返回
否则，我们需要从 disk 中调入页面

而在调入页面时，我们需要判断此时的 cache 中是否已经满了（检查是否所有有效位均为 1 即可），如果是，那么我们必须驱除出去一页，才能进行调入。

代码如下：

1
void get_page(char identifier, u_int64_t address, u_int32_t size) {
2
  u_int32_t offset = address & ((1 << b) - 1);
3
  u_int32_t set_index = (address >> b) & ((1 << s) - 1);
4
  u_int32_t tag = address >> (b + s);
5

6
  if (verbose) {
7
    printf("%c %lx,%d ", identifier, address, size);
8
  }
9

10
  if (cache[set_index] == NULL) {
11
    cache[set_index] =
12
        (struct cache_line *)malloc(sizeof(struct cache_line) * E);
13
    insert_page(set_index, tag, offset);
14
  } else {
15

16
    for (int i = 0; i < E; i++) {
17
      if (cache[set_index][i].valid && cache[set_index][i].tag == tag) {
18
        hits++;
19
        cache[set_index][i].timestamp = 1;
20
        update_cache();
21
        if (verbose) {
22
          printf("hit ");
23
        }
24
        return;
25
      }
26
    }
27

28
    insert_page(set_index, tag, offset);
29
  }
30

31
  putchar('\n');
32
}
33

34
void update_cache() {
35
  for (int i = 0; i < S; i++) {
36
    struct cache_line *set = cache[i];
37
    if (set == NULL)
38
      continue;
39
    for (int j = 0; j < E; j++) {
40
      if (set[j].valid) {
41
        set[j].timestamp++;
42
      }
43
    }
44
  }
45
}
46

47
void insert_page(u_int32_t set_index, u_int32_t tag, u_int32_t offset) {
48
  misses++;
49
  printf("miss ");
50
  struct cache_line *set = cache[set_index];
51
  for (int i = 0; i < E; i++) {
52
    if (set[i].valid == 0) {
53
      set[i].valid = 1;
54
      set[i].tag = tag;
55
      set[i].timestamp = 1;
56
      set[i].block = malloc(sizeof(u_int32_t) * (1 << b));
57
      set[i].block[offset] = 1;
58
      update_cache();
59
      return;
60
    }
61
  }
62
  evict_page(set_index, tag, offset);
63
}
64

65
void evict_page(u_int32_t set_index, u_int32_t tag, u_int32_t offset) {
66
  evictions++;
67
  printf("eviction ");
68
  struct cache_line *set = cache[set_index];
69
  int max = 0;
70
  int max_index = 0;
71
  for (int i = 0; i < E; i++) {
72
    if (set[i].timestamp > max) {
73
      max = set[i].timestamp;
74
      max_index = i;
75
    }
76
  }
77
  set[max_index].valid = 1;
78
  set[max_index].tag = tag;
79
  set[max_index].timestamp = 1;
80
}

注意这里的 offset 其实一点用都没有，当然为了模拟还是让他有点作用了。

我们在 evict_page 的最后，直接将 tag 修改为需要调入页面的 tag，这样省去了再次 get_page 的痛苦。

结果

Part B

这部分想要做出来是简单的，但想要做满分是很难的，后面会详细写一份调优的博客。

最简单的方法就是使用分块，考虑到这里的 cache 是 32 字节的，一个 int 为 4 字节，所以我们可以八个八个值的来复制，于是代码应运而生：

1
char transpose_submit_desc[] = "Transpose submission";
2
void transpose_submit(int M, int N, int A[N][M], int B[M][N]) {
3
  int tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
4
  for (int i = 0; i < M; i += 8) {
5
    for (int j = 0; j < N; j += 8) {
6
      for (int k = i; k < i + 8; k++) {
7
        tmp0 = A[k][j];
8
        tmp1 = A[k][j + 1];
9
        tmp2 = A[k][j + 2];
10
        tmp3 = A[k][j + 3];
11
        tmp4 = A[k][j + 4];
12
        tmp5 = A[k][j + 5];
13
        tmp6 = A[k][j + 6];
14
        tmp7 = A[k][j + 7];
15

16
        B[j][k] = tmp0;
17
        B[j + 1][k] = tmp1;
18
        B[j + 2][k] = tmp2;
19
        B[j + 3][k] = tmp3;
20
        B[j + 4][k] = tmp4;
21
        B[j + 5][k] = tmp5;
22
        B[j + 6][k] = tmp6;
23
        B[j + 7][k] = tmp7;
24
      }
25
    }
26
  }
27
}

当然，这份代码是没办法达到最好效果的，32×32 矩阵的理论 miss 值为 $256$ ，而我们的结果为 $288$

如果用这个代码去考虑 64×64 的矩阵，那么 miss 会达到惊人的 $4612$ ，显然是无法接受的，这里的做法可以是减少分块的大小，例如我们用 4×4 的分块，那么就可以减少 miss 的值：

また夏を追う

最近的笔记

TAOCP 4B & SAT Handbook 阅读

RoundingSAT 阅读笔记其二

基数约束编码中文字顺序的重要性

探索

cache-lab