Hello / 你好

Gu EnHao 的学习博客

分享代码与学习日记。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
std::string gpt_random_prompt(std::mt19937 & rng) {
const int r = rng() % 10;
switch (r) {
case 0: return "So";
case 1: return "Once upon a time";
case 2: return "When";
case 3: return "The";
case 4: return "After";
case 5: return "If";
case 6: return "import";
case 7: return "He";
case 8: return "She";
case 9: return "They";
}

return "The";
}

vocab

vocabulary

将自然语言文本转换为计算机可以处理的数字表示

文本处理的典型流程包括两个步骤:

分词(Tokenization):将输入的自然语言文本按照某种规则分割成一系列token,可以是单词、子词或者字符等

查表(Lookup):将分词得到的每个token在词汇表中查找对应的数值ID

gpt2_eval

1
2
3
4
5
6
7
bool gpt2_eval(
const gpt2_model & model,
const int n_threads,
const int n_past,
const std::vector<gpt_vocab::id> & embd_inp,
std::vector<float> & embd_w,
size_t & mem_per_token);
  • model:当前加载的 GPT‑2 权重和超参,gpt2_model 内包含所有 ggml 张量(embedding、各层权重、KV 缓存等),前向时会使用这些权重。
  • n_threads:执行 ggml_graph_compute_with_ctx 时使用的线程数,即前向推理的并行度(对应命令行 -t/–threads)。
  • n_past:已经处理过的 token 数,也就是 KV 缓存里原本保存的上下文长度;自注意力里用它决定写入/读取 memory_k/v 的偏移,并在 causal mask 时屏蔽历史以外的位置。
  • embd_inp:这次要送进模型的一批 token id(通常是 prompt 的下一段或者刚采样出的 token),类型是 std::vector<gpt_vocab::id>。
  • embd_w:输出参数,函数会把前向结果(最后一个 token 的 logits,长度 = n_vocab)写入这个 vector,供外层采样使用。
  • mem_per_token:用于估算单个 token 前向所需的临时内存;第一次调用时传入 0,函数内部会根据 ggml_used_mem(ctx0)/N 填充它,之后外层就可以据此调整 buffer 大小以避免频繁 realloc。

因此,gpt2_eval 接受模型和输入 token,以及线程数和上下文长度,计算出最终 logits 并告知内存占用,外层生成循环再根据这些 logits 采样下一个 token。

ggml_tensor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct ggml_tensor {
enum ggml_type type;
struct ggml_backend_buffer * buffer;
int64_t ne[GGML_MAX_DIMS]; // 维度大小
size_t nb[GGML_MAX_DIMS]; // 每维 stride
enum ggml_op op; // 该张量对应的算子
int32_t op_params[...];
int32_t flags;
struct ggml_tensor * src[GGML_MAX_SRC]; // 源张量指针
struct ggml_tensor * view_src;
size_t view_offs;
void * data; // 真实数据指针
char name[GGML_MAX_NAME];
void * extra;
char padding[8];
};

ggml_op

Common ops (non-exhaustive):

op purpose
GGML_OP_MUL_MAT Matrix multiplication
GGML_OP_NORM Norm step used in layer/RMS normalization
GGML_OP_SOFT_MAX Softmax
GGML_OP_ROPE Rotary positional embedding

gpt2-ctx-main

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
int main(int argc, char ** argv) {
ggml_time_init();

const int64_t t_main_start_us = ggml_time_us();

gpt_params params;
params.model = "models/gpt-2-117M/ggml-model.bin";

if (gpt_params_parse(argc, argv, params) == false) {
return 1;
}

if (params.seed < 0) {
params.seed = time(NULL);
}

printf("%s: seed = %d\n", __func__, params.seed);

std::mt19937 rng(params.seed);
if (params.prompt.empty()) {
params.prompt = gpt_random_prompt(rng);
}

int64_t t_load_us = 0;

gpt_vocab vocab;
gpt2_model model;

// load the model
{
const int64_t t_start_us = ggml_time_us();

if (!gpt2_model_load(params.model, model, vocab)) {
fprintf(stderr, "%s: failed to load model from '%s'\n", __func__, params.model.c_str());
return 1;
}

t_load_us = ggml_time_us() - t_start_us;

test_gpt_tokenizer(vocab, params.token_test);
}

int n_past = 0;

int64_t t_sample_us = 0;
int64_t t_predict_us = 0;

std::vector<float> logits;

// tokenize the prompt
std::vector<gpt_vocab::id> embd_inp = ::gpt_tokenize(vocab, params.prompt);

params.n_predict = std::min(params.n_predict, model.hparams.n_ctx - (int) embd_inp.size());

printf("%s: prompt: '%s'\n", __func__, params.prompt.c_str());
printf("%s: number of tokens in prompt = %zu, first 8 tokens: ", __func__, embd_inp.size());
for (int i = 0; i < std::min(8, (int) embd_inp.size()); i++) {
printf("%d ", embd_inp[i]);
}
printf("\n\n");

// submit the input prompt token-by-token
// this reduces the memory usage during inference, at the cost of a bit of speed at the beginning
std::vector<gpt_vocab::id> embd;

// determine the required inference memory per token:
size_t mem_per_token = 0;
gpt2_eval(model, params.n_threads, 0, { 0, 1, 2, 3 }, logits, mem_per_token);

for (size_t i = embd.size(); i < embd_inp.size() + params.n_predict; i++) {
// predict
if (embd.size() > 0) {
const int64_t t_start_us = ggml_time_us();

if (!gpt2_eval(model, params.n_threads, n_past, embd, logits, mem_per_token)) {
printf("Failed to predict\n");
return 1;
}

t_predict_us += ggml_time_us() - t_start_us;
}

n_past += embd.size();
embd.clear();

if (i >= embd_inp.size()) {
// sample next token
const int top_k = params.top_k;
const float top_p = params.top_p;
const float temp = params.temp;

const int n_vocab = model.hparams.n_vocab;

gpt_vocab::id id = 0;

{
const int64_t t_start_sample_us = ggml_time_us();

id = gpt_sample_top_k_top_p(vocab, logits.data() + (logits.size() - n_vocab), top_k, top_p, temp, rng);

t_sample_us += ggml_time_us() - t_start_sample_us;
}

// add it to the context
embd.push_back(id);
} else {
// if here, it means we are still processing the input prompt
for (size_t k = i; k < embd_inp.size(); k++) {
embd.push_back(embd_inp[k]);
if (int32_t(embd.size()) >= params.n_batch) {
break;
}
}
i += embd.size() - 1;
}

// display text
for (auto id : embd) {
printf("%s", vocab.id_to_token[id].c_str());
}
fflush(stdout);

// end of text token
if (embd.back() == 50256) {
break;
}
}

// report timing
{
const int64_t t_main_end_us = ggml_time_us();

printf("\n\n");
printf("%s: mem per token = %8zu bytes\n", __func__, mem_per_token);
printf("%s: load time = %8.2f ms\n", __func__, t_load_us/1000.0f);
printf("%s: sample time = %8.2f ms\n", __func__, t_sample_us/1000.0f);
printf("%s: predict time = %8.2f ms / %.2f ms per token\n", __func__, t_predict_us/1000.0f, t_predict_us/1000.0f/n_past);
printf("%s: total time = %8.2f ms\n", __func__, (t_main_end_us - t_main_start_us)/1000.0f);
}

ggml_free(model.ctx_w);

return 0;
}

前后向传播

前向

  • 在应用层(如 examples/gpt-2/main-ctx.cpp (lines 392-695)),前向传播通过 ggml 的“构图 + 执行”模式完成:先在一个新的 ggml_context 里创建输入张量、调用 ggml_mul_mat、ggml_norm、ggml_soft_max_inplace 等算子堆叠出完整的 Transformer 计算图,再用 ggml_build_forward_expand 和 ggml_graph_compute_with_ctx 执行。这一步不需要手写矩阵运算,所有算子都在 ggml C 核心实现(ggml/src/ggml.c)里定义。

反向

  • ggml 在 ggml/src/ggml.c (lines 6025-6669) 中实现了自动求导:ggml_compute_backward() 根据计算图节点的 op 类型,依次调用 ggml_add_or_set、ggml_mul、ggml_repeat_back 等基本算子累加梯度。例如 GGML_OP_MUL_MAT 对应“上游梯度 × 权重转置”/“输入转置 × 上游梯度”的标准公式;GGML_OP_SOFT_MAX、GGML_OP_ROPE 等也有专门的 back 函数。图结构的遍历、梯度缓存(cgraph->grads)和梯度累加逻辑都在这段代码中定义。

前向是“在 C++ 层用 ggml API 描述计算图,然后由 ggml 核心算子执行”;

反向则由 ggml_compute_backward 针对每种 ggml_op 按公式生成梯度张量并在图上回传

ggml_mul_mat

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// ggml_mul_mat
static inline bool ggml_can_mul_mat(const struct ggml_tensor * t0, const struct ggml_tensor * t1) {
static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");
return (t0->ne[0] == t1->ne[0]) &&
(t1->ne[2]%t0->ne[2] == 0) && // verify t0 is broadcastable
(t1->ne[3]%t0->ne[3] == 0);
}
struct ggml_tensor * ggml_mul_mat(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b) {
GGML_ASSERT(ggml_can_mul_mat(a, b));
GGML_ASSERT(!ggml_is_transposed(a));
const int64_t ne[4] = { a->ne[1], b->ne[1], b->ne[2], b->ne[3] };
struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
result->op = GGML_OP_MUL_MAT;
result->src[0] = a;
result->src[1] = b;
return result;
}
void ggml_mul_mat_set_prec(
struct ggml_tensor * a,
enum ggml_prec prec) {
GGML_ASSERT(a->op == GGML_OP_MUL_MAT);
const int32_t prec_i32 = (int32_t) prec;
ggml_set_op_params_i32(a, 0, prec_i32);
}

ggml_norm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// ggml_norm

static struct ggml_tensor * ggml_norm_impl(
struct ggml_context * ctx,
struct ggml_tensor * a,
float eps,
bool inplace) {
struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

ggml_set_op_params(result, &eps, sizeof(eps));

result->op = GGML_OP_NORM;
result->src[0] = a;

return result;
}

struct ggml_tensor * ggml_norm(
struct ggml_context * ctx,
struct ggml_tensor * a,
float eps) {
return ggml_norm_impl(ctx, a, eps, false);
}

struct ggml_tensor * ggml_norm_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a,
float eps) {
return ggml_norm_impl(ctx, a, eps, true);
}

ggml_softmax

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// ggml_soft_max

static struct ggml_tensor * ggml_soft_max_impl(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * mask,
float scale,
float max_bias,
bool inplace) {
GGML_ASSERT(ggml_is_contiguous(a));

if (mask) {
GGML_ASSERT(mask->type == GGML_TYPE_F16 || mask->type == GGML_TYPE_F32);
GGML_ASSERT(ggml_is_contiguous(mask));
GGML_ASSERT(mask->ne[0] == a->ne[0]);
GGML_ASSERT(mask->ne[1] >= a->ne[1]);
GGML_ASSERT(a->ne[2]%mask->ne[2] == 0);
GGML_ASSERT(a->ne[3]%mask->ne[3] == 0);
}

if (max_bias > 0.0f) {
GGML_ASSERT(mask);
}

struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

float params[] = { scale, max_bias };
ggml_set_op_params(result, params, sizeof(params));

result->op = GGML_OP_SOFT_MAX;
result->src[0] = a;
result->src[1] = mask;

return result;
}

struct ggml_tensor * ggml_soft_max(
struct ggml_context * ctx,
struct ggml_tensor * a) {
return ggml_soft_max_impl(ctx, a, NULL, 1.0f, 0.0f, false);
}

struct ggml_tensor * ggml_soft_max_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a) {
return ggml_soft_max_impl(ctx, a, NULL, 1.0f, 0.0f, true);
}

struct ggml_tensor * ggml_soft_max_ext(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * mask,
float scale,
float max_bias) {
return ggml_soft_max_impl(ctx, a, mask, scale, max_bias, false);
}

struct ggml_tensor * ggml_soft_max_ext_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * mask,
float scale,
float max_bias) {
return ggml_soft_max_impl(ctx, a, mask, scale, max_bias, true);
}

void ggml_soft_max_add_sinks(
struct ggml_tensor * a,
struct ggml_tensor * sinks) {
if (!sinks) {
a->src[2] = NULL;
return;
}

GGML_ASSERT(a->op == GGML_OP_SOFT_MAX);
GGML_ASSERT(a->src[2] == NULL);
GGML_ASSERT(a->src[0]->ne[2] == sinks->ne[0]);
GGML_ASSERT(sinks->type == GGML_TYPE_F32);

a->src[2] = sinks;
}

这就是 C 语言中“返回结构体指针”的标准写法:函数 ggml_soft_max_inplace 的返回类型是 struct ggml_tensor ,表示返回一个指向 ggml_tensor 结构体的指针。因为源代码没有用 typedef 给 struct ggml_tensor 起别名,所以在函数声明时必须写成 struct ggml_tensor ;如果像 C++ 里常见的那样 typedef struct ggml_tensor ggml_tensor;,就可以写成 ggml_tensor *。这一写法看起来“奇怪”只是因为 ggml 为了兼容纯 C 编译器,沿用了最传统的 C 风格。函数体的意思就是把参数 a 和默认参数一起传给内部实现 ggml_soft_max_impl,然后返回它的结果。

ggml_rope

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
// ggml_rope

static struct ggml_tensor * ggml_rope_impl(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * c,
int n_dims,
int sections[GGML_MROPE_SECTIONS],
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow,
bool inplace) {
GGML_ASSERT((mode & 1) == 0 && "mode & 1 == 1 is no longer supported");

GGML_ASSERT(ggml_is_vector(b));
GGML_ASSERT(b->type == GGML_TYPE_I32);

bool mrope_used = mode & GGML_ROPE_TYPE_MROPE;
if (mrope_used) {
GGML_ASSERT(a->ne[2] * 4 == b->ne[0]); // mrope expecting 4 position ids per token
} else {
GGML_ASSERT(a->ne[2] == b->ne[0]);
}

if (c) {
GGML_ASSERT(c->type == GGML_TYPE_F32);
GGML_ASSERT(c->ne[0] >= n_dims / 2);
}

struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

int32_t params[15] = { /*n_past*/ 0, n_dims, mode, /*n_ctx*/ 0, n_ctx_orig };
memcpy(params + 5, &freq_base, sizeof(float));
memcpy(params + 6, &freq_scale, sizeof(float));
memcpy(params + 7, &ext_factor, sizeof(float));
memcpy(params + 8, &attn_factor, sizeof(float));
memcpy(params + 9, &beta_fast, sizeof(float));
memcpy(params + 10, &beta_slow, sizeof(float));
if (mrope_used && sections) {
memcpy(params + 11, sections, sizeof(int32_t) * GGML_MROPE_SECTIONS);
} else {
memset(params + 11, 0, sizeof(int32_t) * GGML_MROPE_SECTIONS);
}
ggml_set_op_params(result, params, sizeof(params));

result->op = GGML_OP_ROPE;
result->src[0] = a;
result->src[1] = b;
result->src[2] = c;

return result;
}

struct ggml_tensor * ggml_rope(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int n_dims,
int mode) {
return ggml_rope_impl(
ctx, a, b, NULL, n_dims, NULL, mode, 0, 10000.0f, 1.0f, 0.0f, 1.0f, 0.0f, 0.0f, false
);
}

struct ggml_tensor * ggml_rope_multi(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * c,
int n_dims,
int sections[GGML_MROPE_SECTIONS],
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
return ggml_rope_impl(
ctx, a, b, c, n_dims, sections, mode, n_ctx_orig, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow, false
);
}

struct ggml_tensor * ggml_rope_multi_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * c,
int n_dims,
int sections[GGML_MROPE_SECTIONS],
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
return ggml_rope_impl(
ctx, a, b, c, n_dims, sections, mode, n_ctx_orig, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow, true
);
}

struct ggml_tensor * ggml_rope_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int n_dims,
int mode) {
return ggml_rope_impl(
ctx, a, b, NULL, n_dims, NULL, mode, 0, 10000.0f, 1.0f, 0.0f, 1.0f, 0.0f, 0.0f, true
);
}

struct ggml_tensor * ggml_rope_ext(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * c,
int n_dims,
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
return ggml_rope_impl(
ctx, a, b, c, n_dims, NULL, mode, n_ctx_orig, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow, false
);
}

struct ggml_tensor * ggml_rope_ext_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * c,
int n_dims,
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
return ggml_rope_impl(
ctx, a, b, c, n_dims, NULL, mode, n_ctx_orig, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow, true
);
}

struct ggml_tensor * ggml_rope_custom(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int n_dims,
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
return ggml_rope_impl(
ctx, a, b, NULL, n_dims, NULL, mode, n_ctx_orig, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow, false
);
}

struct ggml_tensor * ggml_rope_custom_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int n_dims,
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
return ggml_rope_impl(
ctx, a, b, NULL, n_dims, NULL, mode, n_ctx_orig, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow, true
);
}

// Apparently solving `n_rot = 2pi * x * base^((2 * max_pos_emb) / n_dims)` for x, we get
// `corr_dim(n_rot) = n_dims * log(max_pos_emb / (n_rot * 2pi)) / (2 * log(base))`
static float ggml_rope_yarn_corr_dim(int n_dims, int n_ctx_orig, float n_rot, float base) {
return n_dims * logf(n_ctx_orig / (n_rot * 2 * (float)M_PI)) / (2 * logf(base));
}

void ggml_rope_yarn_corr_dims(
int n_dims, int n_ctx_orig, float freq_base, float beta_fast, float beta_slow, float dims[2]
) {
// start and end correction dims
float start = floorf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_fast, freq_base));
float end = ceilf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_slow, freq_base));
dims[0] = MAX(0, start);
dims[1] = MIN(n_dims - 1, end);
}

// ggml_rope_back

struct ggml_tensor * ggml_rope_ext_back(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * c,
int n_dims,
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
struct ggml_tensor * result = ggml_rope_ext(
ctx, a, b, c, n_dims, mode, n_ctx_orig, freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow);
result->op = GGML_OP_ROPE_BACK;
return result;
}

struct ggml_tensor * ggml_rope_multi_back(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * c,
int n_dims,
int sections[4],
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
struct ggml_tensor * result = ggml_rope_multi(
ctx, a, b, c, n_dims, sections, mode, n_ctx_orig, freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow);
result->op = GGML_OP_ROPE_BACK;
return result;
}

手撕Vector

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <bits/stdc++.h>
using namespace std;
#define int long long
template <typename T>
class Vector {
T* _data;
size_t _size;
size_t _capacity;
void expand() {
size_t new_capacity = (_capacity == 0) ? 1 : _capacity * 2;
T* new_data = new T[new_capacity];
for (size_t i = 0; i < _size; i++) new_data[i] = _data[i];
if (_data) delete[] _data;
_data = new_data;
_capacity = new_capacity;
cout << "容量拓展到" << _capacity << '\n';
}
public:
Vector() : _data(nullptr), _size(0), _capacity(0) {}
~Vector() {
if (_data) {
delete[] _data;
_data = nullptr;
}
}
void push_back(const T& value) {
if (_size == _capacity) {
expand();
}
_data[_size] = value;
_size++;
}
T& operator[](size_t index) { return _data[index]; }
size_t size() const { return _size; }
size_t capacity() const { return _capacity; }
};

void solve() {
Vector<int> v;
for (int i = 0; i < 6; i++) {
v.push_back(i);
cout << "插入" << i << ",size:" << v.size() << "\n";
}
}
signed main() {
solve();
return 0;
}

CS336 Assignment1

BPE (Byte Pair Encoding) 原理与从零实现

BPE(Byte Pair Encoding,字节对编码)是一种非常流行的子词(subword)分词算法,最初用于数据压缩,后来被广泛应用于自然语言处理领域,尤其是在 GPT 系列、LLaMA、RoBERTa 等大语言模型的分词器中。

BPE 的核心思想

从语料中最频繁出现的相邻符号对(最初是单个字符或字节)开始,逐步合并它们,形成更大的子词单元,直到达到指定的词汇表大小为止。

BPE 训练过程(经典示例)

假设我们有如下小型语料库(每个词后加 </w> 表示词尾):

1
2
3
4
low</w>:    5
lower</w>: 2
newest</w>: 6
widest</w>: 3

字符级拆分后:

1
2
3
4
l o w </w> ×5
l o w e r </w> ×2
n e w e s t </w> ×6
w i d e s t </w> ×3
统计所有相邻 pair 频率 → 发现 (e, s) 和 (s, t) 都是 9 次 → 任选其一合并(如 es)→ 继续迭代 → 最终得到 est、low、lowest 等高频子词

BPE 在实际模型中的两个重要变体

原始 BPE(OpenAI GPT-2 用)

操作在字符级别(UTF-8 bytes) 基础词汇表是 256 个 byte + 合并规则 优点:能处理任何 Unicode 字符,永不出现 OOV(未知词) SentencePiece BPE(Google、LLaMA、T5 等用)

直接在原始文本(不分词)上训练 支持 unigram 模式(BPE 的变种) 可以加入特殊控制符号(如 ▁ 表示空格) BPE 分词时的贪心规则 应用所有合并规则时,按合并顺序从先到后贪心应用(即先训练时学的合并优先级更高)。

例如:

如果先学会了 “un” → “un” 再学会了 “un” + “##able” → “unable” 看到 “unable” 时就会先合并成 “un” + “##able” → “unable”,而不会拆成别的 BPE 的优缺点 优点:

有效解决 OOV 问题(尤其对稀有词、拼写错误、新词) 能把常见词保持完整(high frequency → 合并早 → 成为单个 token) 稀有词被拆成子词,仍有意义 词汇表大小可控 缺点:

分词不一定符合语义或形态学(纯统计) 对低资源语言可能产生很碎的分词 “token 效率”不如 WordPiece 或 Unigram LM 在某些语言上高 总结一句话 BPE 是通过不断合并语料中最常见的相邻符号对,来构建一个大小适中、覆盖广泛的子词词汇表的无监督分词算法,是目前主流大模型分词器的基石之一。

代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
def run_train_bpe(
input_path: str | os.PathLike,
vocab_size: int,
special_tokens: list[str],
**kwargs,
) -> tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
"""Given the path to an input corpus, run train a BPE tokenizer and
output its vocabulary and merges.

Args:
input_path (str | os.PathLike): Path to BPE tokenizer training data.
vocab_size (int): Total number of items in the tokenizer's vocabulary (including special tokens).
special_tokens (list[str]): A list of string special tokens to be added to the tokenizer vocabulary.
These strings will never be split into multiple tokens, and will always be
kept as a single token. If these special tokens occur in the input_path,
they are treated as any other string.

Returns:
tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
vocab:
The trained tokenizer vocabulary, a mapping from int (token ID in the vocabulary)
to bytes (token bytes)
merges:
BPE merges. Each list item is a tuple of bytes (<token1>, <token2>),
representing that <token1> was merged with <token2>.
Merges are ordered by order of creation.
"""
# 1. 参数校验与初始化
pat_str = kwargs.get("pat_str", GPT2_PRETOKENIZER_PATTERN)
special_tokens = special_tokens or []
unique_special_tokens: list[str] = []
seen_specials: set[str] = set()

# 这里的逻辑是去重并保持顺序
for token in special_tokens:
if not isinstance(token, str):
msg = f"Expected special tokens to be strings, got {type(token)!r}"
raise TypeError(msg)
if token not in seen_specials:
seen_specials.add(token)
unique_special_tokens.append(token)

special_tokens_bytes = [token.encode("utf-8") for token in unique_special_tokens]
num_special_tokens = len(special_tokens_bytes)

# 基础词表大小为 256 (字节范围)
if vocab_size < 2**8 + num_special_tokens:
msg = "vocab_size must be at least 256 + number of special tokens"
raise ValueError(msg)

merges_target = vocab_size - num_special_tokens - 2**8
pretokenizer = regex.compile(pat_str)

# 2. 读取文件
with open(input_path, "r", encoding="utf-8") as f:
text = f.read()

words: list[list[int]] = []
word_frequencies: list[int] = []
word_lookup: dict[str, int] = {}

# 3. 预分词 (Pre-tokenization)
# 首先按特殊 token 切分,防止特殊 token 被正则拆散
removable_specials = [token for token in unique_special_tokens if token]
segments = [text]
if removable_specials:
escaped = [regex.escape(token) for token in removable_specials]
split_pattern = regex.compile("|".join(escaped))
segments = [segment for segment in split_pattern.split(text) if segment]

for segment in segments:
for match in pretokenizer.finditer(segment):
token = match.group(0)
if not token:
continue

idx = word_lookup.get(token)
if idx is None:
token_bytes = token.encode("utf-8")
if not token_bytes:
continue
idx = len(words)
word_lookup[token] = idx
words.append(list(token_bytes))
word_frequencies.append(0)

word_frequencies[idx] += 1

# 4. 初始化 BPE 统计结构
token_id_to_bytes: dict[int, bytes] = {i: bytes([i]) for i in range(256)}
merges: list[tuple[bytes, bytes]] = []
next_token_id = 256

pair_stats: Counter[tuple[int, int]] = Counter()
pair_indices: dict[tuple[int, int], set[int]] = {}
word_pair_counters: list[Counter[tuple[int, int]]] = []

# 初次统计所有单词中的 pair
for idx, token_ids in enumerate(words):
freq = word_frequencies[idx]
if freq == 0 or len(token_ids) < 2:
word_pair_counters.append(Counter())
continue

pair_counter = Counter(zip(token_ids[:-1], token_ids[1:]))
word_pair_counters.append(pair_counter)

for pair, count in pair_counter.items():
pair_stats[pair] += count * freq
pair_indices.setdefault(pair, set()).add(idx)

# --- 内部辅助函数 (闭包) ---
def remove_word_from_stats(word_idx: int) -> None:
counter = word_pair_counters[word_idx]
if not counter:
return
freq = word_frequencies[word_idx]
for pair, count in counter.items():
pair_stats[pair] -= count * freq
if pair_stats[pair] <= 0:
pair_stats.pop(pair, None)

indices = pair_indices.get(pair)
if indices is not None:
indices.discard(word_idx)
if not indices:
pair_indices.pop(pair, None)

def add_word_to_stats(word_idx: int) -> None:
tokens = words[word_idx]
if len(tokens) < 2:
word_pair_counters[word_idx] = Counter()
return

counter = Counter(zip(tokens[:-1], tokens[1:]))
word_pair_counters[word_idx] = counter
freq = word_frequencies[word_idx]
for pair, count in counter.items():
pair_stats[pair] += count * freq
pair_indices.setdefault(pair, set()).add(word_idx)

def merge_word(word_idx: int, pair: tuple[int, int], new_token_id: int) -> None:
tokens = words[word_idx]
if len(tokens) < 2:
return

merged: list[int] = []
i = 0
while i < len(tokens):
if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
merged.append(new_token_id)
i += 2
else:
merged.append(tokens[i])
i += 1
words[word_idx] = merged

# 5. BPE 训练主循环
for _ in range(max(0, merges_target)):
if not pair_stats:
break

# 定义优先级:优先频次高,频次相同比较字节内容(为了确定性)
def pair_priority(item: tuple[tuple[int, int], int]) -> tuple[int, bytes, bytes]:
(left_id, right_id), count = item
return count, token_id_to_bytes[left_id], token_id_to_bytes[right_id]

best_pair, _ = max(pair_stats.items(), key=pair_priority)

left_bytes = token_id_to_bytes[best_pair[0]]
right_bytes = token_id_to_bytes[best_pair[1]]

merges.append((left_bytes, right_bytes))

new_token_id = next_token_id
token_id_to_bytes[new_token_id] = left_bytes + right_bytes

affected_words = pair_indices.pop(best_pair, set())

# 如果没有单词受到影响(理论上不应发生,因为 stats 里有),直接跳过
if not affected_words:
next_token_id += 1
pair_stats.pop(best_pair, None)
continue

# 更新受影响单词的统计信息
for word_idx in sorted(affected_words):
remove_word_from_stats(word_idx)
merge_word(word_idx, best_pair, new_token_id)
add_word_to_stats(word_idx)

pair_stats.pop(best_pair, None)
next_token_id += 1

# 6. 构建最终词表
vocab: dict[int, bytes] = {
idx: token for idx, token in token_id_to_bytes.items() if idx < next_token_id
}

# 添加特殊 Token
for token_bytes in special_tokens_bytes:
if len(vocab) >= vocab_size:
break
vocab[next_token_id] = token_bytes
next_token_id += 1

return vocab, merges

ALL code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
from __future__ import annotations

import builtins
import locale
import math
import os
from collections import Counter
from collections.abc import Iterable
from typing import IO, Any, BinaryIO

import numpy as np
import numpy.typing as npt
import regex
import tiktoken
import torch
import torch.nn.functional as F
from jaxtyping import Bool, Float, Int
from torch import Tensor
from torch.nn.utils import clip_grad_norm_


def _ensure_utf8_locale() -> None:
try:
preferred = locale.getpreferredencoding(False)
except Exception:
preferred = "utf-8"
if preferred.lower() != "utf-8":
locale.getpreferredencoding = lambda *_args, **_kwargs: "utf-8" # type: ignore[assignment]


_ensure_utf8_locale()

_ORIGINAL_OPEN = builtins.open


def _utf8_default_open(
file,
mode="r",
buffering=-1,
encoding: str | None = None,
errors: str | None = None,
newline: str | None = None,
closefd: bool = True,
opener=None,
):
if "b" not in mode and encoding is None:
encoding = "utf-8"
return _ORIGINAL_OPEN(file, mode, buffering, encoding, errors, newline, closefd, opener)


builtins.open = _utf8_default_open # type: ignore[assignment]


GPT2_PRETOKENIZER_PATTERN = (
r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
)


def run_linear(
d_in: int,
d_out: int,
weights: Float[Tensor, " d_out d_in"],
in_features: Float[Tensor, " ... d_in"],
) -> Float[Tensor, " ... d_out"]:
"""
Given the weights of a Linear layer, compute the transformation of a batched input.

Args:
in_dim (int): The size of the input dimension
out_dim (int): The size of the output dimension
weights (Float[Tensor, "d_out d_in"]): The linear weights to use
in_features (Float[Tensor, "... d_in"]): The output tensor to apply the function to

Returns:
Float[Tensor, "... d_out"]: The transformed output of your linear module.
"""

if tuple(weights.shape) != (d_out, d_in):
msg = f"weights shape {tuple(weights.shape)} does not match ({d_out}, {d_in})"
raise ValueError(msg)

return F.linear(in_features, weights, bias=None)


def run_embedding(
vocab_size: int,
d_model: int,
weights: Float[Tensor, " vocab_size d_model"],
token_ids: Int[Tensor, " ..."],
) -> Float[Tensor, " ... d_model"]:
"""
Given the weights of an Embedding layer, get the embeddings for a batch of token ids.

Args:
vocab_size (int): The number of embeddings in the vocabulary
d_model (int): The size of the embedding dimension
weights (Float[Tensor, "vocab_size d_model"]): The embedding vectors to fetch from
token_ids (Int[Tensor, "..."]): The set of token ids to fetch from the Embedding layer

Returns:
Float[Tensor, "... d_model"]: Batch of embeddings returned by your Embedding layer.
"""

if tuple(weights.shape) != (vocab_size, d_model):
msg = f"weights shape {tuple(weights.shape)} does not match ({vocab_size}, {d_model})"
raise ValueError(msg)

token_ids = token_ids.to(torch.long)
return F.embedding(token_ids, weights)


def run_swiglu(
d_model: int,
d_ff: int,
w1_weight: Float[Tensor, " d_ff d_model"],
w2_weight: Float[Tensor, " d_model d_ff"],
w3_weight: Float[Tensor, " d_ff d_model"],
in_features: Float[Tensor, " ... d_model"],
) -> Float[Tensor, " ... d_model"]:
"""Given the weights of a SwiGLU network, return
the output of your implementation with these weights.

Args:
d_model (int): Dimensionality of the feedforward input and output.
d_ff (int): Dimensionality of the up-project happening internally to your swiglu.
w1_weight (Float[Tensor, "d_ff d_model"]): Stored weights for W1
w2_weight (Float[Tensor, "d_model d_ff"]): Stored weights for W2
w3_weight (Float[Tensor, "d_ff d_model"]): Stored weights for W3
in_features (Float[Tensor, "... d_model"]): Input embeddings to the feed-forward layer.

Returns:
Float[Tensor, "... d_model"]: Output embeddings of the same shape as the input embeddings.
"""
# Example:
# If your state dict keys match, you can use `load_state_dict()`
# swiglu.load_state_dict(weights)
# You can also manually assign the weights
# swiglu.w1.weight.data = w1_weight
# swiglu.w2.weight.data = w2_weight
# swiglu.w3.weight.data = w3_weight
if d_model <= 0 or d_ff <= 0:
raise ValueError("d_model and d_ff must be positive")

gate = F.linear(in_features, w1_weight, bias=None)
up = F.linear(in_features, w3_weight, bias=None)
activated = F.silu(gate) * up
return F.linear(activated, w2_weight, bias=None)


def run_scaled_dot_product_attention(
Q: Float[Tensor, " ... queries d_k"],
K: Float[Tensor, " ... keys d_k"],
V: Float[Tensor, " ... values d_v"],
mask: Bool[Tensor, " ... queries keys"] | None = None,
) -> Float[Tensor, " ... queries d_v"]:
"""
Given key (K), query (Q), and value (V) tensors, return
the output of your scaled dot product attention implementation.

Args:
Q (Float[Tensor, " ... queries d_k"]): Query tensor
K (Float[Tensor, " ... keys d_k"]): Key tensor
V (Float[Tensor, " ... values d_v"]): Values tensor
mask (Bool[Tensor, " ... queries keys"] | None): Mask tensor
Returns:
Float[Tensor, " ... queries d_v"]: Output of SDPA
"""
d_k = Q.shape[-1]
if d_k == 0:
raise ValueError("d_k must be positive")

scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
fill = torch.finfo(scores.dtype).min
mask = mask.to(dtype=torch.bool, device=scores.device)
if mask.shape != scores.shape:
mask = mask.expand(scores.shape)
scores = scores.masked_fill(~mask, fill)

attention = torch.softmax(scores, dim=-1)
return torch.matmul(attention, V)


def _build_causal_mask(
batch_dims: tuple[int, ...], num_heads: int, seq_len: int, device: torch.device
) -> Bool[Tensor, " ..."]:
mask = torch.ones(seq_len, seq_len, dtype=torch.bool, device=device).tril()
view_shape = (1,) * len(batch_dims) + (1, seq_len, seq_len)
return mask.view(view_shape).expand(*batch_dims, num_heads, seq_len, seq_len)


def run_multihead_self_attention(
d_model: int,
num_heads: int,
q_proj_weight: Float[Tensor, " d_k d_in"],
k_proj_weight: Float[Tensor, " d_k d_in"],
v_proj_weight: Float[Tensor, " d_v d_in"],
o_proj_weight: Float[Tensor, " d_model d_v"],
in_features: Float[Tensor, " ... sequence_length d_in"],
) -> Float[Tensor, " ... sequence_length d_out"]:
"""
Given the key, query, and value projection weights of a naive unbatched
implementation of multi-head attention, return the output of an optimized batched
implementation. This implementation should handle the key, query, and value projections
for all heads in a single matrix multiply.
This function should not use RoPE.
See section 3.2.2 of Vaswani et al., 2017.

Args:
d_model (int): Dimensionality of the feedforward input and output.
num_heads (int): Number of heads to use in multi-headed attention.
max_seq_len (int): Maximum sequence length to pre-cache if your implementation does that.
q_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the Q projection
k_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the K projection
v_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the V projection
o_proj_weight (Float[Tensor, "d_model d_v"]): Weights for the output projection
in_features (Float[Tensor, "... sequence_length d_in"]): Tensor to run your implementation on.

Returns:
Float[Tensor, " ... sequence_length d_out"]: Tensor with the output of running your optimized, batched multi-headed attention
implementation with the given QKV projection weights and input features.
"""
if d_model % num_heads != 0:
raise ValueError("d_model must be divisible by num_heads")

head_dim = d_model // num_heads
batch_dims = tuple(in_features.shape[:-2])
seq_len = in_features.shape[-2]

def _project(weight: Tensor) -> Tensor:
proj = F.linear(in_features, weight, bias=None)
new_shape = (*batch_dims, seq_len, num_heads, head_dim)
proj = proj.reshape(new_shape)
permute_order = list(range(len(batch_dims))) + [len(batch_dims) + 1, len(batch_dims), len(batch_dims) + 2]
return proj.permute(permute_order)

q = _project(q_proj_weight)
k = _project(k_proj_weight)
v = _project(v_proj_weight)

mask = _build_causal_mask(batch_dims, num_heads, seq_len, in_features.device)
attn_output = run_scaled_dot_product_attention(q, k, v, mask=mask)
permute_order = list(range(len(batch_dims))) + [len(batch_dims) + 1, len(batch_dims), len(batch_dims) + 2]
attn_output = attn_output.permute(permute_order)
merged = attn_output.reshape(*batch_dims, seq_len, d_model)
return F.linear(merged, o_proj_weight, bias=None)


def run_multihead_self_attention_with_rope(
d_model: int,
num_heads: int,
max_seq_len: int,
theta: float,
q_proj_weight: Float[Tensor, " d_k d_in"],
k_proj_weight: Float[Tensor, " d_k d_in"],
v_proj_weight: Float[Tensor, " d_v d_in"],
o_proj_weight: Float[Tensor, " d_model d_v"],
in_features: Float[Tensor, " ... sequence_length d_in"],
token_positions: Int[Tensor, " ... sequence_length"] | None = None,
) -> Float[Tensor, " ... sequence_length d_out"]:
"""
Given the key, query, and value projection weights of a naive unbatched
implementation of multi-head attention, return the output of an optimized batched
implementation. This implementation should handle the key, query, and value projections
for all heads in a single matrix multiply.
This version of MHA should include RoPE.
In this case, the RoPE embedding dimension must be the head embedding dimension (d_model // num_heads).
See section 3.2.2 of Vaswani et al., 2017.

Args:
d_model (int): Dimensionality of the feedforward input and output.
num_heads (int): Number of heads to use in multi-headed attention.
max_seq_len (int): Maximum sequence length to pre-cache if your implementation does that.
theta (float): RoPE parameter.
q_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the Q projection
k_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the K projection
v_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the V projection
o_proj_weight (Float[Tensor, "d_model d_v"]): Weights for the output projection
in_features (Float[Tensor, "... sequence_length d_in"]): Tensor to run your implementation on.
token_positions (Int[Tensor, " ... sequence_length"] | None): Optional tensor with the positions of the tokens

Returns:
Float[Tensor, " ... sequence_length d_out"]: Tensor with the output of running your optimized, batched multi-headed attention
implementation with the given QKV projection weights and input features.
"""
if d_model % num_heads != 0:
raise ValueError("d_model must be divisible by num_heads")

head_dim = d_model // num_heads
batch_dims = tuple(in_features.shape[:-2])
seq_len = in_features.shape[-2]
device = in_features.device

def _project(weight: Tensor) -> Tensor:
proj = F.linear(in_features, weight, bias=None)
new_shape = (*batch_dims, seq_len, num_heads, head_dim)
proj = proj.reshape(new_shape)
permute_order = list(range(len(batch_dims))) + [len(batch_dims) + 1, len(batch_dims), len(batch_dims) + 2]
return proj.permute(permute_order)

q = _project(q_proj_weight)
k = _project(k_proj_weight)
v = _project(v_proj_weight)

if token_positions is None:
base = torch.arange(seq_len, device=device, dtype=torch.long)
view_shape = (1,) * len(batch_dims) + (seq_len,)
token_positions = base.view(view_shape)
else:
token_positions = torch.as_tensor(token_positions, dtype=torch.long, device=device)
target_shape = batch_dims + (seq_len,)
if token_positions.shape != target_shape:
missing = len(target_shape) - token_positions.ndim
if missing < 0:
raise ValueError("token_positions has too many dimensions for the provided input")
shape = (1,) * missing + tuple(token_positions.shape)
token_positions = token_positions.reshape(shape)
token_positions = token_positions.expand(target_shape)

rope_positions = token_positions.unsqueeze(-2).expand(*batch_dims, num_heads, seq_len)
q = run_rope(head_dim, theta, max_seq_len, q, rope_positions)
k = run_rope(head_dim, theta, max_seq_len, k, rope_positions)

mask = _build_causal_mask(batch_dims, num_heads, seq_len, device)
attn_output = run_scaled_dot_product_attention(q, k, v, mask=mask)
permute_order = list(range(len(batch_dims))) + [len(batch_dims) + 1, len(batch_dims), len(batch_dims) + 2]
attn_output = attn_output.permute(permute_order)
merged = attn_output.reshape(*batch_dims, seq_len, d_model)
return F.linear(merged, o_proj_weight, bias=None)


def run_rope(
d_k: int,
theta: float,
max_seq_len: int,
in_query_or_key: Float[Tensor, " ... sequence_length d_k"],
token_positions: Int[Tensor, " ... sequence_length"],
) -> Float[Tensor, " ... sequence_length d_k"]:
"""
Run RoPE for a given input tensor.

Args:
d_k (int): Embedding dimension size for the query or key tensor.
theta (float): RoPE parameter.
max_seq_len (int): Maximum sequence length to pre-cache if your implementation does that.
in_query_or_key (Float[Tensor, "... sequence_length d_k"]): Input tensor to run RoPE on.
token_positions (Int[Tensor, "... sequence_length"]): Tensor of shape (batch_size, sequence_length) with the token positions
Returns:
Float[Tensor, " ... sequence_length d_k"]: Tensor with RoPEd input.
"""
if d_k % 2 != 0:
raise ValueError("d_k must be even for RoPE")
if theta <= 0:
raise ValueError("theta must be positive")

x = in_query_or_key
device = x.device
dtype = x.dtype
seq_len = x.shape[-2]

if token_positions is None:
base = torch.arange(seq_len, device=device, dtype=torch.long)
view_shape = (1,) * (x.ndim - 2) + (seq_len,)
token_positions = base.view(view_shape)
else:
token_positions = torch.as_tensor(token_positions, dtype=torch.long, device=device)
expected_prefix = x.shape[:-1]
if token_positions.shape != expected_prefix:
missing = len(expected_prefix) - token_positions.ndim
if missing < 0:
raise ValueError("token_positions incompatible with input shape")
shape = (1,) * missing + tuple(token_positions.shape)
token_positions = token_positions.reshape(shape)
token_positions = token_positions.expand(expected_prefix)

half_dim = d_k // 2
freq_exponents = torch.arange(0, half_dim, device=device, dtype=torch.float32) / half_dim
inv_freq = torch.exp(-math.log(theta) * freq_exponents).to(dtype)
angles = token_positions.to(dtype).unsqueeze(-1) * inv_freq
cos = torch.cos(angles)
sin = torch.sin(angles)

reshaped = x.reshape(*x.shape[:-1], half_dim, 2)
x_even = reshaped[..., 0]
x_odd = reshaped[..., 1]
rotated_even = x_even * cos - x_odd * sin
rotated_odd = x_even * sin + x_odd * cos
prefix_shape = in_query_or_key.shape[:-1]
return torch.stack((rotated_even, rotated_odd), dim=-1).reshape(*prefix_shape, d_k)


def run_transformer_block(
d_model: int,
num_heads: int,
d_ff: int,
max_seq_len: int,
theta: float,
weights: dict[str, Tensor],
in_features: Float[Tensor, " batch sequence_length d_model"],
) -> Float[Tensor, " batch sequence_length d_model"]:
"""
Given the weights of a pre-norm Transformer block and input features,
return the output of running the Transformer block on the input features.

This function should use RoPE.
Depending on your implementation, you may simply need to pass the relevant args
to your TransformerBlock constructor, or you may need to initialize your own RoPE
class and pass that instead.

Args:
d_model (int): The dimensionality of the Transformer block input.
num_heads (int): Number of heads to use in multi-headed attention. `d_model` must be
evenly divisible by `num_heads`.
d_ff (int): Dimensionality of the feed-forward inner layer.
max_seq_len (int): Maximum sequence length to pre-cache if your implementation does that.
theta (float): RoPE parameter.
weights (dict[str, Tensor]):
State dict of our reference implementation.
The keys of this dictionary are:
- `attn.q_proj.weight`
The query projections for all `num_heads` attention heads.
Shape is (d_model, d_model).
The rows are ordered by matrices of shape (num_heads, d_k),
so `attn.q_proj.weight == torch.cat([q_heads.0.weight, ..., q_heads.N.weight], dim=0)`.
- `attn.k_proj.weight`
The key projections for all `num_heads` attention heads.
Shape is (d_model, d_model).
The rows are ordered by matrices of shape (num_heads, d_k),
so `attn.k_proj.weight == torch.cat([k_heads.0.weight, ..., k_heads.N.weight], dim=0)`.
- `attn.v_proj.weight`
The value projections for all `num_heads` attention heads.
Shape is (d_model, d_model).
The rows are ordered by matrices of shape (num_heads, d_v),
so `attn.v_proj.weight == torch.cat([v_heads.0.weight, ..., v_heads.N.weight], dim=0)`.
- `attn.output_proj.weight`
Weight of the multi-head self-attention output projection
Shape is (d_model, d_model).
- `ln1.weight`
Weights of affine transform for the first RMSNorm
applied in the transformer block.
Shape is (d_model,).
- `ffn.w1.weight`
Weight of the first linear transformation in the FFN.
Shape is (d_model, d_ff).
- `ffn.w2.weight`
Weight of the second linear transformation in the FFN.
Shape is (d_ff, d_model).
- `ffn.w3.weight`
Weight of the third linear transformation in the FFN.
Shape is (d_model, d_ff).
- `ln2.weight`
Weights of affine transform for the second RMSNorm
applied in the transformer block.
Shape is (d_model,).
in_features (Float[Tensor, "batch sequence_length d_model"]):
Tensor to run your implementation on.

Returns:
Float[Tensor, "batch sequence_length d_model"] Tensor with the output of
running the Transformer block on the input features while using RoPE.
"""
eps = 1e-5
batch_dims = tuple(in_features.shape[:-2])
seq_len = in_features.shape[-2]
device = in_features.device

base_positions = torch.arange(seq_len, device=device, dtype=torch.long)
view_shape = (1,) * len(batch_dims) + (seq_len,)
token_positions = base_positions.view(view_shape).expand(*batch_dims, seq_len)

attn_input = run_rmsnorm(d_model=d_model, eps=eps, weights=weights["ln1.weight"], in_features=in_features)
attn_output = run_multihead_self_attention_with_rope(
d_model=d_model,
num_heads=num_heads,
max_seq_len=max_seq_len,
theta=theta,
q_proj_weight=weights["attn.q_proj.weight"],
k_proj_weight=weights["attn.k_proj.weight"],
v_proj_weight=weights["attn.v_proj.weight"],
o_proj_weight=weights["attn.output_proj.weight"],
in_features=attn_input,
token_positions=token_positions,
)
residual = in_features + attn_output

ffn_input = run_rmsnorm(d_model=d_model, eps=eps, weights=weights["ln2.weight"], in_features=residual)
ffn_output = run_swiglu(
d_model=d_model,
d_ff=d_ff,
w1_weight=weights["ffn.w1.weight"],
w2_weight=weights["ffn.w2.weight"],
w3_weight=weights["ffn.w3.weight"],
in_features=ffn_input,
)
return residual + ffn_output


def run_transformer_lm(
vocab_size: int,
context_length: int,
d_model: int,
num_layers: int,
num_heads: int,
d_ff: int,
rope_theta: float,
weights: dict[str, Tensor],
in_indices: Int[Tensor, " batch_size sequence_length"],
) -> Float[Tensor, " batch_size sequence_length vocab_size"]:
"""Given the weights of a Transformer language model and input indices,
return the output of running a forward pass on the input indices.

This function should use RoPE.

Args:
vocab_size (int): The number of unique items in the output vocabulary to be predicted.
context_length (int): The maximum number of tokens to process at once.
d_model (int): The dimensionality of the model embeddings and sublayer outputs.
num_layers (int): The number of Transformer layers to use.
num_heads (int): Number of heads to use in multi-headed attention. `d_model` must be
evenly divisible by `num_heads`.
d_ff (int): Dimensionality of the feed-forward inner layer (section 3.3).
rope_theta (float): The RoPE $\\Theta$ parameter.
weights (dict[str, Tensor]):
State dict of our reference implementation. {num_layers} refers to an
integer between `0` and `num_layers - 1` (the layer index).
The keys of this dictionary are:
- `token_embeddings.weight`
Token embedding matrix. Shape is (vocab_size, d_model).
- `layers.{num_layers}.attn.q_proj.weight`
The query projections for all `num_heads` attention heads.
Shape is (num_heads * (d_model / num_heads), d_model).
The rows are ordered by matrices of shape (num_heads, d_k),
so `attn.q_proj.weight == torch.cat([q_heads.0.weight, ..., q_heads.N.weight], dim=0)`.
- `layers.{num_layers}.attn.k_proj.weight`
The key projections for all `num_heads` attention heads.
Shape is (num_heads * (d_model / num_heads), d_model).
The rows are ordered by matrices of shape (num_heads, d_k),
so `attn.k_proj.weight == torch.cat([k_heads.0.weight, ..., k_heads.N.weight], dim=0)`.
- `layers.{num_layers}.attn.v_proj.weight`
The value projections for all `num_heads` attention heads.
Shape is (num_heads * (d_model / num_heads), d_model).
The rows are ordered by matrices of shape (num_heads, d_v),
so `attn.v_proj.weight == torch.cat([v_heads.0.weight, ..., v_heads.N.weight], dim=0)`.
- `layers.{num_layers}.attn.output_proj.weight`
Weight of the multi-head self-attention output projection
Shape is ((d_model / num_heads) * num_heads, d_model).
- `layers.{num_layers}.ln1.weight`
Weights of affine transform for the first RMSNorm
applied in the transformer block.
Shape is (d_model,).
- `layers.{num_layers}.ffn.w1.weight`
Weight of the first linear transformation in the FFN.
Shape is (d_model, d_ff).
- `layers.{num_layers}.ffn.w2.weight`
Weight of the second linear transformation in the FFN.
Shape is (d_ff, d_model).
- `layers.{num_layers}.ffn.w3.weight`
Weight of the third linear transformation in the FFN.
Shape is (d_model, d_ff).
- `layers.{num_layers}.ln2.weight`
Weights of affine transform for the second RMSNorm
applied in the transformer block.
Shape is (d_model,).
- `ln_final.weight`
Weights of affine transform for RMSNorm applied to the output of the final transformer block.
Shape is (d_model, ).
- `lm_head.weight`
Weights of the language model output embedding.
Shape is (vocab_size, d_model).
in_indices (Int[Tensor, "batch_size sequence_length"]) Tensor with input indices to run the language model on. Shape is (batch_size, sequence_length), where
`sequence_length` is at most `context_length`.

Returns:
Float[Tensor, "batch_size sequence_length vocab_size"]: Tensor with the predicted unnormalized
next-word distribution for each token.
"""
if in_indices.shape[-1] > context_length:
raise ValueError("sequence length exceeds context length")

x = run_embedding(
vocab_size=vocab_size,
d_model=d_model,
weights=weights["token_embeddings.weight"],
token_ids=in_indices,
)

for layer_idx in range(num_layers):
prefix = f"layers.{layer_idx}."
layer_weights = {k[len(prefix) :]: v for k, v in weights.items() if k.startswith(prefix)}
x = run_transformer_block(
d_model=d_model,
num_heads=num_heads,
d_ff=d_ff,
max_seq_len=context_length,
theta=rope_theta,
weights=layer_weights,
in_features=x,
)

x = run_rmsnorm(d_model=d_model, eps=1e-5, weights=weights["ln_final.weight"], in_features=x)
logits = run_linear(
d_in=d_model,
d_out=vocab_size,
weights=weights["lm_head.weight"],
in_features=x,
)
return logits


def run_rmsnorm(
d_model: int,
eps: float,
weights: Float[Tensor, " d_model"],
in_features: Float[Tensor, " ... d_model"],
) -> Float[Tensor, " ... d_model"]:
"""Given the weights of a RMSNorm affine transform,
return the output of running RMSNorm on the input features.

Args:
d_model (int): The dimensionality of the RMSNorm input.
eps: (float): A value added to the denominator for numerical stability.
weights (Float[Tensor, "d_model"]): RMSNorm weights.
in_features (Float[Tensor, "... d_model"]): Input features to run RMSNorm on. Can have arbitrary leading
dimensions.

Returns:
Float[Tensor,"... d_model"]: Tensor of with the same shape as `in_features` with the output of running
RMSNorm of the `in_features`.
"""
if weights.shape != (d_model,):
msg = f"weights shape {tuple(weights.shape)} does not match ({d_model},)"
raise ValueError(msg)
if in_features.shape[-1] != d_model:
msg = f"Input features last dimension {in_features.shape[-1]} does not equal d_model {d_model}"
raise ValueError(msg)

variance = in_features.pow(2).mean(dim=-1, keepdim=True)
scale = torch.rsqrt(variance + eps)
return in_features * scale * weights


def run_silu(in_features: Float[Tensor, " ..."]) -> Float[Tensor, " ..."]:
"""Given a tensor of inputs, return the output of applying SiLU
to each element.

Args:
in_features(Float[Tensor, "..."]): Input features to run SiLU on. Shape is arbitrary.

Returns:
Float[Tensor,"..."]: of with the same shape as `in_features` with the output of applying
SiLU to each element.
"""
return F.silu(in_features)


def run_get_batch(
dataset: npt.NDArray, batch_size: int, context_length: int, device: str
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Given a dataset (a 1D numpy array of integers) and a desired batch size and
context length, sample language modeling input sequences and their corresponding
labels from the dataset.

Args:
dataset (np.array): 1D numpy array of integer token IDs in the dataset.
batch_size (int): Desired batch size to sample.
context_length (int): Desired context length of each sampled example.
device (str): PyTorch device string (e.g., 'cpu' or 'cuda:0') indicating the device
to place the sampled input sequences and labels on.

Returns:
Tuple of torch.LongTensors of shape (batch_size, context_length). The first tuple item
is the sampled input sequences, and the second tuple item is the corresponding
language modeling labels.
"""
data = torch.as_tensor(dataset, dtype=torch.long)
if data.ndim != 1:
raise ValueError("dataset must be 1D")
if context_length <= 0:
raise ValueError("context_length must be positive")
if context_length >= data.shape[0]:
raise ValueError("context_length must be smaller than dataset length")

max_start = data.shape[0] - context_length
starts = torch.randint(0, max_start, (batch_size,))
offsets = torch.arange(context_length)
x = data[starts.unsqueeze(1) + offsets]
y = data[starts.unsqueeze(1) + offsets + 1]

target_device = torch.device(device)
return x.to(target_device), y.to(target_device)


def run_softmax(in_features: Float[Tensor, " ..."], dim: int) -> Float[Tensor, " ..."]:
"""
Given a tensor of inputs, return the output of softmaxing the given `dim`
of the input.

Args:
in_features (Float[Tensor, "..."]): Input features to softmax. Shape is arbitrary.
dim (int): Dimension of the `in_features` to apply softmax to.

Returns:
Float[Tensor, "..."]: Tensor of with the same shape as `in_features` with the output of
softmax normalizing the specified `dim`.
"""
shifted = in_features - in_features.max(dim=dim, keepdim=True).values
exps = shifted.exp()
return exps / exps.sum(dim=dim, keepdim=True)


def run_cross_entropy(
inputs: Float[Tensor, " batch_size vocab_size"], targets: Int[Tensor, " batch_size"]
) -> Float[Tensor, ""]:
"""Given a tensor of inputs and targets, compute the average cross-entropy
loss across examples.

Args:
inputs (Float[Tensor, "batch_size vocab_size"]): inputs[i][j] is the
unnormalized logit of jth class for the ith example.
targets (Int[Tensor, "batch_size"]): Tensor of shape (batch_size,) with the index of the correct class.
Each value must be between 0 and `num_classes - 1`.

Returns:
Float[Tensor, ""]: The average cross-entropy loss across examples.
"""
logits = inputs.to(torch.float32)
targets = targets.to(torch.long)
log_probs = logits.log_softmax(dim=-1)
return F.nll_loss(log_probs, targets, reduction="mean")


def run_gradient_clipping(parameters: Iterable[torch.nn.Parameter], max_l2_norm: float) -> None:
"""Given a set of parameters, clip their combined gradients to have l2 norm at most max_l2_norm.

Args:
parameters (Iterable[torch.nn.Parameter]): collection of trainable parameters.
max_l2_norm (float): a positive value containing the maximum l2-norm.

The gradients of the parameters (parameter.grad) should be modified in-place.
"""
clip_grad_norm_(parameters, max_l2_norm)


def get_adamw_cls() -> Any:
"""
Returns a torch.optim.Optimizer that implements AdamW.
"""
return torch.optim.AdamW


def run_get_lr_cosine_schedule(
it: int,
max_learning_rate: float,
min_learning_rate: float,
warmup_iters: int,
cosine_cycle_iters: int,
):
"""
Given the parameters of a cosine learning rate decay schedule (with linear
warmup) and an iteration number, return the learning rate at the given
iteration under the specified schedule.

Args:
it (int): Iteration number to get learning rate for.
max_learning_rate (float): alpha_max, the maximum learning rate for
cosine learning rate schedule (with warmup).
min_learning_rate (float): alpha_min, the minimum / final learning rate for
the cosine learning rate schedule (with warmup).
warmup_iters (int): T_w, the number of iterations to linearly warm-up
the learning rate.
cosine_cycle_iters (int): T_c, the number of cosine annealing iterations.

Returns:
Learning rate at the given iteration under the specified schedule.
"""
if warmup_iters < 0 or cosine_cycle_iters < 0:
raise ValueError("warmup_iters and cosine_cycle_iters must be non-negative")

if warmup_iters > 0 and it <= warmup_iters:
return max_learning_rate * (it / warmup_iters)

if cosine_cycle_iters <= 0:
return min_learning_rate

if it >= cosine_cycle_iters:
return min_learning_rate

cosine_span = max(cosine_cycle_iters - warmup_iters, 1)
progress = (it - warmup_iters) / cosine_span
progress = min(max(progress, 0.0), 1.0)
cosine = 0.5 * (1 + math.cos(math.pi * progress))
return min_learning_rate + (max_learning_rate - min_learning_rate) * cosine


def run_save_checkpoint(
model: torch.nn.Module,
optimizer: torch.optim.Optimizer,
iteration: int,
out: str | os.PathLike | BinaryIO | IO[bytes],
):
"""
Given a model, optimizer, and an iteration number, serialize them to disk.

Args:
model (torch.nn.Module): Serialize the state of this model.
optimizer (torch.optim.Optimizer): Serialize the state of this optimizer.
iteration (int): Serialize this value, which represents the number of training iterations
we've completed.
out (str | os.PathLike | BinaryIO | IO[bytes]): Path or file-like object to serialize the model, optimizer, and iteration to.
"""
state = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"iteration": int(iteration),
}
torch.save(state, out)


def run_load_checkpoint(
src: str | os.PathLike | BinaryIO | IO[bytes],
model: torch.nn.Module,
optimizer: torch.optim.Optimizer,
) -> int:
"""
Given a serialized checkpoint (path or file-like object), restore the
serialized state to the given model and optimizer.
Return the number of iterations that we previously serialized in
the checkpoint.

Args:
src (str | os.PathLike | BinaryIO | IO[bytes]): Path or file-like object to serialized checkpoint.
model (torch.nn.Module): Restore the state of this model.
optimizer (torch.optim.Optimizer): Restore the state of this optimizer.
Returns:
int: the previously-serialized number of iterations.
"""
checkpoint = torch.load(src, map_location="cpu")
model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
return int(checkpoint["iteration"])


class _BPETokenizer:
"""Simple GPT-2 style BPE tokenizer supporting streaming inputs."""

_STREAM_CHUNK_SIZE = 8192

def __init__(
self,
vocab: dict[int, bytes],
merges: list[tuple[bytes, bytes]],
special_tokens: list[str] | None,
) -> None:
self._pretokenizer = regex.compile(GPT2_PRETOKENIZER_PATTERN)

self._id_to_token_bytes: dict[int, bytes] = {}
self._token_bytes_to_id: dict[bytes, int] = {}
for token_id, token_bytes in vocab.items():
idx = int(token_id)
if not isinstance(token_bytes, (bytes, bytearray)):
token_bytes = bytes(token_bytes)
else:
token_bytes = bytes(token_bytes)
self._id_to_token_bytes[idx] = token_bytes
self._token_bytes_to_id[token_bytes] = idx

self._pair_ranks: dict[tuple[bytes, bytes], int] = {}
for rank, pair in enumerate(merges):
if len(pair) != 2:
continue
left, right = pair
if not isinstance(left, (bytes, bytearray)):
left = bytes(left)
else:
left = bytes(left)
if not isinstance(right, (bytes, bytearray)):
right = bytes(right)
else:
right = bytes(right)
self._pair_ranks[(left, right)] = rank

self._bpe_cache: dict[bytes, tuple[int, ...]] = {}

deduped_specials: list[str] = []
seen_specials: set[str] = set()
if special_tokens:
for token in special_tokens:
if not isinstance(token, str):
msg = f"Expected special tokens to be strings, got {type(token)!r}"
raise TypeError(msg)
if not token:
raise ValueError("Special tokens must be non-empty strings.")
if token in seen_specials:
continue
seen_specials.add(token)
deduped_specials.append(token)

self._special_tokens = deduped_specials
self._special_token_to_id: dict[str, int] = {}
self._special_regex: regex.Pattern[str] | None = None
self._special_prefixes: dict[int, set[str]] = {}
self._max_special_prefix_len = 0

if self._special_tokens:
regex_tokens = sorted(self._special_tokens, key=len, reverse=True)
pattern = "|".join(regex.escape(token) for token in regex_tokens)
self._special_regex = regex.compile(pattern)
for token in self._special_tokens:
token_bytes = token.encode("utf-8")
token_id = self._token_bytes_to_id.get(token_bytes)
if token_id is None:
msg = f"Special token {token!r} does not exist in the vocabulary."
raise ValueError(msg)
self._special_token_to_id[token] = token_id
for prefix_len in range(1, len(token)):
self._special_prefixes.setdefault(prefix_len, set()).add(token[:prefix_len])
if len(token) > 1:
self._max_special_prefix_len = max(self._max_special_prefix_len, len(token) - 1)

def encode(self, text: str) -> list[int]:
if not isinstance(text, str):
msg = f"Tokenizer.encode expects a string, got {type(text)!r}"
raise TypeError(msg)
return list(self._encode_from_chunks([text]))

def encode_iterable(self, iterable: Iterable[str] | IO[str]) -> Iterable[int]:
chunks = self._chunk_source(iterable)

def generator() -> Iterable[int]:
yield from self._encode_from_chunks(chunks)

return generator()

def decode(self, token_ids: Iterable[int]) -> str:
byte_segments: list[bytes] = []
for token_id in token_ids:
idx = int(token_id)
try:
token_bytes = self._id_to_token_bytes[idx]
except KeyError as exc:
raise KeyError(f"Unknown token id {idx}") from exc
byte_segments.append(token_bytes)
data = b"".join(byte_segments)
if not data:
return ""
try:
return data.decode("utf-8")
except UnicodeDecodeError:
# Decoding individual tokens may produce incomplete multi-byte sequences.
# Fall back to a byte-preserving decode so callers can still inspect tokens.
return data.decode("latin-1")

def _chunk_source(self, source: Iterable[str] | IO[str]) -> Iterable[str]:
read_method = getattr(source, "read", None)
if callable(read_method):
while True:
chunk = read_method(self._STREAM_CHUNK_SIZE)
if not chunk:
break
if not isinstance(chunk, str):
chunk = chunk.decode("utf-8")
if chunk:
yield chunk
return
for chunk in source:
if not isinstance(chunk, str):
msg = f"encode_iterable expects strings, got {type(chunk)!r}"
raise TypeError(msg)
if chunk:
yield chunk

def _encode_from_chunks(self, chunks: Iterable[str]) -> Iterable[int]:
for segment, is_special in self._split_on_special(chunks):
if not segment:
continue
if is_special:
yield self._special_token_to_id[segment]
continue
for match in self._pretokenizer.finditer(segment):
piece = match.group(0)
if not piece:
continue
token_bytes = piece.encode("utf-8")
if not token_bytes:
continue
yield from self._bpe(token_bytes)

def _split_on_special(self, chunks: Iterable[str]) -> Iterable[tuple[str, bool]]:
if not self._special_regex:
for chunk in chunks:
if chunk:
yield chunk, False
return

buffer = ""
for chunk in chunks:
if not chunk:
continue
buffer += chunk
while True:
match = self._special_regex.search(buffer)
if not match:
break
start, end = match.span()
if start:
yield buffer[:start], False
yield match.group(0), True
buffer = buffer[end:]
keep = self._pending_special_prefix_length(buffer)
if keep == 0:
if buffer:
yield buffer, False
buffer = ""
else:
safe_len = len(buffer) - keep
if safe_len > 0:
yield buffer[:safe_len], False
buffer = buffer[safe_len:]
if buffer:
yield buffer, False

def _pending_special_prefix_length(self, text: str) -> int:
if self._max_special_prefix_len == 0 or not text:
return 0
upto = min(len(text), self._max_special_prefix_len)
for length in range(upto, 0, -1):
suffix = text[-length:]
prefixes = self._special_prefixes.get(length)
if prefixes and suffix in prefixes:
return length
return 0

def _bpe(self, token_bytes: bytes) -> tuple[int, ...]:
cached = self._bpe_cache.get(token_bytes)
if cached is not None:
return cached

if token_bytes in self._token_bytes_to_id:
result = (self._token_bytes_to_id[token_bytes],)
self._bpe_cache[token_bytes] = result
return result

word = tuple(token_bytes[i : i + 1] for i in range(len(token_bytes)))
pairs = self._get_pairs(word)

while pairs:
best_pair = min(
pairs,
key=lambda pair: self._pair_ranks.get(pair, float("inf")),
)
if best_pair not in self._pair_ranks:
break
first, second = best_pair
new_word: list[bytes] = []
i = 0
while i < len(word):
if (
i < len(word) - 1
and word[i] == first
and word[i + 1] == second
):
new_word.append(word[i] + word[i + 1])
i += 2
else:
new_word.append(word[i])
i += 1
word = tuple(new_word)
if len(word) == 1:
break
pairs = self._get_pairs(word)

result = tuple(self._token_bytes_to_id[symbol] for symbol in word)
self._bpe_cache[token_bytes] = result
return result

@staticmethod
def _get_pairs(word: tuple[bytes, ...]) -> set[tuple[bytes, bytes]]:
pairs: set[tuple[bytes, bytes]] = set()
if len(word) < 2:
return pairs
prev = word[0]
for symbol in word[1:]:
pairs.add((prev, symbol))
prev = symbol
return pairs


def get_tokenizer(
vocab: dict[int, bytes],
merges: list[tuple[bytes, bytes]],
special_tokens: list[str] | None = None,
) -> Any:
"""Given a vocabulary, a list of merges, and a list of special tokens,
return a BPE tokenizer that uses the provided vocab, merges, and special tokens.

Args:
vocab (dict[int, bytes]): The tokenizer vocabulary, a mapping from int (token ID in the vocabulary)
to bytes (token bytes)
merges (list[tuple[bytes, bytes]]): BPE merges. Each list item is a tuple of bytes (<token1>, <token2>),
representing that <token1> was merged with <token2>.
Merges are ordered by order of creation.
special_tokens (list[str] | None): A list of string special tokens for the tokenizer. These strings will never
be split into multiple tokens, and will always be kept as a single token.

Returns:
A BPE tokenizer that uses the provided vocab, merges, and special tokens.
"""
if vocab is None:
raise ValueError("vocab must be provided.")
if merges is None:
raise ValueError("merges must be provided.")
return _BPETokenizer(vocab, merges, special_tokens or [])



def run_train_bpe(
input_path: str | os.PathLike,
vocab_size: int,
special_tokens: list[str],
**kwargs,
) -> tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
"""Given the path to an input corpus, run train a BPE tokenizer and
output its vocabulary and merges.

Args:
input_path (str | os.PathLike): Path to BPE tokenizer training data.
vocab_size (int): Total number of items in the tokenizer's vocabulary (including special tokens).
special_tokens (list[str]): A list of string special tokens to be added to the tokenizer vocabulary.
These strings will never be split into multiple tokens, and will always be
kept as a single token. If these special tokens occur in the input_path,
they are treated as any other string.

Returns:
tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
vocab:
The trained tokenizer vocabulary, a mapping from int (token ID in the vocabulary)
to bytes (token bytes)
merges:
BPE merges. Each list item is a tuple of bytes (<token1>, <token2>),
representing that <token1> was merged with <token2>.
Merges are ordered by order of creation.
"""
# 1. 参数校验与初始化
pat_str = kwargs.get("pat_str", GPT2_PRETOKENIZER_PATTERN)
special_tokens = special_tokens or []
unique_special_tokens: list[str] = []
seen_specials: set[str] = set()

# 这里的逻辑是去重并保持顺序
for token in special_tokens:
if not isinstance(token, str):
msg = f"Expected special tokens to be strings, got {type(token)!r}"
raise TypeError(msg)
if token not in seen_specials:
seen_specials.add(token)
unique_special_tokens.append(token)

special_tokens_bytes = [token.encode("utf-8") for token in unique_special_tokens]
num_special_tokens = len(special_tokens_bytes)

# 基础词表大小为 256 (字节范围)
if vocab_size < 2**8 + num_special_tokens:
msg = "vocab_size must be at least 256 + number of special tokens"
raise ValueError(msg)

merges_target = vocab_size - num_special_tokens - 2**8
pretokenizer = regex.compile(pat_str)

# 2. 读取文件
with open(input_path, "r", encoding="utf-8") as f:
text = f.read()

words: list[list[int]] = []
word_frequencies: list[int] = []
word_lookup: dict[str, int] = {}

# 3. 预分词 (Pre-tokenization)
# 首先按特殊 token 切分,防止特殊 token 被正则拆散
removable_specials = [token for token in unique_special_tokens if token]
segments = [text]
if removable_specials:
escaped = [regex.escape(token) for token in removable_specials]
split_pattern = regex.compile("|".join(escaped))
segments = [segment for segment in split_pattern.split(text) if segment]

for segment in segments:
for match in pretokenizer.finditer(segment):
token = match.group(0)
if not token:
continue

idx = word_lookup.get(token)
if idx is None:
token_bytes = token.encode("utf-8")
if not token_bytes:
continue
idx = len(words)
word_lookup[token] = idx
words.append(list(token_bytes))
word_frequencies.append(0)

word_frequencies[idx] += 1

# 4. 初始化 BPE 统计结构
# 修正:范围应该是 256 (0-255),原文的 28 可能是笔误
token_id_to_bytes: dict[int, bytes] = {i: bytes([i]) for i in range(256)}
merges: list[tuple[bytes, bytes]] = []
next_token_id = 256

pair_stats: Counter[tuple[int, int]] = Counter()
pair_indices: dict[tuple[int, int], set[int]] = {}
word_pair_counters: list[Counter[tuple[int, int]]] = []

# 初次统计所有单词中的 pair
for idx, token_ids in enumerate(words):
freq = word_frequencies[idx]
if freq == 0 or len(token_ids) < 2:
word_pair_counters.append(Counter())
continue

pair_counter = Counter(zip(token_ids[:-1], token_ids[1:]))
word_pair_counters.append(pair_counter)

for pair, count in pair_counter.items():
pair_stats[pair] += count * freq
pair_indices.setdefault(pair, set()).add(idx)

# --- 内部辅助函数 (闭包) ---
def remove_word_from_stats(word_idx: int) -> None:
counter = word_pair_counters[word_idx]
if not counter:
return
freq = word_frequencies[word_idx]
for pair, count in counter.items():
pair_stats[pair] -= count * freq
if pair_stats[pair] <= 0:
pair_stats.pop(pair, None)

indices = pair_indices.get(pair)
if indices is not None:
indices.discard(word_idx)
if not indices:
pair_indices.pop(pair, None)

def add_word_to_stats(word_idx: int) -> None:
tokens = words[word_idx]
if len(tokens) < 2:
word_pair_counters[word_idx] = Counter()
return

counter = Counter(zip(tokens[:-1], tokens[1:]))
word_pair_counters[word_idx] = counter
freq = word_frequencies[word_idx]
for pair, count in counter.items():
pair_stats[pair] += count * freq
pair_indices.setdefault(pair, set()).add(word_idx)

def merge_word(word_idx: int, pair: tuple[int, int], new_token_id: int) -> None:
tokens = words[word_idx]
if len(tokens) < 2:
return

merged: list[int] = []
i = 0
while i < len(tokens):
if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
merged.append(new_token_id)
i += 2
else:
merged.append(tokens[i])
i += 1
words[word_idx] = merged

# 5. BPE 训练主循环
for _ in range(max(0, merges_target)):
if not pair_stats:
break

# 定义优先级:优先频次高,频次相同比较字节内容(为了确定性)
def pair_priority(item: tuple[tuple[int, int], int]) -> tuple[int, bytes, bytes]:
(left_id, right_id), count = item
return count, token_id_to_bytes[left_id], token_id_to_bytes[right_id]

best_pair, _ = max(pair_stats.items(), key=pair_priority)

left_bytes = token_id_to_bytes[best_pair[0]]
right_bytes = token_id_to_bytes[best_pair[1]]

merges.append((left_bytes, right_bytes))

new_token_id = next_token_id
token_id_to_bytes[new_token_id] = left_bytes + right_bytes

affected_words = pair_indices.pop(best_pair, set())

# 如果没有单词受到影响(理论上不应发生,因为 stats 里有),直接跳过
if not affected_words:
next_token_id += 1
pair_stats.pop(best_pair, None)
continue

# 更新受影响单词的统计信息
for word_idx in sorted(affected_words):
remove_word_from_stats(word_idx)
merge_word(word_idx, best_pair, new_token_id)
add_word_to_stats(word_idx)

pair_stats.pop(best_pair, None)
next_token_id += 1

# 6. 构建最终词表
vocab: dict[int, bytes] = {
idx: token for idx, token in token_id_to_bytes.items() if idx < next_token_id
}

# 添加特殊 Token
for token_bytes in special_tokens_bytes:
if len(vocab) >= vocab_size:
break
vocab[next_token_id] = token_bytes
next_token_id += 1

return vocab, merges

ViT (Vision Transformer)

ViT (Vision Transformer) 是 Google 在 ICLR 2021 提出的里程碑式工作。它把 Transformer 架构直接搬到图像域,在大规模预训练上打破了 CNN 的统治。CLIP、LLaVA、Stable Diffusion 等多模态模型都以 ViT 作为视觉骨干,因此面试常考。

ViT 总览

一、核心思想:An Image is Worth 16×16 Words

  • 把图片均匀切成 Patch,把每个 Patch 当成一个 Token;整张图就对应一个 Token 序列。
  • 不再依赖卷积的局部归纳偏置和平移不变性,第一层自注意力就拥有全局视野。
  • 视觉和语言共享 Transformer 结构,图文特征更容易对齐。

二、架构流程(Pipeline)

假设输入图像尺寸为 H × W × C(如 224 × 224 × 3),Patch Size P = 16,Embedding 维度 D = 768

  1. Patch Partition
    将图像切成 N = (H × W) / P² = (224 × 224) / (16 × 16) = 14 × 14 = 196 个 Patch,每个 Patch 的形状为 P × P × C
  2. Linear Projection / Patch Embedding
    展平每个 Patch,并通过线性层映射到 D 维。工程中常用 Conv2d(kernel_size=stride=P) 直接完成切块 + 映射。
  3. Positional Embedding
    Transformer 对序列无序,需要向 Patch Embedding 中加可学习的 1D 位置编码,保留 Patch 的空间位置。
  4. Class Token
    在序列最前插入可学习的 [CLS] token,序列长度从 N 变为 N+1。分类时读取 [CLS] 的输出向量。
  5. Transformer Encoder
    堆叠 L 层 Pre-Norm Transformer:LN → MSA → LN → MLP(FFN),层间带残差连接。
  6. MLP Head
    最后再接一个 LN + Linear,输出分类 logits。

Patch Embedding 流程示意

三、ViT vs. CNN(面试高频题)

维度 CNN (ResNet) ViT (Transformer)
归纳偏置 强:先验地假设局部性与平移不变性 弱:没有结构先验,全靠数据学习
数据需求 在小数据集上易训练,表现稳 需要海量数据(JFT-300M 等),ImageNet-1K 上训练更难
感受野 局部 → 随层数加深逐步全局 天然全局,第一层即可关联所有 Patch
计算复杂度 O(H × W),与图像分辨率线性 O(N²),与 Patch 数平方成正比,分辨率高时显存压力大
多模态适配 特征空间与文本差距大,难对齐 与 LLM 架构一致,便于图文对齐(CLIP)

四、关键技术细节

  1. 位置编码外推:训练时 224²,测试时 384² 会导致 Patch 数变化。常对预训练的 2D 位置编码做双三次插值(Bicubic Interpolation)以适配新长度。
  2. 小数据表现差:缺乏卷积的归纳偏置,只能靠大量数据学习“邻近像素相关”这类先验,因此小数据上易过拟合。
  3. 自注意力复杂度Complexity = O(N² · D)N 为 Token 数。在高分辨率下成本过高,于是衍生出 Swin、Window Attention 等改进来降复杂度。

五、常见变体(SOTA 储备)

  • Swin Transformer:窗口注意力 + 移位窗口,使复杂度近似线性 O(N),适合检测和分割。
  • MAE (Masked Autoencoders):ViT 自监督预训练范式,随机 Mask 75% Patch,让模型重建像素,预训练表现突出。
  • DeiT (Data-efficient Image Transformers):引入 Distillation Token,让 ViT 在 ImageNet-1K 这类中等规模数据上也能高效训练。

六、手撕代码:Patch Embedding(PyTorch)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch
import torch.nn as nn


class PatchEmbed(nn.Module):
"""Image to Patch Embedding."""

def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
super().__init__()
self.img_size = img_size
self.patch_size = patch_size
self.n_patches = (img_size // patch_size) ** 2
# Conv2d 一次性完成切块与映射,避免手动 reshape
self.proj = nn.Conv2d(
in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
)

def forward(self, x):
# x: [B, C, H, W]
x = self.proj(x) # [B, D, H/P, W/P]
x = x.flatten(2) # [B, D, N]
x = x.transpose(1, 2) # [B, N, D]
return x

七、总结与代表模型

  • ViT 是多模态大模型的“视觉骨干”,掌握 Input/Output 维度、Patching 机制与 CNN 的区别是面试必备。
  • LLaVA:CLIP-ViT-L/14。
  • Qwen-VL:ViT-bigG。
  • Stable Diffusion:CLIP-ViT-L/14。

CLIP (Contrastive Language-Image Pre-training)

CLIP 是 OpenAI 于 2021 年提出的双塔多模态模型,被称为“图文对齐的基石”。在多模态岗位面试中,它的核心思想、损失函数与工程细节几乎必问。

CLIP 双塔架构示意

一、核心思想

  • 使用图像编码器和文本编码器分别提取特征,再映射到统一语义空间。
  • 通过对比学习(Contrastive Learning)拉近正样本距离、推远负样本距离,从而实现 Zero-shot 分类。
  • 面试金句:CLIP 打通视觉与语言的语义壁垒,让模型“看图懂语义”。

二、架构细节

  1. Image Encoder(视觉塔):ResNet-50、ViT-B/16、ViT-L/14 等结构,输出 D 维视觉向量。
  2. Text Encoder(文本塔):Transformer 结构,输入加入 [SOS][EOS],取 [EOS] 位置作为句子表示。
  3. Projection Head(映射层):线性层映射至同一维度并 L2 归一化,无 Cross-Attention,推理高效。

三、训练目标:对比学习

  1. 数据规模:WIT-400M(4 亿图文对),弱监督规模决定上限。

  2. 相似度矩阵:对 batch 中 N 对样本分别得到 {v_i^I}{v_j^T},计算下式(归一化后即余弦相似度):

  3. InfoNCE / 对称 Cross Entropy

    • 对角线 (i, i) 为正样本,其余为负样本。
    • 行维度做 Softmax(Image→Text),列维度做 Softmax(Text→Image),两者平均。
    • 引入可学习温度 τ 控制分布尖锐度,通常约束 τ ≥ 0.01 避免梯度爆炸。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# image_encoder: ResNet / ViT
# text_encoder: Transformer
# W_i, W_t: 线性映射到共享空间
# t: learnable temperature

I_f = image_encoder(I) # [N, d_i]
T_f = text_encoder(T) # [N, d_t]
I_e = l2_normalize(I_f @ W_i, axis=1) # [N, D]
T_e = l2_normalize(T_f @ W_t, axis=1) # [N, D]
logits = (I_e @ T_e.T) * np.exp(t) # [N, N]
labels = np.arange(N)
loss_i = cross_entropy(logits, labels, axis=1)
loss_t = cross_entropy(logits.T, labels, axis=1)
loss = (loss_i + loss_t) / 2

四、Zero-shot 推理

  1. Prompt Engineering:将标签写成模板句子 A photo of a {label}.
  2. 使用文本塔编码所有 Prompt,缓存文本特征。
  3. 图片通过视觉塔得到特征,与所有文本特征计算余弦相似度,得分最高者即预测。
  4. 多模板取平均可显著提升 Zero-shot 表现。

五、面试常问问题

  • 为什么 Batch Size 极大? 对比学习依赖负样本,Batch 越大,负样本越多,特征越鲁棒;CLIP 训练 Batch 可达 32K。
  • 温度 τ 的作用? 调节 Softmax 尖锐度,CLIP 中为可学习标量,常在 log 域裁剪保障下界。
  • 有哪些局限? 不擅长计数/空间关系/OCR,输入分辨率 224×224 对小目标不敏感。
  • 相比 ImageNet 预训练优势? 数据量大、语言监督更丰富、对分布偏移更鲁棒。
  • 如何拓展到检测/分割? GLIP、Grounding DINO、RegionCLIP 通过区域对齐文本;结合 SAM 可做文本分割。

六、应用与地位

  • Stable Diffusion:使用 CLIP Text Encoder 解析 Prompt。
  • LLaVA / Qwen-VL:采用 CLIP ViT-L/14 作为视觉骨干再接 LLM。
  • CLIP + ViT 基本覆盖现阶段多模态视觉前端 80% 的面试考点。

拟合与泛化

过拟合 vs 欠拟合

  • 过拟合:训练集表现很好,但验证/测试集性能下降,说明模型记住了噪声或特例。
  • 欠拟合:训练集和验证集都表现糟糕,通常意味着模型容量不足或训练不充分。

如何判断

  • 绘制训练 Loss 与验证 Loss 曲线。
    • 训练 Loss 持续下降而验证 Loss 上升 → 过拟合。
    • 两者都停留在高位 → 欠拟合。
  • 对比训练/验证准确率或其他指标是否出现明显分叉。

缓解策略

  • 应对过拟合
    • 收集更多数据或做增强(翻转、裁剪、颜色抖动、Mixup/CutMix 等)。
    • 增加正则化(L1/L2、Dropout、Label Smoothing、数据噪声)。
    • 降低模型复杂度、使用 Early Stopping、应用 BatchNorm。
  • 应对欠拟合
    • 使用更大的模型或更强的结构。
    • 训练更久、采用更合适的学习率策略。
    • 降低正则化强度或改进特征。

数据准备与特征工程

数据集划分

  • 典型拆分:训练集/验证集/测试集(例如 8/1/1),保证各子集分布一致。
  • 数据量有限时可使用 k 折交叉验证轮流作为验证集。
  • 保持随机种子与分层抽样,避免类别不平衡导致的偏差。

预处理与特征工程

  • 数值特征常做标准化(Z-score)或归一化到固定区间,防止量纲影响梯度。
  • 图像常做均值方差归一化、直方图均衡、白化等;文本需要分词/Tokenizer、构建词典、转换为 Embedding。
  • 离散特征通过 one-hot、Embedding、目标编码等方式注入模型。

批处理与数据管线

  • DataLoader 通常负责 shuffle、batch、并行加载与缓存,保证训练稳定。
  • Prefetch、pin memory、mmap、TFRecord 之类技巧可提高吞吐。
  • 在线数据增强(在读取时随机变换)能避免存储大量增强样本。

前向传播与反向传播

计算图

  • 深度学习模型可视为由线性/非线性层组成的有向无环图,前向传播按拓扑顺序计算输出值。
  • 常见操作:矩阵乘、卷积、逐元素激活、拼接、归一化等。

反向传播

  • 基于链式法则:若 ,则
  • 从损失对输出的梯度开始,沿计算图逆序乘以局部梯度即可得到所有参数的梯度。

自动微分

  • 框架(PyTorch、TensorFlow、JAX)都会记录计算图并自动求导,开发者只需定义前向过程。
  • 明确 requires_grad/stop_gradient、合理释放梯度(zero grad),可以避免显存泄漏。

正则化

L1 与 L2

  • L1 正则:在损失中加入 ,大量权重被压到 0,模型稀疏、利于特征选择。
  • L2 正则:加入 ,权重整体收缩但不为 0,模型更平滑、稳定。
  • 直观区别:L1 “让很多参数干脆不用”;L2 “让所有参数都收一收”。

其他常见手段

  • Dropout:训练时随机屏蔽部分神经元,相当于做子网络集成,减少共适应。
  • BatchNorm / LayerNorm:稳定每层输入分布,允许更大学习率,并带来自然的正则化效果。
  • Label Smoothing:将 one-hot 标签的 1 改成 ,其它类别分到 ,降低过度自信。
  • Early Stopping / Weight Decay:监控验证集,当指标不再提升时提前停止;Weight Decay 与 L2 等价,常直接作用于优化器。

激活函数

常见激活函数的数学式与要点:

  • Sigmoid,输出 (0,1),可解释为概率;缺点是梯度饱和。
  • Tanh,输出 (-1,1),收敛快于 Sigmoid,但仍会饱和。
  • ReLU,简单高效但会出现 Dead ReLU。
  • Leaky ReLU 其中 ,缓解 Dead ReLU。
  • PReLU,负半轴斜率 可学习。
  • ELU 让负半轴更平滑。
  • GELU,Transformer 中常见。
  • Swish(常用 ),梯度更平滑,性能略优于 ReLU。

损失函数与指标

交叉熵

  • 标签为 one-hot 时等价于最大化正确类别的 log 概率。
  • 与 softmax 组合后的梯度稳定、本质在最小化真实分布与预测分布的 KL 散度。

其他损失函数

  • MSE,对离群点敏感,常用于回归。
  • MAE,对离群点更鲁棒,但在 0 点不可导。
  • Huber:在误差较小时为二次,大于阈值 时转为线性,兼顾 MSE/MAE 优势。
  • Hinge / Multi-class Hinge,常用于支持向量机等最大间隔分类。
  • Focal Loss 为真实类别概率,通过调节 抑制易分类样本,常用于检测/不平衡分类。

常见指标

  • 分类:Accuracy、Precision、Recall、F1、ROC/AUC。
  • 回归:MSE/MAE、RMSE、R²;MSE 对离群点敏感,MAE 更鲁棒但在 0 点不可导。

优化与训练策略

Mini-Batch 必要性

  • Full-batch 梯度最精确但慢且耗显存;纯 SGD (batch=1) 更新快却噪声大。
  • Mini-batch 在效率与稳定性之间取折中,可利用 GPU 并行,梯度估计更平滑。

优化器

  • SGD:沿负梯度方向更新。
  • Momentum:引入动量项,积累之前的梯度方向,加速收敛并抑制震荡。
  • Adam:同时估计梯度的一阶/二阶矩,为不同参数分配自适应学习率,前期收敛快,但泛化有时略弱于 SGD+Momentum,可在后期切换。

学习率调度

  • Step/Exponential Decay:按固定间隔或指数下调。
  • Cosine Annealing:富有周期感,可配合 Warmup。
  • Warmup:训练初始从较小 LR 逐步升高,避免震荡。
  • Cyclic / OneCycle:先升后降,在 CV、NLP 任务中常见。

梯度消失/爆炸

  • 原因:深层链式相乘、激活饱和、初始化不当。
  • 对策:使用 ReLU 家族、Xavier/He 初始化、残差结构、BatchNorm/LayerNorm、梯度裁剪。

典型网络模块

CNN

  • 卷积层具备参数共享、局部感受野、平移不变性三大优势。
  • 卷积输出尺寸:输入 ,核大小 ,padding ,stride ,则
  • Pooling 用于下采样和提升鲁棒性:Max Pool 保留最强响应,Avg Pool 更平滑,Global Avg Pool 常用于分类尾部。

RNN / LSTM / GRU

  • 普通 RNN 在长序列上易梯度消失/爆炸。
  • LSTM 通过遗忘门、输入门、输出门维护细胞状态 ,缓解长期依赖问题。
  • GRU 将细胞状态与隐状态合并,只保留更新门和重置门,参数更少、计算更快。

Attention 与 Transformer

  • Self-Attention:对每个位置 计算其 Query 与其他位置 Key 的相似度,再对 Value 加权求和。
  • Multi-Head:多组 并行,捕获不同关系。
  • 由于注意力缺乏位置信息,需要添加可学习或正余弦位置编码。
  • 核心公式:
  • Transformer 完全基于 Self-Attention,可并行处理序列,易于扩展到 GPT、BERT、ViT 等大模型。

归一化与正则细节

  • BatchNorm:在通道维上对 mini-batch 标准化,再学习缩放/平移,提升收敛速度并具备轻微正则化;训练和推理需区分均值/方差的来源。
  • LayerNorm:对同一样本的特征维做标准化,与 batch 大小无关,适合 Transformer/NLP。
  • Dropout数据增强 搭配使用可明显提升泛化。
  • 权重初始化:Xavier/Glorot 适合近似线性激活,He 初始化匹配 ReLU 家族;良好的初始化能避免一开始就梯度消失。

一份把 CNN 与 Transformer 串起来的快速笔记,记录关键公式、训练直觉与二者之间的联系。

1. 深度网络训练循环

  1. 正向传播:输入沿着网络逐层计算得到输出。
  2. 计算损失:把输出与标签送入损失函数,得到标量损失。
  3. 反向传播:利用链式法则计算各层梯度。
  4. 参数更新:优化器使用梯度更新权重,循环往复。

2. CNN 基础组件

2.1 卷积层与特征提取

  • 卷积核在图像上滑动,通过局部感受野提取空间特征,可并行堆叠多组卷积层。
  • 示例:输入为 224×224×3,使用 64 个 7×7 卷积核、步长 2,可得到 112×112×64 的输出;空间分辨率减半、通道数等于卷积核个数。
  • 更深的卷积核(通道数更多)可以捕获更复杂的特征模式,输出 tensor 的空间尺寸由步长与 padding 控制。

2.2 ReLU(Rectified Linear Unit)

  • 引入非线性,提升模型表征能力。
  • 通过截断负值缓解梯度消失,使深层网络更易训练。

2.3 池化层

  • 常见的 2×2 最大池化会在每个窗口取最大值,输出 56×56×128 这样的结果(由 112×112×128 池化而来)。
  • 作用:降低空间分辨率、聚合局部信息、减少计算与过拟合风险。

2.4 全连接层与 Softmax

  • 卷积与池化得到的 feature map 需要展平为向量,再送入全连接层。
  • 例:56×56×128 = 401408 维输入,如果映射到 4096 维,需要一个 4096×401408 的权重矩阵和 4096×1 的偏置:

  • 对于 10 类分类任务,再接一个 10×4096 的线性层即可得到 logits。
  • Softmax 会把 logits 变成概率分布,满足

2.5 梯度消失的来源

  • Sigmoid/饱和激活:导数最大仅 0.25,在正负饱和区几乎为 0,多次连乘后梯度指数级衰减。
  • 权重初始化过小:若 ,反向传播会不断乘以 0.01,导致梯度趋近 0;因此需要 Xavier、He 等初始化策略。
  • 网络过深:梯度沿 L 层回传需要连续乘以 ,只要每项略小于 1 就会快速衰减。
  • 缺少跳连接:传统链式结构中,梯度必须层层穿过,无法绕过表现较差的中间层。

3. ResNet 的核心思想

3.1 残差连接公式

  • 残差块输出:。其中 是若干卷积、BN、ReLU 组成的残差分支, 是恒等映射。
  • 反向传播:

“+1” 让梯度至少能直接传回前层,避免被多次连乘压缩为 0。

残差连接让梯度可直达前层

3.2 退化问题与 ResNet 的改进

  • 传统深层网络层数增加时,训练误差反而上升(退化现象)。
  • 论文中 34 层的 plain 网络在 ImageNet 上 top-1 误差为 28.54%,比 18 层的 27.94% 更差;而引入残差后的 34 层网络可降至 25.03%。
  • 原因:若某些层无法进一步提升性能,残差分支可以学到 ,整个块退化为 ,深度增加不会破坏已有表示。
plain 网络与 ResNet 的对比

3.3 残差块结构示意

1
2
3
4
5
6
7
8
输入 x

├──▶ F(x):Conv → BN → ReLU → Conv → BN

└──────────────▶ +


y = F(x) + x
  • 若维度不一致,可用 1×1 卷积或投影矩阵把 调整到同一形状再相加。
  • 残差路径允许信息“跳层”,即便中间卷积暂时训练不好,也不会阻碍梯度流动。

3.4 与 Transformer 的联系

  • Transformer 将残差思想推广到自注意力与前馈子层:输出统一写作

  • 这同样让梯度可以直接穿过自注意力或 FFN 子层,稳定深堆叠结构。

4. Transformer 架构速记

4.1 整体结构

  • 经典 Transformer 采用编码器-解码器架构:每层由自注意力 + 前馈网络组成,堆叠多层后可建模长序列关系。
  • 解码端还包含编码器-解码器注意力,用于关注编码器输出。
Transformer 编码器-解码器宏观结构

4.2 自注意力(Scaled Dot-Product Attention)

  • 输入被映射为查询 、键 、值 ,注意力计算为:

  • 点积注意力可利用高效矩阵乘法实现,内存友好。
缩放点积注意力计算流程
多头注意力并行关注不同子空间

4.3 多头注意力

  • 多头机制让模型同时关注不同子空间:

  • 每个头的维度较小,使整体计算量与单头注意力相近,却能捕捉多尺度依赖。

4.4 自回归与注意力掩码

  • 语言模型满足自回归分解:

  • 为保持自回归性质,解码端在计算自注意力时会对未来位置加上 的 mask,使 softmax 仅依赖已生成的 token。

4.5 Position-wise Feed-Forward Network

  • 每个位置独立的两层感知机,对所有位置共享权重但不同层的参数互不相同:

  • 可视作核大小为 1 的卷积,常用内层维度
逐位置前馈网络结构示意

4.6 Embedding、Softmax 与参数共享

  • Token ID 先通过嵌入矩阵映射到 维空间;同一矩阵也可用于输出层(权重共享),解码器 logits 需要乘以 Softmax 才能得到概率分布。
  • 为保持数值稳定,输入端嵌入通常再乘上

4.7 位置编码(Positional Encoding)

  • 纯注意力模型缺乏位置信息,需要额外向量注入顺序。论文采用固定的正弦/余弦编码:

  • 每个维度对应不同频率,可通过线性变换刻画相对位置信息,易于泛化到更长的序列,并且无需额外参数。

线性代数是机器学习、图形学、控制论乃至量子计算的底层语言。学习过程中如果只背公式,往往无法把抽象符号与具体场景联系起来。本文从零开始梳理关键概念、常见例子与学习路线,帮助你在“理解—计算—应用”之间建立桥梁。

1. 为什么线性代数如此重要

  • 数据表示:向量可以描述单个样本的特征,矩阵可以并行操作一批样本。
  • 线性变换:模型权重(全连接层、卷积核)本质都是线性映射。
  • 优化与分解:梯度、Hessian、奇异值分解 (SVD) 等都植根于线性代数。

2. 前置知识清单

  1. 代数基础:熟悉实数运算、因式分解、函数图像。
  2. 集合与映射:理解“输入 输出”的函数关系,知道多元函数的含义。
  3. 基础几何:二维平面、三维空间中的点、向量、角度与面积。
  4. 初等数列:能处理求和符号 与简单递推。

若上述内容尚不牢固,建议先配合高中代数/几何教材或可汗学院的基础课程复习。

3. 向量:既是箭头也是列表

  • 定义 维向量写作 ,通常当作列向量。
  • 几何直觉:二维向量是平面上的箭头,长度 表示箭头的“强度”,方向由坐标确定。
  • 现实例子:影评情感向量 可表示“正面、负面、中性”特征贡献。
  • 基本运算
    • 加法: 等价于把两个位移连在一起。
    • 标量乘法: 拉伸或压缩向量长度。
    • 点积:,衡量相似度或投影。

示例:设用户喜好向量 表示“动作片”“爱情片”的偏好权重,电影 A 的向量为 ,电影 B 的向量为 。点积 说明用户与电影 A 的特征更接近,推荐系统就会优先推送电影 A。

4. 向量空间、线性组合与基

  • 向量空间:在加法与标量乘法下封闭的集合。例如所有二维向量构成
  • 线性组合。若若干向量的线性组合能覆盖整个空间,它们就是该空间的生成集。
  • 基与维度:最小生成集叫基 (basis)。二维空间常用标准基 。基向量数量即维度。
  • 形象例子:若以“甜味”“酸味”两个向量表示饮品口味空间,任何饮品都能写成它们的线性组合。换基就像改用“柑橘味”“莓果味”描述同一空间。

5. 矩阵:线性变换的载体

  • 定义 矩阵 列的数表,可看作把 维输入映射到 维输出的线性函数。
  • 矩阵作用。矩阵的列向量描述基向量被变换后的结果。
  • 几何例子
    • 缩放矩阵 :水平方向放大 2 倍、竖直方向压缩为原来的一半。
    • 旋转矩阵 :围绕原点逆时针旋转
  • 组合变换:两个变换先后执行等价于矩阵连乘

可以把矩阵视作“坐标轴变形机器”。二维标准坐标系的基为 矩阵 会把 映射到 ,把 映射到 。这两个新向量就是“变形后”的坐标轴。任意向量 可以写成 ,因此

这条公式告诉我们:矩阵的列向量就是被拉伸或扭曲后的坐标轴,原向量的系数 沿着新轴相加,就得到最终坐标。

坐标轴表达示例:设

(水平切变矩阵)。有 因此 轴保持不变,而 轴被倾斜到方向 。任意点 会被映射为

,可以直观看出“沿新 轴被拖拽”这一效果。

示例:图像中的像素点 先进行 45° 旋转,再做 2 倍放大。对应矩阵 最终坐标 ,可直观看到“先旋转再缩放”的顺序性。

6. 矩阵基本运算

运算 表达式 含义
加法 对应元素相加
数乘 所有元素乘以常数
乘法 ,其中 复合变换或把“特征 输出”连接起来
转置 交换行列,点积可写为
逆矩阵 ,满足 撤销变换,只有可逆矩阵才存在

理解矩阵乘法的“行 × 列”视角十分关键:每一行代表一个输出变量如何聚合输入各分量。

7. 线性方程组与高斯消元

把方程组写成 能统一讨论求解流程:

通过高斯消元(行变换)可化为上三角矩阵,再回代得到解。几何上,该例表示两条直线的交点;若系数矩阵行向量共线则无唯一交点。

示例 1(唯一解):上式增广矩阵 得到 。二者对应平面直线在点 相交。

示例 2(欠定/最小二乘):若

, 两行相同导致方程组无解。最小二乘解

表示找到距离两条“重合直线”最近的点。

8. 行列式:体积与可逆性的度量

  • 行列式 衡量单位立方体经过 变换后的体积缩放。
  • ,说明变换把空间压成更低维子空间,矩阵不可逆。
  • 二维示例:矩阵 的行列式为 ,表示面积缩放 2 倍并翻转方向。

三维示例:设

, 利用展开可得

这意味着单位立方体被拉伸为体积为 3 的平行六面体,并保持右手坐标系方向。

9. 特征值与特征向量

特征向量满足 ,即变换后方向不变,仅被放大或缩小。

  • 现实场景:主成分分析 (PCA) 中,协方差矩阵的特征向量就是最大方差方向,特征值表示该方向的方差大小。
  • 计算方式:解特征方程 得到特征值,再回代求特征向量。
  • 几何意义:找到“不会被扭曲方向”的轴,便于判断系统的稳定性或数据的主方向。

10. 正交性、投影与最小二乘

  • 正交向量,夹角为
  • 正交矩阵,既保持长度又保持角度。
  • 投影公式
  • 最小二乘:当 无精确解时,选择 使残差 最小,相当于把 投影到矩阵列空间上。

投影示例:令 表示“沿对角线移动”的方向, 表示某个三维信号。将 投影到 上得到 意味着信号在“对角线平面”上的分量是 ,剩余的 则是与 正交的噪声。

11. 常见矩阵分解

分解 形式 典型用途
LU 为下三角, 为上三角 一次分解,多次求解
QR 正交, 上三角 正交化、稳定地解最小二乘
特征分解 (对称矩阵可正交对角化) 谱聚类、PCA、动力系统分析
SVD 降维、伪逆、低秩近似、推荐系统

分解的本质是把复杂映射拆解成“旋转 + 缩放 + 投影”等易理解的步骤。

奇异值分解 (SVD) 的 LaTeX 图示

SVD 把任意矩阵拆成三段:

  1. :在输入空间旋转/翻转,使数据对齐到“主轴”方向。
  2. :保留非负奇异值 ,沿坐标轴做纯拉伸或压缩。
  3. :把变形后的结果放回输出空间,再旋转/翻转到目标坐标系。

为了兼容 Hexo 默认的 MathJax 配置,我们可以用一条“箭头链”表示 SVD 的三步变换:

这条箭头链展示了“旋转 → 缩放 → 再旋转”的流水线:左侧单位圆上的每个点被 对齐后,再由 拉伸成椭圆,最终通过 映射到输出空间。最大的奇异值 表示矩阵能把某个方向放大的最大倍数,越大说明该方向携带的能量越多;奇异值快速衰减意味着矩阵近似低秩。

数值示例:对于 在几何上: 把单位圆旋转约 将其拉成长轴 、短轴 的椭圆,最后 再旋转到输出坐标。用于降维时,只保留最大的奇异值与对应列即可获得最佳的低秩近似。

12. 线性代数与机器学习的衔接

  1. 数据标准化:零均值处理等价于把样本投影到均值向量的正交补。
  2. 特征工程:PCA/ICA 即寻找协方差矩阵的特征向量或独立基。
  3. 优化算法:梯度、牛顿法中的 Hessian、二阶近似全部依赖矩阵微积分。
  4. 正则化 正则限制向量的 Euclidean 范数, 正则鼓励稀疏系数。
  5. 深度学习:注意力矩阵、卷积核展开、BatchNorm 统计量都可用线性映射描述。

实例: - 数据标准化:对房价特征(面积、房龄、卧室数)做零均值后再训练线性回归,相当于让模型在“去掉公共偏移量”的子空间里拟合。 - PCA:将 784 维的 MNIST 手写图片降到 32 维时,只保留前 32 个最大的奇异值和特征向量,降低存储成本的同时保留主要笔画结构。 - 深度学习:Transformer 中 Query 和 Key 的点积本质是比较两个向量在同一子空间的对齐程度,奇异值过大时常用谱归一化控制注意力权重的放大倍数。

13. 建议的学习路线

  1. 向量直觉阶段:手画二维、三维向量的加法、点积、投影,理解长度与角度的含义。
  2. 矩阵运算阶段:练习矩阵与向量、矩阵与矩阵的乘法,关注形状匹配与变换效果。
  3. 行列式与消元阶段:手算 2×2、3×3 行列式,掌握高斯消元和秩的概念。
  4. 谱分析阶段:从对称矩阵入手求特征值/特征向量,并用真实数据做 PCA。
  5. 分解与数值阶段:实现 Gram-Schmidt、QR、SVD,理解数值稳定性和条件数。
  6. 应用阶段:编程实现线性回归、低秩图像压缩、推荐系统嵌入,观察每一步的线性代数含义。

14. 练习与资源

  • 可视化课程:3Blue1Brown 的 Essence of Linear Algebra 动画。
  • 系统教材:Gilbert Strang 的 Introduction to Linear Algebra 及 MIT 公开课 18.06。
  • 编程练习:使用 NumPy/PyTorch 验证矩阵运算、SVD、最小二乘公式。
  • 自测题:自己构造小矩阵,判断其可逆性、特征值,并在坐标纸上描绘变换后的网格。

循序渐进地把抽象概念与“箭头如何移动”“数据方差指向哪里”等可视图像绑定,线性代数就能真正成为解决问题的工具而非考试公式。

GGML 学习笔记大纲

1. 矩阵乘法基础

矩阵乘法通常写作 。只有当左矩阵 的列数与右矩阵 的行数相同,乘积矩阵 才有定义。其元素由下式确定:

1.1 形式

1.2 一般公式

1.3 数值示例

2. ggml_tensor 结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct ggml_tensor {
enum ggml_type type;
struct ggml_backend_buffer * buffer;
int64_t ne[GGML_MAX_DIMS]; // number of elements
size_t nb[GGML_MAX_DIMS]; // stride in bytes:
// nb[0] = ggml_type_size(type)
// nb[1] = nb[0] * (ne[0] / ggml_blck_size(type)) + padding
// nb[i] = nb[i-1] * ne[i-1]

// compute data
enum ggml_op op;
// op params - allocated as int32_t for alignment
int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];
int32_t flags;
struct ggml_tensor * src[GGML_MAX_SRC];
// source tensor and offset for views
struct ggml_tensor * view_src;
size_t view_offs;
void * data;
char name[GGML_MAX_NAME];
void * extra; // extra things e.g. for ggml-cuda.cu
char padding[8];
};

要点摘要: - ne(number of elements)与 nb(number of bytes)分别描述各维度的元素数量及字节跨度。 - opop_params 指明该张量对应的运算节点及其参数,用于构建计算图。 - view_srcview_offs 允许视图张量共享底层数据,常用于切片、reshape 等操作。

3. 常用命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# GPT-2 单路推理
.\build\bin\Release\gpt-2.exe -m .\models\gpt-2-117M\ggml-model.bin -p "This is an example" -n 128 -t 8 --top_k 40 --top_p 0.9 --temp 0.8

# GPT-2 带批次生成
.\build\bin\Release\gpt-2-batched.exe -np 4 -m .\models\gpt-2-117M\ggml-model.bin -p "Hello my name is" -n 64

# GPT-2 内存分配(alloc 版本)
.\build\bin\Release\gpt-2-alloc.exe -m .\models\gpt-2-117M\ggml-model.bin -p "Sample prompt" -n 80

# GPT-J 推理
.\build\bin\Release\gpt-j.exe -m .\models\gpt-j-6B\ggml-model.bin -p "int main(int argc, char ** argv) {" -n 200 -t 8

# 模型量化示例(F16 -> Q4_0)
.\build\bin\Release\gpt-2-quantize.exe .\models\gpt-2-1558M\ggml-model-f16.bin .\models\gpt-2-1558M\ggml-model-q4_0.bin 2

# SAM 图像分割
.\build\bin\Release\sam.exe -i .\examples\sam\example.jpg -m .\examples\sam\ggml-model-f16.bin -t 8

# YOLOv3-tiny 目标检测
.\build\bin\Release\yolov3-tiny.exe -m .\examples\yolo\yolov3-tiny.gguf -i .\examples\yolo\dog.jpg

4. metadata 速查

字段名 含义示例
shape 矩阵维度,如 (3, 4) 表示 3 行 4 列
dtype 元素类型,例如 float64int32
nnz 稀疏矩阵中非零元素(number of non-zero entries)

创建 Markdown 文件

运行 Hexo 命令自动生成草稿: npx hexo new post "你的标题"

它会在 source/_posts/ 下生成 你的标题.md,带好 front‑matter。 或者直接在 source/_posts/ 里手动新建 yyyy-mm-dd-xxx.md,内容格式如下:

1
2
3
4
5
6
7
8
9
title: 新文章标题
date: 2025-06-06 10:00:00
categories:
- 分类名
tags:
- 标签1
- 标签2
cover: https://你的封面图 (可选)
sticky: 0 # 可选,越大越靠前
正文从这里开始…… ## 本地预览/构建

npm run dev → 浏览器开 http://localhost:4000 看效果。 确认 OK 后 npm run build(可选,用来检查生成有没有报错)。 ## 提交并推送

1
2
3
git add source/_posts/xxx.md
git commit -m "post: xxx"
git push origin main
推送后 GitHub Actions 会触发 “Build & Deploy Blog”,几分钟内博客自动更新。

未来想写的东西

TODO

如果你对这些感兴趣,欢迎来 GitHub 找我聊天。新的旅程开始啦!

0%