Hello / 你好

Gu EnHao 的学习博客

分享代码与学习日记。

GGML

发表于 2025-11-28 更新于 2026-02-19

std::string gpt_random_prompt(std::mt19937 & rng) {
    const int r = rng() % 10;
    switch (r) {
        case 0: return "So";
        case 1: return "Once upon a time";
        case 2: return "When";
        case 3: return "The";
        case 4: return "After";
        case 5: return "If";
        case 6: return "import";
        case 7: return "He";
        case 8: return "She";
        case 9: return "They";
    }

    return "The";
}

vocab

vocabulary

将自然语言文本转换为计算机可以处理的数字表示

文本处理的典型流程包括两个步骤：

分词（Tokenization）：将输入的自然语言文本按照某种规则分割成一系列token，可以是单词、子词或者字符等

查表（Lookup）：将分词得到的每个token在词汇表中查找对应的数值ID

gpt2_eval

bool gpt2_eval(
        const gpt2_model & model,
        const int n_threads,
        const int n_past,
        const std::vector<gpt_vocab::id> & embd_inp,
              std::vector<float>         & embd_w,
              size_t                     & mem_per_token);

model：当前加载的 GPT‑2 权重和超参，gpt2_model 内包含所有 ggml 张量（embedding、各层权重、KV 缓存等），前向时会使用这些权重。
n_threads：执行 ggml_graph_compute_with_ctx 时使用的线程数，即前向推理的并行度（对应命令行 -t/–threads）。
n_past：已经处理过的 token 数，也就是 KV 缓存里原本保存的上下文长度；自注意力里用它决定写入/读取 memory_k/v 的偏移，并在 causal mask 时屏蔽历史以外的位置。
embd_inp：这次要送进模型的一批 token id（通常是 prompt 的下一段或者刚采样出的 token），类型是 std::vector<gpt_vocab::id>。
embd_w：输出参数，函数会把前向结果（最后一个 token 的 logits，长度 = n_vocab）写入这个 vector，供外层采样使用。
mem_per_token：用于估算单个 token 前向所需的临时内存；第一次调用时传入 0，函数内部会根据 ggml_used_mem(ctx0)/N 填充它，之后外层就可以据此调整 buffer 大小以避免频繁 realloc。

因此，gpt2_eval 接受模型和输入 token，以及线程数和上下文长度，计算出最终 logits 并告知内存占用，外层生成循环再根据这些 logits 采样下一个 token。

ggml_tensor

struct ggml_tensor {
    enum ggml_type type;
    struct ggml_backend_buffer * buffer;
    int64_t ne[GGML_MAX_DIMS]; // 维度大小
    size_t  nb[GGML_MAX_DIMS]; // 每维 stride
    enum ggml_op op;           // 该张量对应的算子
    int32_t op_params[...];
    int32_t flags;
    struct ggml_tensor * src[GGML_MAX_SRC]; // 源张量指针
    struct ggml_tensor * view_src;
    size_t view_offs;
    void * data;               // 真实数据指针
    char name[GGML_MAX_NAME];
    void * extra;
    char padding[8];
};

ggml_op

Common ops (non-exhaustive):

op	purpose
`GGML_OP_MUL_MAT`	Matrix multiplication
`GGML_OP_NORM`	Norm step used in layer/RMS normalization
`GGML_OP_SOFT_MAX`	Softmax
`GGML_OP_ROPE`	Rotary positional embedding

gpt2-ctx-main

int main(int argc, char ** argv) {
    ggml_time_init();

    const int64_t t_main_start_us = ggml_time_us();

    gpt_params params;
    params.model = "models/gpt-2-117M/ggml-model.bin";

    if (gpt_params_parse(argc, argv, params) == false) {
        return 1;
    }

    if (params.seed < 0) {
        params.seed = time(NULL);
    }

    printf("%s: seed = %d\n", __func__, params.seed);

    std::mt19937 rng(params.seed);
    if (params.prompt.empty()) {
        params.prompt = gpt_random_prompt(rng);
    }

    int64_t t_load_us = 0;

    gpt_vocab vocab;
    gpt2_model model;

    // load the model
    {
        const int64_t t_start_us = ggml_time_us();

        if (!gpt2_model_load(params.model, model, vocab)) {
            fprintf(stderr, "%s: failed to load model from '%s'\n", __func__, params.model.c_str());
            return 1;
        }

        t_load_us = ggml_time_us() - t_start_us;

        test_gpt_tokenizer(vocab, params.token_test);
    }

    int n_past = 0;

    int64_t t_sample_us  = 0;
    int64_t t_predict_us = 0;

    std::vector<float> logits;

    // tokenize the prompt
    std::vector<gpt_vocab::id> embd_inp = ::gpt_tokenize(vocab, params.prompt);

    params.n_predict = std::min(params.n_predict, model.hparams.n_ctx - (int) embd_inp.size());

    printf("%s: prompt: '%s'\n", __func__, params.prompt.c_str());
    printf("%s: number of tokens in prompt = %zu, first 8 tokens: ", __func__, embd_inp.size());
    for (int i = 0; i < std::min(8, (int) embd_inp.size()); i++) {
        printf("%d ", embd_inp[i]);
    }
    printf("\n\n");

    // submit the input prompt token-by-token
    // this reduces the memory usage during inference, at the cost of a bit of speed at the beginning
    std::vector<gpt_vocab::id> embd;

    // determine the required inference memory per token:
    size_t mem_per_token = 0;
    gpt2_eval(model, params.n_threads, 0, { 0, 1, 2, 3 }, logits, mem_per_token);

    for (size_t i = embd.size(); i < embd_inp.size() + params.n_predict; i++) {
        // predict
        if (embd.size() > 0) {
            const int64_t t_start_us = ggml_time_us();

            if (!gpt2_eval(model, params.n_threads, n_past, embd, logits, mem_per_token)) {
                printf("Failed to predict\n");
                return 1;
            }

            t_predict_us += ggml_time_us() - t_start_us;
        }

        n_past += embd.size();
        embd.clear();

        if (i >= embd_inp.size()) {
            // sample next token
            const int   top_k = params.top_k;
            const float top_p = params.top_p;
            const float temp  = params.temp;

            const int n_vocab = model.hparams.n_vocab;

            gpt_vocab::id id = 0;

            {
                const int64_t t_start_sample_us = ggml_time_us();

                id = gpt_sample_top_k_top_p(vocab, logits.data() + (logits.size() - n_vocab), top_k, top_p, temp, rng);

                t_sample_us += ggml_time_us() - t_start_sample_us;
            }

            // add it to the context
            embd.push_back(id);
        } else {
            // if here, it means we are still processing the input prompt
            for (size_t k = i; k < embd_inp.size(); k++) {
                embd.push_back(embd_inp[k]);
                if (int32_t(embd.size()) >= params.n_batch) {
                    break;
                }
            }
            i += embd.size() - 1;
        }

        // display text
        for (auto id : embd) {
            printf("%s", vocab.id_to_token[id].c_str());
        }
        fflush(stdout);

        // end of text token
        if (embd.back() == 50256) {
            break;
        }
    }

    // report timing
    {
        const int64_t t_main_end_us = ggml_time_us();

        printf("\n\n");
        printf("%s: mem per token = %8zu bytes\n", __func__, mem_per_token);
        printf("%s:     load time = %8.2f ms\n", __func__, t_load_us/1000.0f);
        printf("%s:   sample time = %8.2f ms\n", __func__, t_sample_us/1000.0f);
        printf("%s:  predict time = %8.2f ms / %.2f ms per token\n", __func__, t_predict_us/1000.0f, t_predict_us/1000.0f/n_past);
        printf("%s:    total time = %8.2f ms\n", __func__, (t_main_end_us - t_main_start_us)/1000.0f);
    }

    ggml_free(model.ctx_w);

    return 0;
}

前后向传播

前向

在应用层（如 examples/gpt-2/main-ctx.cpp (lines 392-695)），前向传播通过 ggml 的“构图 + 执行”模式完成：先在一个新的 ggml_context 里创建输入张量、调用 ggml_mul_mat、ggml_norm、ggml_soft_max_inplace 等算子堆叠出完整的 Transformer 计算图，再用 ggml_build_forward_expand 和 ggml_graph_compute_with_ctx 执行。这一步不需要手写矩阵运算，所有算子都在 ggml C 核心实现（ggml/src/ggml.c）里定义。

反向

ggml 在 ggml/src/ggml.c (lines 6025-6669) 中实现了自动求导：ggml_compute_backward() 根据计算图节点的 op 类型，依次调用 ggml_add_or_set、ggml_mul、ggml_repeat_back 等基本算子累加梯度。例如 GGML_OP_MUL_MAT 对应“上游梯度 × 权重转置”/“输入转置 × 上游梯度”的标准公式；GGML_OP_SOFT_MAX、GGML_OP_ROPE 等也有专门的 back 函数。图结构的遍历、梯度缓存（cgraph->grads）和梯度累加逻辑都在这段代码中定义。

前向是“在 C++ 层用 ggml API 描述计算图，然后由 ggml 核心算子执行”；

反向则由 ggml_compute_backward 针对每种 ggml_op 按公式生成梯度张量并在图上回传

ggml_mul_mat

// ggml_mul_mat
static inline bool ggml_can_mul_mat(const struct ggml_tensor * t0, const struct ggml_tensor * t1) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");
    return (t0->ne[0]           == t1->ne[0])  &&
           (t1->ne[2]%t0->ne[2] == 0)          && // verify t0 is broadcastable
           (t1->ne[3]%t0->ne[3] == 0);
}
struct ggml_tensor * ggml_mul_mat(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    GGML_ASSERT(ggml_can_mul_mat(a, b));
    GGML_ASSERT(!ggml_is_transposed(a));
    const int64_t ne[4] = { a->ne[1], b->ne[1], b->ne[2], b->ne[3] };
    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
    result->op     = GGML_OP_MUL_MAT;
    result->src[0] = a;
    result->src[1] = b;
    return result;
}
void ggml_mul_mat_set_prec(
        struct ggml_tensor * a,
        enum ggml_prec       prec) {
    GGML_ASSERT(a->op == GGML_OP_MUL_MAT);
    const int32_t prec_i32 = (int32_t) prec;
    ggml_set_op_params_i32(a, 0, prec_i32);
}

ggml_norm

// ggml_norm

static struct ggml_tensor * ggml_norm_impl(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        float                 eps,
        bool                  inplace) {
    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    ggml_set_op_params(result, &eps, sizeof(eps));

    result->op     = GGML_OP_NORM;
    result->src[0] = a;

    return result;
}

struct ggml_tensor * ggml_norm(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        float                 eps) {
    return ggml_norm_impl(ctx, a, eps, false);
}

struct ggml_tensor * ggml_norm_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        float                 eps) {
    return ggml_norm_impl(ctx, a, eps, true);
}

ggml_softmax

// ggml_soft_max

static struct ggml_tensor * ggml_soft_max_impl(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * mask,
        float                 scale,
        float                 max_bias,
        bool                  inplace) {
    GGML_ASSERT(ggml_is_contiguous(a));

    if (mask) {
        GGML_ASSERT(mask->type == GGML_TYPE_F16 || mask->type == GGML_TYPE_F32);
        GGML_ASSERT(ggml_is_contiguous(mask));
        GGML_ASSERT(mask->ne[0] == a->ne[0]);
        GGML_ASSERT(mask->ne[1] >= a->ne[1]);
        GGML_ASSERT(a->ne[2]%mask->ne[2] == 0);
        GGML_ASSERT(a->ne[3]%mask->ne[3] == 0);
    }

    if (max_bias > 0.0f) {
        GGML_ASSERT(mask);
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    float params[] = { scale, max_bias };
    ggml_set_op_params(result, params, sizeof(params));

    result->op     = GGML_OP_SOFT_MAX;
    result->src[0] = a;
    result->src[1] = mask;

    return result;
}

struct ggml_tensor * ggml_soft_max(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_soft_max_impl(ctx, a, NULL, 1.0f, 0.0f, false);
}

struct ggml_tensor * ggml_soft_max_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_soft_max_impl(ctx, a, NULL, 1.0f, 0.0f, true);
}

struct ggml_tensor * ggml_soft_max_ext(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * mask,
        float                 scale,
        float                 max_bias) {
    return ggml_soft_max_impl(ctx, a, mask, scale, max_bias, false);
}

struct ggml_tensor * ggml_soft_max_ext_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * mask,
        float                 scale,
        float                 max_bias) {
    return ggml_soft_max_impl(ctx, a, mask, scale, max_bias, true);
}

void ggml_soft_max_add_sinks(
        struct ggml_tensor * a,
        struct ggml_tensor * sinks) {
    if (!sinks) {
        a->src[2] = NULL;
        return;
    }

    GGML_ASSERT(a->op == GGML_OP_SOFT_MAX);
    GGML_ASSERT(a->src[2] == NULL);
    GGML_ASSERT(a->src[0]->ne[2] == sinks->ne[0]);
    GGML_ASSERT(sinks->type == GGML_TYPE_F32);

    a->src[2] = sinks;
}

这就是 C 语言中“返回结构体指针”的标准写法：函数 ggml_soft_max_inplace 的返回类型是 struct ggml_tensor ，表示返回一个指向 ggml_tensor 结构体的指针。因为源代码没有用 typedef 给 struct ggml_tensor 起别名，所以在函数声明时必须写成 struct ggml_tensor ；如果像 C++ 里常见的那样 typedef struct ggml_tensor ggml_tensor;，就可以写成 ggml_tensor *。这一写法看起来“奇怪”只是因为 ggml 为了兼容纯 C 编译器，沿用了最传统的 C 风格。函数体的意思就是把参数 a 和默认参数一起传给内部实现 ggml_soft_max_impl，然后返回它的结果。

ggml_rope

// ggml_rope

static struct ggml_tensor * ggml_rope_impl(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        struct ggml_tensor  * c,
        int                   n_dims,
        int                   sections[GGML_MROPE_SECTIONS],
        int                   mode,
        int                   n_ctx_orig,
        float                 freq_base,
        float                 freq_scale,
        float                 ext_factor,
        float                 attn_factor,
        float                 beta_fast,
        float                 beta_slow,
        bool                  inplace) {
    GGML_ASSERT((mode & 1) == 0 && "mode & 1 == 1 is no longer supported");

    GGML_ASSERT(ggml_is_vector(b));
    GGML_ASSERT(b->type == GGML_TYPE_I32);

    bool mrope_used = mode & GGML_ROPE_TYPE_MROPE;
    if (mrope_used) {
        GGML_ASSERT(a->ne[2] * 4 == b->ne[0]); // mrope expecting 4 position ids per token
    } else {
        GGML_ASSERT(a->ne[2] == b->ne[0]);
    }

    if (c) {
        GGML_ASSERT(c->type == GGML_TYPE_F32);
        GGML_ASSERT(c->ne[0] >= n_dims / 2);
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    int32_t params[15] = { /*n_past*/ 0, n_dims, mode, /*n_ctx*/ 0, n_ctx_orig };
    memcpy(params +  5, &freq_base,    sizeof(float));
    memcpy(params +  6, &freq_scale,   sizeof(float));
    memcpy(params +  7, &ext_factor,   sizeof(float));
    memcpy(params +  8, &attn_factor,  sizeof(float));
    memcpy(params +  9, &beta_fast,    sizeof(float));
    memcpy(params + 10, &beta_slow,    sizeof(float));
    if (mrope_used && sections) {
        memcpy(params + 11, sections,  sizeof(int32_t) * GGML_MROPE_SECTIONS);
    } else {
        memset(params + 11, 0,         sizeof(int32_t) * GGML_MROPE_SECTIONS);
    }
    ggml_set_op_params(result, params, sizeof(params));

    result->op     = GGML_OP_ROPE;
    result->src[0] = a;
    result->src[1] = b;
    result->src[2] = c;

    return result;
}

struct ggml_tensor * ggml_rope(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        int                   n_dims,
        int                   mode) {
    return ggml_rope_impl(
        ctx, a, b, NULL, n_dims, NULL, mode, 0, 10000.0f, 1.0f, 0.0f, 1.0f, 0.0f, 0.0f, false
    );
}

struct ggml_tensor * ggml_rope_multi(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        struct ggml_tensor  * c,
        int                   n_dims,
        int                   sections[GGML_MROPE_SECTIONS],
        int                   mode,
        int                   n_ctx_orig,
        float                 freq_base,
        float                 freq_scale,
        float                 ext_factor,
        float                 attn_factor,
        float                 beta_fast,
        float                 beta_slow) {
    return ggml_rope_impl(
        ctx, a, b, c, n_dims, sections, mode, n_ctx_orig, freq_base, freq_scale,
        ext_factor, attn_factor, beta_fast, beta_slow, false
    );
}

struct ggml_tensor * ggml_rope_multi_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        struct ggml_tensor  * c,
        int                   n_dims,
        int                   sections[GGML_MROPE_SECTIONS],
        int                   mode,
        int                   n_ctx_orig,
        float                 freq_base,
        float                 freq_scale,
        float                 ext_factor,
        float                 attn_factor,
        float                 beta_fast,
        float                 beta_slow) {
    return ggml_rope_impl(
        ctx, a, b, c, n_dims, sections, mode, n_ctx_orig, freq_base, freq_scale,
        ext_factor, attn_factor, beta_fast, beta_slow, true
    );
}

struct ggml_tensor * ggml_rope_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        int                   n_dims,
        int                   mode) {
    return ggml_rope_impl(
        ctx, a, b, NULL, n_dims, NULL, mode, 0, 10000.0f, 1.0f, 0.0f, 1.0f, 0.0f, 0.0f, true
    );
}

struct ggml_tensor * ggml_rope_ext(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        struct ggml_tensor  * c,
        int                   n_dims,
        int                   mode,
        int                   n_ctx_orig,
        float                 freq_base,
        float                 freq_scale,
        float                 ext_factor,
        float                 attn_factor,
        float                 beta_fast,
        float                 beta_slow) {
    return ggml_rope_impl(
        ctx, a, b, c, n_dims, NULL, mode, n_ctx_orig, freq_base, freq_scale,
        ext_factor, attn_factor, beta_fast, beta_slow, false
    );
}

struct ggml_tensor * ggml_rope_ext_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        struct ggml_tensor  * c,
        int                   n_dims,
        int                   mode,
        int                   n_ctx_orig,
        float                 freq_base,
        float                 freq_scale,
        float                 ext_factor,
        float                 attn_factor,
        float                 beta_fast,
        float                 beta_slow) {
    return ggml_rope_impl(
        ctx, a, b, c, n_dims, NULL, mode, n_ctx_orig, freq_base, freq_scale,
        ext_factor, attn_factor, beta_fast, beta_slow, true
    );
}

struct ggml_tensor * ggml_rope_custom(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        int                   n_dims,
        int                   mode,
        int                   n_ctx_orig,
        float                 freq_base,
        float                 freq_scale,
        float                 ext_factor,
        float                 attn_factor,
        float                 beta_fast,
        float                 beta_slow) {
    return ggml_rope_impl(
        ctx, a, b, NULL, n_dims, NULL, mode, n_ctx_orig, freq_base, freq_scale,
        ext_factor, attn_factor, beta_fast, beta_slow, false
    );
}

struct ggml_tensor * ggml_rope_custom_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        int                   n_dims,
        int                   mode,
        int                   n_ctx_orig,
        float                 freq_base,
        float                 freq_scale,
        float                 ext_factor,
        float                 attn_factor,
        float                 beta_fast,
        float                 beta_slow) {
    return ggml_rope_impl(
        ctx, a, b, NULL, n_dims, NULL, mode, n_ctx_orig, freq_base, freq_scale,
        ext_factor, attn_factor, beta_fast, beta_slow, true
    );
}

// Apparently solving `n_rot = 2pi * x * base^((2 * max_pos_emb) / n_dims)` for x, we get
// `corr_dim(n_rot) = n_dims * log(max_pos_emb / (n_rot * 2pi)) / (2 * log(base))`
static float ggml_rope_yarn_corr_dim(int n_dims, int n_ctx_orig, float n_rot, float base) {
    return n_dims * logf(n_ctx_orig / (n_rot * 2 * (float)M_PI)) / (2 * logf(base));
}

void ggml_rope_yarn_corr_dims(
    int n_dims, int n_ctx_orig, float freq_base, float beta_fast, float beta_slow, float dims[2]
) {
    // start and end correction dims
    float start = floorf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_fast, freq_base));
    float end   =  ceilf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_slow, freq_base));
    dims[0] = MAX(0, start);
    dims[1] = MIN(n_dims - 1, end);
}

// ggml_rope_back

struct ggml_tensor * ggml_rope_ext_back(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        struct ggml_tensor  * c,
        int                   n_dims,
        int                   mode,
        int                   n_ctx_orig,
        float                 freq_base,
        float                 freq_scale,
        float                 ext_factor,
        float                 attn_factor,
        float                 beta_fast,
        float                 beta_slow) {
    struct ggml_tensor * result = ggml_rope_ext(
        ctx, a, b, c, n_dims, mode, n_ctx_orig, freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow);
    result->op = GGML_OP_ROPE_BACK;
    return result;
}

struct ggml_tensor * ggml_rope_multi_back(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        struct ggml_tensor  * c,
        int                   n_dims,
        int                   sections[4],
        int                   mode,
        int                   n_ctx_orig,
        float                 freq_base,
        float                 freq_scale,
        float                 ext_factor,
        float                 attn_factor,
        float                 beta_fast,
        float                 beta_slow) {
    struct ggml_tensor * result = ggml_rope_multi(
        ctx, a, b, c, n_dims, sections, mode, n_ctx_orig, freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow);
    result->op = GGML_OP_ROPE_BACK;
    return result;
}

C++ 技术栈

发表于 2025-11-24 更新于 2026-02-19

手撕Vector

#include <bits/stdc++.h>
using namespace std;
#define int long long
template <typename T>
class Vector {
  T* _data;
  size_t _size;
  size_t _capacity;
  void expand() {
    size_t new_capacity = (_capacity == 0) ? 1 : _capacity * 2;
    T* new_data = new T[new_capacity];
    for (size_t i = 0; i < _size; i++) new_data[i] = _data[i];
    if (_data) delete[] _data;
    _data = new_data;
    _capacity = new_capacity;
    cout << "容量拓展到" << _capacity << '\n';
  }
 public:
  Vector() : _data(nullptr), _size(0), _capacity(0) {}
  ~Vector() {
    if (_data) {
      delete[] _data;
      _data = nullptr;
    }
  }
  void push_back(const T& value) {
    if (_size == _capacity) {
      expand();
    }
    _data[_size] = value;
    _size++;
  }
  T& operator[](size_t index) { return _data[index]; }
  size_t size() const { return _size; }
  size_t capacity() const { return _capacity; }
};

void solve() {
  Vector<int> v;
  for (int i = 0; i < 6; i++) {
    v.push_back(i);
    cout << "插入" << i << ",size:" << v.size() << "\n";
  }
}
signed main() {
  solve();
  return 0;
}

CS336 Assignment1

发表于 2025-11-24 更新于 2026-02-19

CS336 Assignment1

BPE (Byte Pair Encoding) 原理与从零实现

BPE（Byte Pair Encoding，字节对编码）是一种非常流行的子词（subword）分词算法，最初用于数据压缩，后来被广泛应用于自然语言处理领域，尤其是在 GPT 系列、LLaMA、RoBERTa 等大语言模型的分词器中。

BPE 的核心思想

从语料中最频繁出现的相邻符号对（最初是单个字符或字节）开始，逐步合并它们，形成更大的子词单元，直到达到指定的词汇表大小为止。

BPE 训练过程（经典示例）

假设我们有如下小型语料库（每个词后加 </w> 表示词尾）：

low</w>:    5
lower</w>:  2
newest</w>: 6
widest</w>: 3

字符级拆分后：

l o w </w> ×5
l o w e r </w> ×2
n e w e s t </w> ×6
w i d e s t </w> ×3

统计所有相邻 pair 频率 → 发现 (e, s) 和 (s, t) 都是 9 次 → 任选其一合并（如 es）→ 继续迭代 → 最终得到 est、low、lowest 等高频子词

BPE 在实际模型中的两个重要变体

原始 BPE（OpenAI GPT-2 用）

操作在字符级别（UTF-8 bytes）基础词汇表是 256 个 byte + 合并规则优点：能处理任何 Unicode 字符，永不出现 OOV（未知词） SentencePiece BPE（Google、LLaMA、T5 等用）

直接在原始文本（不分词）上训练支持 unigram 模式（BPE 的变种）可以加入特殊控制符号（如 ▁ 表示空格） BPE 分词时的贪心规则应用所有合并规则时，按合并顺序从先到后贪心应用（即先训练时学的合并优先级更高）。

例如：

如果先学会了 “un” → “un” 再学会了 “un” + “##able” → “unable” 看到 “unable” 时就会先合并成 “un” + “##able” → “unable”，而不会拆成别的 BPE 的优缺点优点：

有效解决 OOV 问题（尤其对稀有词、拼写错误、新词）能把常见词保持完整（high frequency → 合并早 → 成为单个 token）稀有词被拆成子词，仍有意义词汇表大小可控缺点：

分词不一定符合语义或形态学（纯统计）对低资源语言可能产生很碎的分词 “token 效率”不如 WordPiece 或 Unigram LM 在某些语言上高总结一句话 BPE 是通过不断合并语料中最常见的相邻符号对，来构建一个大小适中、覆盖广泛的子词词汇表的无监督分词算法，是目前主流大模型分词器的基石之一。

代码实现

def run_train_bpe(
    input_path: str | os.PathLike,
    vocab_size: int,
    special_tokens: list[str],
    **kwargs,
) -> tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
    """Given the path to an input corpus, run train a BPE tokenizer and
    output its vocabulary and merges.

    Args:
        input_path (str | os.PathLike): Path to BPE tokenizer training data.
        vocab_size (int): Total number of items in the tokenizer's vocabulary (including special tokens).
        special_tokens (list[str]): A list of string special tokens to be added to the tokenizer vocabulary.
            These strings will never be split into multiple tokens, and will always be
            kept as a single token. If these special tokens occur in the input_path,
            they are treated as any other string.

    Returns:
        tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
            vocab:
                The trained tokenizer vocabulary, a mapping from int (token ID in the vocabulary)
                to bytes (token bytes)
            merges:
                BPE merges. Each list item is a tuple of bytes (<token1>, <token2>),
                representing that <token1> was merged with <token2>.
                Merges are ordered by order of creation.
    """
    # 1. 参数校验与初始化
    pat_str = kwargs.get("pat_str", GPT2_PRETOKENIZER_PATTERN)
    special_tokens = special_tokens or []
    unique_special_tokens: list[str] = []
    seen_specials: set[str] = set()

    # 这里的逻辑是去重并保持顺序
    for token in special_tokens:
        if not isinstance(token, str):
            msg = f"Expected special tokens to be strings, got {type(token)!r}"
            raise TypeError(msg)
        if token not in seen_specials:
            seen_specials.add(token)
            unique_special_tokens.append(token)
    
    special_tokens_bytes = [token.encode("utf-8") for token in unique_special_tokens]
    num_special_tokens = len(special_tokens_bytes)

    # 基础词表大小为 256 (字节范围)
    if vocab_size < 2**8 + num_special_tokens:
        msg = "vocab_size must be at least 256 + number of special tokens"
        raise ValueError(msg)

    merges_target = vocab_size - num_special_tokens - 2**8
    pretokenizer = regex.compile(pat_str)

    # 2. 读取文件
    with open(input_path, "r", encoding="utf-8") as f:
        text = f.read()

    words: list[list[int]] = []
    word_frequencies: list[int] = []
    word_lookup: dict[str, int] = {}

    # 3. 预分词 (Pre-tokenization)
    # 首先按特殊 token 切分，防止特殊 token 被正则拆散
    removable_specials = [token for token in unique_special_tokens if token]
    segments = [text]
    if removable_specials:
        escaped = [regex.escape(token) for token in removable_specials]
        split_pattern = regex.compile("|".join(escaped))
        segments = [segment for segment in split_pattern.split(text) if segment]

    for segment in segments:
        for match in pretokenizer.finditer(segment):
            token = match.group(0)
            if not token:
                continue
            
            idx = word_lookup.get(token)
            if idx is None:
                token_bytes = token.encode("utf-8")
                if not token_bytes:
                    continue
                idx = len(words)
                word_lookup[token] = idx
                words.append(list(token_bytes))
                word_frequencies.append(0)
            
            word_frequencies[idx] += 1

    # 4. 初始化 BPE 统计结构
    token_id_to_bytes: dict[int, bytes] = {i: bytes([i]) for i in range(256)}
    merges: list[tuple[bytes, bytes]] = []
    next_token_id = 256

    pair_stats: Counter[tuple[int, int]] = Counter()
    pair_indices: dict[tuple[int, int], set[int]] = {}
    word_pair_counters: list[Counter[tuple[int, int]]] = []

    # 初次统计所有单词中的 pair
    for idx, token_ids in enumerate(words):
        freq = word_frequencies[idx]
        if freq == 0 or len(token_ids) < 2:
            word_pair_counters.append(Counter())
            continue
        
        pair_counter = Counter(zip(token_ids[:-1], token_ids[1:]))
        word_pair_counters.append(pair_counter)
        
        for pair, count in pair_counter.items():
            pair_stats[pair] += count * freq
            pair_indices.setdefault(pair, set()).add(idx)

    # --- 内部辅助函数 (闭包) ---
    def remove_word_from_stats(word_idx: int) -> None:
        counter = word_pair_counters[word_idx]
        if not counter:
            return
        freq = word_frequencies[word_idx]
        for pair, count in counter.items():
            pair_stats[pair] -= count * freq
            if pair_stats[pair] <= 0:
                pair_stats.pop(pair, None)
            
            indices = pair_indices.get(pair)
            if indices is not None:
                indices.discard(word_idx)
                if not indices:
                    pair_indices.pop(pair, None)

    def add_word_to_stats(word_idx: int) -> None:
        tokens = words[word_idx]
        if len(tokens) < 2:
            word_pair_counters[word_idx] = Counter()
            return
        
        counter = Counter(zip(tokens[:-1], tokens[1:]))
        word_pair_counters[word_idx] = counter
        freq = word_frequencies[word_idx]
        for pair, count in counter.items():
            pair_stats[pair] += count * freq
            pair_indices.setdefault(pair, set()).add(word_idx)

    def merge_word(word_idx: int, pair: tuple[int, int], new_token_id: int) -> None:
        tokens = words[word_idx]
        if len(tokens) < 2:
            return
        
        merged: list[int] = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
                merged.append(new_token_id)
                i += 2
            else:
                merged.append(tokens[i])
                i += 1
        words[word_idx] = merged

    # 5. BPE 训练主循环
    for _ in range(max(0, merges_target)):
        if not pair_stats:
            break

        # 定义优先级：优先频次高，频次相同比较字节内容（为了确定性）
        def pair_priority(item: tuple[tuple[int, int], int]) -> tuple[int, bytes, bytes]:
            (left_id, right_id), count = item
            return count, token_id_to_bytes[left_id], token_id_to_bytes[right_id]

        best_pair, _ = max(pair_stats.items(), key=pair_priority)
        
        left_bytes = token_id_to_bytes[best_pair[0]]
        right_bytes = token_id_to_bytes[best_pair[1]]
        
        merges.append((left_bytes, right_bytes))
        
        new_token_id = next_token_id
        token_id_to_bytes[new_token_id] = left_bytes + right_bytes

        affected_words = pair_indices.pop(best_pair, set())
        
        # 如果没有单词受到影响（理论上不应发生，因为 stats 里有），直接跳过
        if not affected_words:
            next_token_id += 1
            pair_stats.pop(best_pair, None)
            continue

        # 更新受影响单词的统计信息
        for word_idx in sorted(affected_words):
            remove_word_from_stats(word_idx)
            merge_word(word_idx, best_pair, new_token_id)
            add_word_to_stats(word_idx)
        
        pair_stats.pop(best_pair, None)
        next_token_id += 1

    # 6. 构建最终词表
    vocab: dict[int, bytes] = {
        idx: token for idx, token in token_id_to_bytes.items() if idx < next_token_id
    }

    # 添加特殊 Token
    for token_bytes in special_tokens_bytes:
        if len(vocab) >= vocab_size:
            break
        vocab[next_token_id] = token_bytes
        next_token_id += 1

    return vocab, merges

ALL code

from __future__ import annotations

import builtins
import locale
import math
import os
from collections import Counter
from collections.abc import Iterable
from typing import IO, Any, BinaryIO

import numpy as np
import numpy.typing as npt
import regex
import tiktoken
import torch
import torch.nn.functional as F
from jaxtyping import Bool, Float, Int
from torch import Tensor
from torch.nn.utils import clip_grad_norm_


def _ensure_utf8_locale() -> None:
    try:
        preferred = locale.getpreferredencoding(False)
    except Exception:
        preferred = "utf-8"
    if preferred.lower() != "utf-8":
        locale.getpreferredencoding = lambda *_args, **_kwargs: "utf-8"  # type: ignore[assignment]


_ensure_utf8_locale()

_ORIGINAL_OPEN = builtins.open


def _utf8_default_open(
    file,
    mode="r",
    buffering=-1,
    encoding: str | None = None,
    errors: str | None = None,
    newline: str | None = None,
    closefd: bool = True,
    opener=None,
):
    if "b" not in mode and encoding is None:
        encoding = "utf-8"
    return _ORIGINAL_OPEN(file, mode, buffering, encoding, errors, newline, closefd, opener)


builtins.open = _utf8_default_open  # type: ignore[assignment]


GPT2_PRETOKENIZER_PATTERN = (
    r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
)


def run_linear(
    d_in: int,
    d_out: int,
    weights: Float[Tensor, " d_out d_in"],
    in_features: Float[Tensor, " ... d_in"],
) -> Float[Tensor, " ... d_out"]:
    """
    Given the weights of a Linear layer, compute the transformation of a batched input.

    Args:
        in_dim (int): The size of the input dimension
        out_dim (int): The size of the output dimension
        weights (Float[Tensor, "d_out d_in"]): The linear weights to use
        in_features (Float[Tensor, "... d_in"]): The output tensor to apply the function to

    Returns:
        Float[Tensor, "... d_out"]: The transformed output of your linear module.
    """

    if tuple(weights.shape) != (d_out, d_in):
        msg = f"weights shape {tuple(weights.shape)} does not match ({d_out}, {d_in})"
        raise ValueError(msg)

    return F.linear(in_features, weights, bias=None)


def run_embedding(
    vocab_size: int,
    d_model: int,
    weights: Float[Tensor, " vocab_size d_model"],
    token_ids: Int[Tensor, " ..."],
) -> Float[Tensor, " ... d_model"]:
    """
    Given the weights of an Embedding layer, get the embeddings for a batch of token ids.

    Args:
        vocab_size (int): The number of embeddings in the vocabulary
        d_model (int): The size of the embedding dimension
        weights (Float[Tensor, "vocab_size d_model"]): The embedding vectors to fetch from
        token_ids (Int[Tensor, "..."]): The set of token ids to fetch from the Embedding layer

    Returns:
        Float[Tensor, "... d_model"]: Batch of embeddings returned by your Embedding layer.
    """

    if tuple(weights.shape) != (vocab_size, d_model):
        msg = f"weights shape {tuple(weights.shape)} does not match ({vocab_size}, {d_model})"
        raise ValueError(msg)

    token_ids = token_ids.to(torch.long)
    return F.embedding(token_ids, weights)


def run_swiglu(
    d_model: int,
    d_ff: int,
    w1_weight: Float[Tensor, " d_ff d_model"],
    w2_weight: Float[Tensor, " d_model d_ff"],
    w3_weight: Float[Tensor, " d_ff d_model"],
    in_features: Float[Tensor, " ... d_model"],
) -> Float[Tensor, " ... d_model"]:
    """Given the weights of a SwiGLU network, return
    the output of your implementation with these weights.

    Args:
        d_model (int): Dimensionality of the feedforward input and output.
        d_ff (int): Dimensionality of the up-project happening internally to your swiglu.
        w1_weight (Float[Tensor, "d_ff d_model"]): Stored weights for W1
        w2_weight (Float[Tensor, "d_model d_ff"]): Stored weights for W2
        w3_weight (Float[Tensor, "d_ff d_model"]): Stored weights for W3
        in_features (Float[Tensor, "... d_model"]): Input embeddings to the feed-forward layer.

    Returns:
        Float[Tensor, "... d_model"]: Output embeddings of the same shape as the input embeddings.
    """
    # Example:
    # If your state dict keys match, you can use `load_state_dict()`
    # swiglu.load_state_dict(weights)
    # You can also manually assign the weights
    # swiglu.w1.weight.data = w1_weight
    # swiglu.w2.weight.data = w2_weight
    # swiglu.w3.weight.data = w3_weight
    if d_model <= 0 or d_ff <= 0:
        raise ValueError("d_model and d_ff must be positive")

    gate = F.linear(in_features, w1_weight, bias=None)
    up = F.linear(in_features, w3_weight, bias=None)
    activated = F.silu(gate) * up
    return F.linear(activated, w2_weight, bias=None)


def run_scaled_dot_product_attention(
    Q: Float[Tensor, " ... queries d_k"],
    K: Float[Tensor, " ... keys d_k"],
    V: Float[Tensor, " ... values d_v"],
    mask: Bool[Tensor, " ... queries keys"] | None = None,
) -> Float[Tensor, " ... queries d_v"]:
    """
    Given key (K), query (Q), and value (V) tensors, return
    the output of your scaled dot product attention implementation.

    Args:
        Q (Float[Tensor, " ... queries d_k"]): Query tensor
        K (Float[Tensor, " ... keys d_k"]): Key tensor
        V (Float[Tensor, " ... values d_v"]): Values tensor
        mask (Bool[Tensor, " ... queries keys"] | None): Mask tensor
    Returns:
        Float[Tensor, " ... queries d_v"]: Output of SDPA
    """
    d_k = Q.shape[-1]
    if d_k == 0:
        raise ValueError("d_k must be positive")

    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        fill = torch.finfo(scores.dtype).min
        mask = mask.to(dtype=torch.bool, device=scores.device)
        if mask.shape != scores.shape:
            mask = mask.expand(scores.shape)
        scores = scores.masked_fill(~mask, fill)

    attention = torch.softmax(scores, dim=-1)
    return torch.matmul(attention, V)


def _build_causal_mask(
    batch_dims: tuple[int, ...], num_heads: int, seq_len: int, device: torch.device
) -> Bool[Tensor, " ..."]:
    mask = torch.ones(seq_len, seq_len, dtype=torch.bool, device=device).tril()
    view_shape = (1,) * len(batch_dims) + (1, seq_len, seq_len)
    return mask.view(view_shape).expand(*batch_dims, num_heads, seq_len, seq_len)


def run_multihead_self_attention(
    d_model: int,
    num_heads: int,
    q_proj_weight: Float[Tensor, " d_k d_in"],
    k_proj_weight: Float[Tensor, " d_k d_in"],
    v_proj_weight: Float[Tensor, " d_v d_in"],
    o_proj_weight: Float[Tensor, " d_model d_v"],
    in_features: Float[Tensor, " ... sequence_length d_in"],
) -> Float[Tensor, " ... sequence_length d_out"]:
    """
    Given the key, query, and value projection weights of a naive unbatched
    implementation of multi-head attention, return the output of an optimized batched
    implementation. This implementation should handle the key, query, and value projections
    for all heads in a single matrix multiply.
    This function should not use RoPE.
    See section 3.2.2 of Vaswani et al., 2017.

    Args:
        d_model (int): Dimensionality of the feedforward input and output.
        num_heads (int): Number of heads to use in multi-headed attention.
        max_seq_len (int): Maximum sequence length to pre-cache if your implementation does that.
        q_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the Q projection
        k_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the K projection
        v_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the V projection
        o_proj_weight (Float[Tensor, "d_model d_v"]): Weights for the output projection
        in_features (Float[Tensor, "... sequence_length d_in"]): Tensor to run your implementation on.

    Returns:
        Float[Tensor, " ... sequence_length d_out"]: Tensor with the output of running your optimized, batched multi-headed attention
        implementation with the given QKV projection weights and input features.
    """
    if d_model % num_heads != 0:
        raise ValueError("d_model must be divisible by num_heads")

    head_dim = d_model // num_heads
    batch_dims = tuple(in_features.shape[:-2])
    seq_len = in_features.shape[-2]

    def _project(weight: Tensor) -> Tensor:
        proj = F.linear(in_features, weight, bias=None)
        new_shape = (*batch_dims, seq_len, num_heads, head_dim)
        proj = proj.reshape(new_shape)
        permute_order = list(range(len(batch_dims))) + [len(batch_dims) + 1, len(batch_dims), len(batch_dims) + 2]
        return proj.permute(permute_order)

    q = _project(q_proj_weight)
    k = _project(k_proj_weight)
    v = _project(v_proj_weight)

    mask = _build_causal_mask(batch_dims, num_heads, seq_len, in_features.device)
    attn_output = run_scaled_dot_product_attention(q, k, v, mask=mask)
    permute_order = list(range(len(batch_dims))) + [len(batch_dims) + 1, len(batch_dims), len(batch_dims) + 2]
    attn_output = attn_output.permute(permute_order)
    merged = attn_output.reshape(*batch_dims, seq_len, d_model)
    return F.linear(merged, o_proj_weight, bias=None)


def run_multihead_self_attention_with_rope(
    d_model: int,
    num_heads: int,
    max_seq_len: int,
    theta: float,
    q_proj_weight: Float[Tensor, " d_k d_in"],
    k_proj_weight: Float[Tensor, " d_k d_in"],
    v_proj_weight: Float[Tensor, " d_v d_in"],
    o_proj_weight: Float[Tensor, " d_model d_v"],
    in_features: Float[Tensor, " ... sequence_length d_in"],
    token_positions: Int[Tensor, " ... sequence_length"] | None = None,
) -> Float[Tensor, " ... sequence_length d_out"]:
    """
    Given the key, query, and value projection weights of a naive unbatched
    implementation of multi-head attention, return the output of an optimized batched
    implementation. This implementation should handle the key, query, and value projections
    for all heads in a single matrix multiply.
    This version of MHA should include RoPE.
    In this case, the RoPE embedding dimension must be the head embedding dimension (d_model // num_heads).
    See section 3.2.2 of Vaswani et al., 2017.

    Args:
        d_model (int): Dimensionality of the feedforward input and output.
        num_heads (int): Number of heads to use in multi-headed attention.
        max_seq_len (int): Maximum sequence length to pre-cache if your implementation does that.
        theta (float): RoPE parameter.
        q_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the Q projection
        k_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the K projection
        v_proj_weight (Float[Tensor, "d_k d_in"]): Weights for the V projection
        o_proj_weight (Float[Tensor, "d_model d_v"]): Weights for the output projection
        in_features (Float[Tensor, "... sequence_length d_in"]): Tensor to run your implementation on.
        token_positions (Int[Tensor, " ... sequence_length"] | None): Optional tensor with the positions of the tokens

    Returns:
        Float[Tensor, " ... sequence_length d_out"]: Tensor with the output of running your optimized, batched multi-headed attention
        implementation with the given QKV projection weights and input features.
    """
    if d_model % num_heads != 0:
        raise ValueError("d_model must be divisible by num_heads")

    head_dim = d_model // num_heads
    batch_dims = tuple(in_features.shape[:-2])
    seq_len = in_features.shape[-2]
    device = in_features.device

    def _project(weight: Tensor) -> Tensor:
        proj = F.linear(in_features, weight, bias=None)
        new_shape = (*batch_dims, seq_len, num_heads, head_dim)
        proj = proj.reshape(new_shape)
        permute_order = list(range(len(batch_dims))) + [len(batch_dims) + 1, len(batch_dims), len(batch_dims) + 2]
        return proj.permute(permute_order)

    q = _project(q_proj_weight)
    k = _project(k_proj_weight)
    v = _project(v_proj_weight)

    if token_positions is None:
        base = torch.arange(seq_len, device=device, dtype=torch.long)
        view_shape = (1,) * len(batch_dims) + (seq_len,)
        token_positions = base.view(view_shape)
    else:
        token_positions = torch.as_tensor(token_positions, dtype=torch.long, device=device)
    target_shape = batch_dims + (seq_len,)
    if token_positions.shape != target_shape:
        missing = len(target_shape) - token_positions.ndim
        if missing < 0:
            raise ValueError("token_positions has too many dimensions for the provided input")
        shape = (1,) * missing + tuple(token_positions.shape)
        token_positions = token_positions.reshape(shape)
        token_positions = token_positions.expand(target_shape)

    rope_positions = token_positions.unsqueeze(-2).expand(*batch_dims, num_heads, seq_len)
    q = run_rope(head_dim, theta, max_seq_len, q, rope_positions)
    k = run_rope(head_dim, theta, max_seq_len, k, rope_positions)

    mask = _build_causal_mask(batch_dims, num_heads, seq_len, device)
    attn_output = run_scaled_dot_product_attention(q, k, v, mask=mask)
    permute_order = list(range(len(batch_dims))) + [len(batch_dims) + 1, len(batch_dims), len(batch_dims) + 2]
    attn_output = attn_output.permute(permute_order)
    merged = attn_output.reshape(*batch_dims, seq_len, d_model)
    return F.linear(merged, o_proj_weight, bias=None)


def run_rope(
    d_k: int,
    theta: float,
    max_seq_len: int,
    in_query_or_key: Float[Tensor, " ... sequence_length d_k"],
    token_positions: Int[Tensor, " ... sequence_length"],
) -> Float[Tensor, " ... sequence_length d_k"]:
    """
    Run RoPE for a given input tensor.

    Args:
        d_k (int): Embedding dimension size for the query or key tensor.
        theta (float): RoPE parameter.
        max_seq_len (int): Maximum sequence length to pre-cache if your implementation does that.
        in_query_or_key (Float[Tensor, "... sequence_length d_k"]): Input tensor to run RoPE on.
        token_positions (Int[Tensor, "... sequence_length"]): Tensor of shape (batch_size, sequence_length) with the token positions
    Returns:
        Float[Tensor, " ... sequence_length d_k"]: Tensor with RoPEd input.
    """
    if d_k % 2 != 0:
        raise ValueError("d_k must be even for RoPE")
    if theta <= 0:
        raise ValueError("theta must be positive")

    x = in_query_or_key
    device = x.device
    dtype = x.dtype
    seq_len = x.shape[-2]

    if token_positions is None:
        base = torch.arange(seq_len, device=device, dtype=torch.long)
        view_shape = (1,) * (x.ndim - 2) + (seq_len,)
        token_positions = base.view(view_shape)
    else:
        token_positions = torch.as_tensor(token_positions, dtype=torch.long, device=device)
        expected_prefix = x.shape[:-1]
        if token_positions.shape != expected_prefix:
            missing = len(expected_prefix) - token_positions.ndim
            if missing < 0:
                raise ValueError("token_positions incompatible with input shape")
            shape = (1,) * missing + tuple(token_positions.shape)
            token_positions = token_positions.reshape(shape)
            token_positions = token_positions.expand(expected_prefix)

    half_dim = d_k // 2
    freq_exponents = torch.arange(0, half_dim, device=device, dtype=torch.float32) / half_dim
    inv_freq = torch.exp(-math.log(theta) * freq_exponents).to(dtype)
    angles = token_positions.to(dtype).unsqueeze(-1) * inv_freq
    cos = torch.cos(angles)
    sin = torch.sin(angles)

    reshaped = x.reshape(*x.shape[:-1], half_dim, 2)
    x_even = reshaped[..., 0]
    x_odd = reshaped[..., 1]
    rotated_even = x_even * cos - x_odd * sin
    rotated_odd = x_even * sin + x_odd * cos
    prefix_shape = in_query_or_key.shape[:-1]
    return torch.stack((rotated_even, rotated_odd), dim=-1).reshape(*prefix_shape, d_k)


def run_transformer_block(
    d_model: int,
    num_heads: int,
    d_ff: int,
    max_seq_len: int,
    theta: float,
    weights: dict[str, Tensor],
    in_features: Float[Tensor, " batch sequence_length d_model"],
) -> Float[Tensor, " batch sequence_length d_model"]:
    """
    Given the weights of a pre-norm Transformer block and input features,
    return the output of running the Transformer block on the input features.

    This function should use RoPE.
    Depending on your implementation, you may simply need to pass the relevant args
    to your TransformerBlock constructor, or you may need to initialize your own RoPE
    class and pass that instead.

    Args:
        d_model (int): The dimensionality of the Transformer block input.
        num_heads (int): Number of heads to use in multi-headed attention. `d_model` must be
            evenly divisible by `num_heads`.
        d_ff (int): Dimensionality of the feed-forward inner layer.
        max_seq_len (int): Maximum sequence length to pre-cache if your implementation does that.
        theta (float): RoPE parameter.
        weights (dict[str, Tensor]):
            State dict of our reference implementation.
            The keys of this dictionary are:
            - `attn.q_proj.weight`
                The query projections for all `num_heads` attention heads.
                Shape is (d_model, d_model).
                The rows are ordered by matrices of shape (num_heads, d_k),
                so `attn.q_proj.weight == torch.cat([q_heads.0.weight, ..., q_heads.N.weight], dim=0)`.
            - `attn.k_proj.weight`
                The key projections for all `num_heads` attention heads.
                Shape is (d_model, d_model).
                The rows are ordered by matrices of shape (num_heads, d_k),
                so `attn.k_proj.weight == torch.cat([k_heads.0.weight, ..., k_heads.N.weight], dim=0)`.
            - `attn.v_proj.weight`
                The value projections for all `num_heads` attention heads.
                Shape is (d_model, d_model).
                The rows are ordered by matrices of shape (num_heads, d_v),
                so `attn.v_proj.weight == torch.cat([v_heads.0.weight, ..., v_heads.N.weight], dim=0)`.
            - `attn.output_proj.weight`
                Weight of the multi-head self-attention output projection
                Shape is (d_model, d_model).
            - `ln1.weight`
                Weights of affine transform for the first RMSNorm
                applied in the transformer block.
                Shape is (d_model,).
            - `ffn.w1.weight`
                Weight of the first linear transformation in the FFN.
                Shape is (d_model, d_ff).
            - `ffn.w2.weight`
                Weight of the second linear transformation in the FFN.
                Shape is (d_ff, d_model).
            - `ffn.w3.weight`
                Weight of the third linear transformation in the FFN.
                Shape is (d_model, d_ff).
            - `ln2.weight`
                Weights of affine transform for the second RMSNorm
                applied in the transformer block.
                Shape is (d_model,).
        in_features (Float[Tensor, "batch sequence_length d_model"]):
            Tensor to run your implementation on.

    Returns:
        Float[Tensor, "batch sequence_length d_model"] Tensor with the output of
        running the Transformer block on the input features while using RoPE.
    """
    eps = 1e-5
    batch_dims = tuple(in_features.shape[:-2])
    seq_len = in_features.shape[-2]
    device = in_features.device

    base_positions = torch.arange(seq_len, device=device, dtype=torch.long)
    view_shape = (1,) * len(batch_dims) + (seq_len,)
    token_positions = base_positions.view(view_shape).expand(*batch_dims, seq_len)

    attn_input = run_rmsnorm(d_model=d_model, eps=eps, weights=weights["ln1.weight"], in_features=in_features)
    attn_output = run_multihead_self_attention_with_rope(
        d_model=d_model,
        num_heads=num_heads,
        max_seq_len=max_seq_len,
        theta=theta,
        q_proj_weight=weights["attn.q_proj.weight"],
        k_proj_weight=weights["attn.k_proj.weight"],
        v_proj_weight=weights["attn.v_proj.weight"],
        o_proj_weight=weights["attn.output_proj.weight"],
        in_features=attn_input,
        token_positions=token_positions,
    )
    residual = in_features + attn_output

    ffn_input = run_rmsnorm(d_model=d_model, eps=eps, weights=weights["ln2.weight"], in_features=residual)
    ffn_output = run_swiglu(
        d_model=d_model,
        d_ff=d_ff,
        w1_weight=weights["ffn.w1.weight"],
        w2_weight=weights["ffn.w2.weight"],
        w3_weight=weights["ffn.w3.weight"],
        in_features=ffn_input,
    )
    return residual + ffn_output


def run_transformer_lm(
    vocab_size: int,
    context_length: int,
    d_model: int,
    num_layers: int,
    num_heads: int,
    d_ff: int,
    rope_theta: float,
    weights: dict[str, Tensor],
    in_indices: Int[Tensor, " batch_size sequence_length"],
) -> Float[Tensor, " batch_size sequence_length vocab_size"]:
    """Given the weights of a Transformer language model and input indices,
    return the output of running a forward pass on the input indices.

    This function should use RoPE.

    Args:
        vocab_size (int): The number of unique items in the output vocabulary to be predicted.
        context_length (int): The maximum number of tokens to process at once.
        d_model (int): The dimensionality of the model embeddings and sublayer outputs.
        num_layers (int): The number of Transformer layers to use.
        num_heads (int): Number of heads to use in multi-headed attention. `d_model` must be
            evenly divisible by `num_heads`.
        d_ff (int): Dimensionality of the feed-forward inner layer (section 3.3).
        rope_theta (float): The RoPE $\\Theta$ parameter.
        weights (dict[str, Tensor]):
            State dict of our reference implementation. {num_layers} refers to an
            integer between `0` and `num_layers - 1` (the layer index).
            The keys of this dictionary are:
            - `token_embeddings.weight`
                Token embedding matrix. Shape is (vocab_size, d_model).
            - `layers.{num_layers}.attn.q_proj.weight`
                The query projections for all `num_heads` attention heads.
                Shape is (num_heads * (d_model / num_heads), d_model).
                The rows are ordered by matrices of shape (num_heads, d_k),
                so `attn.q_proj.weight == torch.cat([q_heads.0.weight, ..., q_heads.N.weight], dim=0)`.
            - `layers.{num_layers}.attn.k_proj.weight`
                The key projections for all `num_heads` attention heads.
                Shape is (num_heads * (d_model / num_heads), d_model).
                The rows are ordered by matrices of shape (num_heads, d_k),
                so `attn.k_proj.weight == torch.cat([k_heads.0.weight, ..., k_heads.N.weight], dim=0)`.
            - `layers.{num_layers}.attn.v_proj.weight`
                The value projections for all `num_heads` attention heads.
                Shape is (num_heads * (d_model / num_heads), d_model).
                The rows are ordered by matrices of shape (num_heads, d_v),
                so `attn.v_proj.weight == torch.cat([v_heads.0.weight, ..., v_heads.N.weight], dim=0)`.
            - `layers.{num_layers}.attn.output_proj.weight`
                Weight of the multi-head self-attention output projection
                Shape is ((d_model / num_heads) * num_heads, d_model).
            - `layers.{num_layers}.ln1.weight`
                Weights of affine transform for the first RMSNorm
                applied in the transformer block.
                Shape is (d_model,).
            - `layers.{num_layers}.ffn.w1.weight`
                Weight of the first linear transformation in the FFN.
                Shape is (d_model, d_ff).
            - `layers.{num_layers}.ffn.w2.weight`
                Weight of the second linear transformation in the FFN.
                Shape is (d_ff, d_model).
            - `layers.{num_layers}.ffn.w3.weight`
                Weight of the third linear transformation in the FFN.
                Shape is (d_model, d_ff).
            - `layers.{num_layers}.ln2.weight`
                Weights of affine transform for the second RMSNorm
                applied in the transformer block.
                Shape is (d_model,).
            - `ln_final.weight`
                Weights of affine transform for RMSNorm applied to the output of the final transformer block.
                Shape is (d_model, ).
            - `lm_head.weight`
                Weights of the language model output embedding.
                Shape is (vocab_size, d_model).
        in_indices (Int[Tensor, "batch_size sequence_length"]) Tensor with input indices to run the language model on. Shape is (batch_size, sequence_length), where
            `sequence_length` is at most `context_length`.

    Returns:
        Float[Tensor, "batch_size sequence_length vocab_size"]: Tensor with the predicted unnormalized
        next-word distribution for each token.
    """
    if in_indices.shape[-1] > context_length:
        raise ValueError("sequence length exceeds context length")

    x = run_embedding(
        vocab_size=vocab_size,
        d_model=d_model,
        weights=weights["token_embeddings.weight"],
        token_ids=in_indices,
    )

    for layer_idx in range(num_layers):
        prefix = f"layers.{layer_idx}."
        layer_weights = {k[len(prefix) :]: v for k, v in weights.items() if k.startswith(prefix)}
        x = run_transformer_block(
            d_model=d_model,
            num_heads=num_heads,
            d_ff=d_ff,
            max_seq_len=context_length,
            theta=rope_theta,
            weights=layer_weights,
            in_features=x,
        )

    x = run_rmsnorm(d_model=d_model, eps=1e-5, weights=weights["ln_final.weight"], in_features=x)
    logits = run_linear(
        d_in=d_model,
        d_out=vocab_size,
        weights=weights["lm_head.weight"],
        in_features=x,
    )
    return logits


def run_rmsnorm(
    d_model: int,
    eps: float,
    weights: Float[Tensor, " d_model"],
    in_features: Float[Tensor, " ... d_model"],
) -> Float[Tensor, " ... d_model"]:
    """Given the weights of a RMSNorm affine transform,
    return the output of running RMSNorm on the input features.

    Args:
        d_model (int): The dimensionality of the RMSNorm input.
        eps: (float): A value added to the denominator for numerical stability.
        weights (Float[Tensor, "d_model"]): RMSNorm weights.
        in_features (Float[Tensor, "... d_model"]): Input features to run RMSNorm on. Can have arbitrary leading
            dimensions.

    Returns:
        Float[Tensor,"... d_model"]: Tensor of with the same shape as `in_features` with the output of running
        RMSNorm of the `in_features`.
    """
    if weights.shape != (d_model,):
        msg = f"weights shape {tuple(weights.shape)} does not match ({d_model},)"
        raise ValueError(msg)
    if in_features.shape[-1] != d_model:
        msg = f"Input features last dimension {in_features.shape[-1]} does not equal d_model {d_model}"
        raise ValueError(msg)

    variance = in_features.pow(2).mean(dim=-1, keepdim=True)
    scale = torch.rsqrt(variance + eps)
    return in_features * scale * weights


def run_silu(in_features: Float[Tensor, " ..."]) -> Float[Tensor, " ..."]:
    """Given a tensor of inputs, return the output of applying SiLU
    to each element.

    Args:
        in_features(Float[Tensor, "..."]): Input features to run SiLU on. Shape is arbitrary.

    Returns:
        Float[Tensor,"..."]: of with the same shape as `in_features` with the output of applying
        SiLU to each element.
    """
    return F.silu(in_features)


def run_get_batch(
    dataset: npt.NDArray, batch_size: int, context_length: int, device: str
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Given a dataset (a 1D numpy array of integers) and a desired batch size and
    context length, sample language modeling input sequences and their corresponding
    labels from the dataset.

    Args:
        dataset (np.array): 1D numpy array of integer token IDs in the dataset.
        batch_size (int): Desired batch size to sample.
        context_length (int): Desired context length of each sampled example.
        device (str): PyTorch device string (e.g., 'cpu' or 'cuda:0') indicating the device
            to place the sampled input sequences and labels on.

    Returns:
        Tuple of torch.LongTensors of shape (batch_size, context_length). The first tuple item
        is the sampled input sequences, and the second tuple item is the corresponding
        language modeling labels.
    """
    data = torch.as_tensor(dataset, dtype=torch.long)
    if data.ndim != 1:
        raise ValueError("dataset must be 1D")
    if context_length <= 0:
        raise ValueError("context_length must be positive")
    if context_length >= data.shape[0]:
        raise ValueError("context_length must be smaller than dataset length")

    max_start = data.shape[0] - context_length
    starts = torch.randint(0, max_start, (batch_size,))
    offsets = torch.arange(context_length)
    x = data[starts.unsqueeze(1) + offsets]
    y = data[starts.unsqueeze(1) + offsets + 1]

    target_device = torch.device(device)
    return x.to(target_device), y.to(target_device)


def run_softmax(in_features: Float[Tensor, " ..."], dim: int) -> Float[Tensor, " ..."]:
    """
    Given a tensor of inputs, return the output of softmaxing the given `dim`
    of the input.

    Args:
        in_features (Float[Tensor, "..."]): Input features to softmax. Shape is arbitrary.
        dim (int): Dimension of the `in_features` to apply softmax to.

    Returns:
        Float[Tensor, "..."]: Tensor of with the same shape as `in_features` with the output of
        softmax normalizing the specified `dim`.
    """
    shifted = in_features - in_features.max(dim=dim, keepdim=True).values
    exps = shifted.exp()
    return exps / exps.sum(dim=dim, keepdim=True)


def run_cross_entropy(
    inputs: Float[Tensor, " batch_size vocab_size"], targets: Int[Tensor, " batch_size"]
) -> Float[Tensor, ""]:
    """Given a tensor of inputs and targets, compute the average cross-entropy
    loss across examples.

    Args:
        inputs (Float[Tensor, "batch_size vocab_size"]): inputs[i][j] is the
            unnormalized logit of jth class for the ith example.
        targets (Int[Tensor, "batch_size"]): Tensor of shape (batch_size,) with the index of the correct class.
            Each value must be between 0 and `num_classes - 1`.

    Returns:
        Float[Tensor, ""]: The average cross-entropy loss across examples.
    """
    logits = inputs.to(torch.float32)
    targets = targets.to(torch.long)
    log_probs = logits.log_softmax(dim=-1)
    return F.nll_loss(log_probs, targets, reduction="mean")


def run_gradient_clipping(parameters: Iterable[torch.nn.Parameter], max_l2_norm: float) -> None:
    """Given a set of parameters, clip their combined gradients to have l2 norm at most max_l2_norm.

    Args:
        parameters (Iterable[torch.nn.Parameter]): collection of trainable parameters.
        max_l2_norm (float): a positive value containing the maximum l2-norm.

    The gradients of the parameters (parameter.grad) should be modified in-place.
    """
    clip_grad_norm_(parameters, max_l2_norm)


def get_adamw_cls() -> Any:
    """
    Returns a torch.optim.Optimizer that implements AdamW.
    """
    return torch.optim.AdamW


def run_get_lr_cosine_schedule(
    it: int,
    max_learning_rate: float,
    min_learning_rate: float,
    warmup_iters: int,
    cosine_cycle_iters: int,
):
    """
    Given the parameters of a cosine learning rate decay schedule (with linear
    warmup) and an iteration number, return the learning rate at the given
    iteration under the specified schedule.

    Args:
        it (int): Iteration number to get learning rate for.
        max_learning_rate (float): alpha_max, the maximum learning rate for
            cosine learning rate schedule (with warmup).
        min_learning_rate (float): alpha_min, the minimum / final learning rate for
            the cosine learning rate schedule (with warmup).
        warmup_iters (int): T_w, the number of iterations to linearly warm-up
            the learning rate.
        cosine_cycle_iters (int): T_c, the number of cosine annealing iterations.

    Returns:
        Learning rate at the given iteration under the specified schedule.
    """
    if warmup_iters < 0 or cosine_cycle_iters < 0:
        raise ValueError("warmup_iters and cosine_cycle_iters must be non-negative")

    if warmup_iters > 0 and it <= warmup_iters:
        return max_learning_rate * (it / warmup_iters)

    if cosine_cycle_iters <= 0:
        return min_learning_rate

    if it >= cosine_cycle_iters:
        return min_learning_rate

    cosine_span = max(cosine_cycle_iters - warmup_iters, 1)
    progress = (it - warmup_iters) / cosine_span
    progress = min(max(progress, 0.0), 1.0)
    cosine = 0.5 * (1 + math.cos(math.pi * progress))
    return min_learning_rate + (max_learning_rate - min_learning_rate) * cosine


def run_save_checkpoint(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    iteration: int,
    out: str | os.PathLike | BinaryIO | IO[bytes],
):
    """
    Given a model, optimizer, and an iteration number, serialize them to disk.

    Args:
        model (torch.nn.Module): Serialize the state of this model.
        optimizer (torch.optim.Optimizer): Serialize the state of this optimizer.
        iteration (int): Serialize this value, which represents the number of training iterations
            we've completed.
        out (str | os.PathLike | BinaryIO | IO[bytes]): Path or file-like object to serialize the model, optimizer, and iteration to.
    """
    state = {
        "model": model.state_dict(),
        "optimizer": optimizer.state_dict(),
        "iteration": int(iteration),
    }
    torch.save(state, out)


def run_load_checkpoint(
    src: str | os.PathLike | BinaryIO | IO[bytes],
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
) -> int:
    """
    Given a serialized checkpoint (path or file-like object), restore the
    serialized state to the given model and optimizer.
    Return the number of iterations that we previously serialized in
    the checkpoint.

    Args:
        src (str | os.PathLike | BinaryIO | IO[bytes]): Path or file-like object to serialized checkpoint.
        model (torch.nn.Module): Restore the state of this model.
        optimizer (torch.optim.Optimizer): Restore the state of this optimizer.
    Returns:
        int: the previously-serialized number of iterations.
    """
    checkpoint = torch.load(src, map_location="cpu")
    model.load_state_dict(checkpoint["model"])
    optimizer.load_state_dict(checkpoint["optimizer"])
    return int(checkpoint["iteration"])


class _BPETokenizer:
    """Simple GPT-2 style BPE tokenizer supporting streaming inputs."""

    _STREAM_CHUNK_SIZE = 8192

    def __init__(
        self,
        vocab: dict[int, bytes],
        merges: list[tuple[bytes, bytes]],
        special_tokens: list[str] | None,
    ) -> None:
        self._pretokenizer = regex.compile(GPT2_PRETOKENIZER_PATTERN)

        self._id_to_token_bytes: dict[int, bytes] = {}
        self._token_bytes_to_id: dict[bytes, int] = {}
        for token_id, token_bytes in vocab.items():
            idx = int(token_id)
            if not isinstance(token_bytes, (bytes, bytearray)):
                token_bytes = bytes(token_bytes)
            else:
                token_bytes = bytes(token_bytes)
            self._id_to_token_bytes[idx] = token_bytes
            self._token_bytes_to_id[token_bytes] = idx

        self._pair_ranks: dict[tuple[bytes, bytes], int] = {}
        for rank, pair in enumerate(merges):
            if len(pair) != 2:
                continue
            left, right = pair
            if not isinstance(left, (bytes, bytearray)):
                left = bytes(left)
            else:
                left = bytes(left)
            if not isinstance(right, (bytes, bytearray)):
                right = bytes(right)
            else:
                right = bytes(right)
            self._pair_ranks[(left, right)] = rank

        self._bpe_cache: dict[bytes, tuple[int, ...]] = {}

        deduped_specials: list[str] = []
        seen_specials: set[str] = set()
        if special_tokens:
            for token in special_tokens:
                if not isinstance(token, str):
                    msg = f"Expected special tokens to be strings, got {type(token)!r}"
                    raise TypeError(msg)
                if not token:
                    raise ValueError("Special tokens must be non-empty strings.")
                if token in seen_specials:
                    continue
                seen_specials.add(token)
                deduped_specials.append(token)

        self._special_tokens = deduped_specials
        self._special_token_to_id: dict[str, int] = {}
        self._special_regex: regex.Pattern[str] | None = None
        self._special_prefixes: dict[int, set[str]] = {}
        self._max_special_prefix_len = 0

        if self._special_tokens:
            regex_tokens = sorted(self._special_tokens, key=len, reverse=True)
            pattern = "|".join(regex.escape(token) for token in regex_tokens)
            self._special_regex = regex.compile(pattern)
            for token in self._special_tokens:
                token_bytes = token.encode("utf-8")
                token_id = self._token_bytes_to_id.get(token_bytes)
                if token_id is None:
                    msg = f"Special token {token!r} does not exist in the vocabulary."
                    raise ValueError(msg)
                self._special_token_to_id[token] = token_id
                for prefix_len in range(1, len(token)):
                    self._special_prefixes.setdefault(prefix_len, set()).add(token[:prefix_len])
                if len(token) > 1:
                    self._max_special_prefix_len = max(self._max_special_prefix_len, len(token) - 1)

    def encode(self, text: str) -> list[int]:
        if not isinstance(text, str):
            msg = f"Tokenizer.encode expects a string, got {type(text)!r}"
            raise TypeError(msg)
        return list(self._encode_from_chunks([text]))

    def encode_iterable(self, iterable: Iterable[str] | IO[str]) -> Iterable[int]:
        chunks = self._chunk_source(iterable)

        def generator() -> Iterable[int]:
            yield from self._encode_from_chunks(chunks)

        return generator()

    def decode(self, token_ids: Iterable[int]) -> str:
        byte_segments: list[bytes] = []
        for token_id in token_ids:
            idx = int(token_id)
            try:
                token_bytes = self._id_to_token_bytes[idx]
            except KeyError as exc:
                raise KeyError(f"Unknown token id {idx}") from exc
            byte_segments.append(token_bytes)
        data = b"".join(byte_segments)
        if not data:
            return ""
        try:
            return data.decode("utf-8")
        except UnicodeDecodeError:
            # Decoding individual tokens may produce incomplete multi-byte sequences.
            # Fall back to a byte-preserving decode so callers can still inspect tokens.
            return data.decode("latin-1")

    def _chunk_source(self, source: Iterable[str] | IO[str]) -> Iterable[str]:
        read_method = getattr(source, "read", None)
        if callable(read_method):
            while True:
                chunk = read_method(self._STREAM_CHUNK_SIZE)
                if not chunk:
                    break
                if not isinstance(chunk, str):
                    chunk = chunk.decode("utf-8")
                if chunk:
                    yield chunk
            return
        for chunk in source:
            if not isinstance(chunk, str):
                msg = f"encode_iterable expects strings, got {type(chunk)!r}"
                raise TypeError(msg)
            if chunk:
                yield chunk

    def _encode_from_chunks(self, chunks: Iterable[str]) -> Iterable[int]:
        for segment, is_special in self._split_on_special(chunks):
            if not segment:
                continue
            if is_special:
                yield self._special_token_to_id[segment]
                continue
            for match in self._pretokenizer.finditer(segment):
                piece = match.group(0)
                if not piece:
                    continue
                token_bytes = piece.encode("utf-8")
                if not token_bytes:
                    continue
                yield from self._bpe(token_bytes)

    def _split_on_special(self, chunks: Iterable[str]) -> Iterable[tuple[str, bool]]:
        if not self._special_regex:
            for chunk in chunks:
                if chunk:
                    yield chunk, False
            return

        buffer = ""
        for chunk in chunks:
            if not chunk:
                continue
            buffer += chunk
            while True:
                match = self._special_regex.search(buffer)
                if not match:
                    break
                start, end = match.span()
                if start:
                    yield buffer[:start], False
                yield match.group(0), True
                buffer = buffer[end:]
            keep = self._pending_special_prefix_length(buffer)
            if keep == 0:
                if buffer:
                    yield buffer, False
                    buffer = ""
            else:
                safe_len = len(buffer) - keep
                if safe_len > 0:
                    yield buffer[:safe_len], False
                    buffer = buffer[safe_len:]
        if buffer:
            yield buffer, False

    def _pending_special_prefix_length(self, text: str) -> int:
        if self._max_special_prefix_len == 0 or not text:
            return 0
        upto = min(len(text), self._max_special_prefix_len)
        for length in range(upto, 0, -1):
            suffix = text[-length:]
            prefixes = self._special_prefixes.get(length)
            if prefixes and suffix in prefixes:
                return length
        return 0

    def _bpe(self, token_bytes: bytes) -> tuple[int, ...]:
        cached = self._bpe_cache.get(token_bytes)
        if cached is not None:
            return cached

        if token_bytes in self._token_bytes_to_id:
            result = (self._token_bytes_to_id[token_bytes],)
            self._bpe_cache[token_bytes] = result
            return result

        word = tuple(token_bytes[i : i + 1] for i in range(len(token_bytes)))
        pairs = self._get_pairs(word)

        while pairs:
            best_pair = min(
                pairs,
                key=lambda pair: self._pair_ranks.get(pair, float("inf")),
            )
            if best_pair not in self._pair_ranks:
                break
            first, second = best_pair
            new_word: list[bytes] = []
            i = 0
            while i < len(word):
                if (
                    i < len(word) - 1
                    and word[i] == first
                    and word[i + 1] == second
                ):
                    new_word.append(word[i] + word[i + 1])
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            word = tuple(new_word)
            if len(word) == 1:
                break
            pairs = self._get_pairs(word)

        result = tuple(self._token_bytes_to_id[symbol] for symbol in word)
        self._bpe_cache[token_bytes] = result
        return result

    @staticmethod
    def _get_pairs(word: tuple[bytes, ...]) -> set[tuple[bytes, bytes]]:
        pairs: set[tuple[bytes, bytes]] = set()
        if len(word) < 2:
            return pairs
        prev = word[0]
        for symbol in word[1:]:
            pairs.add((prev, symbol))
            prev = symbol
        return pairs


def get_tokenizer(
    vocab: dict[int, bytes],
    merges: list[tuple[bytes, bytes]],
    special_tokens: list[str] | None = None,
) -> Any:
    """Given a vocabulary, a list of merges, and a list of special tokens,
    return a BPE tokenizer that uses the provided vocab, merges, and special tokens.

    Args:
        vocab (dict[int, bytes]): The tokenizer vocabulary, a mapping from int (token ID in the vocabulary)
            to bytes (token bytes)
        merges (list[tuple[bytes, bytes]]): BPE merges. Each list item is a tuple of bytes (<token1>, <token2>),
            representing that <token1> was merged with <token2>.
            Merges are ordered by order of creation.
        special_tokens (list[str] | None): A list of string special tokens for the tokenizer. These strings will never
            be split into multiple tokens, and will always be kept as a single token.

    Returns:
        A BPE tokenizer that uses the provided vocab, merges, and special tokens.
    """
    if vocab is None:
        raise ValueError("vocab must be provided.")
    if merges is None:
        raise ValueError("merges must be provided.")
    return _BPETokenizer(vocab, merges, special_tokens or [])



def run_train_bpe(
    input_path: str | os.PathLike,
    vocab_size: int,
    special_tokens: list[str],
    **kwargs,
) -> tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
    """Given the path to an input corpus, run train a BPE tokenizer and
    output its vocabulary and merges.

    Args:
        input_path (str | os.PathLike): Path to BPE tokenizer training data.
        vocab_size (int): Total number of items in the tokenizer's vocabulary (including special tokens).
        special_tokens (list[str]): A list of string special tokens to be added to the tokenizer vocabulary.
            These strings will never be split into multiple tokens, and will always be
            kept as a single token. If these special tokens occur in the input_path,
            they are treated as any other string.

    Returns:
        tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
            vocab:
                The trained tokenizer vocabulary, a mapping from int (token ID in the vocabulary)
                to bytes (token bytes)
            merges:
                BPE merges. Each list item is a tuple of bytes (<token1>, <token2>),
                representing that <token1> was merged with <token2>.
                Merges are ordered by order of creation.
    """
    # 1. 参数校验与初始化
    pat_str = kwargs.get("pat_str", GPT2_PRETOKENIZER_PATTERN)
    special_tokens = special_tokens or []
    unique_special_tokens: list[str] = []
    seen_specials: set[str] = set()

    # 这里的逻辑是去重并保持顺序
    for token in special_tokens:
        if not isinstance(token, str):
            msg = f"Expected special tokens to be strings, got {type(token)!r}"
            raise TypeError(msg)
        if token not in seen_specials:
            seen_specials.add(token)
            unique_special_tokens.append(token)
    
    special_tokens_bytes = [token.encode("utf-8") for token in unique_special_tokens]
    num_special_tokens = len(special_tokens_bytes)

    # 基础词表大小为 256 (字节范围)
    if vocab_size < 2**8 + num_special_tokens:
        msg = "vocab_size must be at least 256 + number of special tokens"
        raise ValueError(msg)

    merges_target = vocab_size - num_special_tokens - 2**8
    pretokenizer = regex.compile(pat_str)

    # 2. 读取文件
    with open(input_path, "r", encoding="utf-8") as f:
        text = f.read()

    words: list[list[int]] = []
    word_frequencies: list[int] = []
    word_lookup: dict[str, int] = {}

    # 3. 预分词 (Pre-tokenization)
    # 首先按特殊 token 切分，防止特殊 token 被正则拆散
    removable_specials = [token for token in unique_special_tokens if token]
    segments = [text]
    if removable_specials:
        escaped = [regex.escape(token) for token in removable_specials]
        split_pattern = regex.compile("|".join(escaped))
        segments = [segment for segment in split_pattern.split(text) if segment]

    for segment in segments:
        for match in pretokenizer.finditer(segment):
            token = match.group(0)
            if not token:
                continue
            
            idx = word_lookup.get(token)
            if idx is None:
                token_bytes = token.encode("utf-8")
                if not token_bytes:
                    continue
                idx = len(words)
                word_lookup[token] = idx
                words.append(list(token_bytes))
                word_frequencies.append(0)
            
            word_frequencies[idx] += 1

    # 4. 初始化 BPE 统计结构
    # 修正：范围应该是 256 (0-255)，原文的 28 可能是笔误
    token_id_to_bytes: dict[int, bytes] = {i: bytes([i]) for i in range(256)}
    merges: list[tuple[bytes, bytes]] = []
    next_token_id = 256

    pair_stats: Counter[tuple[int, int]] = Counter()
    pair_indices: dict[tuple[int, int], set[int]] = {}
    word_pair_counters: list[Counter[tuple[int, int]]] = []

    # 初次统计所有单词中的 pair
    for idx, token_ids in enumerate(words):
        freq = word_frequencies[idx]
        if freq == 0 or len(token_ids) < 2:
            word_pair_counters.append(Counter())
            continue
        
        pair_counter = Counter(zip(token_ids[:-1], token_ids[1:]))
        word_pair_counters.append(pair_counter)
        
        for pair, count in pair_counter.items():
            pair_stats[pair] += count * freq
            pair_indices.setdefault(pair, set()).add(idx)

    # --- 内部辅助函数 (闭包) ---
    def remove_word_from_stats(word_idx: int) -> None:
        counter = word_pair_counters[word_idx]
        if not counter:
            return
        freq = word_frequencies[word_idx]
        for pair, count in counter.items():
            pair_stats[pair] -= count * freq
            if pair_stats[pair] <= 0:
                pair_stats.pop(pair, None)
            
            indices = pair_indices.get(pair)
            if indices is not None:
                indices.discard(word_idx)
                if not indices:
                    pair_indices.pop(pair, None)

    def add_word_to_stats(word_idx: int) -> None:
        tokens = words[word_idx]
        if len(tokens) < 2:
            word_pair_counters[word_idx] = Counter()
            return
        
        counter = Counter(zip(tokens[:-1], tokens[1:]))
        word_pair_counters[word_idx] = counter
        freq = word_frequencies[word_idx]
        for pair, count in counter.items():
            pair_stats[pair] += count * freq
            pair_indices.setdefault(pair, set()).add(word_idx)

    def merge_word(word_idx: int, pair: tuple[int, int], new_token_id: int) -> None:
        tokens = words[word_idx]
        if len(tokens) < 2:
            return
        
        merged: list[int] = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
                merged.append(new_token_id)
                i += 2
            else:
                merged.append(tokens[i])
                i += 1
        words[word_idx] = merged

    # 5. BPE 训练主循环
    for _ in range(max(0, merges_target)):
        if not pair_stats:
            break

        # 定义优先级：优先频次高，频次相同比较字节内容（为了确定性）
        def pair_priority(item: tuple[tuple[int, int], int]) -> tuple[int, bytes, bytes]:
            (left_id, right_id), count = item
            return count, token_id_to_bytes[left_id], token_id_to_bytes[right_id]

        best_pair, _ = max(pair_stats.items(), key=pair_priority)
        
        left_bytes = token_id_to_bytes[best_pair[0]]
        right_bytes = token_id_to_bytes[best_pair[1]]
        
        merges.append((left_bytes, right_bytes))
        
        new_token_id = next_token_id
        token_id_to_bytes[new_token_id] = left_bytes + right_bytes

        affected_words = pair_indices.pop(best_pair, set())
        
        # 如果没有单词受到影响（理论上不应发生，因为 stats 里有），直接跳过
        if not affected_words:
            next_token_id += 1
            pair_stats.pop(best_pair, None)
            continue

        # 更新受影响单词的统计信息
        for word_idx in sorted(affected_words):
            remove_word_from_stats(word_idx)
            merge_word(word_idx, best_pair, new_token_id)
            add_word_to_stats(word_idx)
        
        pair_stats.pop(best_pair, None)
        next_token_id += 1

    # 6. 构建最终词表
    vocab: dict[int, bytes] = {
        idx: token for idx, token in token_id_to_bytes.items() if idx < next_token_id
    }

    # 添加特殊 Token
    for token_bytes in special_tokens_bytes:
        if len(vocab) >= vocab_size:
            break
        vocab[next_token_id] = token_bytes
        next_token_id += 1

    return vocab, merges

Vision with LLM

发表于 2025-11-23 更新于 2026-02-19

ViT (Vision Transformer)

ViT (Vision Transformer) 是 Google 在 ICLR 2021 提出的里程碑式工作。它把 Transformer 架构直接搬到图像域，在大规模预训练上打破了 CNN 的统治。CLIP、LLaVA、Stable Diffusion 等多模态模型都以 ViT 作为视觉骨干，因此面试常考。

ViT 总览

一、核心思想：An Image is Worth 16×16 Words

把图片均匀切成 Patch，把每个 Patch 当成一个 Token；整张图就对应一个 Token 序列。
不再依赖卷积的局部归纳偏置和平移不变性，第一层自注意力就拥有全局视野。
视觉和语言共享 Transformer 结构，图文特征更容易对齐。

二、架构流程（Pipeline）

假设输入图像尺寸为 H × W × C（如 224 × 224 × 3），Patch Size P = 16，Embedding 维度 D = 768。

Patch Partition
将图像切成 N = (H × W) / P² = (224 × 224) / (16 × 16) = 14 × 14 = 196 个 Patch，每个 Patch 的形状为 P × P × C。
Linear Projection / Patch Embedding
展平每个 Patch，并通过线性层映射到 D 维。工程中常用 Conv2d(kernel_size=stride=P) 直接完成切块 + 映射。
Positional Embedding
Transformer 对序列无序，需要向 Patch Embedding 中加可学习的 1D 位置编码，保留 Patch 的空间位置。
Class Token
在序列最前插入可学习的 [CLS] token，序列长度从 N 变为 N+1。分类时读取 [CLS] 的输出向量。
Transformer Encoder
堆叠 L 层 Pre-Norm Transformer：LN → MSA → LN → MLP(FFN)，层间带残差连接。
MLP Head
最后再接一个 LN + Linear，输出分类 logits。

Patch Embedding 流程示意

三、ViT vs. CNN（面试高频题）

维度	CNN (ResNet)	ViT (Transformer)
归纳偏置	强：先验地假设局部性与平移不变性	弱：没有结构先验，全靠数据学习
数据需求	在小数据集上易训练，表现稳	需要海量数据（JFT-300M 等），ImageNet-1K 上训练更难
感受野	局部 → 随层数加深逐步全局	天然全局，第一层即可关联所有 Patch
计算复杂度	`O(H × W)`，与图像分辨率线性	`O(N²)`，与 Patch 数平方成正比，分辨率高时显存压力大
多模态适配	特征空间与文本差距大，难对齐	与 LLM 架构一致，便于图文对齐（CLIP）

四、关键技术细节

位置编码外推：训练时 224²，测试时 384² 会导致 Patch 数变化。常对预训练的 2D 位置编码做双三次插值（Bicubic Interpolation）以适配新长度。
小数据表现差：缺乏卷积的归纳偏置，只能靠大量数据学习“邻近像素相关”这类先验，因此小数据上易过拟合。
自注意力复杂度：Complexity = O(N² · D)，N 为 Token 数。在高分辨率下成本过高，于是衍生出 Swin、Window Attention 等改进来降复杂度。

五、常见变体（SOTA 储备）

Swin Transformer：窗口注意力 + 移位窗口，使复杂度近似线性 O(N)，适合检测和分割。
MAE (Masked Autoencoders)：ViT 自监督预训练范式，随机 Mask 75% Patch，让模型重建像素，预训练表现突出。
DeiT (Data-efficient Image Transformers)：引入 Distillation Token，让 ViT 在 ImageNet-1K 这类中等规模数据上也能高效训练。

六、手撕代码：Patch Embedding（PyTorch）

import torch
import torch.nn as nn


class PatchEmbed(nn.Module):
    """Image to Patch Embedding."""

    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
        super().__init__()
        self.img_size = img_size
        self.patch_size = patch_size
        self.n_patches = (img_size // patch_size) ** 2
        # Conv2d 一次性完成切块与映射，避免手动 reshape
        self.proj = nn.Conv2d(
            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        # x: [B, C, H, W]
        x = self.proj(x)  # [B, D, H/P, W/P]
        x = x.flatten(2)  # [B, D, N]
        x = x.transpose(1, 2)  # [B, N, D]
        return x

七、总结与代表模型

ViT 是多模态大模型的“视觉骨干”，掌握 Input/Output 维度、Patching 机制与 CNN 的区别是面试必备。
LLaVA：CLIP-ViT-L/14。
Qwen-VL：ViT-bigG。
Stable Diffusion：CLIP-ViT-L/14。

CLIP (Contrastive Language-Image Pre-training)

CLIP 是 OpenAI 于 2021 年提出的双塔多模态模型，被称为“图文对齐的基石”。在多模态岗位面试中，它的核心思想、损失函数与工程细节几乎必问。

CLIP 双塔架构示意

一、核心思想

使用图像编码器和文本编码器分别提取特征，再映射到统一语义空间。
通过对比学习（Contrastive Learning）拉近正样本距离、推远负样本距离，从而实现 Zero-shot 分类。
面试金句：CLIP 打通视觉与语言的语义壁垒，让模型“看图懂语义”。

二、架构细节

Image Encoder（视觉塔）：ResNet-50、ViT-B/16、ViT-L/14 等结构，输出 D 维视觉向量。
Text Encoder（文本塔）：Transformer 结构，输入加入 [SOS]、[EOS]，取 [EOS] 位置作为句子表示。
Projection Head（映射层）：线性层映射至同一维度并 L2 归一化，无 Cross-Attention，推理高效。

三、训练目标：对比学习

数据规模：WIT-400M（4 亿图文对），弱监督规模决定上限。
相似度矩阵：对 batch 中 N 对样本分别得到 {v_i^I} 和 {v_j^T}，计算下式（归一化后即余弦相似度）：
InfoNCE / 对称 Cross Entropy：
- 对角线 (i, i) 为正样本，其余为负样本。
- 行维度做 Softmax（Image→Text），列维度做 Softmax（Text→Image），两者平均。
- 引入可学习温度 τ 控制分布尖锐度，通常约束 τ ≥ 0.01 避免梯度爆炸。

# image_encoder: ResNet / ViT
# text_encoder: Transformer
# W_i, W_t: 线性映射到共享空间
# t: learnable temperature

I_f = image_encoder(I)                 # [N, d_i]
T_f = text_encoder(T)                  # [N, d_t]
I_e = l2_normalize(I_f @ W_i, axis=1)  # [N, D]
T_e = l2_normalize(T_f @ W_t, axis=1)  # [N, D]
logits = (I_e @ T_e.T) * np.exp(t)     # [N, N]
labels = np.arange(N)
loss_i = cross_entropy(logits, labels, axis=1)
loss_t = cross_entropy(logits.T, labels, axis=1)
loss = (loss_i + loss_t) / 2

四、Zero-shot 推理

Prompt Engineering：将标签写成模板句子 A photo of a {label}.。
使用文本塔编码所有 Prompt，缓存文本特征。
图片通过视觉塔得到特征，与所有文本特征计算余弦相似度，得分最高者即预测。
多模板取平均可显著提升 Zero-shot 表现。

五、面试常问问题

为什么 Batch Size 极大？ 对比学习依赖负样本，Batch 越大，负样本越多，特征越鲁棒；CLIP 训练 Batch 可达 32K。
温度 τ 的作用？ 调节 Softmax 尖锐度，CLIP 中为可学习标量，常在 log 域裁剪保障下界。
有哪些局限？ 不擅长计数/空间关系/OCR，输入分辨率 224×224 对小目标不敏感。
相比 ImageNet 预训练优势？ 数据量大、语言监督更丰富、对分布偏移更鲁棒。
如何拓展到检测/分割？ GLIP、Grounding DINO、RegionCLIP 通过区域对齐文本；结合 SAM 可做文本分割。

六、应用与地位

Stable Diffusion：使用 CLIP Text Encoder 解析 Prompt。
LLaVA / Qwen-VL：采用 CLIP ViT-L/14 作为视觉骨干再接 LLM。
CLIP + ViT 基本覆盖现阶段多模态视觉前端 80% 的面试考点。

深度学习-基础知识

发表于 2025-11-22 更新于 2026-02-19

拟合与泛化

过拟合 vs 欠拟合

过拟合：训练集表现很好，但验证/测试集性能下降，说明模型记住了噪声或特例。
欠拟合：训练集和验证集都表现糟糕，通常意味着模型容量不足或训练不充分。

如何判断

绘制训练 Loss 与验证 Loss 曲线。
- 训练 Loss 持续下降而验证 Loss 上升 → 过拟合。
- 两者都停留在高位 → 欠拟合。
对比训练/验证准确率或其他指标是否出现明显分叉。

缓解策略

应对过拟合：
- 收集更多数据或做增强（翻转、裁剪、颜色抖动、Mixup/CutMix 等）。
- 增加正则化（L1/L2、Dropout、Label Smoothing、数据噪声）。
- 降低模型复杂度、使用 Early Stopping、应用 BatchNorm。
应对欠拟合：
- 使用更大的模型或更强的结构。
- 训练更久、采用更合适的学习率策略。
- 降低正则化强度或改进特征。

数据准备与特征工程

数据集划分

典型拆分：训练集/验证集/测试集（例如 8/1/1），保证各子集分布一致。
数据量有限时可使用 k 折交叉验证轮流作为验证集。
保持随机种子与分层抽样，避免类别不平衡导致的偏差。

预处理与特征工程

数值特征常做标准化（Z-score）或归一化到固定区间，防止量纲影响梯度。
图像常做均值方差归一化、直方图均衡、白化等；文本需要分词/Tokenizer、构建词典、转换为 Embedding。
离散特征通过 one-hot、Embedding、目标编码等方式注入模型。

批处理与数据管线

DataLoader 通常负责 shuffle、batch、并行加载与缓存，保证训练稳定。
Prefetch、pin memory、mmap、TFRecord 之类技巧可提高吞吐。
在线数据增强（在读取时随机变换）能避免存储大量增强样本。

前向传播与反向传播

计算图

深度学习模型可视为由线性/非线性层组成的有向无环图，前向传播按拓扑顺序计算输出值。
常见操作：矩阵乘、卷积、逐元素激活、拼接、归一化等。

反向传播

基于链式法则：若且，则。
从损失对输出的梯度开始，沿计算图逆序乘以局部梯度即可得到所有参数的梯度。

自动微分

框架（PyTorch、TensorFlow、JAX）都会记录计算图并自动求导，开发者只需定义前向过程。
明确 requires_grad/stop_gradient、合理释放梯度（zero grad），可以避免显存泄漏。

正则化

L1 与 L2

L1 正则：在损失中加入，大量权重被压到 0，模型稀疏、利于特征选择。
L2 正则：加入，权重整体收缩但不为 0，模型更平滑、稳定。
直观区别：L1 “让很多参数干脆不用”；L2 “让所有参数都收一收”。

其他常见手段

Dropout：训练时随机屏蔽部分神经元，相当于做子网络集成，减少共适应。
BatchNorm / LayerNorm：稳定每层输入分布，允许更大学习率，并带来自然的正则化效果。
Label Smoothing：将 one-hot 标签的 1 改成，其它类别分到，降低过度自信。
Early Stopping / Weight Decay：监控验证集，当指标不再提升时提前停止；Weight Decay 与 L2 等价，常直接作用于优化器。

激活函数

常见激活函数的数学式与要点：

Sigmoid：，输出 (0,1)，可解释为概率；缺点是梯度饱和。
Tanh：，输出 (-1,1)，收敛快于 Sigmoid，但仍会饱和。
ReLU：，简单高效但会出现 Dead ReLU。
Leaky ReLU：其中，缓解 Dead ReLU。
PReLU：，负半轴斜率可学习。
ELU：让负半轴更平滑。
GELU：，Transformer 中常见。
Swish：（常用），梯度更平滑，性能略优于 ReLU。

损失函数与指标

交叉熵

标签为 one-hot 时等价于最大化正确类别的 log 概率。
与 softmax 组合后的梯度稳定、本质在最小化真实分布与预测分布的 KL 散度。

其他损失函数

MSE：，对离群点敏感，常用于回归。
MAE：，对离群点更鲁棒，但在 0 点不可导。
Huber：在误差较小时为二次，大于阈值时转为线性，兼顾 MSE/MAE 优势。
Hinge / Multi-class Hinge：，常用于支持向量机等最大间隔分类。
Focal Loss：，为真实类别概率，通过调节抑制易分类样本，常用于检测/不平衡分类。

常见指标

分类：Accuracy、Precision、Recall、F1、ROC/AUC。
回归：MSE/MAE、RMSE、R²；MSE 对离群点敏感，MAE 更鲁棒但在 0 点不可导。

优化与训练策略

Mini-Batch 必要性

Full-batch 梯度最精确但慢且耗显存；纯 SGD (batch=1) 更新快却噪声大。
Mini-batch 在效率与稳定性之间取折中，可利用 GPU 并行，梯度估计更平滑。

优化器

SGD：沿负梯度方向更新。
Momentum：引入动量项，积累之前的梯度方向，加速收敛并抑制震荡。
Adam：同时估计梯度的一阶/二阶矩，为不同参数分配自适应学习率，前期收敛快，但泛化有时略弱于 SGD+Momentum，可在后期切换。

学习率调度

Step/Exponential Decay：按固定间隔或指数下调。
Cosine Annealing：富有周期感，可配合 Warmup。
Warmup：训练初始从较小 LR 逐步升高，避免震荡。
Cyclic / OneCycle：先升后降，在 CV、NLP 任务中常见。

梯度消失/爆炸

原因：深层链式相乘、激活饱和、初始化不当。
对策：使用 ReLU 家族、Xavier/He 初始化、残差结构、BatchNorm/LayerNorm、梯度裁剪。

典型网络模块

CNN

卷积层具备参数共享、局部感受野、平移不变性三大优势。
卷积输出尺寸：输入，核大小，padding ，stride ，则
Pooling 用于下采样和提升鲁棒性：Max Pool 保留最强响应，Avg Pool 更平滑，Global Avg Pool 常用于分类尾部。

RNN / LSTM / GRU

普通 RNN 在长序列上易梯度消失/爆炸。
LSTM 通过遗忘门、输入门、输出门维护细胞状态，缓解长期依赖问题。
GRU 将细胞状态与隐状态合并，只保留更新门和重置门，参数更少、计算更快。

Attention 与 Transformer

Self-Attention：对每个位置计算其 Query 与其他位置 Key 的相似度，再对 Value 加权求和。
Multi-Head：多组并行，捕获不同关系。
由于注意力缺乏位置信息，需要添加可学习或正余弦位置编码。
核心公式：
Transformer 完全基于 Self-Attention，可并行处理序列，易于扩展到 GPT、BERT、ViT 等大模型。

归一化与正则细节

BatchNorm：在通道维上对 mini-batch 标准化，再学习缩放/平移，提升收敛速度并具备轻微正则化；训练和推理需区分均值/方差的来源。
LayerNorm：对同一样本的特征维做标准化，与 batch 大小无关，适合 Transformer/NLP。
Dropout 与 数据增强 搭配使用可明显提升泛化。
权重初始化：Xavier/Glorot 适合近似线性激活，He 初始化匹配 ReLU 家族；良好的初始化能避免一开始就梯度消失。

ResNet and Transformer

发表于 2025-11-21 更新于 2026-02-19 分类于深度学习

一份把 CNN 与 Transformer 串起来的快速笔记，记录关键公式、训练直觉与二者之间的联系。

1. 深度网络训练循环

正向传播：输入沿着网络逐层计算得到输出。
计算损失：把输出与标签送入损失函数，得到标量损失。
反向传播：利用链式法则计算各层梯度。
参数更新：优化器使用梯度更新权重，循环往复。

2. CNN 基础组件

2.1 卷积层与特征提取

卷积核在图像上滑动，通过局部感受野提取空间特征，可并行堆叠多组卷积层。
示例：输入为 224×224×3，使用 64 个 7×7 卷积核、步长 2，可得到 112×112×64 的输出；空间分辨率减半、通道数等于卷积核个数。
更深的卷积核（通道数更多）可以捕获更复杂的特征模式，输出 tensor 的空间尺寸由步长与 padding 控制。

2.2 ReLU（Rectified Linear Unit）

引入非线性，提升模型表征能力。
通过截断负值缓解梯度消失，使深层网络更易训练。

2.3 池化层

常见的 2×2 最大池化会在每个窗口取最大值，输出 56×56×128 这样的结果（由 112×112×128 池化而来）。
作用：降低空间分辨率、聚合局部信息、减少计算与过拟合风险。

2.4 全连接层与 Softmax

卷积与池化得到的 feature map 需要展平为向量，再送入全连接层。
例：56×56×128 = 401408 维输入，如果映射到 4096 维，需要一个 4096×401408 的权重矩阵和 4096×1 的偏置：

对于 10 类分类任务，再接一个 10×4096 的线性层即可得到 logits。
Softmax 会把 logits 变成概率分布，满足：

2.5 梯度消失的来源

Sigmoid/饱和激活：导数最大仅 0.25，在正负饱和区几乎为 0，多次连乘后梯度指数级衰减。
权重初始化过小：若，反向传播会不断乘以 0.01，导致梯度趋近 0；因此需要 Xavier、He 等初始化策略。
网络过深：梯度沿 L 层回传需要连续乘以，只要每项略小于 1 就会快速衰减。
缺少跳连接：传统链式结构中，梯度必须层层穿过，无法绕过表现较差的中间层。

3. ResNet 的核心思想

3.1 残差连接公式

残差块输出：。其中是若干卷积、BN、ReLU 组成的残差分支，是恒等映射。
反向传播：

“+1” 让梯度至少能直接传回前层，避免被多次连乘压缩为 0。

3.2 退化问题与 ResNet 的改进

传统深层网络层数增加时，训练误差反而上升（退化现象）。
论文中 34 层的 plain 网络在 ImageNet 上 top-1 误差为 28.54%，比 18 层的 27.94% 更差；而引入残差后的 34 层网络可降至 25.03%。
原因：若某些层无法进一步提升性能，残差分支可以学到，整个块退化为，深度增加不会破坏已有表示。

3.3 残差块结构示意

输入 x
  │
  ├──▶ F(x)：Conv → BN → ReLU → Conv → BN
  │
  └──────────────▶ +
                   │
                   ▼
                y = F(x) + x

若维度不一致，可用 1×1 卷积或投影矩阵把调整到同一形状再相加。
残差路径允许信息“跳层”，即便中间卷积暂时训练不好，也不会阻碍梯度流动。

3.4 与 Transformer 的联系

Transformer 将残差思想推广到自注意力与前馈子层：输出统一写作

这同样让梯度可以直接穿过自注意力或 FFN 子层，稳定深堆叠结构。

4. Transformer 架构速记

4.1 整体结构

经典 Transformer 采用编码器-解码器架构：每层由自注意力 + 前馈网络组成，堆叠多层后可建模长序列关系。
解码端还包含编码器-解码器注意力，用于关注编码器输出。

4.2 自注意力（Scaled Dot-Product Attention）

输入被映射为查询、键、值，注意力计算为：

点积注意力可利用高效矩阵乘法实现，内存友好。

4.3 多头注意力

多头机制让模型同时关注不同子空间：

每个头的维度较小，使整体计算量与单头注意力相近，却能捕捉多尺度依赖。

4.4 自回归与注意力掩码

语言模型满足自回归分解：

为保持自回归性质，解码端在计算自注意力时会对未来位置加上的 mask，使 softmax 仅依赖已生成的 token。

4.5 Position-wise Feed-Forward Network

每个位置独立的两层感知机，对所有位置共享权重但不同层的参数互不相同：

可视作核大小为 1 的卷积，常用内层维度。

4.6 Embedding、Softmax 与参数共享

Token ID 先通过嵌入矩阵映射到维空间；同一矩阵也可用于输出层（权重共享），解码器 logits 需要乘以 Softmax 才能得到概率分布。
为保持数值稳定，输入端嵌入通常再乘上。

4.7 位置编码（Positional Encoding）

纯注意力模型缺乏位置信息，需要额外向量注入顺序。论文采用固定的正弦/余弦编码：

每个维度对应不同频率，可通过线性变换刻画相对位置信息，易于泛化到更长的序列，并且无需额外参数。

线性代数基础

发表于 2025-11-13 更新于 2026-02-19 分类于数学

线性代数是机器学习、图形学、控制论乃至量子计算的底层语言。学习过程中如果只背公式，往往无法把抽象符号与具体场景联系起来。本文从零开始梳理关键概念、常见例子与学习路线，帮助你在“理解—计算—应用”之间建立桥梁。

1. 为什么线性代数如此重要

数据表示：向量可以描述单个样本的特征，矩阵可以并行操作一批样本。
线性变换：模型权重（全连接层、卷积核）本质都是线性映射。
优化与分解：梯度、Hessian、奇异值分解 (SVD) 等都植根于线性代数。

2. 前置知识清单

代数基础：熟悉实数运算、因式分解、函数图像。
集合与映射：理解“输入输出”的函数关系，知道多元函数的含义。
基础几何：二维平面、三维空间中的点、向量、角度与面积。
初等数列：能处理求和符号与简单递推。

若上述内容尚不牢固，建议先配合高中代数/几何教材或可汗学院的基础课程复习。

3. 向量：既是箭头也是列表

定义：维向量写作，通常当作列向量。
几何直觉：二维向量是平面上的箭头，长度表示箭头的“强度”，方向由坐标确定。
现实例子：影评情感向量可表示“正面、负面、中性”特征贡献。
基本运算：
- 加法：等价于把两个位移连在一起。
- 标量乘法：拉伸或压缩向量长度。
- 点积：，衡量相似度或投影。

示例：设用户喜好向量表示“动作片”“爱情片”的偏好权重，电影 A 的向量为，电影 B 的向量为。点积说明用户与电影 A 的特征更接近，推荐系统就会优先推送电影 A。

4. 向量空间、线性组合与基

向量空间：在加法与标量乘法下封闭的集合。例如所有二维向量构成。
线性组合：。若若干向量的线性组合能覆盖整个空间，它们就是该空间的生成集。
基与维度：最小生成集叫基 (basis)。二维空间常用标准基、。基向量数量即维度。
形象例子：若以“甜味”“酸味”两个向量表示饮品口味空间，任何饮品都能写成它们的线性组合。换基就像改用“柑橘味”“莓果味”描述同一空间。

5. 矩阵：线性变换的载体

定义：矩阵是行列的数表，可看作把维输入映射到维输出的线性函数。
矩阵作用：。矩阵的列向量描述基向量被变换后的结果。
几何例子：
- 缩放矩阵：水平方向放大 2 倍、竖直方向压缩为原来的一半。
- 旋转矩阵：围绕原点逆时针旋转。
组合变换：两个变换先后执行等价于矩阵连乘。

可以把矩阵视作“坐标轴变形机器”。二维标准坐标系的基为矩阵会把映射到，把映射到。这两个新向量就是“变形后”的坐标轴。任意向量可以写成，因此

这条公式告诉我们：矩阵的列向量就是被拉伸或扭曲后的坐标轴，原向量的系数沿着新轴相加，就得到最终坐标。

坐标轴表达示例：设

（水平切变矩阵）。有因此轴保持不变，而轴被倾斜到方向。任意点会被映射为

，可以直观看出“沿新轴被拖拽”这一效果。

示例：图像中的像素点先进行 45° 旋转，再做 2 倍放大。对应矩阵最终坐标，可直观看到“先旋转再缩放”的顺序性。

6. 矩阵基本运算

运算	表达式	含义
加法		对应元素相加
数乘		所有元素乘以常数
乘法	，其中	复合变换或把“特征输出”连接起来
转置		交换行列，点积可写为
逆矩阵	，满足	撤销变换，只有可逆矩阵才存在

理解矩阵乘法的“行 × 列”视角十分关键：每一行代表一个输出变量如何聚合输入各分量。

7. 线性方程组与高斯消元

把方程组写成能统一讨论求解流程：

通过高斯消元（行变换）可化为上三角矩阵，再回代得到解。几何上，该例表示两条直线的交点；若系数矩阵行向量共线则无唯一交点。

示例 1（唯一解）：上式增广矩阵得到、。二者对应平面直线在点相交。

示例 2（欠定/最小二乘）：若

，两行相同导致方程组无解。最小二乘解

表示找到距离两条“重合直线”最近的点。

8. 行列式：体积与可逆性的度量

行列式衡量单位立方体经过变换后的体积缩放。
若，说明变换把空间压成更低维子空间，矩阵不可逆。
二维示例：矩阵的行列式为，表示面积缩放 2 倍并翻转方向。

三维示例：设

，利用展开可得

这意味着单位立方体被拉伸为体积为 3 的平行六面体，并保持右手坐标系方向。

9. 特征值与特征向量

特征向量满足，即变换后方向不变，仅被放大或缩小。

现实场景：主成分分析 (PCA) 中，协方差矩阵的特征向量就是最大方差方向，特征值表示该方向的方差大小。
计算方式：解特征方程得到特征值，再回代求特征向量。
几何意义：找到“不会被扭曲方向”的轴，便于判断系统的稳定性或数据的主方向。

10. 正交性、投影与最小二乘

正交向量：，夹角为。
正交矩阵：，既保持长度又保持角度。
投影公式：
最小二乘：当无精确解时，选择使残差最小，相当于把投影到矩阵列空间上。

投影示例：令表示“沿对角线移动”的方向，表示某个三维信号。将投影到上得到意味着信号在“对角线平面”上的分量是，剩余的则是与正交的噪声。

11. 常见矩阵分解

分解	形式	典型用途
LU	，为下三角，为上三角	一次分解，多次求解
QR	，正交，上三角	正交化、稳定地解最小二乘
特征分解	（对称矩阵可正交对角化）	谱聚类、PCA、动力系统分析
SVD		降维、伪逆、低秩近似、推荐系统

分解的本质是把复杂映射拆解成“旋转 + 缩放 + 投影”等易理解的步骤。

奇异值分解 (SVD) 的 LaTeX 图示

SVD 把任意矩阵拆成三段：

：在输入空间旋转/翻转，使数据对齐到“主轴”方向。
：保留非负奇异值，沿坐标轴做纯拉伸或压缩。
：把变形后的结果放回输出空间，再旋转/翻转到目标坐标系。

为了兼容 Hexo 默认的 MathJax 配置，我们可以用一条“箭头链”表示 SVD 的三步变换：

这条箭头链展示了“旋转 → 缩放 → 再旋转”的流水线：左侧单位圆上的每个点被对齐后，再由拉伸成椭圆，最终通过映射到输出空间。最大的奇异值表示矩阵能把某个方向放大的最大倍数，越大说明该方向携带的能量越多；奇异值快速衰减意味着矩阵近似低秩。

数值示例：对于有在几何上：把单位圆旋转约，将其拉成长轴、短轴的椭圆，最后再旋转到输出坐标。用于降维时，只保留最大的奇异值与对应列即可获得最佳的低秩近似。

12. 线性代数与机器学习的衔接

数据标准化：零均值处理等价于把样本投影到均值向量的正交补。
特征工程：PCA/ICA 即寻找协方差矩阵的特征向量或独立基。
优化算法：梯度、牛顿法中的 Hessian、二阶近似全部依赖矩阵微积分。
正则化：正则限制向量的 Euclidean 范数，正则鼓励稀疏系数。
深度学习：注意力矩阵、卷积核展开、BatchNorm 统计量都可用线性映射描述。

实例： - 数据标准化：对房价特征（面积、房龄、卧室数）做零均值后再训练线性回归，相当于让模型在“去掉公共偏移量”的子空间里拟合。 - PCA：将 784 维的 MNIST 手写图片降到 32 维时，只保留前 32 个最大的奇异值和特征向量，降低存储成本的同时保留主要笔画结构。 - 深度学习：Transformer 中 Query 和 Key 的点积本质是比较两个向量在同一子空间的对齐程度，奇异值过大时常用谱归一化控制注意力权重的放大倍数。

13. 建议的学习路线

向量直觉阶段：手画二维、三维向量的加法、点积、投影，理解长度与角度的含义。
矩阵运算阶段：练习矩阵与向量、矩阵与矩阵的乘法，关注形状匹配与变换效果。
行列式与消元阶段：手算 2×2、3×3 行列式，掌握高斯消元和秩的概念。
谱分析阶段：从对称矩阵入手求特征值/特征向量，并用真实数据做 PCA。
分解与数值阶段：实现 Gram-Schmidt、QR、SVD，理解数值稳定性和条件数。
应用阶段：编程实现线性回归、低秩图像压缩、推荐系统嵌入，观察每一步的线性代数含义。

14. 练习与资源

可视化课程：3Blue1Brown 的 Essence of Linear Algebra 动画。
系统教材：Gilbert Strang 的 Introduction to Linear Algebra 及 MIT 公开课 18.06。
编程练习：使用 NumPy/PyTorch 验证矩阵运算、SVD、最小二乘公式。
自测题：自己构造小矩阵，判断其可逆性、特征值，并在坐标纸上描绘变换后的网格。

循序渐进地把抽象概念与“箭头如何移动”“数据方差指向哪里”等可视图像绑定，线性代数就能真正成为解决问题的工具而非考试公式。

GGML study note

发表于 2025-11-09 更新于 2026-02-19 分类于技术栈

GGML 学习笔记大纲

1. 矩阵乘法基础

矩阵乘法通常写作。只有当左矩阵的列数与右矩阵的行数相同，乘积矩阵才有定义。其元素由下式确定：

1.1 形式

1.2 一般公式

1.3 数值示例

2. ggml_tensor 结构

struct ggml_tensor {
        enum ggml_type type;
        struct ggml_backend_buffer * buffer;
        int64_t ne[GGML_MAX_DIMS]; // number of elements
        size_t  nb[GGML_MAX_DIMS]; // stride in bytes:
                                   // nb[0] = ggml_type_size(type)
                                   // nb[1] = nb[0]   * (ne[0] / ggml_blck_size(type)) + padding
                                   // nb[i] = nb[i-1] * ne[i-1]

        // compute data
        enum ggml_op op;
        // op params - allocated as int32_t for alignment
        int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];
        int32_t flags;
        struct ggml_tensor * src[GGML_MAX_SRC];
        // source tensor and offset for views
        struct ggml_tensor * view_src;
        size_t               view_offs;
        void * data;
        char name[GGML_MAX_NAME];
        void * extra; // extra things e.g. for ggml-cuda.cu
        char padding[8];
    };

要点摘要： - ne（number of elements）与 nb（number of bytes）分别描述各维度的元素数量及字节跨度。 - op 和 op_params 指明该张量对应的运算节点及其参数，用于构建计算图。 - view_src 与 view_offs 允许视图张量共享底层数据，常用于切片、reshape 等操作。

3. 常用命令

# GPT-2 单路推理
.\build\bin\Release\gpt-2.exe -m .\models\gpt-2-117M\ggml-model.bin -p "This is an example" -n 128 -t 8 --top_k 40 --top_p 0.9 --temp 0.8

# GPT-2 带批次生成
.\build\bin\Release\gpt-2-batched.exe -np 4 -m .\models\gpt-2-117M\ggml-model.bin -p "Hello my name is" -n 64

# GPT-2 内存分配（alloc 版本）
.\build\bin\Release\gpt-2-alloc.exe -m .\models\gpt-2-117M\ggml-model.bin -p "Sample prompt" -n 80

# GPT-J 推理
.\build\bin\Release\gpt-j.exe -m .\models\gpt-j-6B\ggml-model.bin -p "int main(int argc, char ** argv) {" -n 200 -t 8

# 模型量化示例（F16 -> Q4_0）
.\build\bin\Release\gpt-2-quantize.exe .\models\gpt-2-1558M\ggml-model-f16.bin .\models\gpt-2-1558M\ggml-model-q4_0.bin 2

# SAM 图像分割
.\build\bin\Release\sam.exe -i .\examples\sam\example.jpg -m .\examples\sam\ggml-model-f16.bin -t 8

# YOLOv3-tiny 目标检测
.\build\bin\Release\yolov3-tiny.exe -m .\examples\yolo\yolov3-tiny.gguf -i .\examples\yolo\dog.jpg

4. metadata 速查

字段名	含义示例
`shape`	矩阵维度，如 `(3, 4)` 表示 3 行 4 列
`dtype`	元素类型，例如 `float64`、`int32`
`nnz`	稀疏矩阵中非零元素（number of non-zero entries）

How to push new post

发表于 2025-11-08 更新于 2026-02-19 分类于教程

创建 Markdown 文件

运行 Hexo 命令自动生成草稿： npx hexo new post "你的标题"

它会在 source/_posts/ 下生成你的标题.md，带好 front‑matter。或者直接在 source/_posts/ 里手动新建 yyyy-mm-dd-xxx.md，内容格式如下：

title: 新文章标题
date: 2025-06-06 10:00:00
categories:
  - 分类名
tags:
  - 标签1
  - 标签2
cover: https://你的封面图 (可选)
sticky: 0          # 可选，越大越靠前

正文从这里开始…… ## 本地预览/构建

npm run dev → 浏览器开 http://localhost:4000 看效果。确认 OK 后 npm run build（可选，用来检查生成有没有报错）。 ## 提交并推送

1
2
3

git add source/_posts/xxx.md
git commit -m "post: xxx"
git push origin main

推送后 GitHub Actions 会触发 “Build & Deploy Blog”，几分钟内博客自动更新。

你好

发表于 2025-11-05 更新于 2026-02-19 分类于日志

未来想写的东西

TODO

如果你对这些感兴趣，欢迎来 GitHub 找我聊天。新的旅程开始啦！