Calvin's Marbles

Rust 的 Borrow Checker

2024-04-07T07:46:32.000Z

介绍 Rust 的 Borrow Checker 的原理。

前期知识：High-level Compiler Architecture

Queries: demand-driven compilation

正在从 pass-based 转变为 demand-driven 模式：

Instead of entirely independent passes (parsing, type-checking, etc.), a set of function-like queries compute information about the input source. For example, there is a query called type_of that, given the DefId of some item, will compute the type of that item and return it to you.

上面的这些 query 是可以被记忆化的，所以在第一次被计算后，剩余的查询就可以从一个 hash table 中被检索出来。这对 Incremental Computation 是非常友好的。

最终，we want the entire compiler control-flow to be query driven. 也就是对于每个 crate，会运行一个 top-level 的 query 即 compile。这会链式地触发后续的各种计算，比如:

The compile query might demand to get a list of codegen-units，比如需要被 LLVM 编译的模块列表
但为了计算这些 codegen-units 就需要使用一个 subquery 计算 Rust 源码中定义的 module 列表
这个 subquery 就需要触发 HIR 的计算
This keeps going further and further back until we wind up doing the actual parsing.

How the compiler executes a query

Providers

If, however, the query is not in the cache, then the compiler will try to find a suitable provider. A provider is a function that has been defined and linked into the compiler somewhere that contains the code to compute the result of the query.

How providers are setup

Memory Management in Rustc

前期知识：Source Code Representation

Intermediate representations 综述

Instead most compilers, including rustc, build some sort of IR out of the source code which is easier to analyze. rustc has a few IRs, each optimized for different purposes:

Token stream: the lexer produces a stream of tokens directly from the source code. This stream of tokens is easier for the parser to deal with than raw text.
Abstract Syntax Tree (AST): the abstract syntax tree is built from the stream of tokens produced by the lexer. It represents pretty much exactly what the user wrote. It helps to do some syntactic sanity checking (e.g. checking that a type is expected where the user wrote one).
High-level IR (HIR): This is a sort of desugared AST. It’s still close to what the user wrote syntactically, but it includes some implicit things such as some elided lifetimes, etc. This IR is amenable to type checking.
Typed HIR (THIR) formerly High-level Abstract IR (HAIR): This is an intermediate between HIR and MIR. It is like the HIR but it is fully typed and a bit more desugared，比如方法调用和隐式解引用都会被显式化. As a result, it is easier to lower to MIR from THIR than from HIR.
Middle-level IR (MIR): This IR is basically a Control-Flow Graph (CFG). A CFG is a type of diagram that shows the basic blocks of a program and how control flow can go between them. Likewise, MIR also has a bunch of basic blocks with simple typed statements inside them (e.g. assignment, simple computations, etc) and control flow edges to other basic blocks (e.g., calls, dropping values). MIR is used for borrow checking and other important dataflow-based checks, such as checking for uninitialized values. It is also used for a series of optimizations and for constant evaluation (via MIRI). Because MIR is still generic, we can do a lot of analyses here more efficiently than after monomorphization.
LLVM-IR: This is the standard form of all input to the LLVM compiler. LLVM-IR is a sort of typed assembly language with lots of annotations. It’s a standard format that is used by all compilers that use LLVM (e.g. the clang C compiler also outputs LLVM-IR). LLVM-IR is designed to be easy for other compilers to emit and also rich enough for LLVM to run a bunch of optimizations on it.

HIR

HIR – “High-Level Intermediate Representation”，是编译期友好的 AST。只会进行 parse、宏展开和 name resolution 的转化。
可以通过第一行的语句得到 HIR 表示，通过第二行的语句得到更为接近原文的 HIR 表示。

1 2	cargo rustc -- -Z unpretty=hir-tree cargo rustc -- -Z unpretty=hir

HIR Bodies

A rustc_hir::Body represents some kind of executable code, such as the body of a function/closure or the definition of a constant. Bodies are associated with an owner, which is typically some kind of item (e.g. an fn() or const), but could also be a closure expression (e.g. |x, y| x + y). You can use the HIR map to find the body associated with a given def-id (maybe_body_owned_by) or to find the owner of a body (body_owner_def_id).

THIR

THIR 也就是 Typed High-Level Intermediate Representation，从前叫 “High-Level Abstract IR。它在 type checking 后生成，被用来构造 MIR，exhaustiveness checking，以及 unsafety checking。

THIR 在 HIR 更下层。在 type checking 完成后，就能填入所有的 type。HIR 具有下面的特性：

类似于 MIR，THIR 只表示 “bodies”。其中包含 function bodies，const initializers 等。换句话说，THIR 中没有 struct 或者 trait 的表示。
THIR 的 body 只是临时被存储，并且在不需要的时候就会被 drop 掉。对应的，HIR 的会存储到编译过程的结束。
THIR 会有更多的 desugar。比如 automatic references and dereferences 会变得显式。method calls 和 overloaded operators 会转换为 plain function call。Destruction scopes 会显式。
这个我理解是因为 THIR 中已经没有 struct 了。
Statements、expressions、match arms 会分开存储。

The THIR lives in rustc_mir_build::thir. To construct a thir::Expr, you can use the thir_body function, passing in the memory arena where the THIR will be allocated. Dropping this arena will result in the THIR being destroyed, which is useful to keep peak memory in check. Having a THIR representation of all bodies of a crate in memory at the same time would be very heavy.

You can get a debug representation of the THIR by passing the -Zunpretty=thir-tree flag to rustc.

下面的代码

1
2
3

fn main() {
    let x = 1 + 2;
}

对应的 THIR

Thir {
    // no match arms
    arms: [],
    exprs: [
        // expression 0, a literal with a value of 1
        Expr {
            ty: i32,
            temp_lifetime: Some(
                Node(1),
            ),
            span: oneplustwo.rs:2:13: 2:14 (#0),
            kind: Literal {
                lit: Spanned {
                    node: Int(
                        1,
                        Unsuffixed,
                    ),
                    span: oneplustwo.rs:2:13: 2:14 (#0),
                },
                neg: false,
            },
        },
        // expression 1, scope surrounding literal 1
        Expr {
            ty: i32,
            temp_lifetime: Some(
                Node(1),
            ),
            span: oneplustwo.rs:2:13: 2:14 (#0),
            kind: Scope {
                // reference to expression 0 above
                region_scope: Node(3),
                lint_level: Explicit(
                    HirId {
                        owner: DefId(0:3 ~ oneplustwo[6932]::main),
                        local_id: 3,
                    },
                ),
                value: e0,
            },
        },
        // expression 2, literal 2
        Expr {
            ty: i32,
            temp_lifetime: Some(
                Node(1),
            ),
            span: oneplustwo.rs:2:17: 2:18 (#0),
            kind: Literal {
                lit: Spanned {
                    node: Int(
                        2,
                        Unsuffixed,
                    ),
                    span: oneplustwo.rs:2:17: 2:18 (#0),
                },
                neg: false,
            },
        },
        // expression 3, scope surrounding literal 2
        Expr {
            ty: i32,
            temp_lifetime: Some(
                Node(1),
            ),
            span: oneplustwo.rs:2:17: 2:18 (#0),
            kind: Scope {
                region_scope: Node(4),
                lint_level: Explicit(
                    HirId {
                        owner: DefId(0:3 ~ oneplustwo[6932]::main),
                        local_id: 4,
                    },
                ),
                // reference to expression 2 above
                value: e2,
            },
        },
        // expression 4, represents 1 + 2
        Expr {
            ty: i32,
            temp_lifetime: Some(
                Node(1),
            ),
            span: oneplustwo.rs:2:13: 2:18 (#0),
            kind: Binary {
                op: Add,
                // references to scopes surronding literals above
                lhs: e1,
                rhs: e3,
            },
        },
        // expression 5, scope surronding expression 4
        Expr {
            ty: i32,
            temp_lifetime: Some(
                Node(1),
            ),
            span: oneplustwo.rs:2:13: 2:18 (#0),
            kind: Scope {
                region_scope: Node(5),
                lint_level: Explicit(
                    HirId {
                        owner: DefId(0:3 ~ oneplustwo[6932]::main),
                        local_id: 5,
                    },
                ),
                value: e4,
            },
        },
        // expression 6, block around statement
        Expr {
            ty: (),
            temp_lifetime: Some(
                Node(9),
            ),
            span: oneplustwo.rs:1:11: 3:2 (#0),
            kind: Block {
                body: Block {
                    targeted_by_break: false,
                    region_scope: Node(8),
                    opt_destruction_scope: None,
                    span: oneplustwo.rs:1:11: 3:2 (#0),
                    // reference to statement 0 below
                    stmts: [
                        s0,
                    ],
                    expr: None,
                    safety_mode: Safe,
                },
            },
        },
        // expression 7, scope around block in expression 6
        Expr {
            ty: (),
            temp_lifetime: Some(
                Node(9),
            ),
            span: oneplustwo.rs:1:11: 3:2 (#0),
            kind: Scope {
                region_scope: Node(9),
                lint_level: Explicit(
                    HirId {
                        owner: DefId(0:3 ~ oneplustwo[6932]::main),
                        local_id: 9,
                    },
                ),
                value: e6,
            },
        },
        // destruction scope around expression 7
        Expr {
            ty: (),
            temp_lifetime: Some(
                Node(9),
            ),
            span: oneplustwo.rs:1:11: 3:2 (#0),
            kind: Scope {
                region_scope: Destruction(9),
                lint_level: Inherited,
                value: e7,
            },
        },
    ],
    stmts: [
        // let statement
        Stmt {
            kind: Let {
                remainder_scope: Remainder { block: 8, first_statement_index: 0},
                init_scope: Node(1),
                pattern: Pat {
                    ty: i32,
                    span: oneplustwo.rs:2:9: 2:10 (#0),
                    kind: Binding {
                        mutability: Not,
                        name: "x",
                        mode: ByValue,
                        var: LocalVarId(
                            HirId {
                                owner: DefId(0:3 ~ oneplustwo[6932]::main),
                                local_id: 7,
                            },
                        ),
                        ty: i32,
                        subpattern: None,
                        is_primary: true,
                    },
                },
                initializer: Some(
                    e5,
                ),
                else_block: None,
                lint_level: Explicit(
                    HirId {
                        owner: DefId(0:3 ~ oneplustwo[6932]::main),
                        local_id: 6,
                    },
                ),
            },
            opt_destruction_scope: Some(
                Destruction(1),
            ),
        },
    ],
}

Control-flow Graph (CFG)

A control-flow graph is structured as a set of basic blocks connected by edges. The key idea of a basic block is that it is a set of statements that execute “together” – that is, whenever you branch to a basic block, you start at the first statement and then execute all the remainder. Only at the end of the block is there the possibility of branching to more than one place (in MIR, we call that final statement the terminator):

bb0: {
    statement0;
    statement1;
    statement2;
    ...
    terminator;
}

总而言之，basic block 是一个执行的整体。在 block 内部，不会有 branching。可以参考 Basic block placement 这个章节。

MIR

MIR is Rust’s Mid-level Intermediate Representation. It is constructed from HIR. MIR was introduced in RFC 1211. It is a radically simplified form of Rust that is used for certain flow-sensitive safety checks – notably the borrow checker! – and also for optimization and code generation.

Key MIR vocabulary

This section introduces the key concepts of MIR, summarized here:

Basic blocks
见上文对 Basic block 的说明。
Locals
Memory locations allocated on the stack (conceptually, at least), such as function arguments, local variables, and temporaries.
These are identified by an index, written with a leading underscore, like _1. There is also a special “local” (_0) allocated to store the return value.
Places: expressions that identify a location in memory, like _1 or _1.f.
Rvalues: expressions that produce a value. The “R” stands for the fact that these are the “right-hand side” of an assignment.
- Operands: the arguments to an rvalue, which can either be a constant (like 22) or a place (like _1).

Some statements like StorageLive are removed in optimization. This happens because the compiler notices the value is never accessed in the code. 可以通过 rustc [filename].rs -Z mir-opt-level=0 --emit mir 显示没有被优化过的 MIR。

一个样例

fn main() {
    let mut vec = Vec::new();
    vec.push(1);
    vec.push(2);
}

下面是 MIR

// WARNING: This output format is intended for human consumers only
// and is subject to change without notice. Knock yourself out.
fn main() -> () {
    let mut _0: ();                      // return place in scope 0 at main.rs:1:11: 1:11
    let mut _1: std::vec::Vec;      // in scope 0 at main.rs:2:9: 2:16
    let _2: ();                          // in scope 0 at main.rs:3:5: 3:16
    let mut _3: &mut std::vec::Vec; // in scope 0 at main.rs:3:5: 3:16
    let _4: ();                          // in scope 0 at main.rs:4:5: 4:16
    let mut _5: &mut std::vec::Vec; // in scope 0 at main.rs:4:5: 4:16
    scope 1 {
        debug vec => _1;                 // in scope 1 at main.rs:2:9: 2:16
    }

    bb0: {
        StorageLive(_1);                 // scope 0 at main.rs:2:9: 2:16
        _1 = Vec::::new() -> bb1;   // scope 0 at main.rs:2:19: 2:29
                                         // mir::Constant
                                         // + span: main.rs:2:19: 2:27
                                         // + user_ty: UserType(0)
                                         // + literal: Const { ty: fn() -> Vec {Vec::::new}, val: Value() }
    }

    bb1: {
        StorageLive(_2);                 // scope 1 at main.rs:3:5: 3:16
        StorageLive(_3);                 // scope 1 at main.rs:3:5: 3:16
        _3 = &mut _1;                    // scope 1 at main.rs:3:5: 3:16
        _2 = Vec::::push(move _3, const 1_i32) -> [return: bb2, unwind: bb5]; // scope 1 at main.rs:3:5: 3:16
                                         // mir::Constant
                                         // + span: main.rs:3:9: 3:13
                                         // + literal: Const { ty: for<'a> fn(&'a mut Vec, i32) {Vec::::push}, val: Value() }
    }

    bb2: {
        StorageDead(_3);                 // scope 1 at main.rs:3:15: 3:16
        StorageDead(_2);                 // scope 1 at main.rs:3:16: 3:17
        StorageLive(_4);                 // scope 1 at main.rs:4:5: 4:16
        StorageLive(_5);                 // scope 1 at main.rs:4:5: 4:16
        _5 = &mut _1;                    // scope 1 at main.rs:4:5: 4:16
        _4 = Vec::::push(move _5, const 2_i32) -> [return: bb3, unwind: bb5]; // scope 1 at main.rs:4:5: 4:16
                                         // mir::Constant
                                         // + span: main.rs:4:9: 4:13
                                         // + literal: Const { ty: for<'a> fn(&'a mut Vec, i32) {Vec::::push}, val: Value() }
    }

    bb3: {
        StorageDead(_5);                 // scope 1 at main.rs:4:15: 4:16
        StorageDead(_4);                 // scope 1 at main.rs:4:16: 4:17
        _0 = const ();                   // scope 0 at main.rs:1:11: 5:2
        drop(_1) -> [return: bb4, unwind: bb6]; // scope 0 at main.rs:5:1: 5:2
    }

    bb4: {
        StorageDead(_1);                 // scope 0 at main.rs:5:1: 5:2
        return;                          // scope 0 at main.rs:5:2: 5:2
    }

    bb5 (cleanup): {
        drop(_1) -> bb6;                 // scope 0 at main.rs:5:1: 5:2
    }

    bb6 (cleanup): {
        resume;                          // scope 0 at main.rs:1:1: 5:2
    }
}

debug vec => _1; 提供了 debug 信息。

StorageLive(_1); 表示 variable _1 is “live” 的，也就是稍后还会被使用，直到遇到一个 StorageDead(_1)。这些标记被 LLVM 来分配栈空间。

= 这样的是赋值语句。

A place is an expression like _3, _3.f or *_3 – it denotes a location in memory.
An Rvalue is an expression that creates a value: in this case, the rvalue is a mutable borrow expression, which looks like &mut . So we can kind of define a grammar for rvalues like so:
1
2
3
4
5
6
7
8
= & (mut)?
| +
| -
| ...

= Constant
| copy Place
| move Place

When you use a place, we indicate whether we are copying it (which requires that the place have a type T where T: Copy) or moving it (which works for a place of any type).

有关 Rust 的类型系统及其 Analysis

ty 模块

ty::Ty

The specific Ty we are referring to is rustc_middle::ty::Ty (and not rustc_hir::Ty). The distinction is important, so we will discuss it first before going into the details of ty::Ty.

In contrast, ty::Ty represents the semantics of a type, that is, the meaning of what the user wrote. For example, rustc_hir::Ty would record the fact that a user used the name u32 twice in their program, but the ty::Ty would record the fact that both usages refer to the same type.

Demo

1	fn foo(x: u32) → u32 { x }

In this function, we see that u32 appears twice. We know that that is the same type, i.e. the function takes an argument and returns an argument of the same type, but from the point of view of the HIR, there would be two distinct type instances because these are occurring in two different places in the program. That is, they have two different Spans (locations).

1	fn foo(x: &u32) -> &u32

进一步的，HIR 可能会丢弃一些信息。比如 &u32 是一个 incomplete 的类型，因为它还缺少一个 lifetime。但我们并不需要写这些 lifetime，因为一些 elision rules 的缘故。其最终的表示类似于 fn foo<'a>(x: &'a u32) -> &'a u32。

在 HIR 级别，这样的表示并没有被生成，所以我们可以说类型是 incomplete 的。但是在 ty::Ty 级别，这些信息会被补足，所以现在类型是 complete 了。进一步的，对于每一个类型，只会有一个 ty::Ty。比如一个 u32 的 ty::Ty 会在整个程序中都被 share。这不同于 rustc_hir::Ty。

Order

HIR is built directly from the AST, so it happens before any ty::Ty is produced. After HIR is built, some basic type inference and type checking is done. During the type inference, we figure out what the ty::Ty of everything is and we also check if the type of something is ambiguous. The ty::Ty is then used for type checking while making sure everything has the expected type.

The hir_ty_lowering module is where the code responsible for lowering a rustc_hir::Ty to a ty::Ty is located. The main routine used is lower_ty.

How semantics drive the two instances of Ty

从类型推断的观点来说，HIR 去对类型进行更少的假设。我们假设两个类型是不同的，除非随后它们被证明是相同的。换句话说，知道的越少，假设就越少。

考虑 fn foo(x: T) -> u32. 考虑调用了 foo::(0). 此时，T 和 u32 最终都是同一个类型，所以最终使用同一个 ty::Ty，但 rustc_hir::Ty 还是不同的。当然这个例子有点过于简单了，因为在 type checking 的时候，会 check the function generically and would still have a T distinct from u32。在后续的 code generation 的时候，才会进行 monomorphized，也就是对于泛型函数的每个版本生成对应的替换掉泛型变量的函数。

ty::Ty implementation

rustc_middle::ty::Ty is actually a wrapper around Interned>. Interned 可以忽略，它还起到一个指针的作用，反正解引用也可以被折叠。TyKind is a big enum with variants to represent many different Rust types，比如原始类型、引用、ADT、泛型以及 lifetime 等。 WithCachedTypeInfo has a few cached values like flags and outer_exclusive_binder. They are convenient hacks for efficiency and summarize information about the type that we may want to know, but they don’t come into the picture as much here.

Allocating and working with types

To allocate a new type, you can use the various new_* methods defined on Ty. These have names that correspond mostly to the various kinds of types. For example:

1	let array_ty = Ty::new_array_with_const_len(tcx, ty, count);

类似的方法返回一个 Ty<'tcx>。注意获得的 lifetime 是 tctx 所访问的哪个 arena 的 lifetime。Types are always canonicalized and interned (so we never allocate exactly the same type twice).

Comparing types

ty::TyKind Variants

ADTs Representation

Bound vars and parameters

Type inference

下面代码中的 things 的类型被推断为 Vec<&str>，因为我们往 things 中加入了 &str。

fn main() {
    let mut things = vec![];
    things.push("thing");
}

The type inference is based on the standard Hindley-Milner (HM) type inference algorithm, but extended in various way to accommodate subtyping, region inference, and higher-ranked types.

Inference variables

inference context 的主要目的是容纳一系列的 inference variable。这些表示那些具体值还没有被确定的 type 或者 region。这些值会在 type-checking 的时候被计算得到。

如果了解 HM 类型系统或者像 Prolog 的逻辑语言就能理解类似的概念。

All told, the inference context stores five kinds of inference variables (as of March 2023):

inference context 存放 5 种 inference variable：

Type variables, which come in three varieties:
- General type variables (the most common). These can be unified with any type.
- Integral type variables, which can only be unified with an integral type, and arise from an integer literal expression like 22.
- Float type variables, which can only be unified with a float type, and arise from a float literal expression like 22.0.
Region variables, which represent lifetimes, and arise all over the place.
Const variables, which represent constants.

All the type variables work in much the same way: you can create a new type variable, and what you get is Ty<'tcx> representing an unresolved type ?T. Then later you can apply the various operations that the inferencer supports, such as equality or subtyping, and it will possibly instantiate (or bind) that ?T to a specific value as a result.

对于 Region variable 情况不同，会在稍后 Region constraints 中讨论。

补充说明：lexical region 和 non-lexical region

如下所示，在 non-lexical lifetime 出现之前，下面的代码会编译失败。

fn main() {
    let mut scores = vec![1, 2, 3];
    let score = &scores[0];
    scores.push(4);
}

编译报错如下，当然我发现现如今的 rust 已经无法复现了

error[E0502]: cannot borrow `scores` as mutable because it is also borrowed as immutable
 --> src/main.rs:4:5
  |
3 |     let score = &scores[0];
  |                  ------ immutable borrow occurs here
4 |     scores.push(4);
  |     ^^^^^^ mutable borrow occurs here
5 | }
  | - immutable borrow ends here

这里报错的原因是 score 是通过 lexical 的方式 borrow 的 scores 的。

Region constraints

Regions are inferenced somewhat differently from types. Rather than eagerly unifying things, we simply collect constraints as we go, but make (almost) no attempt to solve regions. These constraints have the form of an “outlives” constraint:

'a: 'b

实际上这个代码将 'a 和 'b 视作了 subregion 的关系，但实际上是一个意思

'b <= 'a

(There are various other kinds of constraints, such as “verifys”; see the region_constraints module for details.)

但是依然有一个常见，我们会做一些 eager unification。也就是如果有一个 equality constraint between two regions，如

'a = 'b

那么我们就会将这个事实记录在一个 unification table 中。可以使用 opportunistic_resolve_var to convert 'b to 'a，或者反过来也可以. This is sometimes needed to ensure termination of fixed-point algorithms.

Solving region constraints

Region constraints are only solved at the very end of typechecking, once all other constraints are known and all other obligations have been proven. There are two ways to solve region constraints right now: lexical and non-lexical. Eventually there will only be one.

An exception here is the leak-check which is used during trait solving and relies on region constraints containing higher-ranked regions. Region constraints in the root universe (i.e. not arising from a for<'a>) must not influence the trait system, as these regions are all erased during codegen.

To solve lexical region constraints, you invoke resolve_regions_and_report_errors. This “closes” the region constraint process and invokes the lexical_region_resolve code. Once this is done, any further attempt to equate or create a subtyping relationship will yield an ICE.

The NLL solver (actually, the MIR type-checker) does things slightly differently. It uses canonical queries for trait solving which use take_and_reset_region_constraints at the end. This extracts all of the outlives constraints added during the canonical query. This is required as the NLL solver must not only know what regions outlive each other, but also where. Finally, the NLL solver invokes take_region_var_origins, providing all region variables to the solver.

Lexical region resolution

Lexical region resolution is done by initially assigning each region variable to an empty value. We then process each outlives constraint repeatedly, growing region variables until a fixed-point is reached. Region variables can be grown using a least-upper-bound relation on the region lattice in a fairly straightforward fashion.

https://internals.rust-lang.org/t/how-does-region-inference-work/7511/3

Constraints | Ordering | Region-lattice 
------------|----------|--------------
  'a:'b+'c  | 'a <= 'b |      'd      Join, LUB (Most Specific Supertype)
  'b:'d     | 'a <= 'c |      / \     
  'c:'d     | 'b <= 'd |    'b  'c    
  'd        | 'c <= 'd |      \ /     
            |          |      'a      Meet, GLB (Most Common Subtype)

有关 Borrow checker

The borrow checker operates on the MIR. An older implementation operated on the HIR. Doing borrow checking on MIR has several advantages:

The MIR is far less complex than the HIR; the radical desugaring helps prevent bugs in the borrow checker. (If you’re curious, you can see a list of bugs that the MIR-based borrow checker fixes here.)
Even more importantly, using the MIR enables “non-lexical lifetimes”, which are regions derived from the control-flow graph.

Tracking moves and initialization

其作用如下，检查哪些变量是 uninitialized 的状态。

fn foo() {
    let a: (Vec<u32>, Vec<u32>) = (vec![22], vec![44]);
    // a.0 and a.1 are both initialized
    let b = a.0; // moves a.0
    // a.0 is not initialized, but a.1 still is
    let c = a.0; // ERROR
    let d = a.1; // OK
}

因为 Rust 现在允许只 move 一个 field 比如 a.0 了，所以 trace local variable 是不够的。Rust 根据 move path 为粒度去 trace。
A MovePath represents some location that the user can initialize, move, etc. So e.g. there is a move-path representing the local variable a, and there is a move-path representing a.0. Move paths roughly correspond to the concept of a Place from MIR, but they are indexed in ways that enable us to do move analysis more efficiently.

所有的 MovePath 存储在一个 vector 中，我们通过 MovePathIndex 去访问。

One of the first things we do in the MIR borrow check is to construct the set of move paths. This is done as part of the MoveData::gather_moves function. This function uses a MIR visitor called Gatherer to walk the MIR and look at how each Place within is accessed. For each such Place, it constructs a corresponding MovePathIndex. It also records when/where that particular move path is moved/initialized, but we’ll get to that in a later section.

We don’t actually create a move-path for every Place that gets used. In particular, if it is illegal to move from a Place, then there is no need for a MovePathIndex. Some examples:

You cannot move from a static variable, so we do not create a MovePathIndex for static variables.
You cannot move an individual element of an array, so if we have e.g. foo: [String; 3], there would be no move-path for foo[1].
You cannot move from inside of a borrowed reference, so if we have e.g. foo: &String, there would be no move-path for *foo.

These rules are enforced by the move_path_for function, which converts a Place into a MovePathIndex。在诸如上面的错误的场景下，返回错误 Err。这也说明了我们并不需要 track 这些 Place 是否已经 initialized 了，从而减少了开销.

If you have a Place and you would like to convert it to a MovePathIndex, you can do that using the MovePathLookup structure found in the rev_lookup field of MoveData. There are two different methods:

find_local, which takes a mir::Local representing a local variable. This is the easier method, because we always create a MovePathIndex for every local variable.
find, 可以处理任意的 Place。所以，只会返回一个 LookupResult，表示最近的 path。例如对 foo[1] 返回 foo。

As we noted above, move-paths are stored in a big vector and referenced via their MovePathIndex. 但是在这个 vector 中，它们也被构建为一棵树。例如 if you have the MovePathIndex for a.b.c, you can go to its parent move-path a.b. 也可以遍历所有的 child path。比如对于 a.b, you might iterate to find the path a.b.c (here you are iterating just over the paths that are actually referenced in the source, not all possible paths that could have been referenced). These references are used for example in the find_in_move_path_or_its_descendants function, which determines whether a move-path (e.g., a.b) or any child of that move-path (e.g.,a.b.c) matches a given predicate.

The MIR type-check

A key component of the borrow check is the MIR type-check. This check walks the MIR and does a complete “type check” – the same kind you might find in any other language. In the process of doing this type-check, we also uncover the region constraints that apply to the program.

User types

在 MIR type check 的开始，we replace all regions in the body with new unconstrained regions. However, this would cause us to accept the following program:

1
2
3

fn foo<'a>(x: &'a u32) {
    let y: &'static u32 = x;
}

By erasing the lifetimes in the type of y we no longer know that it is supposed to be 'static, ignoring the intentions of the user.

To deal with this we remember all places where the user explicitly mentioned a type during HIR type-check as CanonicalUserTypeAnnotations.

There are two different annotations we care about:

Explicit type ascriptions, 比如 let y: &'static u32 会产生 UserType::Ty(&'static u32).
Explicit generic arguments, 比如 x.foo<&'a u32, Vec> 会产生 UserType::TypeOf(foo_def_id, [&'a u32, Vec]).

Drop Check

Implicit drop

通常，只要 local 被使用，就必须要 local 的 type 是 well-formed 的。This includes proving the where-bounds of the local and also requires all regions used by it to be live.

唯一的特例是在 value go out of scope 的时候，隐式 drop 掉 value，这不需要 value 是 live 的。
如下所示，x 在注释处已经 out of scope 了，并且这是在指向 y 的引用被 invalidate 之后。也就是说在 drop x 的时候，它的类型不是 well-formed 的。但这是个特例，实际上也是唯一 drop value 操作不需要访问任何 dead region 的情况。We check this by requiring the type of the value to be drop-live. The requirements for which are computed in fn dropck_outlives.

fn main() {
    let x = vec![];
    {
        let y = String::from("I am temporary");
        x.push(&y);
    }
    // `x` goes out of scope here, after the reference to `y`
    // is invalidated. This means that while dropping `x` its type
    // is not well-formed as it contain regions which are not live.
}

How values are dropped

At its core, a value of type T is dropped by executing its “drop glue”. Drop glue is compiler generated and first calls ::drop and then recursively calls the drop glue of any recursively owned values.

If T has an explicit Drop impl, call ::drop.
Regardless of whether T implements Drop, recurse into all values owned by T:
- references, raw pointers, function pointers, function items, trait objects1, and scalars do not own anything.
  对于 trait object，可以认为它有一个内置的 Drop 实现，该实现会直接调用 vtable 中的 drop_in_place。这个 Drop 实现需要所有它所有的 generic parameter 都是 alive 的。
- tuples, slices, and arrays consider their elements to be owned. For arrays of length zero we do not own any value of the element type.
- all fields (of all variants) of ADTs are considered owned. We consider all variants for enums.
  The exception here is ManuallyDrop which is not considered to own U.
  PhantomData also does not own anything.
- closures and generators own their captured upvars.

可以通过 fn Ty::needs_drop 判断是否一个类型是否有 drop glue。

Partially dropping a local

如果一个 type 没有实现 Drop，就可以在 drop 掉剩下的成员前 move 掉一些其他的成员。此时，只有那些没有被 move 的成员会被触发 drop glue。

struct PrintOnDrop<'a>(&'a str);
impl<'a> Drop for PrintOnDrop<'_> {
    fn drop(&mut self) {
        println!("{}", self.0);
    }
}

fn main() {
    let mut x = (PrintOnDrop("third"), PrintOnDrop("first"));
    drop(x.1);
    println!("second")
}

但是如果遇到下面的代码，则会报错 cannot move out of type Tup<'_>, which implements the Drop trait。

struct Tup<'a> {
    a: PrintOnDrop<'a>,
    b: PrintOnDrop<'a>,
}

impl<'a> Drop for Tup<'a> {
    fn drop(&mut self) {
        
    }
}

fn main() {
    let mut x = Tup{a: PrintOnDrop("third"), b: PrintOnDrop("first")};
    drop(x.b);
    println!("second")
}

During MIR building we assume that a local may get dropped whenever it goes out of scope as long as its type needs drop.
Computing the exact drop glue for a variable happens after borrowck in the ElaborateDrops pass. 也就是说，即使 local 中的一些成员之前已经被 drop 了，dropck 依然需要这些 value 是 alive 的。

如下所示，完全 move 了 local 的情况下也是这样。x borrow 了 temp，然后被 drop 了。但依然会有下面的报错。

fn main() {
    let mut x;
    {
        let temp = String::from("I am temporary");
        x = PrintOnDrop(&temp);
        drop(x);
    }
} //~ ERROR `temp` dropped here while still borrowed

dropck_outlives

There are two distinct “liveness” computations that we perform:

a value v is use-live at location L if it may be “used” later; a use here is basically anything that is not a drop
a value v is drop-live at location L if it maybe dropped later

When things are use-live, their entire type must be valid at L.
When they are drop-live, all regions that are required by dropck must be valid at L. The values dropped in the MIR are places.

Region inference (Non-Lexical Lifetime, NLL)

The MIR-based region checking code is located in the rustc_mir::borrow_check module.

The MIR-based region analysis consists of two major functions:

replace_regions_in_mir, invoked first, has two jobs:
- First, it finds the set of regions that appear within the signature of the function (e.g., 'a in fn foo<'a>(&'a u32) { ... }). These are called the “universal” or “free” regions – in particular, they are the regions that appear free in the function body.
- Second, it replaces all the regions from the function body with fresh inference variables. This is because (presently) those regions are the results of lexical region inference and hence are not of much interest. The intention is that – eventually – they will be “erased regions” (i.e., no information at all), since we won’t be doing lexical region inference at all.
compute_regions, invoked second: this is given as argument the results of move analysis. It has the job of computing values for all the inference variables that replace_regions_in_mir introduced.
- To do that, it first runs the MIR type checker. This is basically a normal type-checker but specialized to MIR, which is much simpler than full Rust, of course. Running the MIR type checker will however create various constraints between region variables, indicating their potential values and relationships to one another.
- After this, we perform constraint propagation by creating a RegionInferenceContext and invoking its solve method.
- The NLL RFC also includes fairly thorough (and hopefully readable) coverage.

Universal regions

Reference

https://rustc-dev-guide.rust-lang.org/appendix/background.html
编译器的一些基础知识。
https://rustc-dev-guide.rust-lang.org/hir.html
HIR。
https://rustc-dev-guide.rust-lang.org/thir.html
THIR。
https://rustc-dev-guide.rust-lang.org/ty.html
关于 rust 类型系统的介绍。
https://blog.logrocket.com/introducing-the-rust-borrow-checker/
https://rustc-dev-guide.rust-lang.org/borrow_check.html
Rust Compiler Development Guide 上的讲解
https://www.zybuluo.com/darwin-yuan/note/424724
Hindley-Milner类型系统

Database paper part 1

2024-03-08T13:33:22.000Z

在比较早的时候，我使用腾讯文档记录一些数据库的论文。但我越来越无法忍受腾讯文档的 bug 等不便利。因此我打算将这些文章转移到博客中，即使它们中的部分的完成度并不是很高。

这篇文章中，包含 CStore、Kudu、Masstree 和 Ceph。

CStore

按照 projection 存储，一个 projection 对应了一个表的一个或者几个列。
每个 projection 有自己独立的 sort key。不同的 projection 之间，用 join indexes 来维护它们的对应关系。
每个 projection 会水平分区为多个 segment，每个 segment 有自己的一个 sid。
每个 segment 分为 RS 和 WS。
定义 storage key：
1.在 RS 上，直接按序存储，通过遍历获得 index
2.在 WS 上，每次插入获取一个 storage key，大于 RS 上的最大值

从上面看到，(sid, storage key) 可以唯一索引一个 key，它可能在 RS 上，也可能在 WS 上。

在 WS 上，projection 上的每一列用 B tree 来存，是按照 storage key 来排序的。所以还要额外维护 sort key -> projection key 的映射关系。
我没搞懂，为什么 WS 上也不按照 storage key 自增来处理呢？这样就不需要一个 B tree 了啊。这是因为同一个 Segment 中不同列里具有相同 SK 的数据属于同一个 Logical Tuple，所以实际上是做不到递增的。为什么要这样设计呢？原因是 join indexes 就可以只维护每个 projection 上每一行到 (sid, storage key) 的映射关系就行了。如下图所示

从实现上来讲，当一个 tuple 到 WS 的时候，为它分配一个 storage key 也是很自然的。
对于只读查询来说，如果允许其读取过去任意时间的快照（其实就是 Time Travel Query），代价是非常大的。C-Store 维护了一个高水位（High Water Mark，HWM）和一个低水位（Low Water Mark，LWM），这两个水位其实对应了只读查询可读取的时间范围的上限和下限。

CStore 的 MVCC 是以 epoch 为单位的。epoch 的粒度应该是比较大的。我们可以读 epoch e 上的事务，当 epoch e 上的所有事务都被提交完毕。

RS 的存储有优化：
1.排序列+Cardinality 较少：run length 编码
2.排序列+Cardinality 较多：bitmap 编码
3.非排序列+Cardinality较少：delta encoding
4.非排序列+Cardinality较多：正常存储

Kudu

Hadoop 系统中的结构化数据的两种存储方式：
1.静态数据会使用 Avro 行存或者 Parquet 列存来存储，但它们对 UPDATE 单条记录，或者随机访问并不友好。
2.可变数据会存在 semi-structed 仓库中，类似 HBase 或者 Cassandra。这些存储有很低的读写延迟，但是相比静态数据，其顺序读写的带宽不高，从而不适用于 OLAP 或者机器学习。

一种折衷的方案是数据和修改流式写入HBase，再定期导出为HDFS上的Parquet文件。但这样的架构会有以下问题：
1.应用端要写复杂的代码维护两套系统。
2.要跨系统维护一致性的备份、安全策略、监控。
3.更新进入HBase到最终能被查询到的延时可能很久。
4.实际场景中经常有要修改已经持久化到HDFS的文件的需求，包括迟来的数据，或者修正之前的数据。文件重写是高开销的，还可能要人工介入。

Partitioning

The partition schema is made up of zero or more hashpartitioning rules followed by an optional range-partitioning rule:
Hash Partition 将 tuple 中的某些 column 连接起来组成 binary key，然后计算这个串的 hash 值。
Range Partition 将 tuple 中的某些 column 连接起来组成 binary key，然后用 order-preserving encoding 来确定所处的 range。

Replication

Kudu 的 Leader 会负责用本地的 Lock Manager 去串行化 Concurrent 的操作，选择对应的 MVCC 时间戳，并且 propose 到 Raft 上。Raft 层复制的是每个 tablet 的逻辑日志，比如 insert、update、delete 等。
Kudu 说 there is no restriction that the leader must write an operation to its local log before it may be committed，所以能够保障很好的延迟。这里指的应该是 Raft 的 commit，也就是说只要有 quorum 的节点持久化日志就行，Leader 未必要持久化对应的日志，其实也是对 Raft 的优化。
此外，它还列出了两点和 Raft 有关的优化，这里略过。
再次强调，Kudu 并不是复制 tablet 的物理日志，而是 operation log。它的目的是在各个 Replica 之间解耦，从而得到下面的好处：
1.避免所有的 replica 同时经历物理层的开销较大的操作，比如 flush 或者 compaction。这可以降低 client 在写入时感受到的 tail latency。后续还可以实现 speculative read requests，从而减少读的 tail latency。
当然，我觉得这也有坏处，例如各个 Replica 之间的 Snapshot 不太好做了。
2.有机会及时发现某个 replica 被 corrupt 了，从而即使进行修复。

针对 Raft 的成员变更，他们主要引入了 Pre Voter，就是 Learner 来保证不损失可用性。

Master

这里讲述的 Kudu 的 root service 的实现。主要包括的功能有：
1.作为 catalog manager，记录所有的 table 和 tablet，以及对应的元信息，比如 schema、replication level 等。处理 DDL。
2.作为 cluster coordinator，记录存活的 server，并进行 rebalance。
3.作为 tablet directory，记录每个 tablet 在哪些 server 上分布。

catalog manager

Master 会管理一个专有的 tablet。
The master internally writes catalog information to this tablet, while keeping a full writethrough cache of the catalog in memory at all times. Kudu 并不担心占用太多内存，如果后面确实占用了，就把它放到一个 page cache 里面。
1.first writing a table record to the catalog table indicating a CREATING state
2.Asynchronously, it selects tablet servers to host tablet replicas, creates the Master-side tablet metadata
3.Sends asynchronous requests to create the replicas on the tablet servers
a.If the replica creation fails or times out on a majority of replicas, the tablet can be safely deleted and a new tablet created with a new set of replicas.
b.If the Master fails in the middle of this operation, the table record indicates that a roll-forward is necessary and the master can resume where it left off.
对于 delete 或者 change，会先 propogate 到相关的 tablet server，然后 Master 再写自己的存储。

cluster coordinator

每个 tablet 会记录所有 Master 节点的地址。启动之后会开始向这些 master 不断汇报自己上面的 tablet。第一次汇报是全量，后面的是增量。
Kudu 有个关键设计，就是尽管 Master 是 catalog 的 source of truth，但是它只是集群状态的 observer。集群中的 tablet server 会提供比如 tablet replica 的位置信息、Raft 相关、schema version 等信息。tablet 的相关变化也是通过 raft log 记录的。因此 Master 可以借助于 raft log 的 index 去比较 tablet state 的新旧。
Tablet server 承担了更多的责任，每个 tablet 的 Leader replica 负责检查有没有 crash 的 follower。发现后会发起配置变更将这个 follower 移除，并在配置变更完成后通知 Master。Master 负责选择新 replica 所在的 server，然后让 Leader replica 发起新的一轮配置变更。

tablet directory

client 会直接请求 Master 询问 tablet 的位置信息，也会缓存很多最近的信息。当缓存的信息陈旧，则会被拒绝，此时需要重新联系 Master 要最新的 Leader。
Master 会将所有的table partition range 存在内存中，所以请求变多，回复依然还是比较快。即使 tablet directory 变成瓶颈，我们也可以返回陈旧的 location 信息。这里原因是客户端会失败，从而重试。所以论文中说this portion of the Master can be trivially partitioned and replicated across any number of machines. 这里我理解是可以从 Master 的其他副本读，但我理解这里实际上的瓶颈不应该是内存么？

Tablet storage

RowSets

Tablet 的下层结构是 RowSets。分为 DiskRowSets 和 MemRowSets。RowSets 的 range 可能重复，但如果一个 row 存在，那么一定只在一个 RowSets 中。
一个 Tablet 只有一个 MemRowSet。这一部分包括 flush 无需赘述。
MemRowSets 是一个类似 Mass tree 的 B 树。但有一些优化：
1.不支持从树上删除元素。Kudu 也是通过 MVCC 来逻辑删除。
2.同样也不支持 inplace 修改。但作为优化，允许不改变值大小的修改，这样方便进行 CAS 操作。这个操作主要用来构建链表。这个链表一般被用来链接 B+ 树的叶子节点，从而提高扫表效率。
3.并不完全实现 trie of trees，而是只使用一棵树。因为并不需要考虑极端的随机访问。
为了提高扫描性能，使用更大的 internal 和 leaf 节点大小，到 256 bytes 大小。
MemRowSets 是行存，因为内存结构，所以性能也是可以接受的。
每行的 PK 使用 order-preserving encoding，从而只需要 memcmp 就可以实现比较。

DiskRowSet 被分成若干个 32MB 大小的文件，目的是确保它不会太大，从而支持后续的 Incremental compaction。
一个 DiskRowSet 被分成两部分，base data 和 delta store。base data 是列存，其中列被分成很多个小 page 来存储，从而保障随机读。有一个 B 树索引用来根据原始的 offset 来查找实际的 page。支持字典，bitshuffle 等格式。可以指定进一步的压缩方法。
除了 flush 指定列之外，还会写一个 PK 索引列，用来存储每个 PK 的编码后的 PK。还会存储 Bloom filter。
更新和删除通过 delta store 来记录。delta store 可以是 DeltaMemStores 或者 DeltaFile。DeltaMemStore 是一个和上面一样的 B 树。DeltaFile 是一个二进制编码的 column block。delta store维护了 (row offset, timestamp) tuple 到 RowChangeList 的映射。其中 row offset 就是一个 row 在 row set 中的 index。timestamp 就是 MVCC 时间戳。RowChangeList 表示对一个 row 的变更，是一个二进制编码的 list。
在处理 update 时，首先查找 PK 列，然后可以通过它的 B 树索引来获得对应行所处的 page。然后通过查找这个 page 可以获得对应的row 在整个 DiskRowSet 中的offset。然后就可以根据这个索引插入一条更新的数据了。

因为 Delta Store 是以 row-offset 作为主键，所以这个过程会更快 (相比于 Primary key)，这就是为什么插入时要费那么多功夫去获取 row-offset，可以理解为 Kudu 在 Insert/Read 的性能平衡中更倾向于优化 Read 性能。

Compaction

Delta Compaction

扫描 delta store 中的 DiskRowSet，寻找
Because deltas are not stored in a columnar format, the scan speed of a tablet will degrade as ever more deltas are applied to the base data. Thus, Kudu’s background maintenance manager periodically scans DiskRowSets to find any cases where a large number of deltas (as identified by the ratio between base data row count and delta count) have accumulated, and schedules a delta compaction operation which merges those deltas back into the base data columns.

In particular, the delta compaction operation identifies the common case where the majority of deltas only apply to a subset of columns: for example, it is common for a SQL batch operation to update just one column out of a wide table. In this case, the delta compaction will only rewrite that single column, avoiding IO on the other unmodified columns.

Scheduling maintainance

如果 insert 负担变重，则调度偏向于处理“flush”，也就是将 MemRowSets 写成 DiskRowSets。
如果 insert 负担减轻，则偏向于处理 rowset compaction 或者 delta compaction。
Because the maintenance threads are always running small units of work, the operations can react quickly to changes in workload behavior. For example, when insertion workload increases, the scheduler quickly reacts and flushes in-memory stores to disk. When the insertion workload reduces, the server performs compactions in the background to increase performance for future writes. This provides smooth transitions in performance, making it easier for developers and operators to perform capacity planning and estimate the latency profile of their workloads.

Reference

https://zhuanlan.zhihu.com/p/137243163

Masstree

Intros

从这里来看，他们首先强调了，尽管可以 scale out，但是单机的性能依然很重要。然后他们介绍了 This paper presents Masstree, a storage system specialized for key-value data in which all data fits in memory, butmust persist across server restarts. Within these constraints, Masstree aims to provide a flexible storage model.
它的 key 的长度是任意的，支持 range 查询。很多 key 可以共享前缀，从而提高性能。对于比较大的 value 也有优化。它使用了一些 OLFIT 和 rcu 的办法来处理并发：

查询不使用锁或者 interlock 指令，所以它不会 invalidate shared cache line，并且和大多数 insert 和 update 是平行的。
update 只会锁相关的 tree node，树的其他部分不受影响。
Masstree 中所有的 core 都使用一棵树，从而避免 load imbalances that can occur in partitioned designs。相比于其他的将一棵树分开存储的设计，能够彻底解决 imbalance 的问题。
这棵树是 a trie-like concatenation of B+-trees。对 long common key prefixes 特别友好，遥遥领先。查询耗时主要由 total DRAM fetch time of successive nodes during tree descent 来决定。因此，Masstree 使用一个较大的 fanout 从而减少树的深度。同时 fetch 多个 nodes，从而 overlap fetch latencies。另外还会精心设计 cache line 以减少每个 node 需要的 data。

结构

几点挑战：

Masstree must efficiently support many key distributions, including variable-length binary keys where many keys might have long common prefixes.
for high performance and scalability, Masstree must allow fine-grained concurrent access, and its get operations must never dirty shared cache lines by writing shared data structures.
Masstree’s layout must support prefetching and collocate important information on small numbers of cache lines.
其中 2 和 3 就是 Masstree 的 cache craftiness，即缓存友好性。

Masstree 是一个 trie 树，trie 树的每个节点是一个 B+ 树。trie 树的 fanout 是 2^64，也就是 8 个 bytes。通过 trie 的目的是利用 long key 的 shared prefix。通过 B+ 树是支持 short key，以及 fine-grained concurrency。B+ 树的 fanout 是中等的，所以能有效利用内存。

每个 B+ 树都会有至少一个 border node 也就是图中的矩形节点，以及 0 个或多个 interior node 也就是图中的圆形节点。border node 中按照传统的 B+ 树的方式组织 leaf nodes 也就是图中的五角星节点。可以看到 B+ 树的 border node 用来连接到下一层的 trie node，也就是一棵新的 B+ 树上。
Masstree 用一种比较 lazy 的方式去生成更深的层：

Keys shorter than 8h+8 bytes are stored at layer ≤ h.
Any keys stored in the same layerh tree have the same 8h-byte prefix.
When two keys share a prefix, they are stored at least as deep as the shared prefix.
Masstree creates layers as needed (as is usual for tries). Key insertion prefers to use existing trees; new trees are created only when insertion would otherwise violate an invariant. 比如 “01234567AB” 会被存在 root layer 中，直到插入一个 “01234567XY” 之后会产生一个新的 layer。新的 layer 中会有一个 B+ 树，其中存放 AB 和 XY。

复杂度分析

查询复杂度和 B 树相同。对于 B 树，需要检查 O(log n) 个 nodes，进行 O(log n) 次比较。假设 key 的长度是 O(l)，所以总的比较开销是 O(l logn)。Masstree 要在 O(l) 层中比较，每层比较的开销是 O(log n)，所以总的代价也是 O(l logn)。但如果有公共前缀那么 Masstree 的代价就是 O(l + log n) 了。
Masstree’s range queries have higher worst-case complexity than in a B+-tree, since they must traverse multiple layers of tree.

Layout

Figure 2 展示了节点的定义。这里面大量的 15 说明这里使用了 fanout 为 15 的 B+ 树。node *child[16] 中的 node 既可以是 border node，也可以是 interior node。
所有的 Border node 被链接，从而能够实现快速 remove 和 getrange 操作。keyslice 用 64 位 integer 数组表示字符串，相当于是 64 * ceil(n / 8) 代替 8*n，这能提高 13-19% 的效率。后面讲如何处理 ‘\0’。然后，A single tree can store at most 10 keys with the same slice, namely keys with lengths 0 through 8 plus either one key with length > 8 or a link to a deeper trie layer. 这个也好理解，因为如果有第二个，比如上面的 01234567XY，就必须分裂了。我们保证所有 slice 相同的 key 会存在同一个 border node 中。这个设计简化了interior node，它不必包含key的长度。也简化了并发操作的复杂度，带来的一点代价就是在节点分裂时需要做一些检查。

下面讲如何维护 border node 上的 suffix，这些 suffix 最多有 15 个。这里的做法是自适应地 inline 存，或者放在单独的内存块上面。目的是节省内存。总而言之这一块讲得是比较模糊的。

Masstree prefetches all of a tree node’s cache lines in parallel before using the node, so the entire node can be used after a single DRAM latency. Up to a point, this allows larger tree nodes to be fetched in the same amount of time as smaller ones; larger nodes have
wider fanout and thus reduce tree height.

Nonconcurrent modification

Concurrency overview

主要包含细粒度的锁，以及 optimistic concurrency control。
细粒度的锁指的是update 操作只需要 local lock。OCC 指的是读操作并不需要锁，也不会写全局的共享内存。这里应该指的是引用计数之类的东西，会导致 reader 竞争 read lock。
因为 reader 并不会 block 并发的 write 操作。所以可能会读到中间数据。因此，writer 在写之前会将一个 dirty 位标记。在写完之后，再自增 version。Reader 会在读取这节点前记录 version，并在读取后再次比较 version 和 dirty 位。
这里，根据更新的种类是 insert 还是 split，会更新 version 中的不同区域。version 的 layout 如下所示。

The biggest challenge in preserving correctness is concurrent splits and removes, which can shift responsibility for a key away from a subtree even as a reader traverses that subtree.

Writer–writer coordination

通过自旋锁来维护，这个锁在 version 里面的 locked 位上。
但是节点上的一些字段是被其他节点的锁来保护的，比如：

parent 指针收到父节点的锁保护
border node 的 prev指针受到左边 sibling 的保护
这能减少 split 操作的时候需要 acquire 的锁的数量。比如当某个中间节点 split 的时候它不需要子节点的锁，就可以替他们修改 parent 指针了。
但尽管如此，当节点 n 分裂的时候，还是需要：
n 自己的锁
a.目的是避免被并发修改。
n 的新的 sibling 的锁
a. 从后面来看，这里指的就是获取新分裂出来的 n’ 的锁。n’ 的 prev 是 n。
b. 要不要获取分裂前的 prev 的锁，防止 prev 同时分裂？
n 的 parent 的锁
a. 防止父节点被其他线程分裂，从而让新分出来的节点 attach 错了 parent。
b. 从后文来看，更重要的原因是便于因为 parent 可能也满了，所以需要同时分裂 parent。

Writer–reader coordination

基本上就是对之前的 OCC 的一些展开的论述。
这里说了，universal的 before-and-after version 检查能够让reader 发现任何并发的 split，但也会影响性能。有一些性能优化措施，比如让某些操作比如 update，实际上可以避免更新 version。

Update

主要将通过对齐 version 的 alignment，让对它的写是原子的。
所以 update 操作不需要更新 version。

Update 操作时，writer 不能直接把旧的值删除掉，因为此时可能还有 reader 在读。这个是通过 RCU 来解决的。其实这里类似的方法还有 hazard pointer 等。

Border inserts

阅读本章前，可以回顾下 Figure2 的 keyslice 实现。
Border nodes 结构中的 permulation 字段是一个 uint64_t，其最低的 4 个 bit 组成 uint4_t 用来表示 key 的数量。因为 B+ 树的 fanout 是 15，所以正好。高 60 个 bit 组成了 uint4_t[15]，用来索引每一个 key 的实际位置。
在插入时，会加载 permutation，并且重新组织 permutation 字段，匀出一个没有使用的 slot，存储正确的插入位置。
这个操作大部分时候需要一个 compiler fence，在一些机器上需要在写 kv 和写 permutation 中间加一个 memory fence。

New layers

在阅读本章前，可以先看下 Figure2 的 link_or_value 的实现。
插入 k1 到某个 border node，如果发现其中还有个冲突的 k2（这里冲突的含义看前面），那么就创建一个新的 border node 即 n’。将 k2 插入 n’ 上的合适的 keyrange 上，并且将 k2 在 n 中的 value 替换成一个 pointer，这个 pointer 指向 next_layer 这一棵新的 B+ 树。然后，它可以解锁 n，并且继续插入 k1。此时 k1 会插入到新层的 n’ 上。

这里的过程只涉及到一个 key，也就是k1，所以并不需要更新 n 的 version 或者 permutation。我们回顾上文，会比较明白为什么之前这么设计了。
这个场景下需要注意的点是，reader 需要区分 value 和 pointer。因为 pointer 和 layermarker 是分别存放的。首先 writer 要把 key 标记为 UNSTABLE 状态，然后 reader 检查到这个标记的时候就会 retry。然后 writer 会写入 layer pointer 指针，最后把 key 标记为 LAYER。
这里的 UNSTABLE 或者 LAYER 啥的，根据上文，是由 keylen 这个字段来区分的。

Splits

Split 相比非 Split 操作，需要将一些 key 移动到另一个 node 中。所以 get 操作很容易就会丢掉这些被转移了的 key。所以，writer 需要去更新 version 里面的 split 字段。

Split 操作用了 hand-over-hand locking。这个实际上就是同时持有 cur 和 next 的 lock。在 Masstree 里面就是较低层的节点被 lock，并且被 mark 为 splitting。然后依次再更高层上在做同样的工作。这里认为 root 是最高的层。
不妨考虑下面的场景，B 需要分裂出一个 B‘ 新节点。其中虚线箭头表示要被迁移到 B’ 上的 pointer。

行为如下：

B 和 B’ 都被标注为 splitting
包含 X 在内的孩子们被转移到 B’ 上
锁 A，并且标记为 inserting
将 B’ 插入到 A
将 A、B 和 B’ 都解锁，这里指掉那些 flag 状态。增加 A 的 vinsert，以及 B 和 B’ 的 vsplit

下面需要假设一个并发的 findborder(X) 操作，它尝试从 node A 开始寻找某个 key 所在的 border 节点。下面要证明这个操作要么会找到 X，要么就会重试。
首先，假设找到了 B’，那么它就可以找到已经被移动到 B’ 的 X，但这个时候 B’ 还没有被链接到 A 上，也就是说 B’ 还没有被 publish。
反过来，假设找到了 B。并且因为在 handle-over-hand validation 中，先加载 child 的 version，再double check parent 的 verison，所以我们在将 A 设置为 inserting 之前就已经记录下 B 的 version 了。我们还可以推断 B 的 version 是在 step1 之前被记录的，这是因为如果发现 B 在 splitting 状态，那么就会重试。这样的话就有两个可能：
1.如果在 step1 前，findborder 就完成了，那么就肯定能读到 X。
2.否则，B.version ⊕ v 这个操作就会失败，因为看到了 B的 splitting 状态。这个 splitting 状态需要到 step5 才会被清理，但这个时候 vsplit 又会变了。这里还需要注意，vsplit 和 splitting 都是在 verison 上的，所以这个更新无疑是原子的。

reader 处理 split 和 insert 的方式是不同的。insert 会在当前节点 retry，而 split 需要从 root 开始retry。
这里，因为 B 树的 fanout 是比较大的，并且这一块代码没什么锁，跑起来应该挺快，这也意味着并发的 split 实际上并不常见。在测试中，每 10^6 个请求中才有一个因为并发 split 从而需要从 root 开始 retry。相比之下，并发 insert 就会频繁很多，而它们也很容易在本地被处理，这也是为什么 masstree 将两个分开存储的原因。

get

Border node 因为彼此之间有 link，所以可以借助于 link 来处理 split。这里的规则是总是把右边部分分出去创建新 node。
Masstree 还有下面的规定：
1.B+ 树中的第一个 node 是 border node。他不会删除，除非整棵树都被删掉了。它始终是整棵树中最小的节点。
2.每个 border node 管理区间 [lowkey(n), highkey(n))。Split 或者 delete 操作可能修改右区间，但是不会修改左区间。
所以，get操作可以始终通过和下一个 border node 的 lowkey 比较来找到自己要找的 node。

Remove

首先回忆之前的 border insert，在里面并没有更新 vinsert。在这里的场景中我们会介绍和 remove 操作组合起来的时候，因为 remove 也不修改 version 所以可能出现错误。我理解这有点像像是 ABA 问题。
考虑下面的场景，get 操作和 remove 操作重叠了，所以 remove 操作不能 gc 掉 k1 和 v1，不然就影响了 reader。这里应该是对应了前面的 RCU？
相应的，remove 操作是修改 permutation。但如果后续有一个 put操作，刚好把 key 也放到了 i 上。这就会导致 get 返回 v2 了。所以当已经删除了的 slot 被重用的时候，也要更新 vinsert。

Ceph

Introduction

目前有一些依赖对象存储的设计，其中对象存储设备，即 object storage device 也被称作 OSD，元数据服务器即 metadata server，也被称作 MDS。现在并不是读 block 了，而是读更大的 named objects，并且这些 object 的大小也未必要相同。底层的 block 分配由设备处理。Clients typically interact with a metadata server (MDS) to perform metadata operations (open, rename), while communicating directly with OSDs to perform file I/O (reads and writes), significantly improving overall scalability.
这样的架构依然不能解决 MDS 本身的扩展性，元数据没有做 partition。在设计上依旧依赖 allocation lists 和 inode tables，并且不愿意下推一些决策给 OSD。
Ceph 的设计基于的假设是 PB 级的存储实际上是动态的：

大的系统是基于增量构建出来的
节点 failure 是通常情况
workload 的强度和特征总是在变化

Ceph 将 file allocation tables 替换称为 generating function，从而解耦数据和元数据，这个函数也就是后面的 CRUSH 函数。这样 Ceph 就能同时考虑 OSD 了，具体优化的场景包括：

data access 的 distribution
update serialization，这里指的应该是维护各个 update 操作之间的关系
failure detection
recovery
Ceph 用了一个分布式元数据集群来提高元数据访问的 scalability。

Overview

Ceph 对 scalability的要求包括几个方面：

整个系统的 capacity 和 throughput
每个 client 的性能
每个目录和文件的性能
这里包括大量并发读写同一个文件，或者读写同一个目录。

Decoupled Data and Metadata

元数据相关的工作包括 open、rename 等。
对象存储一直以来都是将底层 block 的分配权给各个设备处理的，并且它们也已经将 per-file block list 替换为更短的 object list。但是 Ceph 直接去掉了 allocation list。为了替代，文件中的数据被分成一系列固定命名规则(predictably named)的对象，并且通过一个 CRUSH 函数被映射到具体的设备中。这有个显然的好处，就是组成一个文件的所有对象的名字和位置可以被计算得到，而不需要从某个中心化的地方查询了。

Dynamic Distributed Metadata Management

Ceph utilizes a novel metadata cluster architecture based on Dynamic Subtree Partitioning.
这个算法可以将维护目录树的任务分发给很多个 MDS 来处理。我理解就是一种 partition 策略帮助减轻单个 MDS 节点的负担。

Reliable Autonomic Distributed Object Storage

OSD来处理数据迁移、replication、failure detection 和 failure recovery。对于 MSD 来说，它们好像就是一个单节点的存储。

Client Operation

File IO and capabilities

当进程需要打开文件时，client 会发送一个请求给 MDS 服务器，后者遍历自己的目录层级，然后将文件名翻译成 inode。如果一切顺利， MDS 会返回诸如 inode 等信息。并且还会包含 striping strategy，这里指的是文件是条带化存储的，一个文件可以对应到若干 object 上。
客户端的 capability 分为 read、cache read、write 和 buffer write。后续也会支持管控。
Ceph 的 striping strategy 中为了避免 file allocation metadata，object name 只包含 inode number 和 stripe number。然后就借助于 CRUSH 去将它们映射到 OSD 上。比如说，只要一个 client 知道 inode number、layout 和 file size，它就可以定位到文件对应的所有对象。

Client Synchronization

POSIX semantics sensibly require that reads reflect any data previously written, and that writes are atomic. 这里我理解就是强一致性。
如果有读写或者写写冲突，那么 MDS 就会撤回之前发出的 read caching 和 write buffereing 的 capacity，强制同步 IO。也就是说，所有应用的读和写都会被 block，直到被 OSD 确认。这样 update serialization 和 synchronization 的负担被转移给了 OSD。
当写请求跨越 object 的边界的时候，会向所有对象对应的 OSD 请求各自的锁，然后提交 write 并释放锁。Object locks are similarly used to mask latency for large writes by acquiring locks and flushing data asynchronously.

当然，同步 IO 对特别是小读写请求的影响比较大，因为每次都会请求一次 OSD。在一些情况下，可以选择更松的一致性要求。当然，性能和一致性是一组 tradeoff。
Ceph 支持一些 POSIX IO 的 HPC 接口，比如 O_LAZY flag，也就是放松了 coherency 要求。但是，HPC 程序自己会去控制一致性。这是因为一些应用可能只是让不同的线程写同一个文件的不同部分，这样就和一致性不冲突。
还有两个高级功能，lazyio_propagate 能够 flush 一个 range 到 object store 上。lazyio_ synchronize will ensure that the effects of previous propagations are reflected in any subsequent reads.

Namespace Operations

Namespace Operations 诸如 readdir、unlink、chmod 之类的由 MDS 处理。
For simplicity, no metadata locks or leases are issued to clients. For HPC workloads in particular, callbacks offer minimal upside at a high potential cost in complexity.

Ceph 会对一些最常见的 metadata 访问场景进行优化，比如 readir 后面接一系列 stat 这个场景是 performance killer，Ceph 会选择在 readdir 的时候就直接取回来缓存。因为中间某个文件的属性可能变更了，访问缓存可能会牺牲一点 coherence，但性能提升很大。
对此的另一个优化手段是在 stat 被触发时，MDS 撤回所有的 write capacity，让所有的写暂停。然后获取所有的 writer 上的最新文件大小和 mtime，选择其中的最大的值返回。
当然，如果只有一个 writer，那么就可以直接从 writing client 取到正确的值，从而就不需要上面的过程了。
Applications for which coherent behavior is unnecesssary-victims of a POSIX interface that doesn’t align with their needs-can use statlite, which takes a bit mask specifying which inode fields are not required to be coherent. 这里不太看得懂。

Dynamically Distributed Metadata

Metadata operation 通常占据了近乎一半的文件系统开销，并且处在 critical path 上。Metadata management also presents a critical scaling challenge in distributed file systems: although capacity and aggregate I/O rates can scale almost arbitrarily with the addition of more storage devices, metadata operations involve a greater degree of interdependence that makes scalable consistency and coherence management more difficult.

Ceph 的上的 metadata 很小，基本只包含 file name 和 inode。对象名通过 inode 构建出来，并且通过 CRUSH 来分布到不同的 OSD 上。这简化了 metadata workload，并且让 Ceph 管理能力和文件的大小无关。
此外，Ceph 还要减少和 metadata 相关的 IO 次数。它使用了一个 two-tiered storage strategy，并且通过 Dynamic Subtree Partitioning 最大化 locality，并且提高 cache efficiency。

Metatada storage

MDS 使用 journal 来持久化。每个 journal 有几百兆，可以 absorb repetitive metadata updates。journer 被 lazy 和流式地地写入 OSD 集群。
这个设计有几点好处，但说得有点模糊。
This strategy provides the best of both worlds: streaming updates to disk in an efficient (sequential) fashion, and a vastly reduced re-write workload, allowing the long-term on-disk storage layout to be optimized for future read access. In particular, inodes are embedded directly within directories, allowing the MDS to prefetch entire directories with a single OSD read request and exploit the high degree of directory locality present in most workloads [22]. Each directory’s content is written to the OSD cluster using the same striping and distribution strategy as metadata journals and file data. Inode numbers are allocated in ranges to metadata servers and considered immutable in our prototype, although in the future they could be trivially reclaimed on file deletion. An auxiliary anchor table [28] keeps the rare inode with multiple hard links globally addressable by inode number-all without encumbering the overwhelmingly common case of singly-linked files with an enormous, sparsely populated and cumbersome inode table.

Dynamic Subtree Partitioning

SOTA 的方案包括静态子树切割，或者动态地基于 hash 来做。哈希的方案会破坏元数据的 locality，也会破坏 prefetch 的可能性。
Ceph 的 Dynamic Subtree Partitioning 首先是引入了 hierachy。然后通过 counters with an exponential time decay 维护元数据的 popularity。这个 popularity 会向上往树根处传播，从而 MDS 可以得到一棵反映负载分布的权重树。MDS 可以通过迁移子树的方式来实现负载均衡。这个负载均衡可以只在内存中进行，从而减少对 coherence lock 或者 client capability 的影响。The resulting subtree-based partition is kept coarse to minimize prefix replication overhead and to preserve locality. 不太明白这里说的 prefix replication 是什么。

在 replication 的时候，inode 的内容被分为三块：security、file 和 immutable。security、file 两个组会被使用单独的 FSM 管理。其目的是减少 lock contention。这里也不太明白说的是什么。

Traffic control

尽管做了 partition，但是还是会存在 hotspot 或者 flash crowds(瞬时拥堵)的问题，比如很多个客户端同时访问同一个文件或者目录。Ceph 会根据 popularity 来决定是否将 hotspot 进行分散，同时也会想办法避免损失 locality：
1.读取压力比较大的目录会设有多个 replica 来分散负载。如果一个目录不 popular，那么他就不会被创建其他的 replica。
2.写入压力比较大的目录中的文件会被 hash 到不同的节点上。这会牺牲目录的 locality，但负载是均衡的。写入会直接被 direct 到 authority 节点上。

Distributed Object Storage

让 OSD 处理注入 replicate 之类的工作，让 Ceph 的 RADOS 取得在容量和聚合能力上的线性伸缩。

CRUSH

首先Ceph 会把对象映射到不同的 PG 里面。这是通过一个简单的哈希函数实现的。
然后通过 CRUSH 也就是 Controlled Replication Under Scalable Hashing 函数将 PG 映射到 OSD。

那么定位一个对象就只需要知道 PG 和一个 OSD cluster map。因为这个 map 不会很频繁变化，或者变化也是只其中一小部分，比如上下线节点，所以会元数据也不会被动来动去。
OSD cluster map 是分层的描述，比如可以分为 shelf、rack cabinet、row of cabinet。
CRUSH 会根据 placement rule 将 PG 映射到 OSD。
OSD cluster map 还包含 down 或者 inactive 机器的列表，以及一个 version 号。所有对 OSD 的请求都会带上 version 号。

Replication

Replication 的粒度是 PG。
Primary 会确保所有的 replica 都被写完之后，再回复 client。

Data safety

RADOS 解耦了 sync 和 safety。他的意思是共享存储有两个作用，第一个是同步，也就是让一个更新尽快对其他 client 可见。第二个是可靠性，也就是持久化。
所以，当所有的 OSD 写完 in-memory buffer cache 之后，primary OSD 就会给 client 回复一个 ack，表示 sync 结束了。
之后，当数据被落盘之后，primary OSD 还会再回复一个 commit 给客户端，表示数据 safe 了。

故障检测

主要是分为两个阶段。短暂的无响应会被标记为 down，此时会移交 primary。长期的无响应会被标记为 out，会派其他的 OSD 来接管。这么做的目的也是为了减少数据的搬运。

通用架构设计归纳

2024-02-24T04:34:22.000Z

介绍软件工程领域一些通用的设计方案。

Lease

通过 Lease 可以解决或者缓解下面的一些问题：

减少 invalidation 检查的开销
减少惊群带来的大量无效计算

Backoff

Cache

Partial Ordering

通常面临这样的场景，很多个命令构成了全序关系，但其实它们可以被拆成多组彼此 concurrent 的偏序关系：

在 NewSQL 的事务层+共识层的架构中，将 (start_ts, commit_ts) 的事务拆到多个共识组中，多个共识组之间可以并发 apply。对于共识组中不相交的事务，可以通过并行 apply 继续提高并发度。
CPU 中的 superscalar 技术

GC

我们往往在异步线程中预处理一些对象，并最后将它们 link 到主干上，或者从主干上 unlink 对象，并最后 gc 掉。如果这中间发生重启，那么这些对象就会游离在存储中。如何区分被 unlink 但尚未被回收的对象，和刚被创建但还没有被 link 的对象呢？这里的通用思路是在重启后对比主干和存储中的对象，所有不出现在主干中的对象就需要被删掉。然后依赖重放来解决第一种情况。

Trade off

读和写

锁和非锁

锁意味着串行逻辑，通常用来避免写写冲突。
非锁的方案往往涉及多个版本：

CAS 的方案本质上是异步完成新版本的构建，再原子替换上去。它并不是避免冲突，而是在冲突时退出。
MVCC 的方案本质上是让读不会阻塞写，解决读写冲突。

类似的思想出现很多：

TiDB 中使用乐观锁或者悲观锁处理写写冲突，通过 MVCC 处理读写冲突。
Masstree 中使用锁解决写写冲突，引入版本号解决读写冲突。
Snapshot Isolation 中，读不会被写影响。

ZFC 公理系统

2024-02-10T11:20:33.000Z

介绍 ZF 和 ZFC 集合论的由来。

罗素悖论

在朴素集合论中，将“集合的元素”也视为集合。

可以得到概括公理：

1	exists s: forall x, (x in s <-> P(x))

概括公理说，对于任意的 Prop P，存在集合 s，它里面是所有满足性质 P(x) 的 x。

罗素提出一个悖论，考虑 S = {x | x notin x}，S in S 是否成立？S notin S 是否成立？如果 S in S 成立，则根据定义应该有 S notin S，和假设矛盾。如果 S notin S 成立，则能推出 S in S，有矛盾。

哥德尔不完备定理

定理1

任何自洽的形式系统，只要蕴涵皮亚诺算术公理，就可以在其中构造在体系中不能被证明的真命题，因此通过推理演绎不能得到所有真命题。也就是说体系不完备。

定理2

任何逻辑自洽的形式系统，只要蕴涵皮亚诺算术公理，它就不能用于证明其本身的自洽性。

ZF 和 ZFC 集合论

几个公理

外延公理

1
2
3

forall x,
    forall y,
        {forall z | z in x <-> z in y} -> x = y

正规公理

每个非空集合 x 都包含一个成员 y，使得 x 和 y 不相交。这里 exists a (a in x) 是非空集合的条件，

forall x,
    exists a (a in x) -> exists y
        y in x /\ not (exists z)
            z in y /\ z in x

正则公理可以用来防止罗素悖论。

证明如果 A 是集合，则 A not in A。不妨假定 A in A，则有 A in {A}，所以有 A intersect {A} = {A}。根据正规公理，可以得到 {A} 是空集，这是矛盾的。

替代公理

如果给定任何集合 x，有一个唯一的集合 y 使得 P 对 x 和 y 成立，那么给定任何集合 A，有着一个集合 B 使得，给定任何集合 y，y 是 B 的一个成员，当且仅当有是 A 的成员的一个集合 x 使得 P 对于 x 和 y 成立。

因为在一阶逻辑中无法 quantify 可定义的函数，one instance of the schema is included for each formula f in the language of set theory with free variables among w1, …, wn, A, x, y; but B is not free in f.

简单的形式如下，可以看出这里的 f 指定了一个从 x 到 y 的对应关系，类似 F 在 A 上的应用。所有由此得到的 y 可以被定义为集合 B，类似于 F(A)。

这里有个简单记法。唯一性可以写为 !x, P(x)。它被定义为: forall x, forall y, P(x) /\ P(y) -> x = y。

概括公理模式、分类公理模式、分离公理模式

这个公理类似于弱化版的概括公理。因为它实际上只允许 {a in s: P(x)} 这样的集合，不允许 {a: P(x)} 这样的集合。

forall x,
    exists s,
        forall a,
            a in s <-> a in x /\ P

这个公理可以被用来证明空集的存在。我们可以去找一个所有集合都没有的性质，比如

1	forall w, {u in w \| (u in u) /\ not (u in u)}

配对公理

如果 x 和 y 是集合，则存在一个集合包含 x 和 y。

1	forall x, forall y, exists (x in z /\ y in z)

并集公理

对任一个集合 F 总存在一个集合 A 包含每个为 F 的某个成员的成员的集合。

无穷公理

存在一个有无限多成员的集合 X。
形式化来说，令 S(x) 为 x union {x}，其中 x 为某个集合。存在一个集合 X 使得空集 ∅ 为 X 的成员，且当一个集合 y 为 X 的成员时，S(y) 也是 X 的成员。

幂集公理

对于任意的集合 x，都存在一个集合 y 为 x 的幂集的父集。
这里的 forall z [...] 实际上定义了幂集。

选择公理

良序定理，等价于选择公理，声称所有集合都可以被良序排序。

良序定理形式。

势和序数

集合的势

任何势小于自然数集的集合称为有限集合。
任何势和自然数集一样的集合称为可数无限集合。
任何势大于自然数集的集合称为不可数集合。

对角论证法

假设实数区间 [0,1] 是可数无穷大的，那么就可以将其组成排列 r1, r2, ...。比如如下

r1 = 0 . 5 1 0 5 1 1 0 ...
r2 = 0 . 4 1 3 2 0 4 3 ...
r3 = 0 . 8 2 4 5 0 2 6 ...
r4 = 0 . 2 3 3 0 1 2 6 ...
r5 = 0 . 4 1 0 7 2 4 6 ...
r6 = 0 . 9 9 3 7 8 3 8 ...
r7 = 0 . 0 1 0 5 1 3 5 ...
...

现在如果能构造出一个 x，它一定不等于所有的 r[k]，那么就可以通过反证法证明了实数区间 [0,1] 不是可数无穷大的。
构造方法就是 x 的第 k 位，让它和 r[k] 的第 k 位不同。这是肯定能做到的。
这样的话 x 就会和所有的 r[k] 至少有一位不同。

连续统假设

自然数集的势标记为 N0，实数集的势则被标记为 c。可以证明 c = 2^N0 > N0。连续统假设断言不存在介于实数集的势和自然数集的势之间的基数，亦即 c = N1。

Reference

Software Foundation 做题的 Notes

2024-02-03T11:46:32.000Z

https://github.com/CalvinNeo/SF-zh 做题笔记。

Basic

intro 和 intros 会按顺序将命题中的 forall 里面的，和 -> 左边的按照顺序移动下来作为假设。

reflexivity 相当于是化简等号两边，看是否相等。

rewrite -> H 是用 H 改写 goal 右边的部分，可以被简写为 rewrite H。rewrite <- H 则是从右到左改写。

有个证明 andb b c = true -> c = true 的看起来挺奇怪的，但实际上后面可以用 discriminate 来证明。

Induction

Coq 中 induction 的归纳，不是数学归纳，而是根据构造函数归纳。比如皮亚诺自然数就有两个构造函数 S 和 O。

Lists

Poly

一般证明一些结论，用到 induction 的时候，会用某个操作符对某个操作符的分配率。

1 2	plus_n_Sm: forall n m : nat, S (n + m) = n + S m plus_Sn_m: forall n m : nat, S n + m = S (n + m)

其目的是用分配率，把“递推”条件里面的部分给拆掉，一部分用 IHl' 直接 apply 过去，剩下来一部分是比较简单处理的。
对 list 而言，就是证明各种函数，如 length、rev 和 ++ 之间的关系。

Tactic

这里 apply eq2. apply eq1 好像三段论一样，eq2 是大前提，eq1 是小前提。

apply 类似于反过来推。apply H，如果 H 是 x = y，则可以进行代入的改写，这不是很直观。但如果 H 是 x -> y 这样，就可以将 y 这样的 goal 改写成 x，这就类似于是倒推了。
要使用 apply 策略，被应用的事实（的结论）必须精确地匹配证明目标。甚至等号反一下都不行，需要 symmetry 倒过来。
apply x in H 类似于正过来推。它是把 x 应用到假设 H 而不是 goal 上。具体来说，[apply L in H] 会针对上下文中的假设 [H] 匹配某些形如 [X -> Y] 的条件语句 [L]。[apply L in H] 会针对 [X] 匹配 [H]，如果成功，就将其替换为 [Y]。简而言之，就还是用 L 改写，但是改写 H 了。

同理，simpl in H 也是简化假设 H 而不是 goal。同理，也有 rewrite ... in H 来改写前提。

apply x with (m:=xxx)，也就是帮助 apply 选择把 m 代成是什么。这里我不太清楚如果有多个 forall 如何逐个指定。但另外有一种方式是 pose proof。

injection 是利用单射的性质。能方便的证明 S n = S m -> n = m 这样的命题。在 evSS_ev_remember 中也能看到有使用处理 S (S n') = S (S n)。主要目的还是用来去掉包在外面的上下文。
通过编写 injection H as Hmn1 Hmn2 这样，我们让 Coq 利用构造子的单射性来产生所有它能从 H 所推出的等式。每一个这样的等式都作为假设被添加到上下文中。

discriminate 用来处理 False -> P 这样的问题。也就是后面提到的 ex_falso_quodlibet，爆炸原理。

plus_n_n_injective 的证明很 tricky，我有几种方式都不太证得了，不知道为啥。

eqb_true 里面 intros [] eq 和 intros m eq 是有区别的。

intros 有个问题是，它始终是按照顺序引入的。如果我 intros n m，但我只想对 m 归纳，继续 forall n，那就需要 generalize dependent n 这样把 n 再还回去。
intros 另一个问题是，如果多引入了假设，或者少引入了假设，会导致后面处理会比较奇怪。所以一般直接 intros，看自己要用哪些。

用 unfold 展开定义。同理也有 unfold... in H。类似的展开方法还有 destruct。如 silly_fact_2 中举的例子一样，可以用 destruct 把用 match 讨论的 bar 函数的各个构造函数 destruct 出来讨论。

可以使用 destruct (n =? 3) eqn:E1 这样对表达式的结果进行讨论。

本文中最后总结了已有的一些策略。

一些 case

split 可以使用 match 做一个单 arm 的 destruct。

Theorem combine_split : forall X Y (l : list (X * Y)) l1 l2,
  split l = (l1, l2) ->
  combine l1 l2 = l.
Proof.
  intros tx ty l.
  induction l as [| h t IH].
  - simpl. intros l1 l2 H. simpl in H. inversion H. reflexivity.
  - intros r1 r2 H.
    destruct h.
    destruct (split t) eqn: E.

下面就引入了多余的 n'

Theorem mult_assoc : forall n m p : nat,
  n * (m * p) = (n * m) * p.
Proof.
  intros n.
  induction n as [| n' IHn] eqn: eq1.
  - reflexivity.
  - simpl. intros m p.
    assert (H3: (m + n' * m) * p = m * p + n' * m * p).
    apply mult_plus_distr_r.
    rewrite H3.

Logic

本章开始提到了之前证明了很多 a = b、a -> b 和 forall x, P x 的命题。
表达式 n = m 只是 eq n m 的语法糖（它使用 Notation 机制定义在 Coq 标准库中）。由于 eq 可被用于任何类型的元素，因此它也是多态的。

遇到 \/ 作为条件，可以 destruct H as [H1 | H2]，产生两个 subgoal。然后用 bullet 去讨论。也可以直接 intros [H1 | H2] 甚至 intros [H | H]。
遇到 \/ 作为 goal，可以用 left 或者 right 选择要证明哪一边。
遇到 /\ 作为条件，可以用 destruct 分离成两个条件。
遇到 /\ 作为 goal，需要用 split 将它分开成两个 subgoal，用 bullet 组织。

在 ex_falso_quodlibet 的证明中，会发现有的时候 False 出现在条件中，这个时候只需要 destruct 这个条件就可以了。如 destruct contra。

1 goal
P : Prop
contra : False
______________________________________(1/1)
P

如果在条件中出现 P -> False，但 goal 是 P，则可以使用 exfalso。我感觉好像他的作用是把 goal 变成 False。

一个诸如 H : exists x : A, f x = y /\ In x t 这样的条件，可以被 destruct 为 [w [h1 h2]] 这样。此时 w 就是这个 x。可以直接假设 exists w。

对于后面的排中律系列，在 (P \/ P -> False) 时，建议选择 right，因为 right 会得到一个 P 的假设。

注意
1.P : Prop means “let P be an arbitrary proposition”. It could be true, it could be false.
2.p : P means “let p be a proof of P”. That’s what means that P is true.

对映

我们可以通过以下两种方式来断言 n 为偶数：
evenb n 求值为 true

1	Example even_42_bool : evenb 42 = true.

或者存在某个 k 使得 n = double k

1	Example even_42_prop : even 42.

解释了之前为什么要证明一个很奇怪的什么 eqb_true 的东西。

Theorem eqb_true : forall n m,
    n =? m = true -> n = m.
Proof.
intro n. induction n as [| n' IHn'].
  - intro m. induction m as [| m' IHm'].
    + reflexivity.
    + discriminate.
  - intros m eq. induction m as [| m' IHm'].
    + simpl. discriminate.
    + apply f_equal. apply IHn'. apply eq.
Qed.

一些 case

In_map_iff 的思路是考虑 H 的话，如果 y = x 则是另外的情况，否则 y 在 l 里面。就可以 destruct 出来，一部分匹配 IHl。
另外，下面的 intros [H | H] 的意思是把 f h = y \/ In y (map f t) -> 引入，但生成两个 subgoal。第一个 goal 是对 f h = y 证明成立，第二个是对 In y (map f t) 成立。

Lemma In_map_iff :
  forall (A B : Type) (f : A -> B) (l : list A) (y : B),
    In y (map f l) <->
    exists x, f x = y /\ In x l.
Proof.
  intros A B f l y. split.
  induction l as [| h t IHl].
  - simpl. intros. destruct H.
  (* In y (map f (h :: t)) -> exists x : A, f x = y /\ In x (h :: t) *)
  - simpl. intros [H | H].
    + exists h. simpl. split.
      * apply H.
      * left. reflexivity.
    (* f x != y *)
    + apply IHl in H. destruct H as [w [h1 h2]].
      (* now H is useable *)
      exists w. split.
      * apply h1.
      * right. apply h2.
  (* (exists x : A, f x = y /\ In x l) -> In y (map f l) *)
  - intros [w [h1 h2]].
    rewrite <- h1. apply In_map. exact h2.
Qed.

另外，下面这个目标中的 x = x 如何被 simpl 掉？似乎没办法执行。

Lemma In_map_iff2 :
  forall (A B : Type) (f : A -> B) (l : list A) (y : B),
    In y (map f l) <->
    exists x, f x = y /\ In x l.
Proof.
  intros A B f l y. split.
    induction l.
    - simpl. intros. destruct H.
    - intros. simpl. exists x. split.
      destruct H. apply H. apply IHl in H.
      + admit.
      + simpl.

对应的 goal 如下

1 goal
A : Type
B : Type
f : A -> B
x : A
l : list A
y : B
IHl : In y (map f l) -> exists x : A, f x = y /\ In x l
H : In y (map f (x :: l))
______________________________________(1/1)
f x = y /\ (x = x \/ In x l)

下面这里的 apply H 我觉得比较有意思。

Theorem excluded_middle_2_double_negation_elimination :
  excluded_middle <-> double_negation_elimination.
Proof.
  split.
  - unfold excluded_middle.
    unfold double_negation_elimination.
    unfold not.
    intros.
    destruct (H P).
    + apply H1.
    + exfalso. apply H1. apply H0 in H1. destruct H1.
  - unfold excluded_middle.
    unfold double_negation_elimination.
    unfold not.
    intros.
    (* Get rid of forall *)
    apply H.
    intros.
    apply H0.
    right. intros. apply H0. left. apply H1.
Qed.

当时的 goal 是

1 goal
H : forall P : Prop, ((P -> False) -> False) -> P
P : Prop
______________________________________(1/1)
P \/ (P -> False)

容易想到的是 left 或者 right 这样。但会陷入循环。比如我 left. apply H. apply H. apply H0. apply H.，最终是回到了最初的起点了。但是 apply H 完了之后，就得到

1	(P \/ (P -> False) -> False) -> False

如果我 right. apply H. intros. apply H0. intros. apply H in H1. 就会得到如下的形式。我们不能 apply H in H1，因为这会得到什么呢？注意，这里是前向 apply 了，我们用 P 代换得不到sm东西。

1 goal
H : forall P : Prop, ((P -> False) -> False) -> P
P : Prop
H0 : (P -> False) -> False
H1 : P
______________________________________(1/1)
False

这个应该是用 ((P -> False) -> False) 代换了原来 goal 中的两处 P，然后再利用各种结合率啥的重新组合得到的。所以我觉得如果有 forall P : Prop, … -> P 这样的东西，右边很简单的，不如大胆做个代换。

我们会遇到如 andb_true_iff 或者 ev5_nonsense 这样的，给一堆条件，证明 false = true 或者 ev 5 -> 2 + 2 = 9 这样奇怪的命题。我们要做的并不是去 discriminate 掉 goal，而是要在条件中构造出一个 False，然后通过爆炸原理来证明。这里我觉得主要是，如果假设中能退出 True，那实际上是要证明 True -> False，这个命题实际上是错误的。

另外，在 andb_true_iff 中需要注意 False 和 false 的区别。False 的定义是

1	Inductive False : Prop := .

而 false 只是我们定义的 bool 类型的一个构造函数。而 False 是一个 type，或者一个 proposition。所以无法证明 (false = true) = false。相应的，我们应该证明 false = true <-> False。

另一个回答说，他说 falsehood 通常的构造办法：
1.假设从一个空集中去一个值，这个值就不能被创造出来
2.两个值从不同的构造函数中构造出来的值是相同的，而凑早函数被认为是 injective 的。
比如，如果假设中有 H: False 我们可以 destruct H。如果假设中有 false = true，那么就可以 inversion。

承认排中律

不需要排中律即可证明 double_neg 即 P -> ~~P。但是反过来的 ~~P -> P 则需要排中律来证明。其实 ~~P -> P 感觉就是反证法。

~~(P \/ ~P) 可以被证明，但 P \/ ~P 排中律则依赖选择公理。为什么前者可以被证明呢？因为我们不能同时证明 P \/ ~P 和证伪 P \/ ~P。这样 (P \/ ~P) 和 ~(P \/ ~P) 不能同时为真。因为 forall P: Prop, P -> ~~P，所以 ~~(P \/ ~P) 和 ~(P \/ ~P) 不能同时为真。

~~(P \/ ~P) 的证明流程。unfold 去掉 not，得到的 goal 是 forall P : Prop, (P \/ (P -> False) -> False) -> False。好像是证明 “P \/ (P -> False) 是不可能的”是不可能的。intros 把 P \/ (P -> False) -> False 作为假设，goal 变成 False。这里发现可以进一步 apply 这个假设，替换到如下的形式

1 goal
P : Prop
H : P \/ (P -> False) -> False
______________________________________(1/1)
P \/ (P -> False)

看起来好像是承认 Q -> False 的情况下，反过来证明 Q 成立？继续证明。现在是给定 H : P \/ (P -> False) -> False 要证明 P \/ (P -> False)。假设 P -> False 成立，用 intros，假设 P 成立，证明 False。而已知条件是 P \/ (P -> False) -> False，不如再假设 P 成立，然后证明 P \/ (P -> False)。

排中律本身和其他一些定理可以串起来形成一个证明的圈。这个和实数那几个的定理一样。

InductionProp

之前，我们定义偶数一般是使用 nat。但现在介绍用 nat -> Prop 这样的形式来定义偶数 ev。

1
2
3

Inductive ev : nat -> Prop :=
| ev_0 : ev 0
| ev_SS (n : nat) (H : ev n) : ev (S (S n)).

注意，另一种方式会出错，也就是将 nat 放到 : 的左侧。

1
2
3

Fail Inductive wrong_ev (n : nat) : Prop :=
| wrong_ev_0 : wrong_ev 0
| wrong_ev_SS (H: wrong_ev n) : wrong_ev (S (S n)).

We already know how to perform case analysis on n using destruct or induction

But for some proofs we may instead want to analyze the evidence that ev n directly
这是因为如果某人展示了对于 [ev n] 的 evidence [E]，那么我们知道 [E] 要么从 ev_0 来的，要么从 ev_SS 来的。
一个反演命题。这个命题是对 ev 这个判断奇偶性的命题而言的。对这个命题直接使用 induction 是搞不定的，因为

Theorem ev_inversion :
  forall (n : nat), ev n ->
    (n = 0) \/ (exists n', n = S (S n') /\ ev n').
Proof.
  intros n E.
  destruct E as [ | n' E'].
  - (* E = ev_0 : ev 0 *)
    left. reflexivity.
  - (* E = ev_SS n' E' : ev (S (S n')) *)
    right. exists n'. split. reflexivity. apply E'.
Qed.

The [inversion] tactic can detect that:
1.the first case ([n =0]) does not apply
2.the [n’] that appears in the [ev_SS] case must be the same as [n].

It has an “[as]” variant similar to [destruct], allowing us to assign names rather than have Coq choose them.

Theorem evSS_ev' : forall n,
  ev (S (S n)) -> ev n.
Proof.
  intros n E.
  inversion E as [| n' E' EQ].
  (* We are in the [E = ev_SS n' E'] case now. *)
  apply E'.
Qed.

inversion 策略会做很多东西，比如如果对一个等式使用，就相当于 discriminate 加上 injection。另外，它还会带上使用 injection 可能必须的 intros 和 rewrite。它还可以被 apply，去 analyze evident for inductively defined 命题。下面会用 inversion 来尝试证明 Tactics 章节中涉及的一些定理。

[inversion] 的工作原理大致如下：假设 [H] 指代上下文中的假设 [P]，且 [P] 由 [Inductive] 归纳定义，则对于 [P] 每一种可能的构造，[inversion H] 各为其生成子目标。子目标中自相矛盾者被忽略，证明其余子命题即可得证原命题。
在证明子目标时，上下文中的 [H] 会替换为 [P] 的构造条件，即其构造子所需参数以及必要的等式关系。例如：倘若 [ev n] 由 [ev_SS] 构造，上下文中会引入参数 [n’]、[ev n’]，以及等式 [S (S n’) = n]。

下面，是另一个问题。为了证明 [n] 的性质对于 [ev n] 成立的所有数字都成立。我们需要在 [ev n] 上 induction。证明分为两块，和 [ev n] 的两个构造函数对应。如果他通过 [ev_0] 构造，则 n = 0，那么性质肯定对 [0] 成立。如果他通过 [ev_SS] 构造，那么 [ev n] 的证据就具有形式 [en_SS n’ E’]，其中 [n = S (S n’)] ，[E’] 是 [eb n’] 的证据。这样，the inductive hypothesis says that the property we are trying to prove holds for [n’]。

注意，这里的 exists (S k'). 来自于 ev 的定义，我们这里没有展开 ev。

从 le 的定义来看，可以总结出 destruct、inversion 和 induction 三种策略在 H: le e1 e2 上的作用。destruct H 能够产生两个 case。第一个 case 中 destruct H 产生两个情况，第一个是 e1 = e2，此时 e2 被 e1 替换掉。在第二个 case 中，e2 = S n’，并且对于某个 n’ 有 le e1 n’ 成立，并且用 S n’ 替换 e2。inversion H 会移除不可能的 case，并且将生成的新的等式添加到上下文中。执行 induction H，在第二种情况下，会将 induction hypothesis 的 e2 用 n’ 替换。

具体 case

Theorem ev'_ev_try1 : forall n, ev' n <-> ev n.

Proof.
  intros.
  split.
  - induction n.
    + intros. apply ev_0.
    + simpl. Abort.

对应的 goal 如下，S n 肯定对 ev’ 啥的不成立了啊。

1 goal
n : nat
IHn : ev' n -> ev n
______________________________________(1/1)
ev' (S n) -> ev (S n)

这里和 le_trans 相关的命题都是放缩法。遇到 S n，就通过 le_S 去掉 S。但是放缩法要符合情理。比如 n_le_m__Sn_le_Sm 里面就不能从 S n <= S m 放缩到 S n <= m。
要从 H : S n <= 1 推导 n <= 0，可以 inversion H。从 H1 : S n <= 0 推 n <= 0，也可以 inversion。感觉总的来说，inversion 是将一个更强的条件去分解为多个较弱的条件。
这里一个问题是不能用 injection 解决S n <= S (S m) 推导 n <= S m。

Theorem le_S_ab2_try : forall a b,
  S a <= b -> a <= b.
Proof.
  intros a b. 
  induction a.
  - intros. inversion H.
    apply le_S. apply le_n. apply O_le_n.
  - intros. inversion H.

2 goals
a, b : nat
IHa : S a <= b -> a <= b
H : S (S a) <= b
H0 : S (S a) = b
______________________________________(1/2)
S a <= S (S a)
______________________________________(2/2)
S a <= S m

Theorem le_S_ab2 : forall a b,
  S a <= b -> a <= b.
Proof.
  intros a b. 
  generalize dependent a.
  induction b.
  - intros. inversion H.
  - intros. inversion H. apply le_S. apply le_n.
    apply le_S. apply IHb. apply H1.
Qed.

2 goals
b : nat
IHb : forall a : nat, S a <= b -> a <= b
a : nat
H : S a <= S b
H1 : a = b
______________________________________(1/2)
b <= S b
______________________________________(2/2)
a <= S b

比较有意思的是这里又找到一个必须要 apply with 的结构

Theorem add_le_cases : forall n m p q,
    n + m <= p + q -> n <= p \/ m <= q.
Proof.
    intros.
    generalize dependent p.
    induction n. left. apply O_le_n.
    intros. destruct p.
    * right. rewrite plus_O_n in H. apply le_S_ab2.
      apply add_le_cases_helper.

Unable to find an instance for the variable n.

另一个启发是不要过早地使用 left 或者 right 策略。会出现如下所示的问题，我们的条件和 goal 是不相关的。

Theorem add_le_cases : forall n m p q,
    n + m <= p + q -> n <= p \/ m <= q.
Proof.
    intros.
    generalize dependent p.
    induction n. left. apply O_le_n.
    intros. destruct p.
    * right. rewrite plus_O_n in H. apply le_S_ab2.
      apply add_le_cases_helper with (n:=n). exact H.
    * left. simpl in H. apply add_le_helper2 in H. apply IHn in H.
      destruct H.
      - apply n_le_m__Sn_le_Sm. exact H.
      - 
1 goal
n, m, q : nat
IHn : forall p : nat, n + m <= p + q -> n <= p \/ m <= q
p : nat
H : m <= q
______________________________________(1/1)
S n <= S p

其实在第二个目标中，不要那么早使用 left 策略，而是先想办法 destruct H，根据 destruct 得到的条件选择是 left 还是 right 就能解决问题了。

Theorem add_le_cases : forall n m p q,
    n + m <= p + q -> n <= p \/ m <= q.
Proof.
    intros.
    generalize dependent p.
    induction n. left. apply O_le_n.
    intros. destruct p.
    * right. rewrite plus_O_n in H. apply le_S_ab2.
      apply add_le_cases_helper with (n:=n). exact H.
    * simpl in H. apply add_le_helper2 in H. apply IHn in H.
      destruct H.
      - left. apply n_le_m__Sn_le_Sm. exact H.
      - right. exact H.
Qed.

Reference

https://coq.inria.fr/doc/V8.13+beta1/refman/proofs/writing-proofs/rewriting.html
介绍 rewrite

GhostCell

2024-01-31T11:20:33.000Z

GhostCell 是另一个内部可变性的实现。相比 RefCell，它实现的是编译期的检查。

问题

Rust 中面临所谓 AXM 的问题。A 指的是 Aliasing，M 指的是 Mutability。AXM 表示 A Xor M，也就是这两个只能保证其一。

这让 Rust 容易实现树状结构，而难以实现图状结构（比如双向链表、图等）。其原因是图状结构中有环，环意味着 internal sharing。

对此，Rust 一般提供两种做法：

raw 指针
具有下面的特点：
- Unsafe，也就是 Rust 的类型系统无法进行约束
- 没有 AXM 限制
Interior Mutability
- 非线程安全的有 Cell 和 RefCell，它们允许通过一个不可变借用 &T 去修改内部的对象
- 线程安全的有 Mutex 和 RwLock

比如双向链表就可以写成

struct Node {
    data: T,
    prev: Option>,
    next: Option>,
}

type NodeRef= &RwLock>;

作者就说这样的话一个 Node 会有一个 RwLock。如果我们希望链表可以被多个线程同时修改，这是有用的，但如果只是希望在整个链表上实现 AXM，就有点 overkill 了。

当然，我觉得始终可以去把整个链表外面包一层来解决 AXM 的问题的吧。

介绍

作者宣称 GhostCell 有如下的特性：

支持 internal sharing
zero cost abstraction
safe，通过 Coq 证明
flexible，是 thread-safe 的，对 type T 没有要求

原理

作者说 Rust 将 permission 和 data 绑定了。所以这导致了 AXM 在一个很细的粒度上，比如链表中的每个节点。

因此，GhostCell 将 permission 和 data 分开来。让一个 permission 可以和一整个data 关联，比如整个链表。

这个想法称为 Brand Types，来自于 ST monad 的启发。

GhostCell 表示一个大结构里面的某个小元素。这个大结构有一个 brand 为 'id。
GhostToken 是一个 Permission，可以用来访问任意的标有 'id 的 GhostCell。

以链表为例，进行如下的修改即可使用 GhostCell。

使用

如下所示，创建一个 vec，其中元素都为 cell 的引用。我修改 vec[n / 2] 的值，然后再读取 cell，会发现 cell 的值也被修改了。这里的逻辑没问题，但是 GhostToken 能够避免 borrow checker 的检查报错。

use ghost_cell::{GhostToken, GhostCell};

fn demo(n: usize) {
    let value = GhostToken::new(|mut token| {
        let cell = GhostCell::new(42);
        let vec: Vec<_> = (0..n).map(|_| &cell).collect();
        *vec[n / 2].borrow_mut(&mut token) = 33;
        *cell.borrow(&token)
    });

    assert_eq!(value, 33);
}

实现

type InvariantLifetime<'brand> = PhantomData<fn(&'brand ()) -> &'brand ()>;

/// A `GhostToken<'x>` is _the_ key to access the content of any `&GhostCell<'x, _>` sharing the same brand.
///
/// Each `GhostToken<'x>` is created alongside a unique brand (its lifetime), and each `GhostCell<'x, T>` is associated
/// to one, and only one, `GhostToken` at a time via this brand. The entire set of `GhostCell<'x, T>` associated to a
/// given `GhostToken<'x>` creates a pool of cells all being accessible solely through the one token they are associated
/// to.
///
/// The pool of `GhostCell` associated to a token need not be homogeneous, each may own a value of a different type.
pub struct GhostToken<'brand> {
    _marker: InvariantLifetime<'brand>,
}

impl<'brand> GhostToken<'brand> {
    /// Creates a fresh token to which `GhostCell`s can be tied to later.
    ///
    /// Due to the use of a lifetime, the `GhostCell`s tied to a given token can only live within the confines of the
    /// invocation of the `fun` closure.

    #[allow(clippy::new_ret_no_self)]
    pub fn new(fun: F) -> R
    where
        for<'new_brand> F: FnOnce(GhostToken<'new_brand>) -> R,
    {
        let token = Self {
            _marker: InvariantLifetime::default(),
        };
        fun(token)
    }
}

impl<'brand, T: ?Sized> GhostCell<'brand, T> {
    pub fn borrow<'a>(&'a self, _: &'a GhostToken<'brand>) -> &'a T {
        //  Safety:
        //  -   The cell is borrowed immutably by this call, it therefore cannot already be borrowed mutably.
        //  -   The token is borrowed immutably by this call, it therefore cannot be already borrowed mutably.
        //  -   `self.value` therefore cannot be already borrowed mutably, as doing so requires calling either:
        //      -   `borrow_mut`, which would borrow the token mutably.
        //      -   `get_mut`, which would borrow the cell mutably.
        unsafe { &*self.value.get() }
    }

    pub fn borrow_mut<'a>(&'a self, _: &'a mut GhostToken<'brand>) -> &'a mut T {
        //  Safety:
        //  -   The cell is borrowed immutably by this call, it therefore cannot already be borrowed mutably.
        //  -   The token is borrowed mutably by this call, it therefore cannot be already borrowed.
        //  -   `self.value` therefore cannot already be borrowed, as doing so requires calling either:
        //      -   `borrow` or `borrow_mut`, which would borrow the token.
        //      -   `get_mut`, which would borrow the cell mutably.
        unsafe { &mut *self.value.get() }
    }

    pub fn from_mut(t: &mut T) -> &mut Self {
        //  Safety:
        //  -   `t` is mutably borrowed for the duration.
        //  -   `GhostCell<'_, T>` has the same in-memory representation as `T`.
        unsafe { mem::transmute(t) }
    }
}

Reference

https://www.bilibili.com/video/BV1HP4y1s762/

Performance analysis and tuning on modern CPUs 学习笔记

2023-12-17T02:57:32.000Z

According to 老板，according to yifan，这本书很好，所以我就来学习一下了。

Part 1 的说明

Measuring Performance

Changing a seemingly unrelated part of the source code can surprise us with a significant impact on program performance. This phenomenon is called measurement bias. … Instead, we will just focus on high-level ideas and directions to follow.

This chapter will give a brief introduction to why modern systems yield noisy performance measurements and what you can do about it.

Noise In Modern Systems

一种叫 Dynamic Frequency Scaling 的技术可以短时间内提升 CPU 的频率，但它是基于核心温度的，所以不确定。在散热不太好的笔记本上经常发生第一个 run 会 turbo，但第二个 run 回归到 base frequency 的情况。

软件层面的因素，包括 file cache 是否 warm。有论文说，env var 的大小，以及 link 的顺序都能影响性能。影响内存布局也可以影响性能，有研究说 efficiently and repeatedly randomizing the placement of code, stack, and heap objects at runtime 可以解决处理内存布局带来的问题。

如果 benchmark 一个云处理器环境，上述的 noise 和 variation 基本难以被消除。

temci 这个工具能够设置环境，从而确保一个 low variance。

Measuring Performance In Production

CPU Microarchitecture

Instruction Set Architecture

the critical CPU architecture and microarchitecture features that impact performance

In addition to providing the basic functions in the ISA such as
load, store, control, scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE) and matrix/tensor instructions (Intel AMX). Software mapped to use these advanced instructions typically provide orders of magnitude improvement in performance.

深度学习中，即使使用较少的 bits 效果也挺好，所以一些处理器倾向于引入 8bit integer 等来节约计算和内存带宽。

Pipelining

The processing of instructions is divided into stages. The stages operate in parallel, working on different parts of different instructions.

在 CSAPP 里面已经学习了取指、解码、执行、访存、写回的流水线。

The throughput of a pipelined CPU is defined as the number of instructions that complete and exit the pipeline per unit of time.

The latency for any given instruction is the total time through all the stages of the pipeline.

The time required to move an instruction from one stage to the other defines the basic machine cycle or clock for the CPU. The value chosen for the clock for a given pipeline is defined by the slowest stage of the pipeline. 因此这里的优化点是均衡或者重新设计 pipeline，消除最慢的 stage 的 bottleneck。

假定所有的 stage 都是完美 balanced，没有任何 stall，那么通过 pipeline，可以让一个指令的处理时间减少到原来的 n_stage 分之一。

下面提到三种冒险，其中 Structural hazard 指的是对硬件资源的争抢，通常通过复制硬件解决。剩下两种在 CSAPP 中详细介绍过了，本教程做了归纳。
数据冒险是程序中的数据依赖，分三类：

Read after write
x 的写结束前，x + 1 就需要读到 x 的写的内容，不然 x + 1 会读到更旧的数据。通常使用 bypassing 的办法，将数据从流水线的后面(指令 x + 1 的阶段) forward 到前面(指令 x 的阶段)。
Write after read
x 在读完之后，x + 1 才能写，不然 x 会读到更新的数据。译码阶段总是在写回阶段前面，为什么会有 WAR 场景？原因是需要考虑处理器乱序执行。如下所示，R0 上存在 WAR 场景。
1
2
R1 = R0 ADD 1
R0 = R2 ADD 2
WAR 可以被 register renaming 解决。 Logical (architectural) registers, the ones that are defined by the ISA, are just aliases over a wider register file。因此，只需要将第二个 R0，以及之后的所有的 R0 重命名到另一个寄存器文件上即可解决问题。因此这两个指令之间就可以被乱序执行了。
【Q】这些 physical register file 如何被 gc 呢？旧的 R0 alias 不知道新的 R0 alias 的存在啊。看 CPU backend 的介绍时发现，这个是 ROB 做的。
Write after write
同样可以通过 register renaming 解决。

控制冒险主要指分支预测之类的东西，影响取指阶段。诸如动态分支预测、 speculative execution 能优化。

Exploiting Instruction Level Parallelism (ILP)

大多数的指令可以被并行执行，因为它们是不相关的。

OOO Execution

An instruction is called retired when it is finally executed, and its results are correct and visible in the architectural state. To ensure correctness, CPUs must retire all instructions in the program order. 这里 architectural state 主要区分于 microarchitectural state，后者是隐藏的 state，例如 pipeline registers、cache tag、branch predictor 等。

OOO is primarily used to avoid underutilization of CPU resources due to stalls caused by dependencies, especially in superscalar engines described in the next section.

scoreboard 被用来 schedule the in-order retirement and all machine state updates。它需要记录每一条指令的数据依赖 and where in the pipe the data is available。scoreboard 的大小决定了 CPU 能够超前多少调度彼此无关的指令。

下图中，x + 1 因为一些冲突，插入了一些气泡。而 OOO 的执行可以使 x + 2 这个 independent 的指令先进入执行(EXE)阶段。但是所有的指令还是要按照 program order 来 retire，也就是最后的写回阶段是要按顺序的。

Superscalar Engines and VLIW

Most modern CPUs are superscalar i.e., they can issue more than one instruction in a given cycle. Issue-width is the maximum number of instructions that can be issued during the same cycle.

目前处理器的 issue width 在 2 到 6 之间。为了处理这些 instruction，目前 CPU 也会支持 more than one execution unit and/or pipelined execution units。Superscalar 也可以和之前提到的 pipeline 和 OOO 结合起来用。

下图展示了 issue width = 2 的情况。 Superscalar CPU 中支持多个 independent execution units 能够执行指令，而避免冲突。

Intel 使用 VLIW，即 Very Long Instruction Word 技术将调度 Superscalar 和 multi execution unit 的负担转移给编译器。这是因为 software pipelining, loop unrolling 这些编译器优化相比硬件能看到的更多。

Speculative Execution

下面介绍了 Speculative Execution 是什么，不翻译了。

As noted in the previous section, control hazards can cause significant performance loss in a pipeline if instructions are stalled until the branch condition is resolved. One technique to avoid this performance loss is hardware branch prediction logic to predict the likely direction of branches and allow executing instructions from the predicted path (speculative execution).

如下所示，Speculative Execution 不会等 branch 预测结果，而是直接执行 foo，如打星号所示，但是实际的状态变化直到 condition 被 resolve 之后才会被 commit，这样才能保证 architecture state 不会被 speculative executing 影响。

在现实中，branch 命令可能依赖从内存中加载上来的某个值，这可能需要花费上百个 cycle。如果分支预测的是不正确的，也就是实际应该调用 bar 了，那么 speculative 执行的结果需要被扔掉，称为 branch misprediction penalty。

为了记录 speculation 的进度，CPU 支持一个叫 ReOrder Buffer 即 ROB 的结构。ROB 中依照顺序了所有指令，包括已经 retire 的指令的状态。如果 speculation 是正确的华，speculative execution 的结果按照 program order 会写到 ROB 里面，然后被 commit 到 architecture register 中。

CPUs can also combine speculative execution with out-of-order execution and use the ROB to track both speculation and out-of-order execution.

Exploiting Thread Level Parallelism

A hardware multi-threaded CPU supports dedicated hardware resources to track the state (aka context) of each thread independently.
The main motivation for such a multi-threaded CPU is to switch from one context to another with the smallest latency (without incurring the cost of saving and restoring thread context) when a thread is blocked due to a long latency activity such as memory references

Simultaneous Multithreading

ILP techniques and multi-threading 被结合使用。不同线程中的指令在同一个 cycle 中被并发地执行。同时从多个线程中 dispatch 指令可以充分利用 superscalar 资源，提高 CPU 总体性能。为了支持 SMT，CPU 需要复制硬件去存储 thread state，例如 PC 和寄存器等。追踪 OOO 和 speculative execution 的资源可以复制，也可以共享。一些 Cache 被 hardware 线程共享。

Memory Hierarchy

Memory Hierarchy 从下面两种性质构造：

Temporal locality
Spatial locality

CSAPP 中进行了更为详细的讨论。

Cache Hierarchy

A particular level of the cache hierarchy can be used exclusively for code (instruction cache, i-cache) or for data (data cache, d-cache), or shared between code and data (unified cache).

Placement of data within the cache

Finding data in the cache

Managing misses

对 direct-mapped 来讲，它会 evict 掉之前的 block。
对 set-associative 来讲，需要一个例如 LRU cache 的算法。

Managing writes

CPU designs use two basic mechanisms to handle writes that hit in the cache:

In a write-through cache
hit data is written to both the block in the cache and to the next lower level of the hierarchy
In a write-back cache
hit data is only written to the cache.
Subsequently, lower levels of the hierarchy contain stale data.
The state of the modified line is tracked through a dirty bit in the tag. When a modified cache line is eventually evicted from the cache, a write-back operation forces the data to be written back to the next lower level.

Cache misses on write operations can be handled using two different options:

write-allocate or fetch on write miss cache
数据会被从下一级缓存中被加载上来，并且当前 write operation 会被视作一次 write hit。
no-write-allocate policy
cache miss 会直接被发送到下一级缓存，并且该缓存不会被加载到当前缓存。

Out of these options, most designs typically choose to implement a write-back cache with a write-allocate policy as both of these techniques try to convert subsequent write transactions into cache-hits, without additional traffic to the lower levels of the hierarchy.

Write through caches typically use the no-write-allocate policy.

Other cache optimization techniques

1	Average Access Latency = Hit Time + Miss Rate × Miss Penalty

硬件工程师们努力减少 Hit Time，以及 Miss Penalty。而 Miss Rate 取决于 block size 和 associativity，以及软件。

HW and SW Prefetching

Prefetch 指令和数据可以减少 cache miss 和后续的 stall 的发生。

Hardware prefetchers observe the behavior of a running application and initiate prefetching on repetitive patterns of cache misses. Hardware prefetching can automatically adapt to the dynamic behavior of the application, such as varying data sets, and does not require support from an optimizing compiler or profiling support. Also, the hardware prefetching works without the overhead of additional address-generation and prefetch instructions. 当然，hardware prefetching 只限于某几种 cache miss pattern。

Software memory prefetching 中，开发者可制定一些内存位置，或者让编译器自行添加一些 prefetch 指令。

Main memory

Main memory uses DRAM (dynamic RAM) technology that supports large capacities at reasonable cost points.

主存被三个属性描述，latency、bandwidth 和 capacity。

Latency 主要是两个指标，Memory access time 是从请求，到数据 available 的时间。Memory cycle time 是两个连续的访存操作之间最小的时间间隔。

DDR (double data rate) DRAM 是主要的 DRAM 技术。历史上，DRAM bandwidth 每一代都会被提高，但是 DRAM 的 latency 原地踏步，甚至会更高。下面的表中展示了最近三代的 DDR 技术的相关数据。MT/s 指的是 a million transfers per sec。

Virtual Memory

Virtual Memory 实现多个程序共享一个内存。具有特性：

protection
也就是不能访问其他进程的内存
relocation
也就是不改变逻辑地址，但是将程序加载到任意的物理地址

vitual 地址和物理地址通过 page table 进行转换。vitual 地址分为两部分。virtual page number 用来索引具体的 page table，这个 page table 可能是一层，也有可能嵌套了很多层。page offset 用来 access the physical memory location at the same offset in the mapped physical page。
如果需要的 page 不在主存里面，就会发生 page fault。操作系统负责告诉硬件去处理 page fault，这样一个 LRU 的 page 会被 evict 掉，腾出空间给新 page。

CPUs typically use a hierarchical page table format to map virtual address bits efficiently to
the available physical memory. CPU 会使用一个有层级的 page table 去维护 virtual address 和 physical memory 的关系。但 page fault 的惩罚是很大的，需要从 hierarchy 中 traverse 查找。所以 CPU 支持一个硬件结构 translation lookaside buffer (TLB) 来保存最近的地址翻译结果。

SIMD Multiprocessors

SIMD (Single Instruction, Multiple Data) multiprocessors 是相对于之前的 MIMD 来说的。SIMD 会在一个 cycle 中将一条指令运用于多个数据上，从而利用多个 functional units。

下图是 SISD 和 SIMD 的对比。

Modern CPU design

下图是 Skylake(2015) 的架构。 The Skylake core is split into an in-order front-end that fetches and decodes x86 instructions into u-ops and an 8-way superscalar, out-of-order backend.

The core supports 2-way SMT. It has a 32KB, 8-way first-level instruction cache (L1 I-cache), and a 32KB, 8-way first-level data cache (L1 D-cache). The L1 caches are backed up by a unified 1MB second-level cache, the L2 cache. The L1 and L2 caches are private to each core.

CPU Front-End

前端负责 fetch and decode instructions from memory，将准备好的指令喂给后端执行。

每一个 cycle，前端从 L1 I-cache 中取 16 bytes 的指令。它们被两个线程分享，所以实际上是第一个周期线程 A 取，第二个周期线程 B 取，然后又是线程 A 这样。这些指令是复杂的可变长度的指令，pipeline 的 pre-decode 和 decode 阶段将它们分成 micro Ops(UOPs)，并且把它们排在 Allocation Queue (IDQ) 队列中。

pre-decode 主要标记指令的边界。指令的长度从 1 到 15 bytes 不等。这个阶段还 identifies branch instructions。这个阶段移动最多 6 条指令，或者称为 Macro Instructions 到 instruction queue 中。instruction queue is split between the two threads. instruction queue 中有个 macro-op fusion unit，可以检测两个 macroinstructions 是否可以 fuse 为一个指令、

一个周期中，最多5个被 pre-decode 完毕的指令会被发送到 instruction queue，同样是两个线程轮流分享这个接口。Decoder 将复杂的 macro-Op 转化为固定长度的 UOP，也就是上文提到的 micro Op。Decoder 是 5-way 的。

前端主要的性能提升组件是 Decoded Stream Buffer (DSB)，或者称为 UOP Cache。它的目的是将 macro Op 到 UOP 的转换保存在单独的 DSB 结构中。在取指阶段，会先在 DSB 查询。频繁执行的 macro op 会命中 DSB，pipeline 就能够避免昂贵的 pre-decode 和 decode 环节。

DSB 提供 6 个 UOP，这个和前端到后端的接口匹配。DSB 和 BPU 也就是 branch prediction unit 协同工作。

一些非常复杂的指令需要超过 decoder 能够处理的 UOP，它们就会被送给 Microcode Sequencer (MSROM) 处理。这些命令包含字符串处理、加密、同步等指令。另外，MSROM 也会保存 microcode operation，以便处理分支预测出错（需要刷新 pipeline），或者 floating-point 等异常情况。

Instruction Decode Queue (IDQ) 是在 inorder 的前端和 OOO 的后端之间的桥梁。每个 hardware thread 上 IDQ 大小 64 个 UOP，总计大小 128 UOP。

CPU Back-End

后端是一些 OOO engine，执行指令并且存储结果。
后端的心脏是一个 ReOrder Buffer，即 ROB，它有 224 条 entry。它的功能：

维护 architecture-visible registers 到 physical registers 的映射。这个映射被 Reservation Station/Scheduler (RS) 使用。
支持 register renaming。
记录 speculative execution。

ROB entry 按照 program order retire。

Reservation Station/Scheduler (RS) 结构记录一个 UOP 所使用的所有资源的 availablility。一旦这些资源满足了，就会将这个 UOP dispatch 到某一个 port。因为 the core（不知道指的是什么）是 8-way superscalar，所以 RS 在一个 cycle 中可以 dispatch 最多 8 个 UOP。如上面图 14 所示，这些 Port 分别支持不同的操作：

Ports 0, 1, 5, and 6 provide all the integer, FP, and vector ALU. UOPs dispatched to those ports do not require memory operations.
Ports 2 and 3 are used for address generation and for load operations.
Port 4 is used for store operations.
Port 7 is used for address generation.

Performance Monitoring Unit

Performance Monitoring Counters

这也就是后面在 Sampling 中提到的 PMC。PMC 会收集 CPU 中各个组件的统计信息。

一般来说，PMC 有 48bit 宽，这样可以允许分析工具能够跑更长时间，而不至于被中断触发。这是因为当 PMC overflow 的时候，程序执行会被中断，SW 需要存储 overflow 的情况。在 Sampling 章节也会进一步看到分析工具是如何利用这个特性的。

PMC 是 HW register，被实现为 Model Specific Register(MSR)。也就是具体有哪些，每种 CPU 是不一样的，并且 width 也不一定是 48bit。

Terminology and metrics in performance analysis

perf 命令或者 Intel VTune Profiler 里面的术语很难懂，本章进行介绍。

Retired vs. Executed Instruction

正常情况，the CPU commits results once they are available, and all precedinginstructions are already retired.
Speculative 执行，the CPU keeps their results without immediately committing their results. When the speculation turns out to be correct, the CPU unblocks such instructions and proceeds as normal. 如果预测失败，则会丢弃所有的结果，并且不 retire 它们。

CPU Utilization

CPU 不在跑 idle 线程的时候，即被认为是 utilize 的。

CPI & IPC

UOPs (micro-ops)

Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like59 microoperations - abbreviated µops or uops. 这里 CISC 指的是 Complex Instruction Set Computer。RISC 指的是 Reduced Instruction Set Computer。

这个转换的好处在于，UOP 可以被 OOO 地执行。一个简单的加法指令，例如 ADD EXA,EBX 只会产生一个 UOP。而一个复杂的指令，例如 ADD EAX,[MEM1] 可能生成两个 UOP，一个用来读内存到一个临时的，没有名字的寄存器中，另一个指令将这个临时寄存器中的数字加到 EAX 中。同理，ADD [MEM1],EAX 会产生三个 UOP，一个读内存，一个加，一个写内存。不同的 CPU 处理这些指令，比如如何将它们划分为不同的 UOP 的方式，是不一样的。

除了将复杂的 CISC-like 的指令分解为 RISC-lick 的 UOP 或者说 microoperations 之外，还有一种策略是融合一些指令。有两种融合的类型：

Microfusion
对同一个指令中的多个 UOP 进行 fuse。如下所示，访存操作和加法在 decode 阶段被 fuse 了。
1
2
3
# Read the memory location [ESI] and add it to EAX
# Two uops are fused into one at the decoding step.
add eax, [esi]
Macrofusion
如下所示，一个代数计算和一个条件跳转指令被 fuse 为一个 compute-and-branch UOP 了。
1
2
3
4
# Two uops from DEC and JNZ instructions are fused into one
.loop:
dec rdi
jnz .loop

Both Micro- and Macrofusion save bandwidth in all stages of the pipeline from decoding to retirement. The fused operations share a single entry in the reorder buffer (ROB). The capacity of the ROB is increased when a fused uop uses only one entry。我想这就是为什么要 fuse 指令的原因，因为它们节约了 ROB 的空间。在执行的时候，这个表示两个操作的 ROB entry 会被两个 execution units 处理，被发送到两个不同的 execution ports，但最后作为一个 unit 被 retire。

Linux perf users can collect the number of issued, executed, and retired uops for their workload by running the following command:

$ perf stat -e uops_issued.any,uops_executed.thread,uops_retired.all -- a.exe
2856278 uops_issued.any
2720241 uops_executed.thread
2557884 uops_retired.all

Pipeline Slot

A pipeline slot represents hardware resources needed to process one uop.

Core vs. Reference Cycles

Reference Cycles 是 CPU 计算 cycle 数量，仿佛没有 frequency scaling 一般。相当于不 tuebo，base frequency 下运行的 cycles。

$ perf stat -e cycles,ref-cycles ./a.exe
    43340884632 cycles # 3.97 GHz
    37028245322 ref-cycles # 3.39 GHz
    10,899462364 seconds time elapsed

The core clock cycle counter is very useful when testing which version of a piece of code is fastest because you can avoid the problem that the clock frequency goes up and down.

Cache miss

Mispredicted branch

1
2
3

$ perf stat -e branches,branch-misses -- a.exe
358209 branches
14026 branch-misses

Performance Analysis Approaches

Code Instrumentation

比如用 printf 调试，打印一个函数执行了多少次这样。这是 macro level 的，而不是 mirco level 的。
这个技术的主要用处是，能够快速定位到具体问题的模块。因为性能问题不仅和代码有关，也和数据有关。例如一个场景渲染比较慢，有可能是数据压缩问题，也有可能是场景元素过多。
当然，缺点就是只在 app 层级，到不了内核以及更下层。另外，调试代码需要重新编译，会降低性能，甚至改变现场。

Binary instrumentation 技术的特点：

对二进制分析
提供 static 和 dynamic 两种模式
dynamic 模式可以动态开启，可以限制只对某些函数调试

诸如 Intel Pin 的工具可以拦截某个事件，并且在之后插入代码。

instruction count and function call counts.
intercepting function calls and execution of any instruction in an application.
allows “record and replay” the program region by capturing the memory and HW registers state at the beginning of the region.

Tracing

strace 工具跟踪系统调用，可以被视作对内核的测量。Intel Processor Traces 工具跟踪 CPU 指令，可以被视作对 CPU 的测量。
Tracing is often used as the black-box approach, where a user cannot modify the code of the application, yet they want insight on what the program is doing behind the scenes。

Tracing 用来检查异常。例如一个系统十秒都不响应了，这个时候，Code Instrumentation 可能就不管用，但 tracing 可以去了解到底发生了什么。

Workload Characterization

Workload characterization is a process of describing a workload by means of quantitative parameters and functions。

Counting Performance Events

用一个 Counter 记录在跑一个 workload 时，某个事件在一段时间内发生的次数。

Manual performance counters collection

一般来说，不建议直接观察 PMC，而是使用 Intel Vtune Profiler 去看一个处理后的结果。

可以通过 perf list 查看可以访问的 PMC。

$ perf list
branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]

对于没列出的，可以使用下面的来检查

1 2	perf stat -e cpu/event=0xc4,umask=0x0,name=BR_INST_RETIRED.ALL_BRANCHES/ -- ./a.exe

Multiplexing and scaling events

考虑到有时候需要同时记录多个事件，但只有一个 counter。所以 PMU 会为每个 HW 线程提供四个 counter。但可能还是不够。
Top-Down Analysis Methodology (TMA) 需要一次执行过程中收集最多 100 个不同的 performance event。
这个时候，就引入了 multiplexing 技术，如下图所示。

Sampling

人们常常用 profiling 这个词去指代 sampling，但其实 profiling 这个词包含的范围更广。

User-Mode And Hardware Event-based Sampling

User-mode sampling 是个 SW 方案，也就是在用户程序中 embed 一个库。这个库会为每个线程设置 OS timer，在 timer 超时的时候，程序会收到 SIGPROF 信号。
EBS 是一个 HW 方案，它使用 PMC 去触发中断。

SW 方案只能被用来发现热点。而 HW 方案还可以被用来采样 cache miss，TMA(section 6.1) 等。

SW 方案的 overhead 更大。HW 方案更准确，因为允许采样更多的数据。

Finding Hotspots

下面的一个技术使用了 counter overflow 技术。也就是当 counter 溢出的时候，触发 performance monitoring interrupt(PMI)。

我们不一定要 sample CPU cycle，例如如果要检查程序中哪里出现了最多的 L3 cache miss，就可以 sample MEM_LOAD_RETIRED.L3_MISS 这个事件。

流程如下：

配置 PMC 去计算某项指标，例如 cycle 数
在执行过程中，PMC 会递增
PMC 最终会 overflow，HW 会触发 PMI
Profile 工具会抓住中断，并使用配置了的 Interrupt Service Routine(ISR) 去处理这个事件
- 禁止掉 counting
- reset counter 到 N
- 继续 benchmark 过程

通过 N 的不同取值，可以决定采样的频率。

Collecting Call Stacks

常见情况是，最热点的函数是被多个 caller 调用的。这个时候就需要知道是 Control Flow Graph (CFG) 里面的哪一条路径导致的。想要去追踪 foo 的所有 caller 是很耗时的，但我们只是想最终那些导致 foo 变成 hotspot 的，或者说是想找到 CFG 中经过 foo 的最热的路径。

下面三种方法：

Frame pointer
perf record –call-graph fp
这需要二进制被以 -fnoomit-frame-pointer 形式编译。历史上，Frame pointer 也就是 BP 寄存器被用来 debug，是因为它允许 get the call stack without popping all the arguments from the stack(stack unwinding)。Frame pointer 可以直接告诉我们 return address。
DWARF debug info
perf record –call-graph dwarf
这需要程序被按照 -g (-gline-tables-only) 形式编译。
Intel Last Branch Record (LBR)
perf record –call-graph lbr
这种方式得到的 call graph 的深度不如前面两者。

Flame graph

相比上面提到的，一个更流行的办法是火焰图。

Roofline Performance Model

这里的 roofline 指的是应用程序的性能不会超过机器本身的容量。每个函数或者每个循环都会被机器本身的计算或者内存容量所限制。

HW 有两方面限制：

它可以算多快
FLOPS
它可以多快传输数据
GB/s

一个程序的不同部分，会有不同的 performance characteristics。

Arithmetic Intensity (AI) is a ratio between FLOPS and bytes and can be extracted for every loop in a program。不妨对下面的代码来算一下。在最内层循环有一个加法和一个乘法，所以是 2FLOPS。另外，还有三个读操作和一个写操作，所以传输了 4 * 4 = 16 字节。所以 AI 是 2 / 16 = 0.125。这作为图表的横轴。

一般来说，提高程序性能可以分为 vectorization、memory、threading 三个方面。Roofline methodology 可以帮助评估应用的特征。如下所示，在 roofline 图表上，我们可以画出单核、SIMD 单核和 SIMD 多核的性能。因此可以借助于图表来判断优化方向。AI 越高，说明越是 CPU bound 的，就越要考虑 CPU 方面的优化。Vectorization 和 threading 的操作一般能够让图表中的点上移，而 optimize memory access 的操作一般能够将点右移，也可能能将点上移。

Roffline 分析可以进行前后的对比，如下所示，在执行完交换内外循环，以及 vectorize 后，性能发生了提高。

Static Performance Analysis

Compiler Optimization Reports

CPU Features For Performance Analysis

通常来说，profiling 能够快速发现应用程序的 hotspots。例如考虑 profile 一个函数，你觉得它应该是 cold 的，但你发现它实际花了很长时间并且被调用了很多次。所以你可以使用 cache 等技术来减少调用的次数，从而提高性能。

当主要的性能优化点都被处理后，就需要 CPU的支持来发现新的性能瓶颈了。所以在了解本章之前，需要先确保程序没有主要的性能问题。

现代 CPU 提供新的特性来辅助性能分析，这些特性可以用来发现 cache miss 或者 branch misprediction。这些措施包括：

Top-Down Microarchitecture Analysis Methodology (TMA)
特征化出 workload 的瓶颈，并且能够在源码中进行定位。
Last Branch Record (LBR)
持续地记录最近的 branch outcomes。被用来记录 call stacks，识别 hot branch，计算 misprediction rates of individual branches。
Processor Event-Based Sampling (PEBS)
Intel Processor Traces (PT)
能够回放程序的执行。

Top-Down Microarchitecture Analysis

Part 2 的说明

In part 2, we will take a look at how to use CPU monitoring features (see section 6) to find the places in the code which can be tuned for execution on a CPU. For performance-critical applications like large distributed cloud services, scientific HPC software, ‘AAA’ games, etc. it is very important to know how underlying HW works.

面对 performance-critical 负载时，一些经典算法未必有效。例如链表并不适合用来处理 flat 的数据。原因是需要逐节点动态分配，并且每个元素是零散分布在内存中的。

Some data structures, like binary trees, have natural linked-list-like representation, so it might be tempting to implement them in a pointer chasing manner. However, more efficient “flat” versions of those data structures exist, see boost::flat_map, boost::flat_set.

另一方面，即使算法是最好的，但它未必对某些特定的 case 是最好的。例如对有序数组使用二分查找是最优的，但他的 branch miss 很高，因为是 50-50 的失败率。因此对于短数组，常常是线性查找。

在讲解 CPU 微架构的优化前，先列出一些更高层级的优化方案：

If a program is written using interpreted languages (python, javascript, etc.), rewrite its performance-critical portion in a language with less overhead.
Analyze the algorithms and data structures used in the program, see if you can find better ones.
Tune compiler options. Check that you use at least these three compiler flags: -O3 (enables machine-independent optimizations), -march (enables optimizations for particular CPU generation), -flto (enables inter-procedural optimizations).
If a problem is a highly parallelizable computation, make it threaded, or consider running
it on a GPU.
Use async IO to avoid blocking while waiting for IO operations.
Leverage using more RAM to reduce the amount of CPU and IO you have to use (memoization, look-up tables, caching of data, compression, etc.)

此外，并不是所有的优化方式都对每个平台有效。例如 loop blocking 对 memory hierarchy 很敏感，特别是 L2 和 L3 的大小。loop blocking 优化就是如果两层循环都比较大，那么可能内层循环中的元素会在一轮中被 evict 掉，从而导致外层循环在下轮中重新加载。通过重排循环，可以减少重新加载的次数。可以详细看 https://zhuanlan.zhihu.com/p/292539074 的解读。

Data-Driven Optimizations

Data-Driven 优化的一个经典案例是 Structure-Of-Array to Array-Of-Structures (SOA-to-AOS)。

struct S {
int a[N];
int b[N];
int c[N];
// many other fields
};
<=>
struct S {
int a;
int b;
int c;
// many other fields
};
S s[N];

如何选择取决于数据的读取模式，我理解类似于行存和列存的区别：

If the program iterates over the data structure and only accesses field b, then SOA is better because all memory accesses will be sequential (spatial locality).
If the program iterates over the data structure and does excessive operations on all the fields of the object (i.e. a, b, c), then AOS is better because it’s likely that all the members of the structure will reside in the same cache line. It will additionally better utilize the memory bandwidth since fewer cache line reads will be required.

Data-Driven 优化的另一个案例是 Small Size optimization。也就是提前分配一些内存，用来减少后续的动态内存分配开销。

CPU Front-End Optimizations

CPU Front-End (FE) component is discussed in section 3.8.1. Most of the time, inefficiencies in CPU FE can be described as a situation when Back-End is waiting for instructions to execute, but FE is not able to provide them. As a result, CPU cycles are wasted without doing any actual useful work. Because modern processors are 4-wide (i.e., they can provide four uops every cycle), there can be a situation when not all four available slots are filled. This can be a source of inefficient execution as well. In fact, IDQ_UOPS_NOT_DELIVERED performance event is counting how many available slots were not utilized due to a front-end stall. TMA uses this performance counter value to calculate its “Front-End Bound” metric.

Machine code layout

Assembly instructions will be encoded and laid out in memory consequently. This is what is called machine code layout.

Basic Block

A basic block is a sequence of instructions with a single entry and single exit.
一个 basic block 中的代码会并且只会被执行一次，所以可以减少 control flow graph analysis and transformations。

Basic block placement

// hot path
if (cond)
    coldFunc();
// hot path again

原因：

Not taken branches are fundamentally cheaper than taken.
In the general case, modern Intel CPUs can execute two untaken branches per cycle but only one taken branch every two cycles.
右边这幅图能够更好地利用 instruction cache 和 uop cache（之前提到的 DSB）。这是因为如果把所有的 hot code 连起来，就没有 cache line fragmentation。L1I Cache 中的所有的 Cache line 都被 hot code 使用。这点对 uop cache 也是一样的，因为它的 cache 也是基于下层的 code layout。
如果 taken branch 了，CPU 的 fetch 部分也会受影响。因为它也是 fetch 连续的 16 bytes 指令的。

通过指定 likely 和 unlikely 可以帮助编译器优化。如果一个分支是 unlikely 的，编译器可能选择不 inline，从而减少大小。likely 还可以在 switch 中使用。

Basic block alignment

Skylake 的 instruction cache line 的大小是 64 bytes，下面的代码可能被编译得跨越两个 cache line。这对 CPU 前端的性能会有影响，特别是像这种小循环。

void benchmark_func(int* a) {
    for (int i = 0; i < 32; ++i)
        a[i] += 1;
}

下面的优化通过添加 nop 指令让整个循环都在一个缓存行里面。为什么这能提高性能比较复杂，因此就不解释了。

By default, the LLVM compiler recognizes loops and aligns them at 16B boundaries，就和上面图中未优化的情况一样。通过指定 -mllvm -align-all-blocks 可以变成优化后的样子。但是要慎重做这样的处理，因为增加 NOP 指令会影响执行时间。尽管 NOP 不执行，但是他同样需要被 fetch、解码、retire。所以需要额外地耗费在 FE 中的空间。

Function splitting

目的是将 hot 部分和 cold 部分分离出来。

void foo(bool cond1, bool cond2) {
    // hot path
    if (cond1) {
        // large amount of cold code (1)
    }
    // hot path
    if (cond2) {
        // large amount of cold code (2)
    }
}

分离后

void foo(bool cond1, bool cond2) {
    // hot path
    if (cond1)
        cold1();
    // hot path
    if (cond2)
        cold2();
}
void cold1() __attribute__((noinline)) { // cold code (1) }
void cold2() __attribute__((noinline)) { // cold code (2) }

如下图所示，因为在 hot path 中只保留了 CALL，所以下一条指令很有可能在同一个 cache line 里面。这也说明了，对冷代码而言，应该避免它被 inline。

特别地，冷函数可以被放在 .text.old 中，从而避免在运行时加载。

Function grouping

Figure 44 gives a graphical representation of grouping foo, bar, and zoo. The default layout (see fig. 44a) requires four cache line reads, while in the improved version (see fig. 44b), code of foo, bar and zoo fits in only three cache lines. Additionally, when we call zoo from foo, the beginning of zoo is already in the I-cache since we fetched that cache line already.

和之前一样，function grouping 能提高 I-Cache 和 DSB-cache 的效率。

使用 ld.gold 链接器来指定顺序：

-ffunction-sections flag, which will put each function into a separate section.
--section-ordering-file=order.txt option should be used to provide a file with a sorted list of function names that reflects the desired final layout.

一个叫 HFSort 的工具可以帮助 group function。

Profile Guided Optimizations

Optimizing for ITLB

Virtual-to-physical address translation of memory address 也影响了 CPU FE 的效率。通常，这是被 TLB 来处理，TLB 会缓存最近的地址。When TLB cannot serve translation request, a time-consuming page walk of the kernel page table takes place to calculate the correct physical address for each referenced virtual address.

如果 TMA 指示了 high ITLB overhead，那么下面的内容就很重要：

将一些 performance-critical code 中的部分映射到较大的 page 中。
使用一些标准的 I-cache performance 优化办法，例如 reorder 函数让 hot 函数更为 collocated。或者通过 LTO 或者 IPO 技术减少 hot region 的大小。或者使用 profile guided optimization。或者使用激进的 inline 技术。

总结

CPU Back-End Optimizations

Most of the time, inefficiencies in CPU BE can be described as a situation when FE has fetched and decoded instructions, but BE is overloaded and can’t handle new instructions. Technically speaking, it is a situation when FE cannot deliver uops due to a lack of required resources for accepting new uops in the Backend. An example of it may be a stall due to data-cache miss or a stall due to the divider unit being overloaded.
I want to emphasize to the reader that it’s recommended to start looking into optimizing code for CPU BE only when TMA points to a high “Back-End Bound” metric. TMA further divides the Backend Bound metric into two main categories: Memory Bound and Core Bound, which we will discuss next.

Memory Bound

如下图所示，截止 2010 年，CPU 的提升比内存的提升要快很多。【Q】最近还是这样么？

通过 TMA，Memory Bound estimates a fraction of slots where the CPU pipeline is likely stalled due to demand load or store instructions。

Cache-Friendly Data Structures

A variable can be fetched from the cache in just a few clock cycles, but it can take more than a hundred clock cycles to fetch the variable from RAM memory if it is not in the cache.
下面一张图展示了各个 CPU 操作的耗时。

Cache-friendly code 的关键是 temporal 和 spatial locality。

Access data sequentially

让 HW Prefetcher 感知到我们在顺序访问，从而提前取出下一个 chunk 的数据。
下面代码之所以是 cache friendly 的，是因为它的访问模式和它存储的 layout 是一致的。

1
2
3

for (row = 0; row < NUMROWS; row++)
    for (column = 0; column < NUMCOLUMNS; column++)
        matrix[row][column] = row + column;

又比如，传统的二分查找算法并没有很好的空间局部性，因为它会测试彼此之间相距很远的元素，这些元素可能在不同的 cache line 上。一个解决方案是使用 Eytzinger layout。我理解就是存在一个类似二叉堆的结构中。

Use appropriate containers

或者说是在 array 中直接存对象，还是存指针。如果排序，适合存指针。如果仅仅是线性遍历，适合存对象。

Packing the data

位域

重新排列 field 以减少 padding 和 aligning 带来的内存浪费

struct S1 {
    bool b;
    int i;
    short s;
}; // S1 is sizeof(int) * 3
struct S2 {
    int i;
    short s;
    bool b;
}; // S2 is sizeof(int) * 2

Aligning and padding

比如，一个 16 bytes 的对象可能占用两个 cache line。读取这样的对象需要读两次 cache line。

A variable is accessed most efficiently if it is stored at a memory address, which is divisible by the size of the variable.

// Make an aligned array
alignas(16) int16_t a[N];
// Objects of struct S are aligned at cache line boundaries
#define CACHELINE_ALIGN alignas(64)
struct CACHELINE_ALIGN S {
    //...
};

Padding 技术还可以备用来解决 cache contention 和 false sharing 问题。

For example, false sharing issues might occur in multithreaded applications when two threads, A and B, access different fields of the same structure. An example of code when such a situation might happen is shown on Listing 24. Because a and b members of struct S could potentially occupy the same cache line, cache coherency issues might significantly slow down the program.

struct S {
    int a; // written by thread A
    int b; // written by thread B
};

通过 padding 来解决伪共享

#define CACHELINE_ALIGN alignas(64)
struct S {
    int a; // written by thread A
    CACHELINE_ALIGN int b; // written by thread B
};

When it comes to dynamic allocations via malloc, it is guaranteed that the returned memory address satisfies the target platform’s minimum alignment requirements. Some applications might benefit from a stricter alignment. For example, dynamically allocating 16 bytes with a 64 bytes alignment instead of the default 16 bytes alignment. In order to leverage this, users of POSIX systems can use memalign API.

Dynamic memory allocation

使用诸如 jemalloc 或者 tcmalloc 的实现
使用 custom allocator
例如 arena allocator。这样 allocator 不需要在每次分配的时候都 sys call。
另外，也更灵活，支持不同的策略。比如冷数据一个 arena，热数据一个 arena。将热数据放一起可以有机会共享 cache line。从而提高内存贷款和 spatial locality。同时还可以提高 TLB 利用率，因为 hot data 会占用更少的 page。
此外，还是 thread-aware 的，可以实现 thread local 的分配策略，从而避免线程间的同步。

Tune the code for memory hierarchy

这里还是 loop blocking 的例子。也就是将 matrix 切成多个小块，让每一块能够被装在 L2 cache 里面。

Explicit Memory Prefetching

如下代码所示，如果 calcNextIndex 返回的是很随机的数字，那么 arr[j] 会频繁 cache miss。如果 arr 很大，那么 HW prefetcher 就不能去识别 pattern，然后 prefetch。
因为在计算 j 和访问 arr[j] 之间有不少操作，所以可以借助 __builtin_prefetch 来 prefetch。这也是要点，一定要提前足够的时间。但也不能提前太多，从而污染 cache。在后面的 6.2.5 节中会介绍如何选择。

for (int i = 0; i < N; ++i) {
    int j = calcNextIndex();
    // ...
    doSomeExtensiveComputation();
    // ...
    x = arr[j]; // this load misses in L3 cache a lot
}

一般比较通用的是提前获取下一个 iter 的数据

for (int i = 0; i < N; ++i) {
    int j = calcNextIndex();
    __builtin_prefetch(a + j, 0, 1); // well before the load
    // ...
    doSomeExtensiveComputation();
    // ...
    x = arr[j];
}

需要注意，Explicit Memory Prefetching 不是 portable 的。可能在别的平台上反而会让性能变差。Consider a situation when we want to insert a prefetch instruction into the piece of code that has average IPC=2, and every DRAM access takes 100 cycles. To have the best effect, we would need to insert prefetching instruction 200 instructions before the load. It is not always possible, especially if the load address is computed right before the load itself. The pointer chasing problem can be a good example when explicit prefetching is helpless.

另外，prefetch 指令会加重 CPU 前端的开销。

Optimizing For DTLB

如前介绍，TLB is a fast but finite per-core cache for virtual-to-physical address translations of memory addresses. 没有它，每次内存访问都需要去遍历内核的 page table，然后计算出正确的 physical address。

TLB 由 L1 ITLB(instructions)、L1 DTLB(data) 和 L2 STLB(shared by data and instructions) 组成。L1 ITLB 的 cache miss 的惩罚很小，通常能够被 OOO 执行掩盖掉。L2 STLB 的 cache miss 就会导致遍历内核的 page table 了。这个 penalty 是可被观测的，因为这段时间 CPU 是 stall 的。假设 Linux 的默认 page size 是 4KB，L1 TLB 只能够存几百个 entry，覆盖大约 1MB 的地址空间。L2 STLB 覆盖大概一千多个。

一种减少 ITLB cache miss 的方案是使用更大的 page size。TLB 也支持使用 2MB 和 1GB 的 page。

Large memory applications such as relational database systems (e.g., MySQL, PostgreSQL, Oracle, etc.) and Java applications configured with large heap regions frequently benefit from using large pages.

On Linux OS, there are two ways of using large pages in an application: Explicit and Transparent Huge Pages.

Explicit Hugepages

用户可以通过 mmap 等指令去访问。可以通过 cat /proc/meminfo 并检查 HugePages_Total 来检查相关配置。

Huge page 可以在系统启动，或者运行的时候被保留。在启动期间保留的成功率更高，因为此时系统的内存空间没有被显著碎片化(fragmented)。

可以通过 libhugetlbfs 来 override 掉 malloc 调用。只需要调整环境变量酒席了。

mmap using the MAP_HUGETLB flag
https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/vm/map_hugetlb.c
mmap using a file from a mounted hugetlbfs filesystem
https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/vm/hugepage-mmap.c
shmget using the SHM_HUGETLB flag
https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/vm/hugepage-shm.c

Transparent Hugepages

Linux also offers Transparent Hugepage Support(THP)。

1 2	$ cat /sys/kernel/mm/transparent_hugepage/enabled always [madvise] never

always 表示 system wide，madvise 表示 per process。

Explicit vs. Transparent Hugepages

Background maintenance of transparent huge pages incurs non-deterministic latency overhead from the kernel as it manages the inevitable fragmentation and swapping issues. EHP is not subject to memory fragmentation and cannot be swapped to the disk.
EHP is available for use on all segments of an application, including text segments (i.e., benefits both DTLB and ITLB), while THP is only available for dynamically allocated memory regions.

Core Bound

也就是在 OOO 执行过程中所有不是因为内存原因造成的 stall。包含：

Shortage in hardware compute resources
比如除法和平方根计算被 Divider Unit 处理，耗时比较长。如果这方面的操作比较多，那么就会造成 stall。
Dependencies between software’s instructions

Inlining Functions

它不仅能够去掉调用函数的开销，同时也让其他优化变为可能。因为此时编译期能看到更多的代码。
对 LLVM 编译期而言，基于 computing cost 和 a threshold for each function call(callsite) 来计算。如果 cost 比 threshold 更低，就会 inline。

threshold 的选取通常是固定的，一般来说有一些启发式的方法：

小函数基本总是会被 inline
只有一个 callsite 的函数会被倾向于 inline
大函数通常不会 inline，因为它们让 caller 变大

有些不能 inline 的情况：

递归函数不能 inline
Function that is referred to through a pointer can be inlined in place of a direct call but has to stay in the binary, i.e., cannot be fully inlined and eliminated. The same is true for functions with external linkage.

One way to find potential candidates for inlining in a program is by looking at the profiling data, and in particular, how hot is the prologue and the epilogue of the function. 下面的 demo 中，这个函数的 profile 中体现了 prologue 和 epilogue 花费了大概 50% 的时间。这可能说明了如果我们 inline 这个函数，就能减少 prologue 和 epilogue 的开销。但这不是绝对的，因为 inline 会导致一系列变化，所以很难预测结果。

Overhead | Source code & Disassembly
(%) | of function `foo`
--------------------------------------------
3.77 : 418be0: push r15 # prologue
4.62 : 418be2: mov r15d,0x64
2.14 : 418be8: push r14
1.34 : 418bea: mov r14,rsi
3.43 : 418bed: push r13
3.08 : 418bef: mov r13,rdi
1.24 : 418bf2: push r12
1.14 : 418bf4: mov r12,rcx
3.08 : 418bf7: push rbp
3.43 : 418bf8: mov rbp,rdx
1.94 : 418bfb: push rbx
0.50 : 418bfc: sub rsp,0x8
...
# # function body
...
4.17 : 418d43: add rsp,0x8 # epilogue
3.67 : 418d47: pop rbx
0.35 : 418d48: pop rbp
0.94 : 418d49: pop r12
4.72 : 418d4b: pop r13
4.12 : 418d4d: pop r14
0.00 : 418d4f: pop r15
1.59 : 418d51: ret

Loop Optimizations

通常，循环的性能被下面几点限制：

memory lantency
memory bandwidth
机器的 compute capability

Roofline Perfoemance Model 是评估 HW 理论最大值和不同的 loop 实际之间的方法。Top-Down Microarchitecture Analusis 是另外一个瓶颈相关的信息来源。

这一节中，首先讨论 low-level 的优化，也就是将代码在一个 loop 中移动。这样的优化方式让循环内的计算更有效率。然后是 high-level 的优化，会 restructure loops，通常会影响多个 loop。第二种优化的目的主要是提高 memory access。eliminating memory bandwidth 和 memory lantency 问题。

Low-level optimizations

下面这些优化通常能够提高具有 high arithmetic intensity 的 loop 的性能，比如循环是 CPU-compute-bound 的。大部分情况下，编译器能够自己做相关优化，少部分情况下，需要人工帮助。

Loop Invariant Code Motion (LICM)

将循环中的不变量移出循环。
什么是循环中的不变量呢？ Expressions evaluated in a loop that never change are called loop invariants.

for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        a[j] = b[j] * c[i]; 

==>

for (int i = 0; i < N; ++i) {
    auto temp = c[i];
    for (int j = 0; j < N; ++j)
        a[j] = b[j] * temp;

Loop Unrolling

循环的 induction variable，也就是 for i 的那个 i，每个 iteration 去修改它的代价是比较大的。所以可以选择 unroll 一个循环。

下面的例子中，unroll the loop by a factor of 2. 从而减少了 compare 和 branch 指令的开销到原来的一半。

for (int i = 0; i < N; ++i)
    a[i] = b[i] * c[i];

==>

for (int i = 0; i < N; i+=2) {
    a[i] = b[i] * c[i];
    a[i+1] = b[i+1] * c[i+1];
}

作者建议不要手工展开循环：

编译器很擅长做这个，并且做得很好
处理器有一个 “embedded unroller”，thanks to their OOO speculative execution engine
当处理器正等待第一个 iteration 的负载完成时，它可能预先开始执行第二个 iteration 的负载了。This spans to multiple iterations ahead, effectively unrolling the loop in the instruction Reorder Buffer (ROB).

Loop Strength Reduction (LSR)

将开销比较大的操作换为开销更小的操作。通常和 induction variable 上的计算有关。

for (int i = 0; i < N; ++i)
    a[i] = b[i * 10] * c[i];

==>

int j = 0;
for (int i = 0; i < N; ++i) {
    a[i] = b[j] * c[i];
    j += 10;
}

Loop Unswitching

这个很简单，尝试能不能把 loop 中的 branch 提出来，分成两个 loop

for (i = 0; i < N; i++) {
    a[i] += b[i];
    if (c)
        b[i] = 0;
}


if (c)
    for (i = 0; i < N; i++) {
        a[i] += b[i];
        b[i] = 0;
    }
else
    for (i = 0; i < N; i++) {
        a[i] += b[i];
    }

High-level optimizations

下面的一些变换，从编译器的角度比较难合法地或者自动地进行。所以很多时候可能需要手动做。

Loop Interchange

交换多层循环的 order。
目的是 perform sequential memory accesses to the elements of a multi-dimensional array。
如下所示，在一个按行存储的数组中，让 i 到内层循环的空间局部性更好。

Loop Blocking (Tiling)

Loop Fusion and Distribution (Fission)

Loop fusion 可以用来：

减少 loop overhead，因为它会复用相同的 induction variable
可以提高 memory access 的 temporal locality
如下代码中，如果 x 和 y 都位于同一个缓存行上面，那么将两个 loop fuse 在一起能够避免加载同一个缓存行两次。
这样就能减少 cache footprint，并且提高 memory bandwidth utilization。

相反地有 Loop Distribution 即 Loop Fission。其目的是：

可以先在一个循环中 pre-filter、sort、reorg 数据
减少一个 iteration 中需要访问的数据，从而提高 memory access 的 temporal locality。这对具有较高 cache contention 的场景，也就是大 loop 中会比较有用
减少对寄存器的压力，同样是因为一个 iteration 中会执行更少的操作了
通常还可能提升 CPU FE 的性能，因为 cache utilization 会更好
编译期能够更好地优化小循环

Discovering loop optimization opportunities

如下所示，编译器不能把 strlen 移出循环体。这是因为 a 和 b 两个数组可能是 overlap 的。所以，需要通过 restrict 关键词来显式声明这两段内存不会 overlap。

void foo(char* a, char* b) {
    for (int i = 0; i < strlen(a); ++i)
        b[i] = (a[i] == 'x') ? 'y' : 'n';
}

在一些时候，编译期可以通过 compiler optimization remarks (sec 5.7) 告诉我们失败了的优化。但对于上面的情况，Clang 10 和 GCC 10 目前还都不行。目前只能通过反编译出来的代码才能看出来。

有一些指令可以配置编译器的优化行为，例如 #pragma unroll(8)。

Use Loop Optimization Frameworks

目前有一些框架可以检测 loop transformation 的合法性了。例如 polyhedral 框架，polyhedral 的意思是多面体。LLVM 系列也有自己的 polyheral 框架，即 Polly。LLVM 默认不启用。对于 GEMM 内核，Polly 能提供 20 倍左右的提速。

Vectorization

大部分情况下，Vectorization 能够被编译器发现，然后自动发生。
一般来说，在编译器中完成三种 Vectorization：

inner loop vectorization
outer loop vectorization
SLP (Superword-Level Parallelism) vectorization

第一种最常见。

Compiler Autovectorization

阻止 Autovectorization 的情况：

unsigned loop-indices 溢出
两个指针可能指向重叠的内存区间
处理器本身不支持比如 predicated (bitmask-controlled) load and store operation
vector-wide format conversion between signed integers to doubles because the result operates on vector registers of different sizes

所以一般分为三步：

Legality-check
检查 the loop progresses linearly。
确保 the memory and arithmetic operations in the loop can be widened into consecutive operations。
That the control flow of the loop is uniform across all lanes and that the memory access patterns are uniform。没看懂是在说什么。
确保不会访问和修改不应该的内存。
analyze the possible range of pointers, and if it has some missing information, it has to assume that the transformation is illegal。看不懂。
Profitability-check
It needs to take into account the added instructions that shuffle data into registers, predict register pressure, and estimate the cost of the loop guards that ensure that preconditions that allow vectorizations are met. 我觉得作者这样说就是在让人看不懂。
Transformation

Discovering vectorization opportunities

检查 compiler vectorization remarks 可以发现编译期进行了什么优化，包含是否 vectorized 了，vectorization factor(VF) 是多少。

Vectorization is illegal

下面的函数不是 vectorizable 的。因为是 read-after-write dependence.

void vectorDependence(int *A, int n) {
    for (int i = 1; i < n; i++)
        A[i] = A[i-1] * 2;
}

下面的函数不是 vectorizable 的。因为是 floating-point arithmetic。一般来说，浮点数加法是可交换的，但浮点数加法不是可结合的。如果向量化，则会导致不同的 round decision，和一个不同的结果。

float calcSum(float* a, unsigned N) {
    float sum = 0.0f;
    for (unsigned i = 0; i < N; i++) {
        sum += a[i];
    }
    return sum;
}

如下所示的代码，GCC 会为其创建两个不同的版本。如果发现 a 和 b 和 c 有 overlap，则运行普通版本，否则运行 simd 版本。

void foo(float* a, float* b, float* c, unsigned N) {
    for (unsigned i = 1; i < N; i++) {
        c[i] = b[i];
        a[i] = c[i-1];
    }
}

输出

$ gcc -O3 -march=core-avx2 -fopt-info
a.cpp:2:26: optimized: loop vectorized using 32 byte vectors
a.cpp:2:26: optimized: loop versioned for vectorization because of possible
aliasing

Vectorization is not beneficial

void stridedLoads(int *A, int *B, int n) {
    for (int i = 0; i < n; i++)
        A[i] += B[i * 3];
}

Loop vectorized but scalar version used

一般是因为 loop trip 比较小。比如以 AVX2 来说，如果同时加上 unroll 循环的技术，假设 unroll 4-5 倍，那么一个 loop iteration 需要处理 40 个元素。所以这种情况下，会直接 fallback 到处理剩余尾数的普通循环中。
此时，可以强制使用更小的 vectorization factor 或者 unroll count。

1	#pragma clang loop vectorize_width(N)

Loop vectorized in a suboptimal way

最优的 vectorization factor 是难以通过直觉判断的，因为存在以下因素：

很难在大脑中模拟 CPU 的运行模式
Vector shuffles that touch multiple vector lanes could be more or less expensive than expected, depending on many factors.
什么是 vector shuffle？
运行时，程序可能以非预期的方式运行，取决于 port pressure 或者其他因素
人类可以通过 Vectorization pragmas 来尝试各种 vectorization factor。也可以尝试各种 unroll factor 来选出最优的。但编译器做不到。
非 vec 的版本可能更好
因为 gather/scatter loads, masking, shuffle 这些向量操作更昂贵。
所以也需要尝试禁用向量化比较性能。如使用 -fno-vectorize 和 -fno-slp-vectorize，或者 #pragma clang loop vectorize(enable)。

Use languages with explicit vectorization

“Close to the metal” programming model

这里说的是传统的 C 和 C++ 中没有向量化的概念，所以总是需要通过 compiler intrinsics 来做这个工作。而像 ISPC 这样的语言中 += 这样的操作符会被隐式地认为是 SIMD 操作，并会并行的执行多个加法。

Optimizing Bad Speculation

Mispredicting a branch can add a significant speed penalty when it happens regularly. When such an event happens, a CPU is required to clear all the speculative work that was done ahead of time and later was proven to be wrong. It also needs to flush the whole pipeline and start filling it with instructions from the correct path. Typically, modern CPUs experience a 15-20 cycles penalty as a result of a branch misprediction.

现在的处理器能够进行分支预测，不仅仅是静态的规则，甚至可以发现动态的模式。

我们可以通过 TMA Bad Speculation 指标来查看一个程序在多大程度上收到分支预测的影响。对此，作者推荐只有当分支预测失败率在 10% 以上的时候，再进行关注。

从前，可以在分治命令前加上前缀 0x2E 或者 0x3E 来分别表示是否选择这个分支。但随着后面分支预测机制的完善，这个功能被去掉了。所以目前唯一能够避免分支预测失败的直接办法是不使用分支。下面介绍两种避免分支的办法。

注：likely 和 unlikely 是通过放置可能性较大的指令到靠近分支跳转指令来实现优化的。

Replace branches with lookup

int mapToBucket(unsigned v) {
    if (v >= 0 && v < 10) return 0;
    if (v >= 10 && v < 20) return 1;
    if (v >= 20 && v < 30) return 2;
    if (v >= 30 && v < 40) return 3;
    if (v >= 40 && v < 50) return 4;
    return -1;
}

==>

int buckets[256] = {
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
... };
int mapToBucket(unsigned v) {
    if (v < (sizeof (buckets) / sizeof (int)))
    return buckets[v];
    return -1;
}

Replace branches with predication

额外说明

superscalar 和 SMT

在“Superscalar Engines and VLIW”中介绍了 superscalar 技术。superscalar 指的是在一条流水线中，有多个执行单元，所以在一个 cycle 中可以 issue 多条指令。所以这就导致原先串行的命令中，不冲突的命令不再构成全序关系，这也就是所谓的 OOO 乱序执行。BTW，将全序关系拆解成多个不冲突的偏序关系也是一种并行或者分布式领域的常用技术。

SMT 也就是 Simultaneous Multithreading，也被称为 hyperthreading。指一个核心同时拥有两套寄存器、缓存，保存两个线程工作的现场。

Grafana 使用 histogram_quantile 和 rate 的精度问题

2023-09-11T14:57:32.000Z

Grafana 上如果观测离群点，会发现它的值漂移地很厉害。往往一个实际 2min 的指标能显示出是几个 hour。
如 https://github.com/pingcap/tiflash/issues/8076 所述。这个问题发生需要同时使用 histogram_quantile 和 rate。

其原因是 rate 会将 count 除成小数，因为 IEEE 浮点数不能精确表示，所以引入了噪音数据。
如下所示，8.192 和 67108.864 这两个桶对应的 sum 应该是相等的。但因为浮点数加法的问题，它们不相等了。因此这些立群值就会被放到 bucket 序号更大的桶里面了。

https://github.com/m3db/m3/issues/3706

The problem is that the rate function, while doing its magic, turns counts into fractions. Most fractions can’t be expressed exactly as a floating point number (IEEE754 standard). The resulting number that represents the fraction is just an approximation that uses up all bits of mantissa.

因为精度的问题，导致在某个 edge 上会进一位。所以可以用 histogram_quantile(1.0, sum(round(1000000000*rate(xxx{}[5m]))) by (le) / 1000000000) 这样来规避。

这也体现出浮点数的性质不咋样，连结合律都不满足。在工程上来讲，不满足结合律意味着没法分治。

关于 TiKV、TiDB、TiFlash 的一些思考

2023-07-22T15:20:37.000Z

一些常见问题的思考，只代表个人见解。

TiKV 相关

TiKV 写入性能

KV 热点

如果出现热点 Key，机器会吃不消么？写热点是难以避免的。TiKV 选择按 Range 切割，但是 User Key 不跨 Region。一段区间内的写热点，会导致容量超过上限而分裂，新分裂出来的 Region 可以被调度到其他 Node 上，从而实现负载均衡。在文章中提到，可以通过预分区的方式来划分 Region。可是对于单调递增的主键，或者索引，它会永远写在最后一个 Region 上。但我认为热点 Region 未必意味着热点机器，可以先进行 Split，然后通过 Leader Transfer 给其他的 Peer，或者通过 Conf Change 直接干掉自己。我猜测这个主要取决于数据迁移的效率和中心化服务的质量，如果在 Raft Log 阶段就能检测到流量问题并分裂，那么负载有可能被分流到多个相邻的 Region 中。

TiKV 提供了 SHARD_ROW_ID_BITS 来进行打散，这类似于 Spanner 架构中提到的利用哈希解决 Append 写的思路。TiBD 提供了 AUTO_RANDOM 替代 AUTO_INCREMENT。

注意，如果负载是频繁对某个特定的 key 更新，则 TS 一定也被用来计算哈希，不然热点 key 一定是在同一个 Region 内。这样一个 key 的不同版本就分布在不同的 Region 中，就不利于扫表了。因为下推到 TiKV 的请求可以理解为从 [l, r] 去扫出来所有 commit_ts <= scan_ts 的数据，这样的扫表一定是会涉及到所有的机器，性能会很差。对于点查也一样，我们始终要找一个大于 user_key + ts 的 TiKV Key，哈希分片不好 seek。特别地，如果是 SI，那还得扫 [0, scan_ts] 中有没有 Lock，这个过程也要访问多个机器。

如果在构造 key 的时候就进行分片，比如在最左边加一个 shard_id，这样 rehash 会很困难。shard_id 可以比如是通过某个特定字段哈希得到。

在 Spanner 中存在 Tablet，也就是将多个同时访问比较频繁的 Region co-locate，这些 Region 彼此之间未必是有序的，甚至可能属于不同的表。

关于 Region 大小的讨论

较小的 Region 的好处：

每个 Region 中较低的并发
更加快速的调度

较大的 Region 的好处：

Placement Driver 的压力变小
CompactLog、Heartbeat 等网络开销变小
1PC 的事务更多

Raft 存储

原来 TiKV 使用 RocksDB 存储 Raft Log 和相关 Meta，存在几个问题：

WAL + 实际数据，需要写两次盘。
数据变多，Compaction 负担变大，写放大更大。层数更多，写放大更大。

因此引入了类似 bitcask 架构的 RaftEngine 来解决这个问题。RaftEngine 中每个 Region 对应一个 Memtable，数据先通过 Group Write 写入到文件中，然后再注册到 Memtable 中。在读取时从 Memtable 获取位置，再从文件中读取。因此随着 Region 日志 Apply 进度的不同，RaftEngine 在文件中会存在空洞，因此需要 rewrite。这使得存在一部分 CPU 和 IO 花费在 rewrite 逻辑上，而不能像 PolarDB 一样按照水位线直接删除。RaftEngine 这么做可以减少 fsync 的调用频率，并且充分利用文件系统 buffer 来做聚合。

此外，Raftstore 还使用 async_io 来异步落盘 Raft 日志和 Raft 状态。这样，Raftstore 线程不被 io 阻塞，能够处理更多的 Raft 相关请求和日志。需要注意，这反过来可能会加重 PeerFsm、ApplyFsm 和网络的负担，对 CPU 的要求更高。

日志和数据分离存储

Titan

Titan 的思路是将 RocksDB 中的 value 拿出来存，减少 Compaction 对 CPU 和 IO 的开销，但会带来空间放大。并且数据局部性差，所以范围查询性能较差。

Titan 将这些大 value 有序地存放在一些 blob file 中，并且保存了 value 对应的 user key 用来反查 RocksDB。反查的原因是 blob file 本身需要 gc，所以要通过 user key 来查询是否过期，这会带来一些写放大。

Titan 有两种 gc 策略：

定时 rewrite blob file
监听每次 Compaction 事件，从而维护每个 blob 文件中无效数据的大小。每次重写 invalid 率最高的几个文件，并更新回 RocksDB。旧的文件需要确保不再有 Snapshot 引用才可被删除。
在 LSM-tree compaction 的时候同时进行 blob 文件的重写
也就是在 Compaction 的同时写到一份新的 blob 文件中。因为不需要的 kv 会在 Compaction 的时候被过滤掉，也就相当于自动完成了 gc。
这种方案要求 blob 文件也需要伴随着 SST 进行分层，从而带来写放大。并且也有不小的空间放大。因此，该策略只对最下面两层生效。

在有限的场景中，Titan 能够带来收益。

业界也有类似 Titan 的 KV 分离存储方案，比如 WiscKey 等。

TiKV 读取

Cache

TiKV 处理读请求对 Block Cache 要求较高，较低的 Block Cache Hit 会导致读性能倍数下滑。Block Cache 需要占用接近一般的内存，但也需要保留一部分给系统作为 Page Cache，以及处理查询时的内存。

不同的压缩方式，对 CPU 的压力不同。

Coprocessor

Cop 可以支持写入么？

一个合理的优化是让 Cop 能支持 update where 类型的下推。这样就能免去从 TiKV 到 TiDB 的额外一次处理的开销。当然，对于 TiKV 本身来说还是需要将数据从 Rocksdb 读出来，在写回去，从而导致缓存被刷新的问题的。

Multi Raft 相关

关于 Raft 协议本身

Follower Replication 和 Follower Snapshot

Follower Snapshot 的好处有：

因为是有处于一个 Zone 的 Follower 发送 Snapshot，所以可能更快。并且跨 Zone 流量也少
减少 Leader 的负担

TiFlash 做了 Learner Snapshot，相比 Follower Snapshot，它甚至是一个异构的 Snapshot。CRDB 做了类似的工作，称为 Delegate Snapshot。TiKV 目前还不支持。

关于读

Raft 的一个问题就是读的时候无论是 Leader 还是 Follower 都需要 Read Index。比如，对 Leader 而言，它需要问 quorum 自己当前是否还是 Leader。TiKV 一般 Leader Read 提供两种方案，第一种是 read_local，也就是 Leader 节点上 lease 读，另一种是 read_index，也就是在不确定自己是否还是 Leader 的时候，进行 ReadIndex。

Raft 状态的思考

RaftLocalState 中相比 Raft 协议多包含了 last_index 和 commit。其中 commit 可以避免重启后不能立即 apply 的情况。

存储 Raft 状态和 Region 状态

TiKV 使用 Raft Engine 存储 Raft 元信息和 Raft 日志。使用 KV Engine 存 Region 信息、Region Apply 信息和具体的 KV数据。

一个 Eager 落盘导致的问题

并不是所有时候，eager 落盘都能保证正确性问题。下面就是一个例子。
前面说过，在 TiKV 的实现中有两个 engine，KVEngine 存储 KV Meta 和 KV Data，RaftEngine 存储 Raft Meta 和 Raft Data。其中有一个 Apply Snapshot 的场景会同时原子地修改这两个 Engine，但可惜这两个 Engine 无法做到原子地落盘。并且因为两个 Engine 中都存有 Meta 和 Data，所以任意的先后顺序，都会导致数据不一致。这里的解决方式是将 RaftEngine 中的的 Raft Meta 写到 KVEngine 中，称为 Snapshot Meta。写入的时候，会先写 KVEngine，再写 RaftEngine。当在两个非原子写入中间出现宕机，从而不一致的时候，会使用 KVEngine 中的 Raft Meta 替换 RaftEngine 中的 Raft Meta。

在Apply Snapshot阶段开始时，它会调用 clear_meta 删除掉 KV Meta、Raft Meta 和 Raft Data，但这个删除是不应该立即落盘的，而是在 WriteBatch 里面。在这之后，还会再往 WriteBatch 中写入 Snapshot Meta 等。这些写入会被一起发送给一个 Async Write 写入。我们的错误是，在实现删除 Raft Engine 数据时，并不是写 Write Batch，而是直接写盘。在 clear_meta 之后系统又立即宕机了。这样重启恢复后，就会看到空的 Raft Meta 和 Raft Data，但 KV Meta 却还存在。这是一个 Panic 错误，因为两个 Meta 不一致了。

这样的错误是难以调查的，我们可以加日志获得重启后从磁盘中读到的结果，但仍然不知道这个结果是如何被写入的。查的方式是脑补，也就是针对这样的场景，假设在不同时刻宕机，考虑会出现什么样的持久化状态。
这里，KV Meta 的落盘信息是有的，它可能是没清就宕机了，也可能是写完新的数据之后宕机的。考量这个可以看一些 Meta 信息有没有写入，比如我们发现 Snapshot Meta 并不存在，因此说明是前一种情况。既然如此，为什么 Raft Meta 和 Data 都没了呢？只能说明是 Raft 的清早了。

当然，这里有个迷惑点，就是 KV Meta 提示当前是在 Applying Snapshot 状态，而如果我们是第一种情况的话，这个 Applying 状态应该还没有被写入。这个原因是这个实例发生了多次重启，在 T-2 次启动后 Apply Snapshot 时，KVEngine 和 RaftEngine 都落盘成功了，但是后续的流程没进行下去就重启了。所以在 T-1 次启动会重新 Apply Snapshot，但这一次甚至没到落盘就重启了，而 Snapshot Meta 是金标准。然后就是我们见到的 T 次启动的错误。这启示我们不能只通过一个元数据来判断当前集群的状态，而是要检查所有的元数据，来石锤当前状态是如何得到的。

Multi Raft 的思考

共识层和事务层的关系

Percolator 事务提交模型中，commit_ts(R) < start_ts(T) 的事务 R 对事务 T 可⻅。不满⾜该关系的事务为并发事务，并发事务如果访问相同的 key 将会导致其中⼀个事务会碰到 Lock ⽽回滚。
Raft 的 Read Index 模型中，一个读请求需要等到 applied_index 大于等于 read_index 时，才能读取数据。但并不保证是否能读到 applied_index = x + 1 时的数据。实际上无论是否读到，都不违背强一致读的原则。因为如果一个读 A 能读到 applied_index = x + 1，而另一个读 B happen after 读 A，那么读 B 一定会读到 applied_index >= x + 1 的数据。

TiKV 的共识层在事务层之下。在事务 Commit 之前的很多数据也会被复制到多数节点上，这产生了一些写放大。但也需要注意其带来的好处：

共识层实际为 Percolator 提供了类似 BigTable 的存储。
首先提供了外部一致性。
然后提供了 PUT default/PUT lock 和 PUT write/DEL lock 的原子性写入。
当然，这里要先读后写，可能会有 Write Skew。
共识层本身也可以作为一个 Raw KV 对外服务。
共识层参与定序。这个在后面介绍。
多个 Raft Group 组成的共识层提高了并发能力。
Lock 的存在性和⼀致性由该⾏所处的 Raft Group 保障。
事务提交后，会写⼊ Write 并删除 Lock，其原⼦性由 Raft Write Batch 保障。
共识层提供了全序广播语义。
“在 xx 之前，一定不会有别的 Lock 和 Write 了”

当然这也存在一个 argue 点，因为 Raft Log 本身也是 total order 的。虽然我们目前不是全局一个 Raft Group 的，但看起来会有一些冗余。后面会讨论。
特别地，在 CDC 服务和 TiFlash 中，我们实际上不会处理未 Commit 的数据。

共识序和事务序

双重定序

事务层的实现中，为了满足隔离性，通常会给事务分配 id 来表示相互依赖的事务之间的偏序关系。TiDB 中使用了 TSO，Spanner 中使用了 TrueTime，CRDB 中使用了 HLC。
共识层的实现中，为了实现容灾和高可用，使用共识算法在各个 RSM 之间复制日志，这些日志为全序关系，RSM 可以应用这个全序关系确保所有副本间是线性一致的。

事务层生成 TSO 和共识层生成 Log 两个行为：

不是原子的
也不构成全序关系
实际上也没必要，两个不相交的事务按照事务序本来可以并行 Commit 的，但因为要写到共识层，必须又要排出一个全序关系来。
甚至一个事务的 commit_ts （相比某个特定事务）更小，而 index 更大
下面会展示这种情况，并详细阐述。

总而言之，Percolator 协议保证了事务层能够生成一个特定的排序，并且按照它的二阶段方式写入到共识层。共识层保证了所有的副本都会应用该特定排序。

共识层为事务层提供帮助

目前 TiKV 通过一个 pd 分配一个全局的 tso 来作为事务的 start_ts 和 commit_ts，所以它们之间彼此构成全序关系。当然，实际上不同的事务可能具有同一个 commit_ts，但这并不影响我们的讨论。通过 start_ts 和 commit_ts 可以构建有依赖的事务之间偏序关系，也可以用来判断事务是否是 concurrent 的。如果在单个节点上串行地 commit 这些事务，则面临问题：

整个系统毫无并行度
这个应该算是 MultiRaft 的一个 bonus，正如后面讲的，如果没有 MultiRaft，同样可以做 partitioning。
因此，TiKV 在多个线性一致的存储(Region)上储存这些事务，它保证了每个事务在每个 Region 上都遵循了 start_ts 和 commit_ts 所 imply 的顺序，也 Percolator 那一套。这样尽管各个 Region 之间是并发的了，但只要 Region 内遵循这个 order 就行了。
当然，这个切分也未必是按照 Region 来，比如 CDC 会使用表来切分。无论按照哪种方式来切分，我觉得一个实现的要点是每个 shard 在调度上是不可以再分的了。比如一个 Region 的一部分数据在 store 1 上，另一部分数据在 store 2 上，这样做实际上会导致无论在 store 1 和 store 2 上都很难独立构建出该 Region 上数据的全序关系，比如 store 1 如果不和 store 2 交互，那么就很难知道 store 2 上还有没有 happen before 它的事务了。比如说，如果两个 store 上 apply 这个 Region 的 log 的进度不一样。
如何判断某个 tso 之前还有没有其他 Lock 或者 Write？
因此，读事务会在取得 start_ts 后，再通过 ReadIndex 请求一下 Region Leader 上的 commit_index。那么假设在这之前 Region 上有写入任意的 Lock 或者 Write，都能被 ReadIndex 扫到。这样就保证了读事务能看到 start_ts 之前的所有修改。至于 start_ts 之后的也有 Lock 可以帮忙。
同样考虑一个Snapshot Isolation(SI)/一个两难问题，这里不再详细展开具体内容。但 ReadIndex 提供了一个保证，就是截止到 read_index，这个 Region 上到底有没有 Write，是很确定的。我理解这实际上就是一种全序广播了。破坏这种全序广播可能会有严重后果，比如如果将 Write 乱序到 Lock 前面，则违反了 Percolator 事务的约束。我们实际上也没办法很好的处理，在“跨 Region 提交事务”中，就构造出了这样的场景。

此外，对于并发事务，共识层也会对它们之间排出一个串行的顺序，比如两个并发的事务不能同时 Commit，而要等到 Log 按序 apply 而这可能有点过强。诸如 ParallelRaft 或者 MultiPaxos 的算法允许并行 apply，可以解决此问题，但会导致 Leader 和 Follower 之间的 apply order 难以统一，从而无法实现 Follower Read。

共识层对并发事务的乱序

刚才说过，共识层未必会按照事务序写入。这也很容易理解，因为取 start_ts 和 commit_ts 和真正写共识层不是原子的。
TiKV 事务在读取时，需要同时接收事务层和共识层的定序。为了满⾜线性⼀致读，需要⾸先带上 start_ts，发送⼀个 ReadIndexRequest 给对应的 Region，求出⼀个 applied_index。在实际实现中，start_ts 并⽆作⽤。
如下所示，Key a 和 Key b 属于两个事务。在事务提交前，可以看到或者得到的保证是：

start_ts(a) < commit_ts(a)
start_ts(b) < commit_ts(b)
start_ts(a) < start_ts(b)
并且这两个是并发事务，也就是说 commit_ts(a) > start_ts(b)

不妨假设 commit_ts 分别为 4 和 6，然后再假如以 (start_ts=7, read_index=202) 读取，则可能读到一个锁和 Key b，或者读到 Key a 和 Key b，前者需要 ResolveLock，实际上导致以新的 read_index 来读取。因为 Key a 和 Key b 的写入对 start_ts=7 的读取事务可见，但该读取可能在 applied_index 大于等于 read_index 的任意时刻返回读取的值。因为共识层的序至少保证了同一个 key 的 prewrite 在 commit 前面。

Key a：(start_ts: 1, applied_index: 100), (commit_ts: 4, applied_index: 210）
Key b：(start_ts: 3, applied_index: 101), (commit_ts: 6, applied_index: 200)

反过来讲，如果共识层给出下面的顺序，我们看到了中间的 a 或者 b 上有锁。因为这两个事务是并发事务，所以这也是 OK 的

Key a：(start_ts: 1, applied_index: 100), (commit_ts: 4, applied_index: 200）
Key b：(start_ts: 3, applied_index: 101), (commit_ts: 6, applied_index: 210)

可以看到，尽管将事务拆到了 N 个线性一致的存储上执行，并且这些存储可能对并发事务任意定序，但最终读到的结果还是满足了线性一致，以及事务隔离层的要求的。

并发事务的共识序

并发事务 1 和 2，假设 start_ts1 < start_ts2 < commit_ts1 < commit_ts2，那么两个事务彼此不可见。假设这两个事务写入同一个 region，那么在 raft log entry 层面，完全可以出现 commit_ts1 对应的 raft log 的 index 更靠后，而 commit_ts2 对应的更靠前。比如

1 2	index 10: Put Write CF commit_ts2 index 11: Put Write CF commit_ts1

跨 Region 提交事务

TiFlash 不能在看到第一个 write 记录时“提交”该事务的所有 key，这里的“提交”指的是写入下层存储，比如将 Default 写过去，但并不包含删除 Lock 等。
现在比如考虑两个事务，假设 a 在一个 region r1，b 和 c 在另一个 region r2 让：

1 2	commit a(applied_index@r1=100), commit b(applied_index@r2=300), commit_ts=4 commit c(applied_index@r2=200), commit_ts=1

从事务层上来看，一定有读事务能看到 c，或者 a、b、c。现在如果看到 a 提交了，能不能跑到 b 的 region 上把 b 也提交了呢？我认为是不可以的，因为从共识层上来说，b 在 c 的后面被 commit 的，如果用 (start_ts > 4, read_index = 250) 去读的话，可能读到 lock b，甚至可能连 lock b 也还没被写入，当然也有可能读到 b。但如果我们在 apply a 的 write 记录的时候发现了 a 被 write 了，就直接写 b 的 write 记录，那么就导致 b 一定在 c 前面就能被读到，实际上违反了共识层的序。

具体来说，不妨考虑 client 先后从 Learner 和 Leader 读：

在 Learner 上，它使用 read_index = 250 读，但是因为 commit a 已经被 apply 的原因，所以它一定读到了 commit b。
当然细究下来，因为 lock + default 是原子的，所以实际上 write 无法被正确执行。但这就是 orphan write key 的问题，之前在处理 multi rocks 的时候就解过，我觉得很复杂。在这个场景下，我觉得免不了要去进行等待。在异步系统中的等待，我觉得可以理解为是一种活性问题。
在 Leader 上，此时 Leader apply 到了 260，所以此时 Leader 上一定没有 commit b，这导致它读不到 commit b。

这里线性一致读就被破坏了。反之，如果按共识序 commit，则不会有这种情况。具体就不展开了。

这个场景在单 Region 上无法构造，原因是单 Region 上是串行的。尽管“在看到第一个 write 记录时‘提交’该事务的所有 key”可能相当于让一部分 Write 被乱序，但这种乱序不是直接去把 Write 挪到 Lock 之前那样是破坏性的。比如说，因为 Percolator 的特性，单 Region 上的某个事务的 Prewrite 一定都在 Commit 前面。因此，就算在看到第一个 Write 时候，将该事务的所有 Default 都提前写到下层存储，也不至于提前到某个 Lock 前面。这样被写的 Key 始终有 Lock 保护，直到看到它对应的 Write。
而在多 Region 中不同 Region 可以说是完全异步的（不考虑 Split 等），那我就可以构造一个无比提前的 Write，让它失去 Lock 的保护。

Split/Merge 和事务

Split/Merge 和 Read

Split 和 Merge 会导致 Region 发生变化，自然也可能会影响读取。主要体现在下面几个方面：

影响 Lease 本身或者 Lease 续约
推高 RegionEpoch 从而导致 ReadIndex 失败

Split/Merge 和 Apply Snapshot

Multi Raft 实现的复杂度，很大程度在处理 Split/Merge 和 Apply Snapshot 的冲突上。

Split 和 Apply Snapshot 的冲突

我们需要处理一个 Region 上的 Follower 还没有执行到分裂为 Base 和 Derived 前，一份来自 Derived 的 Snapshot 已经被发过来的情况。这会产生 Region Overlap 的问题，在一些下层存储中会导致数据损坏。一种方案是在 Base 完成分裂前根据 Epoch 拒绝掉这些 Snapshot。

Merge 和 Apply Snapshot 的冲突

Merge 过程可以简单理解为下面几步：

调度 Source 和 Target Region 的各个 Peer，让它们对齐到同一个 Store 上。
Source Peer 执行 Prepare Merge。
Source Peer 等待 Target Peer 追完 Source Peer 的日志。
Source Peer 对 Target Peer 去 Propose Commit Merge。
Target Peer 执行 Commit Merge。

可能在下面一些阶段收到 Snapshot：

Prepare Merge 结束
Leader 上的 Commit Merge 结束，但 Follower 上的 Commit Merge 还没有开始

Split 和 Generate Snapshot 的冲突

主要指 Split 等会改变 RegionEpoch 从而导致 Snapshot 失效。

Raft Group 和 Data Range 的对应关系

TiKV 中，Raft Group 和 Region 严格一一对应。TiKV 中 Region 管理一段范围内的数据，在其他一些实现中，这段范围可能被称作 Shard、Partition 等。讨论下这个设计：

Raft 本身和 Region 数据的版本引入了全序关系
首先，Raft Admin Command 会穿插在写入之间形成很多 barrier，带来额外的持久化负担。
然后，这导致了新创建的 peer 只能通过 Snapshot 追进度的情况。从 Raft 协议来看，ConfChange 之前的日志的提交和复制应当遵守 C_old 的配置项目，但是它并没有禁止进入 C_new 状态的 Leader 给新 peer 发送 ConfChange 之前的日志。但考虑到如果新 peer 还在处理 C_old 时代的日志，它的本地状态比如 RegionLocalState 肯定对应了 C_old，这个时候它接受到了一个“不认识”的 store 的 AppendEntries，这是比较奇怪的。
Raft Group 不稳定
Split 会分出独立的 Raft Group，给 pd 调度带来压力。也变相增大了 recover 的工作量。
Merge 两个 Region 会销毁一个 Raft Group，这里面有不少 corner case。比如 Leader 关掉后的孤儿 Learner 问题。

我觉得可能 Spanner 的架构会更好一点。也就是说：

一个 “Spanner Region” 一个 Raft Group，但这个 “Spanner Region” 不再和某个 Key range 绑定。
一个 “Spanner Region” 下可以被调度多个 Key range。例如有局部性的 Key range 可以被调度在一起，或者处于打散负载的目的可以将 Key range 进行随机的分布。

Raft Group 和 Data Range 分开的架构

在这之上还有一些设计：

全局需要维护多少个 Raft Group？
一个 Raft Group 可能需要处理不同 Key range 的数据。但全局关系肯定是过强了，破坏了 Partitioning 的初衷。所以会更倾向于引入乱序 Apply 机制来提高 RSM 的吞吐量。
谁有权限写 Data Range？
一般来说，会将对应的 Raft Group Leader 设置为 Data Range 的 “Leader”，让它来处理写入。这样做的好处是可以减少一次 RPC。

另外，分开的实现还有个好处，就是如果 Raft 层的 Leader 发生切换，Data Range 层的读取不会收到影响，而是可以 bypass 掉 Raft 层。CRDB 就是这样实现的，也就是类似是 Data Range 上的 LeaseRead。相比之下，TiKV 的 LeaseRead 和 Raft Leader 的生命周期是绑定的。

另外值得一提的是 CRDB 将 Lease 和机器绑定而不是和 Data Range 绑定，从而减少网络开销。它的做法是每个 Data Range 的 “Leader” 会去维护一个 meta 表（也是一个 Data Range）上的 liveness 记录。我理解它可以以一个比较低的频率去更新 liveness 记录，因为如果不是节点挂了下限，或者是重新调度到当前 Raft Leader 的节点上这两种情况，只要 Raft Leader 还是同一个，那么就完全没有必要续期。而 TiKV 的绑定方式则必须要求 Lease 是比 Election Timeout 要短的。对于 meta 表自己的 Lease，是通过 expiration time 来维护的。

在本文的后面还会提到 Follower Read 相关的话题，特别是它和乱序 apply 的关系。我个人觉得，如果将 Data Range 和 Raft Group 分开，我们仍然是可以实现 Follower Read 的。如果你把 Data Range 看成一个 RSM，那这种架构就类似于一个 Raft Group 去管理多个 RSM。我们在 Data Range 上维护一个 index，应该就行了。

Raft 日志的内容

Raft 日志中到底记录什么呢？可以看下面的总结：

TiKV
TiKV 中 Raft 日志分为 Admin 和 Write。Admin 基本只和 Raft 和 Region 管理有关。Raft 指的是 Raft 的成员变更，比如 Add/Remove Voter/Learner，TransferLeader 等。Region 指的是管理的 key range 的元数据变更，比如 Split、Merge、数据校验等。
Admin 和 Write 在一起构成全序关系，这个话题之前已经展开讨论过了。
Write 包含 Put、Delete、DeleteRange 和 IngestSST，这些都是逻辑日志，或者说是不 aware 下层 rocksdb 的。
OceanBase
OceanBase 中复制的是 clog。从文档来看，它们复制的是物理日志。通过 replay clog，能够得到同样的 log 文件，其中记录的是 redo log。
下面来自Oceanbase 文档
OceanBase 数据库单台物理机上启动一个 observer 进程，有几万到十万分区，所有分区同时共用一个 Clog 文件，当写入的 Clog 文件超过配置的阈值（默认为 64 MB）时，会打开新的 Clog 文件进行写入。
OBServer 收到的某个分区 Leader 的写请求产生的 Clog、其他节点 OBServer 同步过来的 Clog（存在分区同在一个 Paxos Group)，都写入 Log Buffer 中，由单个 IO 线程批量刷入 Clog 文件。
PolarDB
在《PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database》中讲得比较清楚。
PolarDB 的存储层基于 PolarFS，计算节点共享地访问这个存储层。PolarDB 中每个数据库对应 PolarFS 中的一个卷，每个卷由若干 Chunk 组成。不同于 TiKV 的 Region，这里 Chunk 大小为 10GB，而卷的大小在 10GB 到 100TB 之间，所以它们元数据节点的调度压力会小很多，并且所有节点的元数据都可以缓存在内存中。一个 Chunck Server 管理多个 Chunk，PolarDB 通过增加 ChunkServer 的数量来平衡热点。这里我觉得 TiKV 的 multi rocks 方案可能更好，因为它允许一个 hot region 被分裂。在 PolarDB 中，一个服务器上运行多个 ChunkServer，但每个 ChunkServer 对应一个专用的 SSD，并且绑定一个专用的 CPU 核心。
一个 Chunk 由 64KB 大小的 block 组成。PolarFS 的 Raft 日志实际复制的是这些 block 的 WAL。
Kudu
Kudu 中复制的是逻辑日志。他们的观点是这样可以实现各个 Replica 在存储格式上是解耦的。

进一步讨论：日志和选举的关系

Raft 中的领导人完全性原则要求 Leader 必须拥有所有已提交的日志，这实际上是一个比较强的约束。在 Ongaro 等人对于 MultiPaxos 的描述中，可以发现该约束是可以被消减掉的，从而选举过程可以不关注日志的完备性。
在此基础上，可以让选举体现出其他的优先级。以 Ob 的 Palf 为例，它的“一呼百应”的方案，可以始终给距离自己最“近”的节点投票。而 Raft 选举的实质是谁状态更新，谁就更容易当选。这个方案目前来看，无论是否效果最优，但确实代价比较大。

有关 Raft 的日志和选举关系的讨论，可以见 Raft 算法介绍中的“日志和选举”章节详细讨论。

进一步讨论：日志和事务的关系

将多个分区的写入统一到一个 Raft Group 中进行复制，应该是有利于事务的。因为如果一个事务跨 Region，就会是一个分布式事务，而如果只有一个 Raft Group，那么就不会涉及到跨 Region 的问题。

Mono LSM 和 Multi LSM 的考量

这里指的是不同的 Region 的数据是否 share 一个 LSM 树。我认为如果使用 range partition，那么 multi lsm 的策略是一个非常重要的优化。

线性一致读

Follower Read

TiDB 支持多种读取方式，例如最近 Peer、Leader、Follower、Learner、自适应等多种模式，这些依赖于 Follower Read，在这之前都需要从 Raft Leader 读取。

不同于 ParallelRaft 和 MultiPaxos 的部分实现，TiKV 会串行地 apply raft log。

这样的好处是，更容易通过 Read Index 实现 Follower Read 了。TiKV 在这一点上行得通，主要还是因为它的数据和 Raft Group 绑定的缘故。也就是以 scheduler 为代价来实现 Partitioning，从而减少各个 Raft Group 的压力。
这样的坏处是，引入了更强的全序关系。因为我们实现共识层的目的是服务上层的事务层，而事务层本身就允许并行事务以任意的顺序被提交，所以在共识层排成强序，实际上是多余的。当然，Partitioning 分成多个 Raft Group 能减少这部分的强序关系的数量。

总的来说，TiKV 实现的 Follower Read，是通常被称作 Strong Follower Read 的类型。

Learner Read

不同于 Follower，Learner 不是 Voter，没有选举功能。所以 Learner Read 和 Follower Read 有不同。
Learner Read 在 TiFlash 场景下更为丰富，在 TiFlash 章节讨论。

强一致读（加上事务）

从共识层上来讲，强一致，或者线性一致有明确的定义。CRDB将其“推广”到事务层之上，也就是归结到所谓的 non-stale 读上。因为 CRDB 只有 leaseholder 也就是所谓的 Leader 能服务读。但推广到有 Follower Read 的场景下就是，在任意的节点上：

在 SERIALIZABLE 下，读事务应该能看到在它之前已经提交了的所有的写事务。这里的“它之前”我理解根据事务的实现的不同而不同，但至少要在事务的第一个读之前。比如 Percolator 模型中就是 start_ts。
在 RC 级别上，事务中的每一个读语句能看到在它之前已经提交了的所有的写事务。

Stale Read

Stale Read 的作用是让读请求被分配到任一节点上，从而避免某热点机器，或者跨数据中心的 read index 请求产生的延迟。

这样的事务只能服务读，并且 staleness 也是需要被严格控制的。

Stale Read 是读 ts 时间点上所有已提交事务的旧数据。因为读不到最新的写入，所以不是强一致的。但它仍然保持有全局事务记录一致性，并且不破坏隔离级别。我理解可能就是所谓的 Time travel query。

一般提供两种：

精确时间戳
有界时间
在给定的时间范围内选择一个合适的时间戳，该时间戳能保证所访问的副本上不存在开始于这个时间戳之前且还没有提交的相关事务，即能保证所访问的可用副本上执行读取操作而且不会被阻塞。
因此这样的读取方式能提高可用性。

使用 Stale Read 需要 NTP 的支持。

所以它并不是“弱一致读”，无论从哪一个节点返回的结果都是一致的，不会出现 A 返回 1000 笔记录，而 B 返回 1111 笔记录的情况。

事务相关

加锁的时机

无论是悲观锁还是乐观锁，都面临加锁时机的选取。

在提交时加锁存在下面的问题：

乐观锁的问题
因为整个事务需要缓存在内存中，所以大事务面临 OOM

在 DML 时加锁存在下面的问题：

每写一个 key 都要和 TiKV 通讯一次
多次对同一个 key 的 prewrite 无法确认先后（网络可能被任意延迟）
对 TiFlash 而言，因为列需要按照 commit_ts 排序，所以最好等到 commit 之后再行转列，而 DML 加锁意味着 DML 阶段 prewrite，那么在 DML 阶段就可以行转列了

Percolator 事务和共识层乱序

在什么程度上共识层可以乱序呢？我的结论是：

跨 Region 情况下会破坏线性一致读，并且从事务层修正的难度比较大，可能引入很长的等待
单 Region 上，如果保证 Lock 和 Write 的全局序，但只在发现事务 A 的第一个 commit 的时候，将事务相关的所有的 Default 写入，这种情况应该是可以的。对于较为基础的 case 我有 tla 证明
根据具体实现，需要落盘 Default 和 Lock 是一起的，比如先落盘 Lock 再落盘 Default。可以不用原子落盘两个 cf。

Partitioned RaftKV 相关

和 Mono-store RaftKV 的兼容性问题

新架构简化了 Snapshot 的生成和注入流程：

在生成时，只需要对当前 Region 对应的 RocksDB 做一个 Snapshot 就行。这个 Snapshot 包含的数据可以新于 Raft Local State。
在注入时，只需要重命名 RocksDB 文件夹即可。不需要处理 range overlap 的问题。因此不需要引入单线程的 region worker。

因此 Mono-store RaftKV 需要处理下列问题：

RocksDB 数据和 Raft 状态不一致。
Snapshot 的 Range 可能和其他本地 Region Overlap。

不光是 Snapshot，在 Partitioned RaftKV 中，Region Peer 之间也可能互相 Overlap。所幸这个场景只会出现在 BatchSplit 和调度 Peer 发生冲突的情况下。

在新架构中，Apply 的落盘也实现了异步化，现在下层引擎可以选择在任意时刻落盘数据，并且在落盘完毕后通知 raftstore。这对 TiFlash 来说是一件好事，我们可以由此来让 KVStore 的落盘不再阻塞。

采用更大的 Region 的性能影响

可以采用 Parallel Raft 的方式实现并行 Apply。
单个 Region 的 Apply 压力会增大，但是下层 RocksDB 的负担减轻了。相比于单个实例的 RocksDB，新架构的层数更少，并且并发写入也更少。后续还可以尝试支持多盘部署。

另一个考量点是如果集群中出现很多小表，那么大 Region 的效果不能完全展示：

因为编码的问题，table 编码不相邻的表不能被合并到同一个 Region 中。
相邻的 table 合并会给 TiFlash 带来不少问题。例如如果给一些小表添加 TiFlash 副本，并且这个小表被合并到一个大 Region 中，那么发来的 Snapshot 可能非常大，并且包含了大量 TiFlash 不需要的数据。此外，TiFlash 本身的存储引擎也需要做出调整。

TiFlash 相关

架构

为什么 TiFlash 实现 HTAP 基于 Raft？

Raft 帮助我们实现：

LB
HA
Sharding

但是 TiFlash 只通过 Raft 同步各个表的 record 部分的数据。我们不同步索引，因为不需要。我们不同步 DDL 相关结构，因为并不是所有表都存在 TiFlash 副本。取而代之的是在解析失败，或者后台任务中，定期取请求 TiKV 的 Schema。

另一种强一致的方案是基于 CDC 和 safe TS，这样的方案理论上达不到和 Raft 一样的性能。这是因为类似 CDC 的方案的 safe TS 是基于表的，而 Raft 的 applied_index 是基于 Region 的。在一些场景下，如果一个 write 涉及到多个 Region，那么为了保证原子性，需要这些 Region 上的数据全部被写完，才能前进 ts，这会影响大事务的同步效率。另外，在读取时，也需要等待 safe TS 前进之后，才能读取。而基于 Raft 的方案只需要相关的 Region 的 applied_index 前进到 ReadIndex 就可以了。另外，CDC 也只保证单表事务。

为什么在 TiSpark 之外还开发 TiFlash

TiSpark 直接操作 TiKV，绕过了事务层，可能产生一致性问题。
TiSpark 没有自己的列式存储，处理分析性查询并不占优势。

TiFlash 是副本越多越好么？

不是。理论上是 1 副本的性能最好，但是考虑到高可用，通常建议 2 副本。

1 副本性能最好的原因是，DeltaTree 的 Segment 的粒度要显著比 TiKV 的 region 大，因此同一个 Segment 上会存在多个 Region。

考虑存在 4 个 Region，从 A 到 D，如果只设置一个副本，其分布类似

1 2	Store1: [A0, B0] Store2: [C0, D0]

而如果设置两个副本，其分布类似

1 2	Store1: [A0, B0, C0, D0] Store2: [A1, B1, C1, D1]

假如一个查询同时覆盖这 4 个 region，那么一副本的情况下，Store1 和 Store2 分别扫描自己的一部分数据就行了。而两副本的情况下，则可能扫描到多余的 Region 的数据。

有一些人还会觉得副本数越多，并发能力越强。但在基于 Raft 的分区策略下，并发能力是通过合理的 Sharding 来提升的。而具体到一个副本上是可以支持大量的并发查询的，并且我们也更容易对这些查询做 Cache，当然在 AP 场景下可能有限。

DDL 如何同步？

TiDB 的 DDL 的优化点：

延迟 reorg 到读
例如 add column 的 reorg 阶段实际上不会写入默认值，而是在读的时候才返回默认值。
以新增代替变更
例如 alter column 只会扩大列的值域，比如 int8 扩大为 int64。如果涉及缩小至于或者改变类型，则会体现为新增一个 column，然后把老的 detach 掉。
因此新的 Schema 能够解析老的 Schema。

TiFlash 上 DDL 的特点：

TiFlash 只需要同步需要表的 DDL。
TiFlash 只需要同步部分 DDL 类型，诸如 add index 等 DDL 并不需要处理，更没有 reorg 过程。
尽管 TiDB 将 schema 存在 TiKV 上，但 TiKV 是 schemaless 的。所以如果 TiFlash 只从 TiKV 同步数据，就会涉及解码等工作。

因此，TiFlash 有两种 DDL 同步方式：

定期拉取（一般是 10s）并更新
根据 TiFlash 和 TiDB 上 version 落后的情况，可以分为拉 diff 和拉全量。
该方式已经能解决大部分 drop table 的问题了。但通过该方式无法保证当前任意时间点上的 schema 一定和 TiDB 是一致的，所以一定存在解析失败的情况。
当解析 row 失败的时候更新 schema，称为 lazy sync

在更新之后，TiFlash 会自己维护一份 schema。

这里面存在的问题主要是两种 DDL 同步方式和实际 raft log 是异步的。因为 TiDB 和 TiFlash 的特点，这个异步是可以被处理的，并且尽可能去掉全序的依赖是很多系统的设计理念，所以这种做法本身也是挺好的，但其中 corner case 很多。例如：

Schema 和 row data 中的列数对不上。这种情况无论是谁缺，至少可以通过拉一次 Schema 来解决。有些场景甚至可以不拉 schema。
某个列的类型变了
一张表 drop 后，TiKV 中就无法读取该表的 schema 了。如果在 drop 前有一条 add column，但 lazy sync 又没有读到，那么 TiFlash 就看不到。所以如果后续有一条 row 写入过来，TiFlash 就会丢弃这个 column。假如这个 table 被 recover 了，那么 TiFlash 就会读不到这个 column 的数据。
在一张表对应的 DeltaMerge 实例创建前，这张表就被 drop 掉了。在此之后，row 数据到来，并导致 DeltaMerge 实例被创建。

TiFlash 的高可用

对于复制自动机的系统，高可用主要取决于选举的速度。
对于 TiFlash 来说，它不参与选举，但选举本身同样会有影响，一方面是 ReadIndex，另一方面是无主的时候无法复制日志。但除此之外，TiFlash 自身的宕机和重启也影响高可用。因为一个批量查询会被下推给 tiflash，以避免影响 TP，如果此时 TiFlash 没追上，则查询会 hang 住。所以 TiFlash 的高可用还和追日志的规模有关。

Raft 共识层

有关 Learner 的问题

Peer 活性

Learner 尽管在 Raft Group 中，但不参与投票。所以当 Voter 节点因为 Region 被销毁（通常因为 merge）全部被销毁后，Learner 节点就无法找到 Leader 节点。对于 Voter 节点来说，这种情况它可以发起选举，然后发现其他节点上的 Tombstone 标记，从而确认 Region 已经被摧毁了。但因为 Learner 不参与投票，所以是无法发现这种情况的，从而僵死。这给 TiFlash 带来了不少 Corner Case：

在 Region 销毁的场景如 CommitMerge，target region 的 Voter 至少可以在 Leader 销毁之后，因为超时触发选举，从而启动自毁。而 Learner 则不行，会 miss leader 然后卡死
特别地，CommitMerge 本身对 Source Peer 也会有检查，所以这里可能造成连环等待。比如如果在等待 Source 追数据，就会 Yield 为 WaitMergeSource。如果卡在 CommitMerge 上，那么后续的 RemovePeer 也无法执行。
在 ConfChange 中，如果删除了某个 Learner，但又没有能够将该日志复制给 Learner，那么稍后 Learner 就不会得到 Leader 的任何消息，从而一样卡死。
在 BatchSplit 中，如果新 Split 出来的 Region 在 TiFlash apply BatchSplit 命令前就在所有 Voter 节点中被删除的话，后续 TiFlash 节点即使 apply 完 BatchSplit，也无法再收到任何日志，因为 Leader peer 已经不存在了

上述的卡死在之前需要等待 2h 之后触发存活性检查才会被发现。或者人工将僵死的 Region peer 设置为 tombstone 状态。

Snapshot

另外，Raft Log GC 也需要 respect Learner 的进度，不然会导致频繁的 Snapshot 生成失败。

有关 Learner Read

由 Follower Read 派生出来的 Learner Read 也让 TiFlash 成为一个强一致的 HTAP。

Learner Read 和 MaxTS 的推进

Learner Read 和 commit_ts

即使在 read index 的时候推进 max ts 的机制，依然会发生在收到 Leader 关于带有 read_ts 的 Read Index 请求的回复后，在 Wait Index 超过返回的 applied_index 之后，看到具有更小的 commit_ts 的提交。但这种情况并不会导致问题，因为在 applied_index 之前，我们至少可以看到对应的锁。

Bypass lock 机制

Read through lock 机制

存储

为什么在列存前还有一个 KVStore？

在 CStore 模型中，WS 和 RS 都是列存，WS 的数据通过 Tuple Mover 被批量合并到 RS 中。体现在 TiFlash 中，WS 是 DM 中基于 PS 的 Delta 层，而 RS 是 Stable 层。

除此之外，TiFlash 还有一个 KVStore，目的是：

保存未提交的数据，并实现 Percolator 事务的部分功能
因为只有已提交的数据才会写入行存，为了和 Apply 状态机一致，所以未提交的数据同样需要持久化，因此引入 KVStore。
KVStore 管控 Apply 进度，对 DM 屏蔽了上游。DM 可以异步落盘。日志复制的架构下，上游的落盘进度不能比下游更新，因为下游更新，重放是幂等的；而上游更新，会丢数据。

为什么不将未提交的数据直接写在列存中呢？

KVStore 需要负责维护 apply 状态机
当然我们可以将这一部分作为单独的 Raft 模块，所以这不是很 solid 的理由。
KVStore 不仅是一个容器，还是 Percolator 事务的执行器
例如，它需要维护当前 Region 上的所有 Lock。在一个查询过来时，需要检查该查询是否和 Lock 冲突，并尝试 resolve lock。而在列存中维护 lock cf 会很奇怪。
这意味着要执行近乎实时的行转列
首先，如果存一些未提交数据在 KVStore 中，然后在提交时 batch 执行行转列，有可能可以只读取一次 schema 结构，减少开销。
其次，TiDB 中存在乐观事务和悲观事务。如果使用乐观事务，并且冲突比较大，那么很可能 TiFlash 要花费大量时间在多余的行转列上。

KVStore 的落盘模式相关问题

理论上 KVStore 也可以做到独立写盘，从而使得 DM 的落盘进度不会阻塞 Raft Log 的回收。缺点是会使 KVStore 完全变成上游，写链路更长。虽然我们底层用的 PS，Compaction 相对较少，但同样有写放大。但这目前也无法实现，因为：

KVStore 落盘是全量的，KVStore 和 DM 的内存操作又绑在一块。
这导致在落盘 KVStore 前必须先落盘 DM。并且整个过程还需要加自己的锁，否则会导致数据丢失，而加锁导致阻塞 Apply。特别在一些场景下，少量的 Raft Log 就会导致 KVStore 和 DM 的落盘，严重影响读取延迟。
Raftstore V1 的 Apply 落盘又是同步的。
在 Raftstore V1 中，写入的数据可能在操作系统的 Page Cache 中，也有可能被刷入了磁盘。如果是前者，那么会在 raftlog_gc 等地方被显式地 sync。但困难在于，V1 中无法精确获得这些时刻，从而进行通知。又因为 TiFlash 的状态不能落后于 Proxy，否则 Proxy 的 applied_index 可能比 KVStore 新从而丢数据。所以这里索性当做同步落盘处理，让 TiFlash 先落盘。代价是我们要劫持 TiKV 所有可能写 apply state 的行为，哪怕这个写不是 sync 写。后面会介绍我的一些异步落盘的想法。

一个优化方案是解耦 KVStore 和 DM 的落盘。也就是在 DM 落盘后，再清理掉 KVStore 中的数据。这需要将 Region 中的数据拆分成 KV 对落盘，但这会失去对 KV 对做聚合的能力，从而将顺序写转换为随机写，如果写入很密集，性能也许会比较差，所以这个在功能和性能上都依赖 UniPS。

另一种方案比较简单，也就是限制由 KVStore，实际上就是 Raftstore 发起的落盘，改为由 DM 发起。但这个方案并不感知 Raft Log 的占用，可能导致它膨胀。

前面提到异步落盘 KVStore 的问题，一个思路是落盘时使用过去的状态+当前的数据。但存在一些问题：

这个“过去的状态”也需要比 DM 的落盘状态要新，所以还是要先加锁获取 KVStore 状态，再无锁落盘 DM，再用旧状态落盘 KVStore。这样不能解耦和 DM 的落盘，但能够在落盘 DM 的时候无锁已经很好了。
Split/Merge 或者可能 Apply Snapshot 改变了全局状态。这样的指令在 V1 中是不能被重放的，不然新 Split 出来的 Region 可能和重启前已经被 Split 和 Persist 出来的 Region 冲突。这样就需要在处理这些 Admin 指令的时候同步等待异步的 Persist 完成。其实更简单的方式是根据之前加锁获取的状态来推断有没有执行这些 Admin。
需要让 KVStore 支持其他命令的重放。目前来看，应该存在一些 corner case。
需要让 KVStore 通知 Proxy，当前落盘的 applied_index 并不是期望的 applied_index。这实际上破坏了 TiKV 的 MultiRaft 约束，更好的方式是拒接来自 Proxy 的落盘请求，然后从 KVStore 重新主动发起一个。
落盘 KVStore 同样需要加锁，从而阻塞 Raft 层的写入。

另一种方案是过去的状态和过去的数据。比如可以在 KVStore 在落盘时，新开一个 Memtable 处理新写入。此时需要处理新 Memtable 上的 Write 可能依赖老 Memtable 上的 Default 之类的问题。这样的好处是在落盘 KVStore 的时候都不需要加锁了。但是还存在两个问题：

在这前面需要落盘 DM，当然这个锁先前说了可以去掉。
如果写入很大，那么可能在旧的 Memtable 还没写完之前，新的 Memtable 就满了。这样还是 Write Stall。

如果希望彻底和 DM 解耦，就需要想办法保存上次 DM 落盘到现在落盘 KVStore 期间被写到 DMCache 上的数据。这是困难的。

KVStore 如何处理事务

在每一次 Raftstore 的 apply 写入时，会遍历所有 write 写入，并进行事务提交，也就是将数据从 KVStore 移动到 DeltaMerge。事务提交并不一定落盘，大部分情况是写在 DeltaMerge 的 DeltaCache 中的。
如果出现事务 rollback 回滚，则 TiKV 不仅会删除掉之前写的 default 和 lock，还会写一条 Rollback 记录，它也会被写到 Write CF 中，其用途是避免同 start_ts 事务再次被发起，client 需要用新请求的 start_ts。
可以看到，因为共识层的存在，TiFlash 无需处理事务 rollback 的问题。这也是 KVStore 存在的意义之一。

KVStore 的存储格式

是否直接用 protobuf 存储 Region？

protobuf 具有的几个特性让它不适合存储 Region：

较大的 size 下性能较差
不能只读取部分数据

是否使用 flag 存储 Region Extension？

https://github.com/pingcap/tiflash/issues/8590 不建议这样做。

Raft 机制带来的内存和存储开销

有没有可能 TiFlash 自己 truncate 日志呢？理论上 Learner 不会成为 Leader 从而发送日志，也不会处理 Follower Snapshot 请求。而 Raft 协议本身就是让每个节点自己做 Snapshot 然后 truncate 日志的。

我们在云上 TiFlash 做这样的优化，因为云上使用的 UniPS 对内存更敏感。PageDirectory 为每个 Page 占用大约 0.5KB 的内存。另一方面，UniPS 全部受我们控制，所以相比 Raft Engine 也更好做透明的回收。透明回收小于 persisted applied_index 的所有 Entry，如果 Raftstore 会访问已经被回收的 Entry，会给一个 Panic。

为什么 TiFlash 使用 DeltaTree 作为存储

目的是为了适应频繁的更新。我们采用类似 CStore的思路，引入了 PageStorage 这个对象存储。其中针对写优化的部分称为 Delta 层，类似于 RocksDB 的 L0，存储在 PageStorage 中。针对读优化的部分称为 Stable 层，以 DTFile 文件的形式存储，但文件路径在 PageStorage 作为 External Page 的形式维护。

存储模型的进一步讨论

和 StarRocks 的比较

例如可以将 update 操作分为 delete 和 insert 操作。查询时，同时查询 delete 和 insert，并决定最终的输出。StarRocks 使用这样的方式，他们指出 Delete+Insert 这样的模式有利于下推 Filter。StarRocks 据此实现了主键模型。
这里需要区分他们的更新模型，也就是一种不支持 MVCC，始终返回最新数据的模型。这种模型应该就是一种类似 LSM 的方案，在 Compaction 的时候只保留一个版本。但是在查询的时候仍然需要 merge 多个版本，并且不支持下推 filter。
主键模型的优势就是查询时不需要 merge，并且支持下推 filter 和索引。这种方式主要是将主键索引加载到内存中，对于 Update 操作，通过主键索引找到记录的位置，写一个 Delete，然后再写一个 Insert。可以发现这种方案仍然是不支持 MVCC 的，我理解如果要支持 MVCC 那么 merge 可能是必然的。
此外，主键模型对内存是有开销的，我理解这个应该不是关键问题。首先，如果数据有冷热之分，可以持久化一部分主键索引到磁盘上。其次，这个场景在大宽表有优势。

来自 TiKV 的约束

从 Raft 层接入数据导致 TiFlash 的存储层的分区会收到 TiKV Key Format 的影响。例如尽管 TiFlash 的 Segment 和 TiKV 的 Region 并不对应，Segment 远大于 Region。但它们都被映射到同一个 Key Range 上。

这就导致 TiFlash 数据的物理排列一定是根据 TiKV 的主键有序的，TiFlash 无法自行指定主键。另外 TiFlash 本身也没有二级索引。

目前来自 TiKV 的约束有：

MVCC 字段
如果要和 TiDB 一起玩，就必须要支持 MVCC，不能只保存最新的版本。
Unique 的主键

DM 的 Delta 层是如何实现的？

PageStorage 先前使用 Append 写加上 GC 的方案，但带来写放大、读放大和空间放大。因为这里 GC 采用的 Copy Out 的方式，所以理论上写放大和空间放大构成一个 trade off：

如果允许更少的有效数据和更多的碎片，那么空间放大更严重
否则，写放大更严重

旧的 PageStorage 主要存在下面的问题：

GC 开销很大，因为需要遍历所有的 Version 或者说 Snapshot 才能得到可以被安全删除的数据。这样会产生很多额外的遍历。
每张表一个实例，如果存在很多小表，则会产生非常多的文件，甚至会用光 fd。
冷热数据分离。因为 meta 一般会被频繁更新，而实际上存在一些比较冷的 data。这会导致冷 data 阻碍 meta 进行 gc，这样会产生空间放大。到一定程度之后，又会触发 gc，进一步加剧问题。

在 SSD 盘上，随机写和顺序写的差距不大，原因是 FTL 会将随机写转换为顺序写，所以寻址相关的开销并不是很大。尽管如此，顺序写依然存在优势，首先顺序写可以做聚合，同样的 IOPS 写入带宽是会比随机写要大很多，然后是顺序写的 gc 会更容易。此外，因为变成随机读，性能会变差。特别是对类似 Raft Log 这样的 scan 场景。

新一版本的设计，TiFlash 会通过 SpaceMap 尽量选择从已有的文件中分配一块合适的空间用来写入 blob。当 blob 被分配完毕后，多个 writer 可以并发地写自己的部分。在写入 blob 完成后，会写 WAL 记录相关元信息。在这之后就可以更新内存中的数据。

为什么 DM 的 Stable 只有一层？

DM 的设计目标包含优化读性能和支持 MVCC 过滤。这就导致要解决下面的场景：

TiFlash 有比较多的数据更新操作，与此同时承载的读请求，都会需要通过 MVCC 版本过滤出需要读的数据。而以 LSM Tree 形式组织数据的话，在处理 Scan 操作的时候，会需要从 L0 的所有文件，以及其他层中与查询的 key-range 有 overlap 的所有文件，以堆排序的形式合并、过滤数据。在合并数据的这个入堆、出堆的过程中 CPU 的分支经常会 miss，cache 命中也会很低。测试结果表明，在处理 Scan 请求的时候，大量的 CPU 都消耗在这个堆排序的过程中。

另外，采用 LSM Tree 结构，对于过期数据的清理，通常在 level compaction 的过程中，才能被清理掉（即 Lk-1 层与 Lk 层 overlap 的文件进行 compaction）。而 level compaction 的过程造成的写放大会比较严重。当后台 compaction 流量比较大的时候，会影响到前台的写入和数据读取的性能，造成性能不稳定。

为了缓解单层带来的写放大，DM 按照 key range 分成了多个 Segment。每个 Segment 中包含自己的 Stable 和 Delta。其中 Delta 合并 Stable 会产生一个新的 Stable。

为什么 TiFlash 按 TSO 升序存储？

TiKV 的 TSO 按照逆序存，有利于找新版本。
TiFlash 因为都是处理扫表，所以逆序的收益不是很大。ClickHouse 使用升序存储，所以 TiFlash 也沿用了升序。
但这里就导致在处理 Snapshot 写入的时候，需要读完每个 row key 的所有版本，并在一个 read 调用中返回给下游的 stream。

TiFlash 如何处理 Raft Snapshot？

为什么 TiFlash 不处理 DeleteRange？

TiKV 通过 DeleteRange 来删表。TiFlash 则是通过拉取 DDL，并确保已经过了 gc safepoint 后，才会物理删除表。

需要注意的是，除了删表之外，pd 可能从 TiFlash 调度走某个 Region，这也涉及删除操作。对于这样的操作，TiFlash 就得立即响应。

在 gc 时，在 write cf 上写一个 DEL 记录，也就是所谓的 tombstone key 是比较少见的。现在的做法是在 Compaction 的时候将这些 key filter 掉。当然在提交事务的时候，DEL lock cf 是很常见的。

读取

为什么 TiFlash 没有 buffer pool

对于 AP 负载，扫表的数据规模很大，Cache 起不到太大作用。

资源管理

弹性的资源管理和存算分离

在目前的计算机架构下，进程是资源的分配单位。这就意味着如果程序对除了 CPU 之外的某个资源的需求存在很大的弹性，那么就需要将这一部分单独剥离出来。
TiFlash Cloud 中就使用了存算分离，当然还使用了 OSS 等方案，但我认为是正交的设计。

内存

历史上计算层出现过不少因为查询过大导致的 OOM，计算层通过 kill query 或者 spill 的方式进行解决。但存储层目前还缺少这块。理论上存储层的开销主要分为几类：

Memtable
包含 KVStore 的 RegionData 和 DeltaTree 的 DeltaCache。
这类场景下，OOM 主要发生在大事务场景。
Cache
主要用来服务计算节点，列存主要是扫表，所以没有做 Block cache 或者 row cache。
索引
包含 DeltaTree 的 DeltaIndex，PageStorage 的 PageDirectory 等。
Compact 相关，比如 delte merge 等
行转列相关

在一些场景下，因为存储层和计算层并不互相感知，会导致存储层会被计算层的大任务干到 OOM 或者报异常。而实际上这些任务可以被 kill，stall 或者通过 kill query 抢占计算层的内存。

因此，在 TiFlash 侧实现一个统一的内存管理还是有必要的。

空指针

严格来讲避免空指针也不完全算是内存管理。但确实是工作中遇到的一个比较关键的问题。我在分布式架构和高并发相关场景这篇文章里面说吧。

线程

IO

CPU

TiFlash Cloud

快速扩容(FAP) What & why？

目的：

复用 TiFlash 行转列的结果。减少 TiKV 生成、传输和 TiFlash 接收、转换 Snapshot 的开销。
在测试中，发现能够减少 96% 的 CPU 开销和 20% 的内存开销。
如果提升调度的 limiter，能够大幅提高吞吐量，体现为添加副本总时间的减少。但该增长不是线性的，也取决于 TiFlash 侧线程池的大小，以及串行 ingest 的开销。
需要注意，因为 Region 和 Raft Group 绑定，导致 FAP 必须等待 apply Confchange 之后的 Checkpoint，所以对于单个小 Region 来说，可能要花费更长的时间来处理。
利用如 S3 的特性，减少跨 Region 通信。
提高副本迁移，特别是单副本迁移的效率。
在扩容场景下，新节点可能因为处理全量 Snapshot 更慢，导致进度落后，从而进一步触发全量 Snapshot。此时新机器无法处理被 dispatch 过来的请求。

要点：

使用 PageStorage 替换 RaftEngine。这样使得 Raft、KVStore 和 DeltaTree 数据都一起被存到同一个 checkpoint 里面，保证原子性和一致性。
副本选择和由 Learner 管理的副本创建。用来快速扩容的 TiFlash Checkpoint，必须要比扩容对应的 confchange log entry 要新。这是因为 TiKV 通过一个 Snapshot 来帮助新 node 追日志，而这个 Snapshot 必然在 confchange 后产生。如果接受一个更早的 Checkpoint，那么就要确保 raft 能够给新 peer 发送 confchange 前的日志。即使能，这也意味着新 peer 要处理添加自己的 confchange cmd。即使通过忽略等方案处理，那么在这之前的 batch split cmd 就需要伪装成生成 Checkpoint 的那个 peer，并将这个 region 重新切开（涉及一些行转列和写盘）。而如果与此同时，batch split 得到的某个 split 的最新版本又通过正常途径调度过来，并且在 apply snapshot，那么这里就可能产生 region overlap 导致的数据问题。可以看出，因为违反了 TiKV 的约束，所以产生了很多的潜在问题。
注入数据。需要注意，原有的 TiKV 的通过 Snapshot 初始化副本的流程需要重新走一遍。
对旧版本数据的清理。

这个 feature 类似于 Learner Snapshot，所以为什么不通过 Follower/Learner Snapshot 来实现 FAP 呢？

TiKV 主要需要该 Feature 来避免跨地区的 Snapshot 复制，而 TiFlash 需要该 Feature 实现异构的 Snapshot，侧重点上有所不同。
该 feature 需要在 TiKV 或者 PD 等组件中实现一定的调度机制。所以 FAP 实际可以视为一个部分的实现，后续有可能进行推广。届时 FAP 的 phase 1 过程就有可能被移动到 prehandle snapshot 中处理了。
Follower Snapshot 有可能会失败，例如 Follower 节点实际上做不了该 Snapshot。此时 Snapshot 依然会由 Leader 来处理。目前 TiKV 的模型还不支持这种模式。

使用 UniPS 替换 RaftEngine

目前 TiKV 使用 engine_traits 描述了一个可以用来作为 raftstore 的存储的 engine 所需要的接口。这些接口基本是基于 RocksDB 而抽象出来的。因此 UniPS 需要模拟出其中关键的特性，例如 WriteBatch 等。

UniPS 的性能劣于 RaftEngine，写入延迟大约是两倍。但是仍有不少优化空间。

为什么 TiFlash Cloud 目前还是两副本？

目前快速恢复还是实验状态。TiFlash 重启后也需要进行一些整理和追日志才能服务，可能影响 HA，这些需要时间优化。尽管如此，快速恢复依然是一个很好的特性，因为：

快速恢复在 1 wn 下，可以从本节点重启，减少 TiKV 生成 Snapshot 的负担。而这个负担在 v1 版本的 TiKV 上是比较大的。
减少宕机一个节点恢复后，集群恢复到正常 2 副本的时间。

因为基于 Raft，所以本地数据的丢失只会导致从上一个 S3 Checkpoint 开始回放。如果只有一个存储节点，会失去 HA 特性。

S3 在 TiFlash Cloud 中起到什么作用？

TiFlash Cloud 会定期上传 Checkpoint 到 S3 上，Checkpoint 是一个完整的快照，可以用来做容灾。即使在存储节点宕机后，其上传的那部分数据依然可以被用来查询，可能只能用来服务 stale read？
TiFlash 计算节点可以从 S3 获得数据，相比从存储节点直接获取要更为便宜。存储节点只需要提供一些比较新的数据的读取，减少压力。
快速扩容逻辑可以复用其他存储节点的数据，此时新节点并不需要从 TiKV 或者其他 TiFlash 获得全部的数据。副本迁移同理，不需要涉及全部数据的移动。

尽管如此，S3 并不是当前 TiFlash 数据的全集。本地会存在：

上传间隔时间内，还没有上传到 S3 的数据。
因为生命周期太短，在上传前就被 tombstone 的数据。
尚在内存中的数据。

S3 vs EBS

对于 S3 而言：

具备 99.999999999% 的持久性和 99.99% 的可用性。
也就是说一天中的不可用时间大约在 9s 左右。

定价：

PUT/POST/LIST/COPY 0.005
GET/SELECT 0.0004
存储每 GB 0.022 USD 每月

可以看到，S3 的定价相比 EBS 要便宜不少。此外，从灾备上来讲，使用 EBS 可能需要为跨 AZ 容灾付出更多的成本，而 S3 可以实现跨 AZ 容灾。

当然 S3 也有缺陷，比如访问延迟比较高。

Reference

https://docs.pingcap.com/zh/tidb/stable/troubleshoot-hot-spot-issues
https://www.infoq.com/articles/raft-engine-tikv-database/ RaftEngine
https://www.zhihu.com/question/47544675 固态硬盘性能
https://docs.pingcap.com/zh/tidb/stable/titan-overview Titan 设计
Fast scans on key-value stores

TiFlash 性能测试的一个场景

2023-07-14T15:20:37.000Z

我们需要测试在有大量活跃 Region 情况下 TiFlash 的性能，具体负载是对一个大表压 update where。因为原有的测试工具需要加载全量数据到内存，并且只能单线程运行，所以重新做了一个专门的压测工具。这个工具是从 N 条数据中 sample 出 K 个出来，并启动多个 worker 发送 SQL 命令。

本文讲述在压测过程中发现的几个现象，并讲述作为工程师如何快速定位集群中出现的这些。

第一个问题

第一个问题很好解决，原因是我用了 mysql 这个 crate，而它是阻塞的。这样我们开了很多线程，从而带来了很多的上下文切换。后面替换成 async-mysql 并且基于 tokio runtime，这样当任务需要等待的时候只会 yield，原线程还可以接着做其他任务。

前	后

第二个问题

现在 N = 500m，b = 10k，它们是尽量平均存放的。因为压测机器内存的限制，它只能选取 N 中的 K 个数据作为集合，然后在这个集合上 sample 构造请求。因为 cpu 的限制，一次只能 sample i 个作为请求发送。现在观察到下面的现象：

如果 k 较小，则这些请求分布的 region 很不均匀。反之，则分布更均匀。
这里是因为如果 region 分布均匀，那么同样数量的请求会覆盖更多的 region，从而导致更多 region 被激活。而集群的某个指标和 region 被激活的数量有关。所以我们通过这个性质判断是否分布均匀。
请求的 QPS 是稳定的。

现在调查这个问题，首先是进行统计学的分析，k 的大小会不会影响对 k 进行二次抽样得到的 i 个数据的分布均匀度呢？不得而知，但可以考虑一个简单的问题，也就是从 N 中抽 K 个，那么能覆盖多少个 region？得到下面的期望 E(cover)

$$
E(cover) = b (1 - (\frac{b - 1}{b})^k)
$$

从这个期望看到，只要 k 达到 N 的 1%，那么就基本能够覆盖全部的 region 了。那对于有 N 和 i 的场景，就借助于模拟进行验证。通过模拟的验证，上面的期望是也是使用的。

如果既然直接 sample i 和在 k 中 sample i 是一样的，也就是不影响均匀程度，那么上面观察的结果是非预期的。这是因为对于同样的 i，k 不同，均匀程度不同，于是怀疑代码实现。特别地，如果数据处理算法没有问题，那么会首先考虑数据本身的问题。

我们的数据是从 n 个数据中抽样 k 个得到的，抽样算法是蓄水池算法。于是怀疑蓄水池算法的实现。果不其然，我发现 evict 的判定被我写反了，应该是满足 k / j 的情况下，会 evict，结果我按照不 evict 处理了。

综上，对于这种因为随机数写错了，导致的非 panic 的 bug。我们通过先统计学建模，再计算机模拟验证的方式，证明了数据生成过程是不符合预期的，进而找到了 bug。

第三个问题

在解决这个问题之后，请求确实均匀了。但并发还是上不去，只有 1K+ QPS，检查发现是 TiKV 磁盘满了，导致限流。

第四个问题

在解决上面三个问题后，压测的 SQL 可以达到 3K+ QPS。但进一步观察到另一个奇怪的现象。当 K 为 N 的 1% 的时候，是打到了 3K 并发，但是进一步增大 K，并发数反而降低了。调节压测程序的参数，发现压不上去，因此判定是集群的问题。

这里让人奇怪的点是，K 无论是 1% 还是 10% 的 N，它的 E(cover) 都已经能覆盖所有 region 了。那为什么 QPS 会不一样？

出于第三个问题的经验，首先查看了 TiKV 的写入指标。这里 15.15 对应的 10% 的情况，15.25 对应的是 1% 的情况。

可以看出两个情况下 SSD 都满了，那么为啥 QPS 还能分出个高下呢？同事说 procinfo::pid::io_task 测出来是 100%，但不意味着 SSD 到了瓶颈。

进一步查看写入，可以发现前台写入确实拉开了不小的差距。

总体的写入上来看，也看不出两个负载之间有什么特别大的区别。

比较 io，可以看到前半段 write iops 稍微高一点或者相等，但是 write bandwidth 明显低，考虑是有较多的小写入。难道是因为前面的写入更散导致的？

IOPS	IO bandwidth

后面，大佬同事给出一个观察，他认为 QPS 的问题可能是因为读取慢而不是写入慢导致的。确实从下面的监控来说，后半段读取耗时少了很多。

他也发现了 running task 出现了堆积。也就是入得多出得少。并且 1 和 2 两台机器尤其吃紧。

进一步检查 qps，发现不同机器之间的写 qps 差别也比较大。读的 qps 差距不大，但这可能是因为 lb 的原因，例如如果有机器是 15kops，那么其他机器计算能到 30kops 也会被限制在 15kops 的。

关于写入的问题，检查 compaction，发现非 L0 到 L1 的 Major Compaction 很多。从而怀疑是否是因为第二个负载的池子比较有限，最多就涉及总共的 1% 的数据，而第一个池子的负载有总共池子的 10% 的数据。导致第二个池子的 hot key 实际上很少，compaction 不会到很下层。但因为第一个负载的 Compaction 没有明显多余第二个负载，所以这个怀疑应该也是不正确的。

这里大佬又观察到 Block Cache Hit 上第二个负载明显高于第一个负载。这能解释一部分问题了。因为第一个池子是 10% 的样本，写入更散，Block Cache 容易被打穿到磁盘，所以读就变慢了。而我们的 update where 又是需要读的。

这也解释了为什么要从 1Kops 跑一段时间才能到 3Kops，其实是在预热缓存。

Reference

tuxun 游戏经验

2023-05-28T11:20:33.000Z

图寻是一个有趣的通过图片猜测地点的游戏。

注意，本文所引用的图片均来自于诸如百度、新浪、维基百科等官方资料网站。这些地图中的标注可能存在有主权争议的标注，这些标注是资料提供方的立场，而非我本人的立场。本人谴责部分人群居高临下，罔顾历史和主权去划分国界的行为。当然，本人使用这些图片的目的只是作为和国家主权领土无关的方面的示意，我发现在这些特定方面上，这些图片的水平很高。但限于地理知识和精力有限，本人没有能力对此一一校核。即使发现有问题，也没有能力基于这些图片进行改正。

Hazard Pointer

2023-05-16T12:34:22.000Z

介绍 Hazard Pointer。

简介

Hazard Pointer类似于面向多线程的智能指针，它能够无等待地进行线程安全的 GC。在前面提到的 MS 队列中，它的 free 就可以由 HazardPointer 实现。
维护一个全局数组 HP hp[N]，其中 N 表示线程的数量。HP 看上去是 ThreadLocal 的，但实际上不是。因为虽然每个线程只能写自己的 HP，但是却可以读所有其他线程的 HP。
当需要访问某个指针时，就将这个指针赋值给自己的 HP，也就是通知别人不要释放这个指针。这里的释放被称为 reclaim 或者 deallocate，指的是销毁这个对象。
如果同时访问多个指针呢？同样可以扩展HP。比如 HazardPointer 的思维通常是在无锁队列中出现，这个场景最多也就操作一个节点的指针。而如果操作其他数据结构比如树或者图，那么一个线程可以同时访问多个指针。
每个线程维护一个私有链表，称为 retired。当该线程准备释放一个指针时，如果此时这个指针还有 reader，则不能立即 gc 掉这个指针，此时把该指针放入 retired 链表中。被 retire 的指针不再可以被新的 reader 访问了，但它仍旧可以被之前就已经指向了的 reader 访问。
当一个线程要释放某个指针时，它需要检查全局的 HP 数组。如果有一个线程的 HP 指向这个指针，则不能释放。

Why Hazard Pointer

HP 是用来实现安全地释放某个指针的。这里的安全指的是当有 reader 在读的时候，不能立即销毁该指针指向的对象。例如考虑一个线程在遍历链表，另一个线程在移除链表中的某些节点，这两个线程是不互相知晓对方的。如果在删除时，读线程已经持有了访问该节点的指针，那么这个节点就不能被立即删除。诸如此类的场景称为 Deferred Reclamation。

一些做法可以解决该问题，比如引用计数。例如可以分别维护读引用计数和写引用计数，这样当一个写在发生前，总可以先检查两个引用计数是否都是0。它的缺点是，无论是读还是写，都需要访问一个共享的用来计数的结构。也许这个结构可以借助于原子操作实现得“轻量”，但无论如何，竞争访问(contention)是少不了的。但考虑一个读多写少的场景，也就是大部分时候压根就不会写那个借助引用计数维护的对象，无论它是一个链表节点，或者是一个别的什么。那是否可以让读操作避免去访问共享对象，从而减少竞争呢？这就是 Hazard Pointer 的思想。这样各个读操作之间没有竞争。

除了 hazard pointer，还有诸如 epoch-based 或者 RCU 的做法来解决问题。

实现

本章节中会进一步研究 hp 实现的原理，并基于 folly 的实现给出一份 Rust 的实现。

基本原理

如下所示，蓝色的是读线程，紫色的是写线程。reader 在读 hp 指向的那个 x，而 writer 通过 cas 来把 x 替换成 y。每个 reader，都持有一个 harzard pointer。这里面的东西，要么是空的，要么是当前线程正在使用的 pointer。writer 会检查所有的 hp，检查里面是否包含了自己要 gc 的地址。如果有，说明现在还有 reader 访问，writer 就会 yield 掉，做一些其他的事情，等过一段时间再检查 hazard pointer。等到没有 reader 访问这个 old value 了，writer 就会销毁掉它。

对于每个对象，它所有的 reader 组成一个链表。如何保证 hp 链表的线程安全？我们不会在使用完某个 hp 后就释放它，而是通过 cas 将它设置为空。所以这些链表中的 hp 是可以被复用的。所以，在获取 hp 的时候，是有一些竞争的，但相比新分配一个 hp，或者 cas 整个链表会好很多。

这里还讲了 wait free 的实现依赖于 memory reclamation，而不是反过来。所以无法利用某种 wait free 的算法来实现 hp。

epoch 的方式更 batch 一些，也更消耗内存。因为只有在 epoch 结束之后，才能释放这个 epoch 中的所有对象。

reader 如何访问 x 呢？ reader 设置 hp，然后后面的 writer 在释放 old value 前进行检查会看到有 hp 被设置了，从而推迟销毁。
但这个方案是存在问题的，问题的实质是 hp 和 hp 保护的 ptr 的读写不是原子的。所以考虑下面的顺序，可以发现 W 在还有 R 读 old value 的时候就把 old value 销毁了，这显然是不安全的

R 读 ptr
W 更新 ptr
W 检查 hp
W 销毁 old value
R 更新 hp

这里有个疑问，如果先更新 hp，再读取 ptr 是否可行呢？这是不可行的，因为 hp 里面要填入它保护的 ptr，所以必须先取到 ptr。

因此设计下面的过程：

R 读 ptr，读到值为 ptr1。
R 更新 hp。
R 再读 ptr，确保此时读到的值还是 ptr1，此时视为 R 可以安全读 ptr 的值。否则 R 就需要加载最新的 ptr，然后重复这个过程。
W 更新 ptr。
W 检查 hp。

这样的话，如果在 2 和 3 之间有一个 writer 进行了 CAS，那么它会导致 3 读出来的 ptr 发生变化，这样 reader 就能重试。如果在 3 之后有一个 writer，那么 writer 在看到 hp 之后就会重试。

folly 的接口

本章节中，研究 folly 的接口，并尝试用 Rust 来实现一份接口。

一个类继承 hazptr_obj_base 才可以被用 hp 维护。这个 hazptr_obj_base 中有一个 retire 方法。为什么要这么做呢？其实大可以把 MyType 用 HazPtrTarget 包一层，但这样就多了一层间接，用起来不方便。就是想把 HazPtrTarget 里面的一些功能 trait 出来，让 MyType 实现。这样的话，就不需要外面包一层了。当然，后面会发现因为 Rust 中不能往 trait 里面放比如 hazptrs: LinkedList 之类的 field，所以作者还是包了一层。

1 2	AtomicPtr AtomicPtr>

一个 retire 的对象，不再 accessible，但仍然被 hp 保护。在 reclaim 阶段，会检查该对象是否可以被删除。

另外，为什么提供一个显式的 retire 函数，而不直接在析构的时候做 retire 相关的事情。作者觉得这是一个 taste 的问题，drop(*) 有点奇怪的。

1
2
3

let old: *mut HazPtrTarget = x.swap(new);
HazPtrTarget::retire(old); // 1
drop(*old); // 2

在访问需要保护的对象前，需要首先创建一个 hazptr_holder。holder 是 reader 实际操作的对象。holder 会检查自己是否拥有 hp，如果没有的话，会从对应的 domain 去 acquire 一个。
holder 的方法 get_protected 接受一个 AtomicPtr，也就是需要保护的对象。它返回一个不可变引用，这个引用是被 hp 保护的了。

1
2
3

impl HazPtrHolder {
    fn get_protected<'a, T>(&'a mut self, &'_ AtomicPtr) -> &'a T;
}

首先，无所谓这个 AtomicPtr 的生命周期，因为实际是为了拿到它维护的指针。【Q】那为什么要这个 AtomicPtr 保护一下呢？这和之前说的要读两次 ptr 有关，也就是需要原子地从 ptr 上 load 出它当前的值。
其次，即使返回 &T，我们依旧需要 &mut self。这是为了避免先 get_protected a 之后又 get_protected b，这样 a 实际上就不被保护了。
最后，返回的 &T 的生命周期是和 HazPtrHolder 一致的。

如图所示，retire(x) 调用时，发现 x 还在被访问，不能直接 reclaim，所以记录到 retire 表中。retire(y) 调用的时候，发现 y 在被访问，但 x 已经不再被访问了，所以 reclaim x，然后把 y 记录在表中。

演讲者认为 folly 的 demo 存在一些疏漏。如果 V 是一个引用类型的话，则可能出现问题。

///   // Called frequently
///   U get_config(V v) {
///     hazptr_holder h = make_hazard_pointer();
///     Config* ptr = h.protect(config_);
///     /* safe to access *ptr as long as it is protected by h */
///     return ptr->get_config(v);
///     /* h dtor resets and releases the owned hazard pointer,
///        *ptr will be no longer protected by this hazard pointer */
///   }

下图列出 haz holder、haz pointer、atomic 和 haz obj 的关系。
看起来 holder 是 reader 实际操作的对象。holder 会检查自己是否拥有 hp，如果没有的话，会从对应的 domain 去 acquire 一个。

这里 non-typical 的是比如去 dfs 遍历一棵树，然后对于树的每一层都有一个 hp 来记录。这样的话，每个线程的 hp 的数量是 log(d) 而不是常数。
第二段的意思是，retire 但还没有 reclaim 的 haz ptr 的数量，和 hp 的数量是线性的。

/// Memory Usage
/// ------------
/// - The size of the metadata for the hazptr library is linear in the
///   number of threads using hazard pointers, assuming a constant
///   number of hazard pointers per thread, which is typical.
/// - The typical number of reclaimable but not yet reclaimed of
///   objects is linear in the number of hazard pointers, which
///   typically is linear in the number of threads using hazard
///   pointers.

下面的 Rust 代码展示了 hp 的大概使用方法。这里的 HazPtrHolder::load 类似于 get_protected。可以看到，当 HazPtrHolder 被析构后，就不再可以使用 my_x 了。

#[test]
fn feels_good() {
    let x = AtomicPtr::new(Box::into_raw(Box::new(42))); // x: AtomicPtr
    // As a reader:
    let h = HazPtrHolder::default();
    let my_x: &i32 = h.load(&x);
    drop(h);
    // invalid:
    let _ = *my_x;
    // As a writer:
    let old = x.swap(
        Box::into_raw(Box::new(9001)),
        Ordering::SeqCst,
    );
    HazPtrobject::retire(old);
}

上面的代码也有几个不尽如人意的地方：

首先是 AtomicPtr 可能接受一个 null，但我们的场景中实际要禁止 null。显然 Rust 中缺少一个 NotNullAtomicPtr 的结构。

下面就是 load 的实现。分为两部分：

尝试获得一个 hp，如果当前的 holder 尚未绑定一个 hp，就从 domain 上 acquire 一个。acquire 的实现类似从链表里面取一个可用节点，这在后面介绍。
从 AtomicPtr 中加载 ptr，放到刚才获得的 hp 里面，也就是 protect 方法。然后再从 ptr 中读一遍，确保现在读到的 ptr2 等于之前的 ptr1。

这里能够在 ptr1 == ptr2 的时候返回 ptr1，其安全性在于：

ptr1 不可能被释放，因为它此时至少受到我们刚创建的 hp 的保护。
caller 保证 AtomicPtr 非空，并且只可能通过 retire 来被释放掉。

pub unsafe fn load(&mut self, ptr: &'_ AtomicPtr) -> Option<&T>
{
    let hazptr = if let Some(hazptr) = self.0 {
        hazptr
    } else {
        let hazptr = SHARED_DOMAIN.acquire();
        self.0 = Some(hazptr);
        hazptr
    };
    let mut ptr1 = ptr.load(Ordering::SeqCst);
    loop {
        hazptr.protect(ptr1);
        let ptr2 = ptr.load(Ordering::SeqCst);
        if ptr1 == ptr2 {
            // All good -- protected
            break Some(todo!())
        } else {
            ptr1 = ptr2;
        }
    }
}

后面卡在 impl HazPtrObject for T 的 domain 方法上了。这个 domain 方法要返回一个 HazPtrDomain，但写不下去了。因为无法从 trait HazPtrObject 或者任一类型 T 中去取到它被绑定的 domain。所以我们得引入一个 HazPtrObjectWrapper 来保存 domain。这个 Wrapper 可以解引用为 T，还实现了 HazPtrObject。哎，所以这不是走回 AtomicPtr> 这样的老路了么。

这里遇到一个 Drop 相关的问题，他要把 self cast 成一个 *mut dyn Drop，这要求 HazPtrObject 的 Self 必须 impl Drop。但 Rust 又报一个 warning 说 HazPtrObject 继承一个 Drop trait 是多余的。搞到最后作者也没办法去掉这 warning，只能通过 drop_bounds 去禁用掉这个 warning。在视频的第二部分，通过用一个新的空的 trait Reclaim 替换 Drop 来解决掉那个多的 warning。毕竟我们核心只是要一个虚表。

#[allow(drop_bounds)]
pub trait HazPtrObject
where
    Self: Sized + Drop + 'static,
{
    fn domain(&self) -> &HazPtrDomain;

    unsafe fn retire(self: *mut Self) {
        if !std::mem::needs_drop::<Self>() {
            return;
        }
        unsafe { &*self }.domain().retire(self as *mut dyn Drop);
    }
}

后续引入一个 trait Deleter，给到 retire 函数，用来指定如何删除这个对象。

然后说了下，测试里面的 swap 之后，还需要 retire 才能语义上保证 old 不会再 accessible 了。borrow checker 保证了，在调用 reset 的时候，不可能持有一个 load 得到的 &T 指针。这是因为 load 和 reset 都需要 mut borrow。

下面实现 HazPtrDomain::acquire。我们遍历 domain 维护的整个链表，找到第一个 active 为 false 的节点并返回。注意 active 为 true 表示这个 node 正在被使用。如果遍历完整个链表都找不到，则需要扩容链表。这里的链表扩容挺符合直觉的，是一个经典的 loop cas，每次都想办法更新 head。如果 cas 失败，说明 head 已经被其他线程更新了，就重试一次。但是重试前要更新自己的 next 为新的 head。

bulk_reclaim 尝试 reclaim 掉所有可以被 gc 的对象。这里说一下，一个对象 retire 之后，就会从 hp 链表中被删掉，然后放到 retire 链表里。如果是这样的话，又在放到 retire 链表之前线程崩了，那么这个对象是不是泄露了？其实 retire 只会设置 hp 的 active 为 false。

如果此时 retire 被某个线程拿过去回收了，所以 retire 是 nullptr，然后又有一个新的节点要加到 retire 上，会不会导致出现两个 retire 链表？关于这个问题，看后面的实现就可以知道不会发生。他的做法并不是 swap 两个链表，而是在处理完老 retire 的 tail 后，将它的 next 设置为新 retired 链表的 head，然后再尝试更新 retire 为老 retire 的 head。结果就是

1	\| 老 retire 中尚未释放的链表的 head \| 新 retire 的 head \|

如何判断 retire 链表中的元素是可以被安全删除的呢？这里会遍历 hazptrs 链表，将遍历得到的指针全记录在 guarded_ptrs 中。我想其中一个要点是，如果一个指针被 retire 了：

那么它就不会再出现在 hazptrs 链表中，因为它不可能被新的线程所访问。
因为它还没有被 reclaim，所以也不会在 hazptrs 中出现分配在同一个位置上的另一个对象。

后面是对 Deleter 的实现。Deleter 是负责真正 reclaim 这个 ptr 的。Deleter 的作用是标记到底要 delete 什么对象。这个是通过 trait 里面的 vtable 来实现的。
第一版是这个写法。DropBox 适合用来析构从 Box 创建的对象。而 drop_in_place 应该可以在任何情况下被使用。但如果这个 ptr type 本身需要 drop 的话，可能会内存泄露。所以作者两种都提供了。

pub struct DropInPlace;
impl Deleter for DropInPlace {
    fn delete(ptr: *mut dyn Drop) {
        unsafe std::ptr::drop_in_place(ptr);
    }
}
pub struct DropBox;
impl Deleter for DropBox {
    fn delete(ptr: *mut dyn Drop) {
        let _ = Box::from_raw(ptr);
    }
}

然后可能是他觉得用一个类包一下有点丑，所以就写了下面的方案。但这样写会报错，说后面那个 fn 没有 impl Deleter。

impl Deleter for fn(*mut (dyn Drop +'static)) {
    fn delete(&self, ptr:*mut dyn Drop) {
        (*self)(ptr)
    }
}

这里的原因是，需要传一个 fat pointer，一个带虚表的 pointer 给 retire，但一个 raw function 不是 fat pointer。

下面的代码是可以过的

fn drop_box(ptr:*mut dyn Drop) {
    let _ = unsafe { Box::from_raw(ptr) };
}
static FOO: fn(*mut dyn Drop) = drop_box;

unsafe { old.retire(&FOO) };

另外，类型转换也能过

fn drop_in_place(ptr: *mut dyn Drop) {
    unsafe { std::ptr::drop_in_place(ptr) };
}

unsafe { old.retire(&(drop_in_place as fn(*mut dyn Drop))) };

在这个讲座的第二部分，介绍了原代码的几个问题：

在 reclaim 统计 ptr 数量的时候，需要过滤掉 inactive 的。这个也说明了 active 的作用，就是当一个 hp 被 retire 之后，它会被标记为 inactive，等待下一轮被拿出来复用，而不是从链表中摘出来释放掉。
在之前 HazPtrObject::reclaim 的实现中，也许 Self 不需要被 drop。但 *mut Self 可能来自于一个 Box，这个 Box 本身需要被 drop。因此不能简单判断 needs_drop 然后就跳过。
在 HazPtrDomain::bulk_reclaim 中，在没有判断 guarded_ptrs.contains(&(n.ptr as *mut u8)) 的前提下，就在前面 Box::from_raw 去获取所有权了。这个是不正确的，因为此时可能有 reader 在访问。
正确的做法应该如下所示，去获取 n 也就是解出来的一个共享的引用。
1
2
let n = unsafe { &*node };
node = n.next.get_mut():

Reference

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1121r3.pdf
https://github.com/jonhoo/haphazard/blob/7f0d8d62e071f8bc55233a3d2437225d6282e368/src/lib.rs
Rust HazPtr 作者的第一部视频结束后的代码。

Excerpt from Harry Potter

2023-04-09T15:20:37.000Z

While rereading Harry Potter, I found these passages are inspiring and interesting.

HP1

“No, thanks,” said Harry. “The poor toilet’s never had anything as horrible as your head down it — it might be sick.” Then he ran, before Dudley could work out what he’d said.

But from that moment on, Hermione Granger became their friend. There are some things you can’t share without ending up liking each other, and knocking out a twelve-foot mountain troll is one of them.

It does not do to dwell on dreams and forget to live, remember that.

Everyone fell over laughing except Hermione, who leapt up and performed the countercurse. Neville’s legs sprang apart and he got to his feet, trembling.

“Call him Voldemort, Harry. Always use the proper name for things. Fear of a name increases fear of the thing itself.”

“There are all kinds of courage,” said Dumbledore, smiling. “It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends. I therefore award ten points to Mr. Neville Longbottom.”

HP2

“Did you really?” said Mr. Weasley eagerly. “Did it go all right? I — I mean,” he faltered as sparks flew from Mrs. Weasley’s eyes, “that — that was very wrong, boys — very wrong indeed. . . .”

“Rubbish,” said Hermione. “You’ve read his books — look at all those amazing things he’s done —“
“He says he’s done,” Ron muttered.

“I never thought I’d see the day when you’d be persuading us to break rules,” said Ron. “All right, we’ll do it. But not toenails, okay?”

“See here, Malfoy, if Dumbledore can’t stop them,” said Fudge, whose upper lip was sweating now, “I mean to say, who can?”

help will always be given at Hogwarts to those who ask for it.

“However,” said Dumbledore, speaking very slowly and clearly so that none of them could miss a word, “you will find that I will only truly have left this school when none here are loyal to me. You will also find that help will always be given at Hogwarts to those who ask for it.”

Ahaha

“But one of us seems to be keeping mightily quiet about his part
in this dangerous adventure,” Dumbledore added. “Why so modest, Gilderoy?”

“He tried to do a Memory Charm and the wand backfired,” Ron explained quietly to Dumbledore.
“Dear me,” said Dumbledore, shaking his head, his long silver mustache quivering. “Impaled upon your own sword, Gilderoy!”

“Voldemort put a bit of himself in me?” Harry said, thunderstruck.

“because I asked not to go in Slytherin.” What really matters is one’s choice.

“It only put me in Gryffindor,” said Harry in a defeated voice,
“because I asked not to go in Slytherin. . . .”
“Exactly,” said Dumbledore, beaming once more. “Which makes you very different from Tom Riddle. It is our choices, Harry, that show what we truly are, far more than our abilities.” Harry sat motionless in his chair, stunned. “If you want proof, Harry, that you belong in Gryffindor, I suggest you look more closely at this.”
Dumbledore reached across to Professor McGonagall’s desk, picked up the blood-stained silver sword, and handed it to Harry. Dully, Harry turned it over, the rubies blazing in the firelight. And then he saw the name engraved just below the hilt.
Godric Gryffindor.
“Only a true Gryffindor could have pulled that out of the hat, Harry,” said Dumbledore simply.

“Dobby is free”

“Master has given a sock,” said the elf in wonderment. “Master gave it to Dobby.”
“What’s that?” spat Mr. Malfoy. “What did you say?”
“Got a sock,” said Dobby in disbelief. “Master threw it, and Dobby caught it, and Dobby — Dobby is free.”

And the day came eventually…

“Least I could do, Dobby,” said Harry, grinning. “Just promise never to try and save my life again.”

HP3

Very accurate prophecy

“I dare not, Headmaster! If I join the table, we shall be thirteen! Nothing could be more unlucky! Never forget that when thirteen dine together, the first to rise will be the first to die!”

Very accurate prophecy too

“If you must know, Minerva, I have seen that poor Professor Lupin will not be with us for very long. He seems aware, himself, that his time is short. He positively fled when I offered to crystal gaze for him —“

This scene is a wonderful scene in the movie
“Have you ever seen anything quite as pathetic?” said Malfoy.
“And he’s supposed to be our teacher!”
Harry and Ron both made furious moves toward Malfoy, but Hermione got there first — SMACK!
She had slapped Malfoy across the face with all the strength she could muster. Malfoy staggered. Harry, Ron, Crabbe, and Goyle stood flabbergasted as Hermione raised her hand again.

“YOU CHEATING SCUM!” Lee Jordan was howling into the megaphone, dancing out of Professor McGonagall’s reach. “YOU FILTHY, CHEATING B —“

In Chinese version, it is fist rather than finger. But finger here is more vivid.
Professor McGonagall didn”t even bother to tell him off. She was actually shaking her finger in Malfoy’s direction, her hat had fallen off, and she too was shouting furiously.

“I must admit, Peter, I have difficulty in understanding why an innocent man would want to spend twelve years as a rat,” said Lupin evenly.

“If you made a better rat than a human, it’s not much to boast about, Peter,” said Black harshly.

“You don’t understand!” whined Pettigrew. “He would have killed me, Sirius!”
“THEN YOU SHOULD HAVE DIED!” roared Black. “DIED RATHER THAN BETRAY YOUR FRIENDS, AS WE WOULD HAVE DONE FOR YOU!”
Black and Lupin stood shoulder to shoulder, wands raised.
“You should have realized,” said Lupin quietly, “if Voldemort didn’t kill you, we would. Good-bye, Peter.”

“I’m not doing this for you. I’m doing it because — I don’t reckon my dad would’ve wanted them to become killers — just for you.”

You know, Minister, I disagree with Dumbledore on many counts . . . but you cannot deny he’s got style. . . .

“What we need,” said Dumbledore slowly, and his light blue eyes moved from Harry to Hermione, “is more time.”
“But —“ Hermione began. And then her eyes became very round. “OH!”
“Now, pay attention,” said Dumbledore, speaking very low, and very clearly. “Sirius is locked in Professor Flitwick’s office on the seventh floor. Thirteenth window from the right of the West Tower. If all goes well, you will be able to save more than one innocent life tonight. But remember this, both of you: you must not be seen. Miss Granger, you know the law — you know what is at stake. . . . You — must — not — be — seen.”

“Dumbledore just said — just said we could save more than one innocent life. . . .” And then it hit him. “Hermione, we’re going to save Buckbeak!”

He knew it!

Harry tugged harder on the rope around Buckbeak’s neck. The hippogriff began to walk, rustling its wings irritably. They were still ten feet away from the forest, in plain view of Hagrid’s back door.
“One moment, please, Macnair,” came Dumbledore’s voice.
“You need to sign too.” The footsteps stopped. Harry heaved on the rope. Buckbeak snapped his beak and walked a little faster.

You can always rely on yourself

And then it hit him — he understood. He hadn’t seen his father — he had seen himself — Harry flung himself out from behind the bush and pulled out his wand.
“EXPECTO PATRONUM!” he yelled.

It made all the difference in the world

“It didn’t make any difference,” said Harry bitterly. “Pettigrew got away.”
“Didn’t make any difference?” said Dumbledore quietly. “It made all the difference in the world, Harry. You helped uncover the truth. You saved an innocent man from a terrible fate.”

We all know in HP7, Pettigrew gave his life to save Harry.

This is magic at its deepest, its most impenetrable, Harry. But trust me … the time may come when you will be very glad you saved Pettigrew’s life.

“You think the dead we loved ever truly leave us? You think that
we don”t recall them more clearly than ever in times of great trouble? Your father is alive in you, Harry, and shows himself most plainly when you have need of him. How else could you produce that particular Patronus? Prongs rode again last night.”
It took a moment for Harry to realize what Dumbledore had said.

HP4

The bar that leads to diag alley is called leaking cauldron
“We’re not thundering,” said Ron irritably. “We’re walking. Sorry if we’ve disturbed the top-secret workings of the Ministry of Magic.”
“What are you working on?” said Harry.
“A report for the Department of International Magical Cooperation,” said Percy smugly. “We’re trying to standardize cauldron thickness. Some of these foreign imports are just a shade too thin — leakages have been increasing at a rate of almost three percent a year —“
“That’ll change the world, that report will,” said Ron. “Front page of the Daily Prophet, I expect, cauldron leaks.”

Crouch can’t even spell Percy’s name correctly.

“Ludo, we need to meet the Bulgarians, you know,” said Mr. Crouch sharply, cutting Bagman’s remarks short. “Thank you for the tea, Weatherby.”

It was the only time the brothers have seem what they will look like when they get old.

The entrance hall rang with laughter. Even Fred and George joined in, once they had gotten to their feet and taken a good look at each other’s beards.

“because Dobby wants paying now”

“Dobby has traveled the country for two whole years, sir, trying to find work!” Dobby squeaked. “But Dobby hasn’t found work, sir, because Dobby wants paying now!”
The house-elves all around the kitchen, who had been listening and watching with interest, all looked away at these words, as though Dobby had said something rude and embarrassing.
Hermione, however, said, “Good for you, Dobby!”
“Thank you, miss!” said Dobby, grinning toothily at her. “But most wizards doesn’t want a house-elf who wants paying, miss. ‘That’s not the point of a house-elf,’ they says, and they slammed the door in Dobby’s face! Dobby likes work, but he wants to wear clothes and he wants to be paid, Harry Potter. . . . Dobby likes being free!”

“Oh no, sir, no,” said Dobby, looking suddenly serious. “ ‘Tis part of the house-elf’s enslavement, sir. We keeps their secrets and our silence, sir. We upholds the family’s honor, and we never speaks ill of them — though Professor Dumbledore told Dobby he does not insist upon this. Professor Dumbledore said we is free to — to —“
Dobby looked suddenly nervous and beckoned Harry closer.
Harry bent forward. Dobby whispered, “He said we is free to call him a — a barmy old codger if we likes, sir!”
Dobby gave a frightened sort of giggle.
“But Dobby is not wanting to, Harry Potter,” he said, talking normally again, and shaking his head so that his ears flapped.
“Dobby likes Professor Dumbledore very much, sir, and is proud to keep his secrets and our silence for him.”

“temporarily deaf”

“Of course we still want to know you!” Harry said, staring at Hagrid. “You don’t think anything that Skeeter cow — sorry, Professor,” he added quickly, looking at Dumbledore.
“I have gone temporarily deaf and haven’t any idea what you said, Harry,” said Dumbledore, twiddling his thumbs and staring at the ceiling.

“Really, Hagrid, if you are holding out for universal popularity,
I’m afraid you will be in this cabin for a very long time,” said Dumbledore, now peering sternly over his half-moon spectacles. “Not a week has passed since I became headmaster of this school when I haven’t had at least one owl complaining about the way I run it. But what should I do? Barricade myself in my study and refuse to talk to anybody?”

Another subtle foreshadowing. One must finish reading the entire book before being impressed by Dumbledore’s foresight.

For a fleeting instant, Harry thought he saw a gleam of something like triumph in Dumbledore’s eyes.

“You are blinded,” said Dumbledore, his voice rising now, the aura of power around him palpable, his eyes blazing once more, “by the love of the office you hold, Cornelius! You place too much importance, and you always have done, on the so-called purity of blood! You fail to recognize that it matters not what someone is born, but what they grow to be! Your dementor has just destroyed the last remaining member of a pure-blood family as old as any — and see what that man chose to make of his life!

“make a choice between what is right and what is easy”.

Remember Cedric. Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.

“Listen,” said Harry firmly. “If you don’t take it, I’m throwing it down the drain. I don’t want it and I don’t need it. But I could do with a few laughs. We could all do with a few laughs. I’ve got a feeling we’re going to need them more than usual before long.”

HP5

DO NOT SURRENDER YOUR WAND

Harry —
Dumbledore’s just arrived at the Ministry, and he’s trying to sort it all out. DO NOT LEAVE YOUR AUNT AND UNCLE’S HOUSE. DO NOT DO ANY MORE MAGIC. DO NOT SURRENDER YOUR WAND

I think it’s funny, since it remind me of Sir Humphrey Appleby.

… because some changes will be for the better, while others will come, in the fullness of time, to be recognized as errors of judgment. Meanwhile, some old habits will be retained, and rightly so, whereas others, outmoded and outworn, must be abandoned. Let us move forward, then, into a new era of openness, effectiveness, and accountability, intent on preserving what ought to be preserved, perfecting what needs to be perfected, and pruning wherever we find practices that ought to be prohibited.

“You told her He-Who-Must-Not-Be-Named is back?
“Yes.”
Professor McGonagall sat down behind her desk, frowning at Harry.
Then she said, “Have a biscuit, Potter.”

That tells the difference between Hermione and Percy.

“You disagree?”
“Yes, I do,” said Hermione, who, unlike Umbridge, was not whispering, but speaking in a clear, carrying voice that had by now attracted the rest of the class’s attention. “Mr. Slinkhard doesn’t like jinxes, does he? But I think they can be very useful when they’re used defensively.”

Ron’s awkward compliment shows the subtle change in his relationship with Hermione.

“We do try,” said Ron. “We just haven’t got your brains or your memory or your concentration — you’re just cleverer than we are — is it nice to rub it in?”
“Oh, don’t give me that rubbish,” said Hermione, but she looked slightly mollified as she led the way out into the damp courtyard.

Excellent sarcasm.

“For disrupting my class with pointless interruptions,” said Professor Umbridge smoothly. “I am here to teach you using a Ministry approved method that does not include inviting students to give their opinions on matters about which they understand very little. Your previous teachers in this subject may have allowed you more license, but as none of them — with the possible exception of Professor Quirrell, who did at least appear to have restricted himself to age-appropriate subjects — would have passed a Ministry inspection —“
“Yeah, Quirrell was a great teacher,” said Harry loudly, “there was just that minor drawback of him having Lord Voldemort sticking out of the back of his head.”

Hermione is not a nerd.

Nobody raised objections after Ernie, though Harry saw Cho’s friend give her a rather reproachful look before adding her name. When the last person — Zacharias — had signed, Hermione took the parchment back and slipped it carefully into her bag. There was an odd feeling in the group now. It was as though they had just signed some kind of contract.

He was on the sixth stair when it happened. There was a loud, wailing, klaxonlike sound and the steps melted together to make a long, smooth stone slide. There was a brief moment when Ron tried to keep running, arms working madly like windmills, then he toppled over backward and shot down the newly created slide, coming to rest on his back at Harry’s feet.

Umbridge’s confrontation with Snape. Snape refused to say one more word to deal with Umbridge’s nonsense.

“Well, the class seems fairly advanced for their level,” she said briskly to Snape’s back. “Though I would question whether it is advisable to teach them a potion like the Strengthening Solution. I think the Ministry would prefer it if that was removed from the syllabus.”
Snape straightened up slowly and turned to look at her.
“Now … how long have you been teaching at Hogwarts?” she asked, her quill poised over her clipboard.
“Fourteen years,” Snape replied. His expression was unfathomable.
His eyes on Snape, Harry added a few drops to his potion; it hissed menacingly and turned from turquoise to orange.
“You applied first for the Defense Against the Dark Arts post, I believe?” Professor Umbridge asked Snape.
“Yes,” said Snape quietly.
“But you were unsuccessful?”
Snape’s lip curled.
“Obviously.”
Professor Umbridge scribbled on her clipboard.
“And you have applied regularly for the Defense Against the Dark Arts post since you first joined the school, I believe?”
“Yes,” said Snape quietly, barely moving his lips. He looked very angry.
“Do you have any idea why Dumbledore has consistently refused to appoint you?” asked Umbridge.
“I suggest you ask him,” said Snape jerkily.
“Oh I shall,” said Professor Umbridge with a sweet smile.
“I suppose this is relevant?” Snape asked, his black eyes narrowed.

It is always wise to take fate firmly in one’s own hands.

“Well, better expelled and able to defend yourselves than sitting safely in school without a clue,” said Sirius.

Some parents are “living through us”.

“You don’t think he has become … sort of … reckless … since he’s been cooped up in Grimmauld Place? You don’t think he’s … kind of … living through us?”

there are things worth dying for

“Your father knew what he was getting into, and he won’t thank you for messing things up for the Order!” said Sirius angrily in his turn. “This is how it is — this is why you’re not in the Order — you don’t understand — there are things worth dying for!”
“Easy for you to say, stuck here!” bellowed Fred. “I don’t see you risking your neck!”

Being a victim is not a shame. However, being a fighter is more honorable.

“What’s this?” said Mrs. Longbottom sharply. “Haven’t you told your friends about your parents, Neville?”
Neville took a deep breath, looked up at the ceiling, and shook his head. Harry could not remember ever feeling sorrier for anyone, but he could not think of any way of helping Neville out of the situation.
“Well, it’s nothing to be ashamed of!” said Mrs. Longbottom angrily. “You should be proud, Neville, proud! They didn’t give their health and their sanity so their only son would be ashamed of them, you know!”
“I’m not ashamed,” said Neville very faintly, still looking anywhere but at Harry and the others. Ron was now standing on tiptoe to look over at the inhabitants of the two beds.
“Well, you’ve got a funny way of showing it!” said Mrs. Longbottom. “My son and his wife,” she said, turning haughtily to Harry, Ron, Hermione, and Ginny, “were tortured into insanity by You-Know-Who’s followers.”
Hermione and Ginny both clapped their hands over their mouths.
Ron stopped craning his neck to catch a glimpse of Neville’s parents and looked mortified.
“They were Aurors, you know, and very well respected within the Wizarding community,” Mrs. Longbottom went on. “Highly gifted, the pair of them. I — yes, Alice dear, what is it?”

An incisive comment on speech censorship.

“Oh Harry, don’t you see?” Hermione breathed. “If she could have done one thing to make absolutely sure that every single person in this school will read your interview, it was banning it!”

Brilliant irony.

“Oh, so that’s why he wasn’t prosecuted for setting up all those regurgitating toilets!” said Professor McGonagall, raising her eyebrows. “What an interesting insight into our justice system!”

I cannot allow you to manhandle my students

“Well, usually when a person shakes their head,” said McGonagall coldly, “they mean ‘no.’ So unless Miss Edgecombe is using a form of sign language as yet unknown to humans —“
Professor Umbridge seized Marietta, pulled her around to face her, and began shaking her very hard. A split second later Dumbledore was on his feet, his wand raised. Kingsley started forward and Umbridge leapt back from Marietta, waving her hands in the air as though they had been burned.
“I cannot allow you to manhandle my students, Dolores,” said Dumbledore, and for the first time, he looked angry.
“You want to calm yourself, Madam Umbridge,” said Kingsley in his deep, slow voice. “You don’t want to get yourself into trouble now.”

Dumbledore’s contempt of Fudge jumps off the page.

“Well, the game is up,” he said simply. “Would you like a written confession from me, Cornelius — or will a statement before these witnesses suffice?”
Harry saw McGonagall and Kingsley look at each other. There was fear in both faces. He did not understand what was going on, and neither, apparently, did Fudge.
“Statement?” said Fudge slowly. “What — I don’t — ?”
“Dumbledore’s Army, Cornelius,” said Dumbledore, still smiling as he waved the list of names before Fudge’s face. “Not Potter’s Army. Dumbledore’s Army.”
“But — but —“
Understanding blazed suddenly in Fudge’s face. He took a horrified step backward, yelped, and jumped out of the fire again.
“You?” he whispered, stamping again on his smoldering cloak.
“That’s right,” said Dumbledore pleasantly.
“You organized this?”
“I did,” said Dumbledore.
“You recruited these students for — for your army?”
“Tonight was supposed to be the first meeting,” said Dumbledore, nodding. “Merely to see whether they would be interested in joining me. I see now that it was a mistake to invite Miss Edgecombe, of course.”
Marietta nodded. Fudge looked from her to Dumbledore, his chest swelling.
“Then you have been plotting against me!” he yelled.
“That’s right,” said Dumbledore cheerfully.
“NO!” shouted Harry.
Kingsley flashed a look of warning at him, McGonagall widened her eyes threateningly, but it had suddenly dawned upon Harry what Dumbledore was about to do, and he could not let it happen.
“No — Professor Dumbledore!”
“Be quiet, Harry, or I am afraid you will have to leave my office,” said Dumbledore calmly.
“Yes, shut up, Potter!” barked Fudge, who was still ogling Dumbledore with a kind of horrified delight. “Well, well, well — I came here tonight expecting to expel Potter and instead —“
“Instead you get to arrest me,” said Dumbledore, smiling. “It’s like losing a Knut and finding a Galleon, isn’t it?”
“Weasley!” cried Fudge, now positively quivering with delight, “Weasley, have you written it all down, everything he’s said, his confession, have you got it?”
“Yes, sir, I think so, sir!” said Percy eagerly, whose nose was splattered with ink from the speed of his note-taking.
“The bit about how he’s been trying to build up an army against the Ministry, how he’s been working to destabilize me?”
“Yes, sir, I’ve got it, yes!” said Percy, scanning his notes joyfully.
“Very well, then,” said Fudge, now radiant with glee. “Duplicate your notes, Weasley, and send a copy to the Daily Prophet at once. If we send a fast owl we should make the morning edition!” Percy dashed from the room, slamming the door behind him, and Fudge turned back to Dumbledore. “You will now be escorted back to the Ministry, where you will be formally charged and then sent to Azkaban to await trial!”
“Ah,” said Dumbledore gently, “yes. Yes, I thought we might hit that little snag.”
“Snag?” said Fudge, his voice still vibrating with joy. “I see no snag, Dumbledore!”
“Well,” said Dumbledore apologetically, “I’m afraid I do.”
“Oh really?”
“Well — it’s just that you seem to be laboring under the delusion that I am going to — what is the phrase? ‘Come quietly’ I am afraid I am not going to come quietly at all, Cornelius. I have absolutely no intention of being sent to Azkaban. I could break out, of course — but what a waste of time, and frankly, I can think of a whole host of things I would rather be doing.”
Umbridge’s face was growing steadily redder, she looked as though she was being filled with boiling water. Fudge stared at Dumbledore with a very silly expression on his face, as though he had just been stunned by a sudden blow and could not quite believe it had happened. He made a small choking noise and then looked around at Kingsley and the man with short gray hair, who alone of everyone in the room had remained entirely silent so far. The latter gave Fudge a reassuring nod and moved forward a little, away from the wall. Harry saw his hand drift, almost casually, toward his pocket.
“Don’t be silly, Dawlish,” said Dumbledore kindly. “I’m sure you are an excellent Auror, I seem to remember that you achieved ‘Outstanding’ in all your N.E.W.T.s, but if you attempt to — er — ‘bring me in’ by force, I will have to hurt you.”
The man called Dawlish blinked, looking rather foolish. He looked toward Fudge again, but this time seemed to be hoping for a clue as to what to do next.
“So,” sneered Fudge, recovering himself, “you intend to take on Dawlish, Shacklebolt, Dolores, and myself single-handed, do you, Dumbledore?”
“Merlin’s beard, no,” said Dumbledore, smiling. “Not unless you are foolish enough to force me to.”
“He will not be single-handed!” said Professor McGonagall loudly, plunging her hand inside her robes.
“Oh yes he will, Minerva!” said Dumbledore sharply. “Hogwarts needs you!”
“Enough of this rubbish!” said Fudge, pulling out his own wand.
“Dawlish! Shacklebolt! Take him!”
A streak of silver light flashed around the room. There was a bang like a gunshot, and the floor trembled. A hand grabbed the scruff of Harry’s neck and forced him down on the floor as a second silver flash went off — several of the portraits yelled, Fawkes screeched, and a cloud of dust filled the air. Coughing in the dust, Harry saw a dark figure fall to the ground with a crash in front of him. There was a shriek and a thud and somebody cried, “No!” Then the sound of breaking glass, frantically scuffling footsteps, a groan — and silence.
Harry struggled around to see who was half-strangling him and saw Professor McGonagall crouched beside him. She had forced both him and Marietta out of harm’s way. Dust was still floating gently down through the air onto them. Panting slightly, Harry saw a very tall figure moving toward them.

“ ‘Course we have,” said George. “Never been expelled, have we?”
“We’ve always known where to draw the line,” said Fred.
“We might have put a toe across it occasionally,” said George.
“But we’ve always stopped short of causing real mayhem,” said Fred.
“But now?” said Ron tentatively.
“Well, now —“ said George.
“— what with Dumbledore gone —“ said Fred.
“— we reckon a bit of mayhem —“ said George.
“— is exactly what our dear new Head deserves,” said Fred.
“You mustn’t!” whispered Hermione. “You really mustn’t! She’d love a reason to expel you!”
“You don’t get it, Hermione, do you?” said Fred, smiling at her.
“We don’t care about staying anymore. We’d walk out right now if we weren’t determined to do our bit for Dumbledore first. So anyway,” he checked his watch, “phase one is about to begin. I’d get in the Great Hall for lunch if I were you, that way the teachers will see you can’t have had anything to do with it.”

“Thank you so much, Professor!” said Professor Flitwick in his squeaky little voice. “I could have got rid of the sparklers myself, of course, but I wasn’t sure whether I had the authority… .”

Hermione and rebellious…

“Oh, why don’t we have a night off?” said Hermione brightly, as a silver-tailed Weasley rocket zoomed past the window. “After all, the Easter holidays start on Friday, we’ll have plenty of time then… .”
“Are you feeling all right?” Ron asked, staring at her in disbelief.
“Now you mention it,” said Hermione happily, “d’you know … I think I’m feeling a bit … rebellious.”

When it comes to satire, McGonagall will never let you down.

“I should have made my meaning plainer,” said Professor McGonagall, turning at last to look Umbridge directly in the eyes. “He has achieved high marks in all Defense Against the Dark Arts tests set by a competent teacher.”

“She towered over”

Professor McGonagall got to her feet too, and in her case this was a much more impressive move. She towered over Professor Umbridge.
“Potter,” she said in ringing tones, “I will assist you to become an Auror if it is the last thing I do! If I have to coach you nightly I will make sure you achieve the required results!”
“The Minister of Magic will never employ Harry Potter!” said Umbridge, her voice rising furiously.
“There may well be a new Minister of Magic by the time Potter is ready to join!” shouted Professor McGonagall.

Glorious finale of Fred and George’s schooling.

“You know what?” said Fred. “I don’t think we are.”
He turned to his twin.
“George,” said Fred, “I think we’ve outgrown full-time education.”
“Yeah, I’ve been feeling that way myself,” said George lightly.
“Time to test our talents in the real world, d’you reckon?” asked Fred.
“Definitely,” said George.
And before Umbridge could say a word, they raised their wands and said together, “Accio Brooms!”
Harry heard a loud crash somewhere in the distance. Looking to his left he ducked just in time — Fred and George’s broomsticks, one still trailing the heavy chain and iron peg with which Umbridge had fastened them to the wall, were hurtling along the corridor toward their owners. They turned left, streaked down the stairs, and stopped sharply in front of the twins, the chain clattering loudly on the flagged stone floor.
“We won’t be seeing you,” Fred told Professor Umbridge, swinging his leg over his broomstick.
“Yeah, don’t bother to keep in touch,” said George, mounting his own.
Fred looked around at the assembled students, and at the silent, watchful crowd.
“If anyone fancies buying a Portable Swamp, as demonstrated upstairs, come to number ninety-three, Diagon Alley — Weasleys’ Wizarding Wheezes,” he said in a loud voice. “Our new premises!”
“Special discounts to Hogwarts students who swear they’re going to use our products to get rid of this old bat,” added George, pointing at Professor Umbridge.
“STOP THEM!” shrieked Umbridge, but it was too late. As the Inquisitorial Squad closed in, Fred and George kicked off from the floor, shooting fifteen feet into the air, the iron peg swinging dangerously below. Fred looked across the hall at the poltergeist bobbing on his level above the crowd.
“Give her hell from us, Peeves.”
And Peeves, whom Harry had never seen take an order from a student before, swept his belled hat from his head and sprang to a salute as Fred and George wheeled about to tumultuous applause from the students below and sped out of the open front doors into the glorious sunset.

Indeed, a week after Fred and George’s departure Harry witnessed Professor McGonagall walking right past Peeves, who was determinedly loosening a crystal chandelier, and could have sworn he heard her tell the poltergeist out of the corner of her mouth, “It unscrews the other way.”

there are things much worse than death

“We both know that there are other ways of destroying a man, Tom,” Dumbledore said calmly, continuing to walk toward Voldemort as though he had not a fear in the world, as though nothing had happened to interrupt his stroll up the hall. “Merely taking your life would not satisfy me, I admit —“
“There is nothing worse than death, Dumbledore!” snarled Voldemort.
“You are quite wrong,” said Dumbledore, still closing in upon Voldemort and speaking as lightly as though they were discussing the matter over drinks. Harry felt scared to see him walking along, undefended, shieldless. He wanted to cry out a warning, but his headless guard kept shunting him backward toward the wall, blocking his every attempt to get out from behind it. “Indeed, your failure to understand that there are things much worse than death has always been your greatest weakness —“

faltered … as surveyed … magisterially over …
This is one of the few scenes in which Dumbledore act very aggresively. Don’t forget that he used to hand out with Grindelwald.

“Now see here, Dumbledore!” said Fudge, as Dumbledore picked up the head and walked back to Harry carrying it. “You haven’t got authorization for that Portkey! You can’t do things like that right in front of the Minister of Magic, you — you —“
His voice faltered as Dumbledore surveyed him magisterially over his half-moon spectacles.
“You will give the order to remove Dolores Umbridge from Hogwarts,” said Dumbledore. “You will tell your Aurors to stop searching for my Care of Magical Creatures teacher so that he can return to work. I will give you …” Dumbledore pulled a watch with twelve hands from his pocket and glanced at it, “half an hour of my time tonight, in which I think we shall be more than able to cover the important points of what has happened here. After that, I shall need to return to my school. If you need more help from me you are, of course, more than welcome to contact me at Hogwarts. Letters addressed to the headmaster will find me.”

“Kreacher is what he has been made by wizards, Harry,” said Dumbledore. “Yes, he is to be pitied. His existence has been as miserable as your friend Dobby’s. He was forced to do Sirius’s bidding, because Sirius was the last of the family to which he was enslaved, but he felt no true loyalty to him. And whatever Kreacher’s faults, it must be admitted that Sirius did nothing to make Kreacher’s lot easier —“

“Sirius did not hate Kreacher,” said Dumbledore. “He regarded him as a servant unworthy of much interest or notice. Indifference and neglect often do much more damage than outright dislike… .
The fountain we destroyed tonight told a lie. We wizards have mistreated and abused our fellows for too long, and we are now reaping our reward.”

Well, Flitwick’s got rid of Fred and George’s swamp,” said Ginny.
“He did it in about three seconds. But he left a tiny patch under the window and he’s roped it off —“
“Why?” said Hermione, looking startled.
“Oh, he just says it was a really good bit of magic,” said Ginny, shrugging.
“I think he left it as a monument to Fred and George,” said Ron through a mouthful of chocolate. “They sent me all these, you know,” he told Harry, pointing at the small mountain of Frogs beside him.
“Must be doing all right out of that joke shop, eh?”

who can be intimidated

“And do I look like the kind of man who can be intimidated?” barked Uncle Vernon.
“Well …” said Moody, pushing back his bowler hat to reveal his sinisterly revolving magical eye. Uncle Vernon leapt backward in horror and collided painfully with a luggage trolley. “Yes, I’d have to say you do, Dursley.”

HP6

“You are determined to hate him”

“You are determined to hate him, Harry,” said Lupin with a faint smile.

Snape’s irony

“Yes, indeed, most admirable,” said Snape in a bored voice. “Of course, you weren’t a lot of use to him in prison, but the gesture was undoubtedly fine —“
“Gesture!” she shrieked; in her fury she looked slightly mad.
“While I endured the dementors, you remained at Hogwarts, comfortably playing Dumbledore’s pet!”

Greate self-defending

“Think!” said Snape, impatient again. “Think! By waiting two hours, just two hours, I ensured that I could remain at Hogwarts as a spy! By allowing Dumbledore to think that I was only returning to the Dark Lord’s side because I was ordered to, I have been able to pass information on Dumbledore and the Order of the Phoenix ever since! Consider, Bellatrix: The Dark Mark had been growing stronger for months. I knew he must be about to return, all the Death Eaters knew! I had plenty of time to think about what I wanted to do, to plan my next move, to escape like Karkaroff, didn’t I?

Harry got to his feet. As he walked across the room, his eyes fell upon the little table on which Marvolo Gaunt’s ring had rested last time, but the ring was no longer there.
“Yes, Harry?” said Dumbledore, for Harry had come to a halt.
“The ring’s gone,” said Harry, looking around. “But I thought you might have the mouth organ or something.”
Dumbledore beamed at him, peering over the top of his halfmoon spectacles.
“Very astute, Harry, but the mouth organ was only ever a mouth organ.”
And on that enigmatic note he waved to Harry, who understood himself to be dismissed.

A very tough conversation with the Ministry.

“But if I keep running in and out of the Ministry,” said Harry, still endeavoring to keep his voice friendly, “won’t that seem as though I approve of what the Ministry’s up to?”
“Well,” said Scrimgeour, frowning slightly, “well, yes, that’s partly why we’d like —“
“No, I don’t think that’ll work,” said Harry pleasantly. “You see, I don’t like some of the things the Ministry’s doing. Locking up Stan Shunpike, for instance.”
Scrimgeour did not speak for a moment but his expression hardened instantly. “I would not expect you to understand,” he said, and he was not as successful at keeping anger out of his voice as Harry had been. “These are dangerous times, and certain measures need to be taken. You are sixteen years old —“
“Dumbledore’s a lot older than sixteen, and he doesn’t think Stan should be in Azkaban either,” said Harry. “You’re making Stan a scapegoat, just like you want to make me a mascot.”
They looked at each other, long and hard. Finally Scrimgeour said, with no pretense at warmth, “I see. You prefer — like your hero, Dumbledore — to disassociate yourself from the Ministry?”
“I don’t want to be used,” said Harry.
“Some would say it’s your duty to be used by the Ministry!”
“Yeah, and others might say it’s your duty to check that people really are Death Eaters before you chuck them in prison,” said Harry, his temper rising now. “You’re doing what Barty Crouch did. You never get it right, you people, do you? Either we’ve got Fudge, pretending everything’s lovely while people get murdered right under his nose, or we’ve got you, chucking the wrong people into jail and trying to pretend you’ve got ‘the Chosen One’ working for you!”
“So you’re not ‘the Chosen One’?” said Scrimgeour.
“I thought you said it didn’t matter either way?” said Harry, with a bitter laugh. “Not to you anyway.”
“I shouldn’t have said that,” said Scrimgeour quickly. “It was tactless —“
“No, it was honest,” said Harry. “One of the only honest things you’ve said to me. You don’t care whether I live or die, but you do care that I help you convince everyone you’re winning the war against Voldemort. I haven’t forgotten, Minister… .”
He raised his right fist. There, shining white on the back of his cold hand, were the scars which Dolores Umbridge had forced him to carve into his own flesh: I must not tell lies.
“I don’t remember you rushing to my defense when I was trying to tell everyone Voldemort was back. The Ministry wasn’t so keen to be pals last year.”
They stood in silence as icy as the ground beneath their feet. The gnome had finally managed to extricate his worm and was now sucking on it happily, leaning against the bottommost branches of the rhododendron bush.
“What is Dumbledore up to?” said Scrimgeour brusquely.
“Where does he go when he is absent from Hogwarts?”
“No idea,” said Harry.
“And you wouldn’t tell me if you knew,” said Scrimgeour, “would you?”
“No, I wouldn’t,” said Harry.

“a baboon brandishing a stick”, LOL
“Harry’s already Apparated,” Ron told a slightly abashed Seamus, after Professor Flitwick had dried himself off with a wave of his wand and set Seamus lines: “I am a wizard, not a baboon brandishing a stick.” “Dum — er — someone took him. Side-Along Apparition, you know.”

“He accused me of being ‘Dumbledore’s man through and through.’ “
“How very rude of him.”
“I told him I was.”

He raised his glass as though toasting Voldemort, whose face remained expressionless. Nevertheless, Harry felt the atmosphere in the room change subtly: Dumbledore’s refusal to use Voldemort’s chosen name was a refusal to allow Voldemort to dictate the terms of the meeting, and Harry could tell that Voldemort took it as such.

“For the greater good”.

“You call it ‘greatness,’ what you have been doing, do you?” asked Dumbledore delicately.
“Certainly,” said Voldemort, and his eyes seemed to burn red. “I have experimented; I have pushed the boundaries of magic further, perhaps, than they have ever been pushed —“
“Of some kinds of magic,” Dumbledore corrected him quietly. “Of some. Of others, you remain … forgive me … woefully ignorant.”

“I am glad to hear that you consider them friends,” said Dumbledore. “I was under the impression that they are more in the order of servants.”

“Let us speak openly. Why have you come here tonight, surrounded by henchmen, to request a job we both know you do not want?”
Voldemort looked coldly surprised. “A job I do not want? On the contrary, Dumbledore, I want it very much.”
“Oh, you want to come back to Hogwarts, but you do not want to teach any more than you wanted to when you were eighteen. What is it you’re after, Tom? Why not try an open request for once?”
Voldemort sneered. “If you do not want to give me a job —“
“Of course I don’t,” said Dumbledore. “And I don’t think for a moment you expected me to. Nevertheless, you came here, you asked, you must have had a purpose.”

The most valuable thing of a man is braveness.

“Of course, it doesn’t matter how he looks… . It’s not r-really important … but he was a very handsome little b-boy … always very handsome … and he was g-going to be married!”
“And what do you mean by zat?” said Fleur suddenly and loudly.
“What do you mean, ‘ ‘e was going to be married?’ “
Mrs. Weasley raised her tear-stained face, looking startled.
“Well — only that —“
“You theenk Bill will not wish to marry me anymore?” demanded Fleur. “You theenk, because of these bites, he will not love me?”
“No, that’s not what I —“
“Because ‘e will!” said Fleur, drawing herself up to her full height and throwing back her long mane of silver hair. “It would take more zan a werewolf to stop Bill loving me!”
“Well, yes, I’m sure,” said Mrs. Weasley, “but I thought perhaps — given how — how he —“
“You thought I would not weesh to marry him? Or per’aps, you hoped?” said Fleur, her nostrils flaring. “What do I care how he looks? I am good-looking enough for both of us, I theenk! All these scars show is zat my husband is brave! And I shall do zat!” she added fiercely, pushing Mrs. Weasley aside and snatching the ointment from her.
Mrs. Weasley fell back against her husband and watched Fleur mopping up Bill’s wounds with a most curious expression upon her face. Nobody said anything; Harry did not dare move. Like everybody else, he was waiting for the explosion.
“Our Great-Auntie Muriel,” said Mrs. Weasley after a long pause, “has a very beautiful tiara — goblin-made — which I am sure I could persuade her to lend you for the wedding. She is very fond of Bill, you know, and it would look lovely with your hair.”
“Thank you,” said Fleur stiffly. “I am sure zat will be lovely.”
And then, Harry did not quite see how it happened, both women were crying and hugging each other. Completely bewildered, wondering whether the world had gone mad, he turned around: Ron looked as stunned as he felt and Ginny and Hermione were exchanging startled looks.

she would not say, “Be careful,” or “Don’t do it,” but accept his decision, because she would not have expected anything less of him

Harry looked at Ginny, Ron, and Hermione: Ron’s face was screwed up as though the sunlight were blinding him. Hermione’s face was glazed with tears, but Ginny was no longer crying. She met Harry’s gaze with the same hard, blazing look that he had seen when she had hugged him after winning the Quidditch Cup in his absence, and he knew that at that moment they understood each other perfectly, and that when he told her what he was going to do now, she would not say, “Be careful,” or “Don’t do it,” but accept his decision, because she would not have expected anything less of him. And so he steeled himself to say what he had known he must say ever since Dumbledore had died.

HP7

“He must’ve known you’d always want to come back.”

But I don’t think so, not anymore. He knew what he was doing when he gave me the Deluminator, didn’t he? He — well,” Ron’s ears turned bright red and he became engrossed in a tuft of grass at his feet, which he prodded with his toe, “he must’ve known I’d run out on you.”
“No,” Harry corrected him. “He must’ve known you’d always want to come back.”

“And if it does fall into his grasp,” said Dumbledore, almost, it seemed, as an aside, “I have your word that you will do all in your power to protect the students of Hogwarts?”

“Professor Snape sent them into the Forbidden Forest, to do some work for the oaf, Hagrid.”
“Hagrid’s not an oaf!” said Hermione shrilly.
“And Snape might’ve thought that was a punishment,” said Harry, “but Ginny, Neville, and Luna probably had a good laugh with Hagrid. The Forbidden Forest . . . they’ve faced plenty worse than the Forbidden Forest, big deal!”

This is a very giant change of Harry. He finally learned how to shut his mind from Voldemort, and he’s no longer obsessed with Hallows.

His scar burned, but he was master of the pain; he felt it, yet was apart from it. He had learned control at last, learned to shut his mind to Voldemort, the very thing Dumbledore had wanted him to learn from Snape. Just as Voldemort had not been able to possess Harry while Harry was consumed with grief for Sirius, so his thoughts could not penetrate Harry now, while he mourned Dobby.
Grief, it seemed, drove Voldemort out . . . though Dumbledore, of course, would have said that it was love. . . .
On Harry dug, deeper and deeper into the hard, cold earth, subsuming his grief in sweat, denying the pain in his scar. In the darkness, with nothing but the sound of his own breath and the rushing sea to keep him company, the things that had happened at the Malfoys’ returned to him, the things he had heard came back to him, and understanding blossomed in the darkness. . . .
The steady rhythm of his arms beat time with his thoughts. Hallows . . . Horcruxes . . . Hallows . . . Horcruxes . . . Yet he no longer burned with that weird, obsessive longing. Loss and fear had snuffed it out: He felt as though he had been slapped awake again.
Deeper and deeper Harry sank into the grave, and he knew where Voldemort had been tonight, and whom he had killed in the topmost cell of Nurmengard, and why. . . .
And he thought of Wormtail, dead because of one small unconscious impulse of mercy. . . . Dumbledore had foreseen that. . . . How much more had he known?

Another side depiction of Naville’s family.

“Yeah, well, I couldn’t ask people to go through what Michael did, so we dropped those kinds of stunts. But we were still fighting, doing underground stuff, right up until a couple of weeks ago. That’s when they decided there was only one way to stop me, I suppose, and they went for Gran.”
“They what?” said Harry, Ron, and Hermione together.
“Yeah,” said Neville, panting a little now, because the passage was climbing so steeply, “well, you can see their thinking. It had worked really well, kidnapping kids to force their relatives to behave, I s’pose it was only a matter of time before they did it the other way around. Thing was,” he faced them, and Harry was astonished to see that he was grinning, “they bit off a bit more than they could chew with Gran. Little old witch living alone, they probably thought they didn’t need to send anyone particularly powerful. Anyway,” Neville laughed, “Dawlish is still in St. Mungo’s and Gran’s on the run. She sent me a letter,” he clapped a hand to the breast pocket of his robes, “telling me she was proud of me, that I’m my parents’ son, and to keep it up.”

This little episode shows Ron’s character traits vividly.
“Your mother can’t produce food out of thin air,” said Hermione. “No one can. Food is the first of the five Principal Exceptions to Gamp’s Law of Elemental Transfigur —“
“Oh, speak English, can”t you?” Ron said, prising a fish bone outfrom between his teeth.

“Yeah, well, food’s one of the five exceptions to Gamp’s Law of Elemental Transfiguration,” said Ron to general astonishment.

“Why would Harry Potter try to get inside Ravenclaw Tower? Potter belongs in my House!”
Beneath the disbelief and anger, Harry heard a little strain of pride in her voice, and affection for Minerva McGonagall gushed up inside him.

“It’s not a case of what you’ll permit, Minerva McGonagall. Your time’s over. It’s us what’s in charge here now, and you’ll back me up or you’ll pay the price.” And he spat in her face.
Harry pulled the Cloak off himself, raised his wand, and said, “You shouldn’t have done that.”
As Amycus spun around, Harry shouted, “Crucio!”

“Potter, I — that was very — very gallant of you — but don’t you realize — ?”

There was a sound of movement, of clinking glass: Amycus was coming round. Before Harry or Luna could act, Professor McGonagall rose to her feet, pointed her wand at the groggy Death Eater, and said, “Imperio.”

The aged caretaker had just come hobbling into view, shouting, “Students out of bed! Students in the corridors!”
“They’re supposed to be, you blithering idiot!” shouted McGonagall. “Now go and do something constructive! Find Peeves!”

Percy returns to the arms of his family.

“I was a fool!” Percy roared, so loudly that Lupin nearly dropped his photograph. “I was an idiot, I was a pompous prat, I was a — a —“
“Ministry-loving, family-disowning, power-hungry moron,” said Fred.
Percy swallowed.
“Yes, I was!”
“Well, you can’t say fairer than that,” said Fred, holding out his hand to Percy.
Mrs. Weasley burst into tears. She ran forward, pushed Fred aside, and pulled Percy into a strangling hug, while he patted her on the back, his eyes on his father.

“Well, we do look to our prefects to take a lead at times such as these,” said George in a good imitation of Percy’s most pompous manner. “Now let’s get upstairs and fight, or all the good Death Eaters’ll be taken.”

“I am not such a coward.”

“Karkaroff’s Mark is becoming darker too. He is panicking, he fears retribution; you know how much help he gave the Ministry after the Dark Lord fell.” Snape looked sideways at Dumbledore’s crooked-nosed profile. “Karkaroff intends to flee if the Mark burns.”
“Does he?” said Dumbledore softly, as Fleur Delacour and Roger Davies came giggling in from the grounds. “And are you tempted to join him?”
“No,” said Snape, his black eyes on Fleur’s and Roger’s retreating figures. “I am not such a coward.”
“No,” agreed Dumbledore. “You are a braver man by far than Igor Karkaroff. You know, I sometimes think we Sort too soon… .”

Snape began to care for his soul. This is a big change of him.

“If you don’t mind dying,” said Snape roughly, “why not let Draco do it?”
“That boy’s soul is not yet so damaged,” said Dumbledore. “I would not have it ripped apart on my account.”
“And my soul, Dumbledore? Mine?”
“You alone know whether it will harm your soul to help an old man avoid pain and humiliation,” said Dumbledore.

This is also a very important conversation. Snape no longer fight only for Lily, but for justice now.

“Don’t be shocked, Severus. How many men and women have you watched die?”
“Lately, only those whom I could not save,” said Snape. He stood up. “You have used me.”
“Meaning?”
“I have spied for you and lied for you, put myself in mortal danger for you. Everything was supposed to be to keep Lily Potter’s son safe. Now you tell me you have been raising him like a pig for slaughter —“
“But this is touching, Severus,” said Dumbledore seriously. “Have you grown to care for the boy, after all?”
“For him?” shouted Snape. “Expecto Patronum!”
From the tip of his wand burst the silver doe: She landed on the office floor, bounded once across the office, and soared out of the window. Dumbledore watched her fly away, and as her silvery glow faded he turned back to Snape, and his eyes were full of tears.
“After all this time?”
“Always,” said Snape.

There is always something worth giving your life for.

“I am sorry too,” said Lupin. “Sorry I will never know him … but he will know why I died and I hope he will understand. I was trying to make a world in which he could live a happier life.”

“Would I?” asked Dumbledore heavily. “I am not so sure. I had proven, as a very young man, that power was my weakness and my temptation. It is a curious thing, Harry, but perhaps those who are best suited to power are those who have never sought it. Those who, like you, have leadership thrust upon them, and take up the mantle because they must, and find to their own surprise that they wear it well.

At last he said, “Grindelwald tried to stop Voldemort going after the wand. He lied, you know, pretended he had never had it.”

You are the true master of death, because the true master does not seek to run away from Death. He accepts that he must die, and understands that there are far, far worse things in the living world than dying.

Pity the living

“I’ve got to go back, haven’t I?”
“That is up to you.”
“I’ve got a choice?”
“Oh yes.” Dumbledore smiled at him. “We are in King’s Cross, you say? I think that if you decided not to go back, you would be able to … let’s say … board a train.”
“And where would it take me?”
“On,” said Dumbledore simply.
Silence again.
“Voldemort’s got the Elder Wand.”
“True. Voldemort has the Elder Wand.”
“But you want me to go back?”
“I think,” said Dumbledore, “that if you choose to return, there is a chance that he may be finished for good. I cannot promise it. But I know this, Harry, that you have less to fear from returning here than he does.”
Harry glanced again at the raw-looking thing that trembled and choked in the shadow beneath the distant chair.
“Do not pity the dead, Harry. Pity the living, and, above all, those who live without love. By returning, you may ensure that fewer souls are maimed, fewer families are torn apart. If that seems to you a worthy goal, then we say good-bye for the present.”

“Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”

Church 编码

2023-04-04T15:20:37.000Z

介绍 Church 编码和 Scott 编码。

邱奇数使用 lambda 构成的高阶函数来描述自然数。事实上邱奇编码可以用来描述一些很基本的结构，例如布尔值、元组、列表和 tagged unions。
可以将 0 表示为函数 zero 即 \f x. x。x 是什么并不重要，但我们可以将 f 令为 add1，将 x 令为 0。那么 0 就是 zero(add1, 0) = 0。
然后，可以将 1 表示为函数 one 即 \f x. f x。进行代换可以得到 one(add1, 0) = add1(0)。同理，将 2 表示为 \f x. f (f x)。
任意一个数 n 可以表示为($) f^n x，我们要想出一个结构实现把 f 执行 n 次，那实际上需要套一个递归的概念。下面来定义这个 Successor 函数 s。
递推函数 s 可以求出 n 的 Successor 为 \n f x -> f (($) n f x)。检查类型

1 2	ghci> :t s s :: ((t1 -> t2) -> t3 -> t1) -> (t1 -> t2) -> t3 -> t2

可以理解为 s 是接受一个函数 n，返回另一个和 f/x 有关的函数。不妨简单带入

add1 (($) zero add1 0) 为 1
add1 (($) one add1 0) 为 2

实现加法函数plus(m, n) = m + n，plus = \m n f x. m f(n f x)。这里用到了性质f^(m+n) x = f^m f^n x。
实现乘法函数multi(m, n) = m * n，multi = \m n f x. m (n f) x。相当于将 n f 应用 m 次在 x 上。而 n f 表示 n。

add1 x = x + 1
shownat n = ($) n add1 0
zero = \f x -> x
s = \n f x -> f (($) n f x)
one = s zero
two = s one
three = s two
shownat one

add n m f x = ($) n f (($) m f x)
multi n m f = n (m f)

ghci> :t zero
zero :: p1 -> p2 -> p2
ghci> :t one
one :: (t1 -> t2) -> t1 -> t2
ghci> :t two
two :: (t3 -> t3) -> t3 -> t3
ghci> :t three
three :: (t3 -> t3) -> t3 -> t3

下面定义和 bool 量有关的函数。可以看到，true 就是传两个元素选择第一个，false 就是选择第二个。

1
2
3

true x y = x
false x y = y
showbool b = ($) b True False

容易看出，通过 true 和 false，可以自然而然定义出 if-then-else 语义

1	ifte pred x y = ($) pred x y

下面定义 pair 结构。不同于一般编程中指定如何构造结构，这里的思路是定义如何去消费这个结构。这里可以传入一个 sel。sel 可以是 fst 和 snd，表示选出 a 或者 b。

pair a b sel = ($) sel a b
fst p = p true
snd p = p false
shownat (fst (($) pair one two))

这里有个小问题，写成 pair a b sel = sel $ a b 会有错误

:28:1: error:
    ? Non type-variable argument in the constraint: Num (t3 -> t3)
      (Use FlexibleContexts to permit this)
    ? When checking the inferred type
        it :: forall {t3}. Num (t3 -> t3) => t3 -> t3

下面定义 pair 上的函数 next。(next (: pair a b)) 返回 (: pair (s a) a)，可以理解为 pair 上的 Successor。
思路很简单，新构造一个 pair，它的第二个元素是 (fst p)，第一个元素是 s (fst p)>

1
2
3

next p = ($) pair (s (fst p)) (fst p)
nn_of_a a = ($) fst (next (($) pair a a))
shownat $ nn_of_a one

实现减法函数，首先先实现一个 pred 函数。它可以求出 n - 1 是什么。例如 pred two 是 one，但 pred zero 是 zero。这里的方案是从 pair zero zero 开始，调用 n 次 next，就可以得到 (n, n - 1)。使用 snd 返回就行。

1	pred n = snd (($) n next (($) pair zero zero))

减法函数

1	sub m n = ($) m (pred n)

下面这个函数用来判断 n 是不是 0。

1	isZero n = ($) n (\x -> false) true

不妨进行代入来看看原理

zero
1
($) (\f x -> x) (\x -> false) true
这里的 f 实际上就是 (\x -> false)，而 x 就是 true。所以肯定是 true。
one
1
($) (\f x -> f x) (\x -> false) true
这里代入就是 (\x -> false) true，即 false。

Reference

http://learnyouahaskell.com
https://www.zhihu.com/question/19804597
Church 编码
https://faculty.iiit.ac.in/~venkatesh.choppella/popl/current-topics/lambda-calculus-2/index.html
Church 编码

线性代数复习——以MIT18.06为指导(3)

2023-03-01T14:40:32.000Z

本文从MIT的线代教程的角度重新学习线性代数。

这是第三部分，从 L26 开始。

L26 对称矩阵和正定矩阵

本章讲对称矩阵。
这里注意，因为对称矩阵不一定满秩，比如
$$
\begin{equation}
\begin{bmatrix}
1&1 \\
1&1
\end{bmatrix}
\end{equation}
$$
就不满秩。

首先介绍实对称矩阵，有两个特性：

特征值都是实数
特征向量正交
这里注意，如果出现重特征值，则可能一个平面中都是特征向量。所以这里说的是总能选出一套完全正交的特征向量。
其实这里我不是很懂，投影矩阵是个什么样的例子呢？

在 L22 已经讲过如何判断是否存在 n 个线性无关的特征向量了。现在假设存在，那么矩阵可以对角化为 $A = S \Lambda S^{-1}$。
对对称矩阵进行对角化，因为特征向量正交，所以 S 实际是正交矩阵，这里改写为 Q。因为 Q 是正交矩阵，根据正交矩阵的性质，有 $Q Q^T = I$，即 $Q^{-1} = Q^T$。所以有
所以可以得到下面的式子。
$$
A = Q \Lambda Q^{-1} = Q \Lambda Q^T
$$

上面的定理又被称为谱定理或者主轴定理。

在笔记上还写了一段

下面来证明为什么实对称矩阵的特征值都是实数。特别地，第二个性质教授说直接看课本就行。

首先他介绍了一个特性，其中 $\overline{c}$ 表示复数 c 的共轭。
$$
\overline{A} \overline{x} = \overline{\lambda} \overline{x}
$$

考虑上面的式子，可以得到
$$
\overline{x}^T \overline{A}^T = \overline{x}^T \overline{\lambda}
$$

考虑到 $\overline{A} = A$ 和 $A = A^T$ 可以得到
$$
\overline{x}^T A = \overline{x}^T \overline{\lambda}
$$

两边同时右乘 x 有
$$
\overline{x}^T A x = \overline{x}^T x \overline{\lambda}
$$

接下来考虑 $ A x = \lambda x$，两边同时左乘 $\overline{x}^T$ 有

$$
\overline{x}^T A x = \lambda \overline{x}^T x
$$

比较可得，$\overline{\lambda} = \lambda$ 或者 $\overline{x}^T A$ 为0。下面要证明后者不为0。
不妨展开看一下
$$
\begin{equation}
\begin{bmatrix}
\overline{x_1}&…&\overline{x_n}
\end{bmatrix}
\begin{bmatrix}
\overline{x_1} \\
… \\
\overline{x_n}
\end{bmatrix}
\end{equation}
$$
其中每一个 $\overline{x_i} x_i$ 都可以看做是 $(a + bi)(a - bi)$，结果为复向量的模。只有当向量是0的时候，模才是0。

特别地，如果 A 是复矩阵，那么要满足上面两个条件则需要 $ A = \overline{A}^T$。

TODO

下面介绍一个性质，也就是主元的符号和特征值的符号数量是匹配的。例如有 x 个主元为正，n - x 个主元为负，那么就有 x 个特征值为正，n - x 个特征值为负。

正定矩阵是对称矩阵。它的所有特征值为正。所以它对应的微分方程是收敛的。

【Q】这里为什么要在对称矩阵上定义正定矩阵呢？

正定矩阵的主元都为正。

可以通过行列式判断是否是正定。如果 A 正定，则 $det(A) \gt 0$。但反之就未必，例如矩阵
$$
\begin{bmatrix}
-1&0 \\
0&-3 \\
\end{bmatrix}
$$

其实要所有的子行列式都为正才行。

L27 复矩阵和快速傅里叶变换

我们如何定义两个复向量正交呢？$z^T z = 0$ 可以么？不行。因为 $z^T z$ 是向量模长的平方，它应该是一个正数。但如果考虑
$$
\begin{equation}
\begin{bmatrix}
1&i
\end{bmatrix}
\begin{bmatrix}
1 \\
i \\
\end{bmatrix}
= 0
\end{equation}
$$
难道向量的长度为0？

实际上我们定义复向量 z 的模是 $\overline{z}^T z$，也可以写作 $z^H z$。这里的 H 表示埃尔米特(Hermitian)矩阵的意思。

进行延伸，我们定义复矩阵中的“对称矩阵”。它被称作埃尔米特矩阵，定义为 $\overline{A}^T$ 即 $A^H$。它的特性是 $\overline{A}^T = A$。

下面的例子是一个埃尔米特矩阵，可以发现主对角线上一定都是实数，主对角线两边的元素“共轭对称”。
$$
\begin{bmatrix}
2&3+i \\
3-i&5 \\
\end{bmatrix}
$$

埃尔米特矩阵具有实数特征值和正交的特性向量，这个应该是从对称矩阵上类比得到的，教授在上节课证明的时候已经带过了。

类似地，我们定义对应于“正交矩阵”的概念即酉矩阵。

假如 $q_1$、$q_2$、……、$q_n$ 彼此标准正交，即
$$
\begin{equation}
\overline{q_i}^T q_i =
\left\{
\begin{array}{lr}
0, \quad i\ne j \\
1, \quad i=j
\end{array}
\right.
\end{equation}
$$

可以得到 $Q^H Q = I$，这里 $Q^H$ 就是酉矩阵。

求逆方便的原因是 $F_4^H F = I$，所以逆就是共轭转置。

容易想到 $F_{64}$ 和 $F_{32}$ 之间可能有联系。在 $F_{64}$ 中，w 是 1 的 64 次方根，不妨记作 $w_{64}$。可以发现 $w_{64}^2 = w_{32}$。

如何建立 $F_{64}$ 和 $F_{32}$ 之间的联系呢？不妨考虑下面的式子，$F_{64}$ 被拆成两个 $F_{32}$ 组合起来很稀疏的矩阵。
$$
\begin{equation}
F_{64} =
X
\begin{bmatrix}
F_{32}&0 \\
0&F_{32}
\end{bmatrix}
P
\end{equation}
$$

当然，整个式子要乘上 X 和 Y 才能成立。

这里的 Y 是一个奇偶置换矩阵，如下所示。从矩阵乘法的第四种方法可以看出。

而 X 是一个
$$
\begin{bmatrix}
I&D \\
I&-D
\end{bmatrix}
$$

其中 D 是对角矩阵，对角线上的值为从 $w^0$ 到 $w^{32}$。

建立上面联系的目的是什么呢？比如如果要做傅里叶变换，那么就要用 $F_{64}$ 乘以某个向量，这是 $64^2$ 的复杂度。但经过上面的分解，我们只需要 $2*32^2 + 32$ 即可。

还可以从 32 继续分解为 16/8/4/2/1。最终中间的矩阵越来越简单，但两侧会补上很多的修正矩阵。

L28 正定矩阵和最小值

首先是复习上上节课介绍的正定性的判断方法。主要分为：

特征值法
行列式法
主元法
这里需要注意之前讲的一个性质，也就是主元的积等于行列式的值。
$x^T A x \gt 0$，实际上一般这才是定义
注意，这里是不是要把原点扣掉？因为令 x 为零向量，那么肯定是等于0的啊？

下面教授举了几个例子。

[2 6; 6 18] 是一个半正定矩阵。因为它是奇异矩阵，所以有一个特征值为0，只有一个主元。半正定矩阵的特征值大于等于0。

而 [2 6; 6 7]就是不定矩阵，它实际上是一个马鞍面。

[2 6; 6 20]是正定矩阵。

从微积分的角度来看，极小值点满足一阶导数为零，且二阶导数为正。这里教授说的是 minimum，不知道翻译成啥。

从线性代数中，f(x1, x2, … xn) 存在极小值的条件是二阶导数矩阵是正定的。

事实上对 [2 6; 6 20] 进行配方，得到的是 $2(x + 3y)^2 + 2y^2$，它的图像是一个碗形。它的某个截面是椭圆，其实可以从椭圆的方程就能看出来了。

其实配方法就是矩阵消元，可以从矩阵消元得到。
例如我们将矩阵 [2 6; 6 20] 进行 LU 分解。显然 U 是 [2 6; 0 2]。因为我们是用第二行减去了第一行的三倍，所以 L 是 [1 0; 3 1]，也就是加回来。

什么是二阶导数矩阵呢？

其中 $f_{xx}$ 和 $f_{yy}$ 必须为正，才有最小值。并且还需要足够大以抵消混合导数的影响。
$$
\begin{bmatrix}
f_{xx}&f_{xy} \\
f_{yx}&f_{yy}
\end{bmatrix}
$$

这里 $f_{xy} = f_{yx}$，所以也能看出为什么二阶导数矩阵是对称的了。

TODO

L29 相似矩阵和若尔当标准型

正定矩阵从何而来？它实际上来自于最小二乘法。不放先复习下，这里的 A 是一个长方形矩阵。
$$
\hat{x} = (A^T A)^{-1} A^T b
$$

我们要研究在什么情况下 $A^T A$ 是正定的。

先来想一想，正定矩阵肯定是对称的。正定矩阵的逆矩阵一定也是正定的，判断正定性有四种做法，$x^T A x \gt 0$、主元、特征值、子行列式。我们知道逆矩阵的特征值是倒数，那么不影响正负号。这里补充下，对称矩阵的逆矩阵也是对称的，可以由很早之前证明的 $(A^{-1})^T = (A^T)^{-1}$ 得到。

如果 A 和 B 都是正定矩阵，那么 A + B 是正定矩阵。这个可以从第一种判定办法得到。

回到一开始的问题，也就是什么情况下 m 行 n 列的长方形矩阵 $A$，有 $A^T A$ 是正定的。直觉上来看，它像是一个数的平方，肯定大于等于0，但矩阵中是否这样呢？

我们选用第一种判定办法来研究。也就是证明 $x^T A^T A x$ 大于0。即要证明 $(A x)^T A x \gt 0$，而 $A x$ 是个向量，向量的模是大于等于0的，并且等号只在零向量取得。那么什么时候 $A x$ 为 0 呢，或者更有用的是反过来，如何保证 $A x$ 不为0呢？这样我们就取不到等号了。这个也就是在问什么时候 A 的零空间里面只有零向量，这个很简单，满秩的情况下。也就是说如果 A 满秩，那么 $A^T A$ 正定。

现在已经很接近最后的核心内容，正定性将之前的东西联系起来。

现在我们不再讨论对称阵了，而是普通的方阵。

A 和 B 是相似的，则存在某个可逆矩阵 M，有 $B = M^{-1} A M$。这个式子有什么意义？其实先前在对角化的时候就已经见到过。$S^{-1} A S = \Lambda$ 可以描述为 A 和 $\Lambda$ 相似。

【Q】其实我觉得这里只要 M 可逆就行了。

但如果我们不取特征向量矩阵 S，而是取另一个 M，同样可以得到另一个相似矩阵 B。所以所有的相似矩阵可以划为一类，并可以由其中最好的矩阵 $\Lambda$ 来代表。

例如考虑 [2 1; 1 2]，它的特征值矩阵 $\Lambda$ 是 [3 0; 0 1]。但我们也可以取另一个可逆矩阵 M = [1 4; 0 1]，它是上三角矩阵，可以轻松算到逆矩阵是 [1 -4; 0 1]，通过 M 可以得到另一个相似矩阵。总而言之，下面的几个矩阵都是相似矩阵，可以从特征值来验证。

$$
\begin{equation}
\begin{bmatrix}
2&1 \\
1&2
\end{bmatrix}
\sim
\begin{bmatrix}
3&0 \\
0&1
\end{bmatrix}
\sim
\begin{bmatrix}
3&7 \\
0&1
\end{bmatrix}
\sim
\begin{bmatrix}
1&7 \\
0&3
\end{bmatrix}
\end{equation}
$$

所有的相似矩阵的特征值是相同的。可以给出如下的证明：
$$
A x = \lambda x \\
A M M^{-1} x = \lambda x \\
M^{-1} A M M^{-1} x = M^{-1} \lambda x \\
B M^{-1} x = M^{-1} \lambda x
$$

因此这些矩阵都具有特征值 $\lambda$。但对应的特征向量则未必一样，为 $M^{-1} x$。

【Q】不妨考虑这个问题，如果两个矩阵的特征向量都相同，那么这两个矩阵一定相同么？答案不是，可以考虑

$$
\begin{equation}
\begin{bmatrix}
2&1 \\
1&2
\end{bmatrix}
\begin{bmatrix}
3&2 \\
2&3
\end{bmatrix}
\end{equation}
$$

有

>>> np.linalg.eig(np.array([[2,1],[1,2]]))
(array([3., 1.]), array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))
>>> np.linalg.eig(np.array([[3,2],[2,3]]))
(array([5., 1.]), array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))

如果两个矩阵的特征值和特征向量都相同呢？这两个矩阵也不一定相同，可以考虑
$$
\begin{equation}
\begin{bmatrix}
2&1 \\
0&2
\end{bmatrix}
\begin{bmatrix}
2&0 \\
1&2
\end{bmatrix}
\end{equation}
$$

下面介绍一些比较难的情况。首先是有重特征值的情况，这意味着特征向量未必是线性无关的了，当然还是有可能线性无关的。但如果不存在 n 个线性无关的特征向量，那么就不能对角化。
假设有重特征值4，可以分为下面的情况：
第一种如下，和单位矩阵相关
$$
\begin{equation}
\begin{bmatrix}
4&0 \\
0&4
\end{bmatrix}
\end{equation}
$$
对于这种矩阵 $M^{-1} A M = A = 4I$。

所有其他的矩阵类似于
$$
\begin{equation}
\begin{bmatrix}
4&1 \\
0&4
\end{bmatrix}
\end{equation}
$$
不同于上面的单位矩阵，这是一个不可对角化的矩阵。这里可以看出，矩阵经过初等变换之后，可能从原来的可对角化变成不可对角化了。
这样的矩阵称为若尔当标准型。

不是所有的矩阵都可以很容易被变成若尔当标准型，因为这需要特征值严格相等。而在数值计算中，一个值些微的变化就能导致特征值变化甚至秩的变化。所以计算若尔当标准型很不友好。

下面的矩阵也拥有特征值4和4，所以
$$
\begin{equation}
\begin{bmatrix}
5&1 \\
-1&3
\end{bmatrix}
\end{equation}
$$

下面的矩阵的四个特征向量都是0，秩为2，有两个线性无关的特征向量。
$$
\begin{equation}
\begin{bmatrix}
0&1&0&0 \\
0&0&1&0 \\
0&0&0&0 \\
0&0&0&0
\end{bmatrix}
\end{equation}
$$
下面的矩阵也一样
$$
\begin{equation}
\begin{bmatrix}
0&1&0&0 \\
0&0&0&0 \\
0&0&0&1 \\
0&0&0&0
\end{bmatrix}
\end{equation}
$$
但这两个矩阵不是相似矩阵，从它们有不同的若尔当块可以看出。

若尔当块如下，主对角线上是重特征值 $\lambda_i$，上方对角线都是1，其他位置的元素都是0。每个若尔当块只有一个特征向量，多个若尔当块可以拼成一个若尔当矩阵。

教授说，第一个矩阵上面 3x3 是一个若尔当块，下面 1x1 是一个若尔当块。我不是很明白，因为上面 3x3 不是有两个特征向量么？其实我搞错了，特征向量对应的是 $ A - \lambda I$ 的零空间而不是列空间。

若尔当的理论是，每个方阵 A 和某个若尔当矩阵 J 相似。而这个若尔当矩阵中的块的数量是特征向量的数量。

而对于比较好的情况，也就是有不同的特征值，那么若尔当矩阵就是对角阵 $\Lambda$。也就是说若尔当块只是比对角矩阵多了对角线上方的一些1，这已经是最优的了。

这里补充下几何重数和代数重数。
代数重数：也就是矩阵特征多项式中，某个解也就是特征值的重数。

几何重数：矩阵某个特征值对应的特征空间的维度。也就是 $A-\lambda I$ 的零空间的维度。

L30 奇异值分解(SVD)

这是线性代数的核心部分。是矩阵最终和最好的分解。

一个矩阵会被分解为一个正交矩阵 $U$，一个对角矩阵 $\Sigma$ 和一个正交矩阵 $V$ 的乘积。即 $A = U \Sigma V^T$。
首先介绍两个特殊情况：

A 是正定矩阵，那么 A 可以被分解为 $Q A Q^T$
在第 L26 课中讲过，实对称矩阵的特征向量是正交的。
A 可对角化，那么 A 可以被分解为 $ S A S^{-1}$
但需要注意，这里的正交向量矩阵 S 未必是正交的。

如何做到呢？我们可以在行空间找某个 $v_1$，它变换到列空间里是 $ u_1 = A v_1$。在奇异值分解中，找的是行空间里面的一组正交基，变换到列空间里面的一组正交基上。
此外，零空间上的向量对应了对角矩阵对角线上为0的元素。

首先，通过施密特正交化，可以将行空间的一组基转化为一组正交基。但并不是所有的正交基在变换后都还是正交的，所以要找特殊的一组正交基。

这里的 r 表示矩阵的秩。如果加上零空间的部分，则有 $A V = U \Sigma$。其中零空间对应的正交基是 $v_{r+1} , …, v_n$。

Reference

https://zhuanlan.zhihu.com/p/470026382
几何重数和代数重数

线性代数复习——以MIT18.06为指导(2)

2023-02-16T14:40:32.000Z

本文从MIT的线代教程的角度重新学习线性代数。

这是第二部分，从 L14 开始。

L14 正交向量和子空间

这张图其实在四个子空间那一节课已经展示了，但当时我没有搞明白，为什么这四个空间的位置这么诡异。因为这4个空间两两正交。

正交(orthogonal)向量 $x$、$y$ 满足 $x^T y = 0$。可以用它证明勾股定理即 $ |x|^2 + |y|^2 = |x+y|^2 $。注意这里向量点乘，所以 $x^T y = y^T x$。

根据定义，零向量和所有向量正交。

两个子空间正交是什么意思呢？即 A 中的所有向量，和 B 中的所有向量正交。考虑教室作为一个三维空间，墙上的向量和地板上的向量未必正交。或者其实考虑 $ [1,1,0]$ 和 $[1,0,1]$ 即可。或者考虑踢脚线的那个向量，它同时处于两个空间中，肯定不会自己和自己正交。

所以如果两个空间正交，它们的交集一定不会是某个非零向量。当然这不是充分的。

证明行空间和零空间正交。

不如把 A 写完整。可以看到每一行 $r_i$ 乘上 $x$ 都为 0。
$$
Ax = 0 \\
\Leftrightarrow \\
\begin{equation}
\begin{bmatrix}
r_1 \\
r_2 \\
… \\
r_m \\
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2 \\
… \\
x_m \\
\end{bmatrix}
= 0
\end{equation}
$$

然后我们证明这些行的线性组合，和 $x$ 正交。这是肯定的。

那么它们可能是什么呢？考虑 $R^3$，行空间和零空间不可能是两条直线，因为维度加起来不对。其实我们可以考虑是一条线和一个面的正交。

可以看出，行空间和零空间“划分”整个 $R^m$。这称为 $m$ 维空间里的正交补。也就是说行空间包含了所有和零空间垂直的向量。

如何求一个无解方程组的解？这里指的是 $m$ 很大的时候。比如进行很多次测量，里面存在各种噪声，方程组大概率无解。所以我们要采取比如最小二乘法的技巧来“拟合”。

如果 $A$ 是 $m$ 行 $n$ 列的矩阵，则 $A^T A$ 有一些很好的特性：

它是方阵
它是对称矩阵
这是因为 $ (A^T A)^T = A^T A $

所以当我们发现 $ A x = b$ 无解时，两边乘以 $A^T$ 就会得到一个“好”的方程 $ A^T A \hat {x} = A^T x$。

现在考虑 $A^T A$什么时候是可逆的呢？
首先这个矩阵未必可逆，比如零矩阵，比如下面这个 $A$
$$
\begin{bmatrix}
1&3 \\
1&3 \\
1&3 \\
\end{bmatrix}
$$

事实上 $N(A^T A)=N(A)$，且 $r(A^T A) = r(A)$。为什么呢？关于这个性质，在 L16 最后会进行证明，这里不妨假设它是对的。

所以 $N(A^T A)$ 可逆，当且仅当 $A$ 的各列线性无关，这样零空间就是0维的了。

L15 子空间投影

向量 $b$ 和它在向量 $a$ 上的投影 $p$ 正交。如果我们将 $p$ 视为 $b$ 的近似，那么 $e = b - p$ 就是这近似的误差。这里 $p$ 是 $a$ 的某个倍数，即 $p = x a$。

$e$ 等于 $b - x a$。整理一下上面的关系，有
$$
a^T (b - x a) = 0
$$

化简一下，可以求得 $x$

$$
x a^T a = a^T b \\
x = \frac{a^T b}{a^T a}
$$

投影 p 就是
$$
p = a \frac{a^T b}{a^T a}
$$

投影矩阵 P，上面一行是一个矩阵(列乘以行)，下面一行是一个数。投影矩阵 P 作用于 b，得到 b 向量在 a 上的投影。
$$
P = \frac{a a^T}{a^T a} \\
p = P b
$$

看看这个矩阵的列空间，先想想几何意义。因为 $P$ 是投影矩阵，它就是要投到 $a$ 这根向量上。所以 $P$ 的列空间就是通过 $a$ 的一条直线。所以 $P$ 的秩是1。

其实根据上节课末尾介绍的 $r(A^T A) = r(A)$，可以知道分式上面的 $ a a^T $ 的秩就是 $a$ 的秩是1，下面是一个常数。

这个矩阵是对称的么？是的，因为 $ a a^T $ 是对称的。

还能发现，投影两次是自己，即 $ P^2 = P$。

下面来看在更高维，比如 $R^3$ 上投影。现在 $b$ 不是在 $a$ 上投影，而是在 $A$ 上投影了。

为什么要投影呢？因为 $Ax = b$ 无解，也就是说 $b$ 不一定在 $A$ 的列空间中。所以我们就退而求其次，求 $A \hat{x} = p$。也就是求 $b$ 在 $A$ 列空间上的投影。

这里的 $e$ 向量是垂直于平面的了。

此时我们的投影是

$$
p = \hat{x}_1 a_1 + \hat{x}_2 a_2 = A \hat{x}
$$

因为 e 垂直于由 $a_1$ 和 $a_2$ 构成的平面，可以表示为 $b - A \hat{x}$，所以可以得到两个方程
$$
a_1^T (b - A \hat{x}) = 0 \\
a_2^T (b - A \hat{x}) = 0 \\
\Leftrightarrow \\
A^T (b - A \hat{x}) = 0
\Leftrightarrow \\
A^T A \hat{x} = A^T b
$$

这个方程看上去和二维的有点出入，但其实你把二维里面的 $a$ 看做是 $A$ 就对的上了。这个式子铭记在心，它很重要。

$e$ 在 $A^T$ 的零空间中，也就是说 $e$ 垂直于 $A$ 的列空间。其实我觉得从图上就能看出来吧……

现在不妨讨论下方程的解
$$
\hat{x} = (A^T A)^{-1} A^T b
$$

投影 $p$ 就是
$$
p = A \hat{x} = A (A^T A)^{-1} A^T b
$$

形式上很像

$$
\frac{a a^T}{a^T a}
$$

我们能简化为 $ P = A A^{-1} (A^T)^{-1} A^T $，结果居然是 $I$。这肯定有问题，可是哪里有问题呢？教授说因为 $A$ 不是方阵。所以我们不能直接写成分式，即下面未必成立

$$
p = \frac{A A^T}{A^T A} b
$$

而如果 $A$ 是一个可逆的方阵，那么列空间是整个 $R^m$，而投影矩阵确实就是单位阵。

同理，$ P $ 也是对称阵。
$ P^2 = P$，写出来可以发现中间可以消掉。

最小二乘求得是 $ A^T A \hat{x} = A^T b$。这个方程之前介绍过，说的是 $e$ 和 $A$ 的列空间正交。

L16 投影矩阵和最小二乘

如果 $b$ 垂直于列空间，则 $P b = 0$。因为 $A^T b = 0$。
如果 $b$ 属于列空间，则 $P b = b$。将 $Ax$ 带入，发现等于 $b$。

所以这里有点向量分解的意思。已经知道投到列空间的向量是 $P b$，那么投到左零空间的向量是 $(I - P) b$。

回到最小二乘，我们要最小化的是 $ | Ax - b |^2 $，因为 $Ax - b = e$，所以也就是最小化 $e^2$。对这个有疑问的，可以多去看一下二维的情况。

最小二乘很容易受到离群点的影响，因为这个离群点会导致误差的平方非常大。

如下图所示，如果我们不是把 b1、b2、b3 放入方程，而是将 p1、p2、p3 放入方程，则好处是方程是有解的。并且我们也能发现它们确实是“最近”的。而我们要最小化的是 $e1^2 + e2^2 + e3^2$。

现在我们求
$$
\begin{equation}
\hat{x} =
\begin{bmatrix}
\hat{C} \\
\hat{D} \\
\end{bmatrix}
\end{equation}
$$
以及 $p$。

$\hat{x}$ 是什么呢？之前说过是
$$
A^T A \hat{x} = A^T b
$$

下面这种图中对这个方程进行了化简，最后得到右边的矩阵

那么解就是下面的方程组
$$
3C + 6D = 5 \\
6C + 14D = 11 \\
$$

可以用分析的办法来解决，即考虑最小化 $e1^2 + e2^2 + e3^2$。分别对 $C$ 和 $D$ 求偏导数，可以同样得到上面两个方程。

最后我们求出 $b = p + e$，可以发现 $e$ 和 $p$ 互相垂直。

填坑，证明如果 $A$ 各列线性无关，则 $A^T A$ 可逆。

只需要证明如果 $A^T A x = 0$，则 $x = 0$。因为在 L09 讲过，如果矩阵可逆，那么它的零空间只有0。

第一种办法，是两边都左乘 $x^T$，得到 $ x^T A^T A x = 0$。其实就是 $(Ax)^T (Ax) = 0$，很有趣。而如果 $y^T y = 0$，说明 $y$ 是零向量，因为这个表示向量长度的平方。

【Q】如果这里 $A$ 是矩阵，$A^T A = 0$，是什么性质呢？A 也是0，可以看下这个证明。这个证明引入的符号也很实用。

上面得到 $Ax = $，因为 $A$ 各列线性无关，所以 $x$ 一定是零向量。

下节课的预告：相互垂直的向量一定线性无关。

L17 正交矩阵和施密特正交化

首先定义正交向量 $q_i$ 和 $q_j$，如果 $ i \ne j$ 则 $q^T_i q_j = 0$，如果 $ i = j$ 则 $ q^T_i q_j = 1 $。我觉得这个 0 表明了正交，1 表明了标准。

下面教授证明性质如果 Q 的列向量为标准正交向量，则 $ Q^T Q = I$ 为单位阵。

$$
\begin{equation}
Q =
\begin{bmatrix}
q_1&q_2&x&q_n \\
\end{bmatrix}
\end{equation}
$$

$$
Q^T Q =
\begin{equation}
\begin{bmatrix}
q_1 \\
q_2 \\
x \\
q_n \\
\end{bmatrix}
\begin{bmatrix}
q_1&q_2&x&q_n \\
\end{bmatrix}
= I
\end{equation}
$$

教授说，尽管这个性质对非方阵也有效，但仅方阵并且标准正交的情况下，我们才会将它称为正交矩阵。

下面举例说明正交矩阵的定义。

例如下面的置换矩阵是正交矩阵。其实我想，根据上面的性质，可以知道如果 Q 还是方阵，则 $ Q^T = Q^{-1}$。而正交矩阵 P 也满足这个关系，所以很容易想到它是正交矩阵。
$$
\begin{equation}
P =
\begin{bmatrix}
0&0&1 \\
1&0&0 \\
0&1&0 \\
\end{bmatrix}
\end{equation}
$$

同理，下面这个矩阵也是正交矩阵，因为它的的两个列向量都是单位向量。
$$
\begin{equation}
Q =
\begin{bmatrix}
cos\theta&-sin\theta \\
sin\theta&cos\theta \\
\end{bmatrix}
\end{equation}
$$

下面这个矩阵的列向量正交，但因为列向量不是单位向量，所以不是正交矩阵
$$
\begin{equation}
\begin{bmatrix}
1&1 \\
1&-1 \\
\end{bmatrix}
\end{equation}
$$

后面还介绍了 Hadamard 矩阵，略了。

下面的矩阵的列向量是标准正交的，但它不是方阵，所以不是正交矩阵，但我们可以拓展它成为正交矩阵。
$$
\begin{equation}
\frac{1}{3}
\begin{bmatrix}
1&-2 \\
2&-1 \\
2&2 \\
\end{bmatrix}
\end{equation}
$$

我们之前已经介绍过了，如果 Q 的列向量是标准正交向量，但 Q 未必是方阵，则投影到 Q 的列空间的投影矩阵为

$$
P = Q (Q^T Q)^{-1} Q^T
$$

因为标准正交，所以有 $Q^T Q = I$。即 $ P = Q Q^T$。

其实当时学到这我就有疑问，为啥 $ Q Q^T$ 不能也是 I，后来才发现不一定是方阵……不过如果 $Q$ 确实是方阵，那么 $ Q^ Q = I$，即 $P$ 就是单位阵了。因为 $Q$ 的列向量张成了整个空间，所以投影过程不会对向量有任何改变。

正规方阵 $Q$ 可以之前讲过的所谓“线性代数基本方程”即 $A^T A \hat{x} = A^T b$ 进行简化。也就是说，如果 $A$ 是 $Q$，那么 $ \hat{x} = Q^T b$。即 $\hat{x_i}=q_i^T b$。

下面介绍施密特正交化，教授吐槽了下这个东西居然还能两个人冠名。

首先施密特说，一个正交基除以自己的长度得到标准正交基，即
$$
q_1 = \frac{A}{|A|}
$$

然后就是如何求得正交，其实从下面的图中就可以看出，前面讲的 b 和 A 的误差 e 就是正交基。

即 $ B = b - \frac{A^T b}{A^T A} A$。其实这里我很困惑，不应该是 $ B = b - A(A^T A)^{-1} A^T b$ 么？搞了半天，原来这里的 A 和 B 啥的都是向量而不是矩阵……所以这里能直接用分式除也是这个原因。哈哈，想到 Artin 书里面直接写不要写成分式，就挺好笑，MIT 老师的风格都好不一样。

我们可以验证正交性。
$$
A^T B = A^T (b - \frac{A^T b}{A^T A} A)
$$
教授的意思是可以像这样提出来 $A^T b$。不知道这里为啥可以分配，可能是我瞎猜的
$$
A^T b = A^T (I - \frac{A^T A}{A^T A}) = 0
$$

最后，教授又将 Q 和 LU 分解联系起来。他说 $A = Q R$，R 就是行最简阶梯矩阵，randomwalk 的笔记说可以放宽到上三角矩阵。

他说下面的式子中，$a_1^T q_2$是0。因为 $q_2$ 和 $a_1$ 垂直。
$$
\begin{equation}
\begin{bmatrix}
a_1&a_2 \\
\end{bmatrix}
=
\begin{bmatrix}
q_1&q_2 \\
\end{bmatrix}
\begin{bmatrix}
a_1^T q_1&a_2^T q_1 \\
a_1^T q_2&a_2^T q_2 \\
\end{bmatrix}
\end{equation}
$$

L18 行列式和它的性质

开始着重介绍方阵。首先介绍行列式，它的作用是引出后面的特征值。

行列式中包含了很多有关这个矩阵的信息，例如可逆矩阵的行列式不是0，奇异矩阵的行列式是0。

首先介绍三个行列式的性质，通过下面这三个性质，我们可以定义出所有矩阵的行列式
第一个性质说，$det(I) = 1$。
第二个性质说，行交换导致行列式符号相反。通过行交换，我们可以得到所有的置换矩阵。容易知道 $det(P)= \pm 1$。
第3a个性质说，行列式的一行中的每个元素都乘上一个 t，则行列式的值就会乘上一个 t。
第3b个性质说，行列式可以对某一行进行线性组合，听起来很怪，不如直接看下面的性质。同时注意，教授不是在说 $det(A+B)=det(A)+det(B)$。
$$
\begin{equation}
\left | \begin{matrix}
a + a’&b + b’ \\
c&d \\
\end{matrix} \right |
=
\left | \begin{matrix}
a&b \\
c&d \\
\end{matrix} \right |
+
\left | \begin{matrix}
a’&b’ \\
c&d \\
\end{matrix} \right |
\end{equation}
$$

听到这里，我不禁点了暂停进行了验证，确实通过三个性质推出所有矩阵的行列式了。关于为什么最右边的行列式为0可以看看后面的推导。
$$
\begin{equation}
\left | \begin{matrix}
1&2 \\
0&4 \\
\end{matrix} \right |
=
\left | \begin{matrix}
1&0 \\
0&4 \\
\end{matrix} \right |
+
\left | \begin{matrix}
0&2 \\
0&4 \\
\end{matrix} \right |
= 4
\left | \begin{matrix}
1&0 \\
0&1 \\
\end{matrix} \right |
+
2 \times 4
\left | \begin{matrix}
0&1 \\
0&1 \\
\end{matrix} \right |
\end{equation}
$$
不得不说，上面三个性质非常容易理解，比用代数余子式的定义好多了，比什么鬼逆序数更是高到不知高哪里去。

下面介绍行列式的其他性质，它们都可以从上面三个性质得到。
第四个性质说，如果行列式的两个行相等，那么行列式的值为0。这里推理的方法很巧妙，根据性质2，如果我们交换两行，行列式会反号。但我们交换的行又是相等的，照理说行列式符号不变。又要反号，又要不变，那行列式的值只能为0了。

第五个性质说，行 k 减去行 i 的若干倍，行列式的值不变。其实应该能猜到不变，消元如果能让行列式的值变化，那岂不是消元就有可能让不可逆矩阵可逆了？
所以
$$
\left | \begin{matrix}
0&1 \\
0&1 \\
\end{matrix} \right |
=
\left | \begin{matrix}
0&1 \\
0&0 \\
\end{matrix} \right |
$$

第六个性质说，如果矩阵的某一行都是0，则行列式是0。

第七个性质，上三角矩阵的行列式 U 等于对角线元素的乘积。这个很容易理解，因为可以通过之前的性质消掉上三角，得到一个对角矩阵，其行列式不变。然后可以用性质 3a “提取公因式”。

第八个性质，$det(A) = 0$ 当且仅当 A 是奇异矩阵。因为矩阵如果可逆，那么每一行都有主元，那么始终可以 A -> U -> D -> d1 d2 ... dn这样计算。
比如
$$
\begin{equation}
\left | \begin{matrix}
a&b \\
c&d \\
\end{matrix} \right |
=
\left | \begin{matrix}
a&b \\
0&d-\frac{c}{a}b \\
\end{matrix} \right |
= ad - bc
\end{equation}
$$

第九个性质，$det(AB) = det(A) det(B)$。所以虽然行列式不具备所谓的 adding property，但对乘法是有性质的。关于这个的证明教授没给出，randomwalk 说书上有，但其实书上也没找到。

从第九个性质可以推出 $det(A^{-1}) det(A) = det(I) = 1$，所以 $det(A^{-1}) = \frac{1}{det(A)}$。

另一个更有趣的性质是 $det(2A) = 2^n det(A)$，相当于每行都需要乘以一个 2。这就将行列式和体积联系起来了，如果正方体的边长扩大到原来的两倍，那么体积扩大到原来的 $2^3$ 倍。关于体积，后面还会详细介绍。

第十个性质，$det(A^T) = det(A)$，这就将行的性质可以推广到列上。
更有意思的是教授给了一个直觉的证明。我们可以将方阵进行 LU 分解，所以实际是要证明

$$
|U^T L^T| = |L U|
$$
根据性质九，实际是要证明
$$
|U^T| |L^T| = |L| |U|
$$

而这个是显然的，因为上下三角阵都等于主对角线的乘积，而转置不会改变主对角线。

后面是一个非常有趣的观察，也就是置换具有奇偶性。这个往深了讲还能扯到交错群上。

教授说，进行一次行交换，行列式符号取反，所以得到的矩阵“是不一样”的。这听起来有点循环论证，因为交换行符号取反是教授的行列式定义啊。

L19 行列式和代数余子式

首先介绍从行列式的性质推导出行列式的计算公式。

对于二阶行列式，教授上节课已经从消元的角度推导过一次了，这次他用类似我上节课尝试的做法做的。其中第一和第四个行列式等于0。

$$
\begin{equation}
\left | \begin{matrix}
a&b \\
c&d \\
\end{matrix} \right |
=
\left | \begin{matrix}
a&0 \\
c&0 \\
\end{matrix} \right |
+
\left | \begin{matrix}
a&0 \\
0&d \\
\end{matrix} \right |
+
\left | \begin{matrix}
0&b \\
c&0 \\
\end{matrix} \right |
+
\left | \begin{matrix}
0&b \\
0&d \\
\end{matrix} \right |
\end{equation}
$$

容易想到，对于3阶方阵，我们这样能拆出来 27 个行列式，这也太多了。但其中有很多行列式必然是0，因为它们的某一行或者列是0。有哪些行列式有可能不是0呢？

考虑三阶方阵，这里我省略下标，即 $a_{ij}$ 写成 $ij$ 了。如果我们选定第一行是 11，那么如果第二行选 22，那么最终可以被 11 消掉，所以只能选 22 或者 23。在第二行选完之后，第三行能选的也就唯一确定了。

$$
\begin{equation}
\left | \begin{matrix}
11&0&0 \\
0&22&23 \\
0&32&33 \\
\end{matrix} \right |
\end{equation}
$$

所以可以得到

$$
\begin{equation}
\left | \begin{matrix}
11&0&0 \\
0&22&23 \\
0&32&33 \\
\end{matrix} \right |
=
\left | \begin{matrix}
11&0&0 \\
0&0&23 \\
0&32&0 \\
\end{matrix} \right |
+
\left | \begin{matrix}
11&0&0 \\
0&22&0 \\
0&0&33 \\
\end{matrix} \right |
\end{equation}
$$

这里行列式是加，但如果再往下算，会发现符号会取反

$$
… = 11 * 22 * 33 - 11 * 23 * 32
$$

所以这里正负号有什么规律么？3阶方阵是有个规律，但没必要去记，因为4阶方阵不一样了。比如4阶方阵的副对角线上那四个数乘起来的 $14 * 23 * 32 * 41$ 其实符号应该为正。这是因为 $(4 , 3 , 2 , 1)$ 这样的排列，只需要进行两次即偶数次交换，即 1 和 4，2 和 3 交换即可。行列式的符号不变。

下面推导一个行列式的通用公式，称为 The Big Formula。

$$
det(A) = \sum_{n! items} \pm a_{1\alpha} a_{2\beta} … a_{n\omega}
$$
其中，$\alpha, \beta, …, \omega$ 是从 1 到 n 的一个排列。

这个公式的 intuition 很简单，因为第一行可以在 n 列中选一个元素，第二行只能在剩下没选的 n-1 列中选一个元素，否则必然是0，由此往下推，总共就是 n! 中方案。

下面介绍代数余子式。

以三阶矩阵为例

$$
det(A) = a_{11} + (a_{22} a_{33} - a_{23} a_{32}) \\
+ a12 (…) \\
+ … \\
$$

如果我们同时对 $a_{11}$ 展开行列式，可以看出，右下角的那个2阶方阵就是上面括号中的数字。即代数余子式 $C_{ij}$。

$$
\left | \begin{matrix}
a_{11}&0&0 \\
0&a_{22}&a_{23} \\
0&a_{32}&a_{33} \\
\end{matrix} \right |
$$

不难猜出规律，正负号和 $i+j$ 有关。偶数正，奇数负。

可以得到行列式的代数余子式的计算方法
$$
det(A) = a_{11} C_{11} .. a_{1n} C_{1n}
$$

下面，同样是用代数余子式来算二阶矩阵的行列式

$$
\begin{equation}
\left | \begin{matrix}
a&b \\
c&d \\
\end{matrix} \right |
= a d + b (-c)
\end{equation}
$$

可以看到，代数余子式求行列式也不简单。其实根据之前的性质7，也就是所谓的 pivot formula，可以发现，消元之后将主元都乘起来是最简单的。

下面，介绍了一个叫三对角线阵的方阵。

$$
\begin{equation}
A_4 =
\left | \begin{matrix}
1&1&0&0 \\
1&1&1&0 \\
0&1&1&1 \\
0&0&1&1 \\
\end{matrix} \right |
\end{equation}
$$

这里算 A3，我的方法更取巧，因为只需要消元成下面的形式，就可以发现交换 1 和 2 行得到 I。这样就可以知道行列式的值为 -1。

$$
\left | \begin{matrix}
0&1&0 \\
1&0&0 \\
0&0&1 \\
\end{matrix} \right |
$$

这里算 A4 很有意思，选取 $a_{11}$，其代数余子式为 $A_3$。选取 $a_{12}$，代数余子式如下，我们不认识。但如果将它先转置，那么算起来会方便很多。这是一个很有趣的技巧。

$$
\left | \begin{matrix}
1&1&0 \\
0&1&1 \\
0&1&1 \\
\end{matrix} \right |
$$

后面教授推到
$$
A_n = |A_{n-1}| - |A_{n-2}|
$$

得到 $A_n$ 的行列式的值随着 n 增大有个长度为 6 的循环节。这个略过了。

L20 克莱默法则和体积

基于行列式求逆矩阵如下，其中 $C^T$ 为各个 $a_{ij}$ 对应的 $C_{ij}$ 构成的矩阵的转置，称为伴随矩阵。

$$
A^{-1} = \frac{1}{det(A)} C^T
$$

教授说，det(A) 是 n 个 entry 的 product，而 $C^T$ 是 n-1 个 entry 的 product，不知道这有什么深层含义。

下面教授证明上面的式子是对的，也就是要证明 $A C^T = (det(A)) T$。

首先确认主对角线上的每一个元素都是 det(A)。
不妨直接写出来。可以看出第一行乘以第一列就是 $a_{11} C_{11} + … + a_{1n} C_{1n}$，也就是行列式的值。
$$
\begin{equation}
\begin{bmatrix}
a_{11}&…&a_{1n} \\
\ &\ddots& \ \\
a_{n1}&…&a_{nn} \\
\end{bmatrix}
\begin{bmatrix}
c_{11}&…&c_{1n} \\
\ &\ddots& \ \\
c_{n1}&…&c_{nn} \\
\end{bmatrix}
\end{equation}
$$

然后确认非主对角线上的元素都是0。我们不妨考虑二阶矩阵这个简单的情况。
下面这个矩阵，它的行列式的值是0，两行相同。
$$
\begin{equation}
\begin{bmatrix}
a&b \\
a&b \\
\end{bmatrix}
\end{equation}
$$

如果求 A 的第一行，和 C^T 的最后一列的内积，即 C 的最后一行的内积，也就相当于是求一个第一行和最后一行相等的矩阵的行列式。而如果两行相等，则行列式的值为0。

考虑 $Ax = b$，有 $x = \frac{1}{det(A)} C^T b$，这就是克莱默法则。

这个神奇的 $C^T b$ 是怎么来的呢？教授说，一个余子式乘以一个数字，总让人想起行列式的值。不妨考虑 b 是个 n 维向量，那么其中一个分量的解就可以写成如下的形式，其中 $det(B_1)$ 就是那个不知道是什么的行列式的值。

$$
x_1 = \frac{det(B_1)}{det(A)}
$$

克莱默发现，这个行列式对应的矩阵 $B_1$ 实际上就是用 b 替换掉 A 的第一列，从而得到的矩阵。

$$
\begin{equation}
\left [ \begin{matrix}
b_1&a_{12}&…&a_{1n} \\
b_2&\ n & - &1 \\
…&\ cols& \ & of \\
b_n&\ &\ A \\
\end{matrix} \right ]
\end{equation}
$$

容易看出，上面矩阵的右下角的部分就是 $C_11$。那么 $B_1$ 的行列式可以写成

$$
C_{11} b_1 + C_{21} b_2 + … + C_{n1} b_n
$$

而如下所示，上面这个式子等于 $C^T b$

$$
\begin{equation}
\begin{bmatrix}
c_{11} & c_{21} & \ldots & c_{n1} \\
\end{bmatrix}
\begin{bmatrix}
b_{1} \\
b_{2} \\
\vdots \\
b_{n} \\
\end{bmatrix}
\end{equation}
$$

行列式的值和体积有对应的关系。例如三阶行列式的值，就是对应三个向量组成的平行六面体的体积。

如何证明呢？这里有个神来一笔，我们只需要证明体积满足行列式的那三个定义就行了。

对于第一个定义，可以发现单位立方体的体积是1，这对应了 I 的行列式。

特别地，如果 A 是正交矩阵 Q，它对应的也是单位立方体，只是可能会发生旋转。它的体积也是 1，这是因为 $Q^T Q = I$。所以行列式相等，即 $ |Q^T Q| = |I|$，然后观察到 $ |Q^T Q| = |I|
\Rightarrow |Q| |Q| = |I| = 1 $，所以 $|Q| = \pm 1$。

对于第二个定义，显然交换行相当于坐标轴对调，不会改变体积。

对于第 3a 个定义，是显然的。

对于第 3b 个定义，下面的图一目了然。

特别地，教授还讲了三顶点都不过原点的三角形的面积，是

$$
\begin{equation}
\left | \begin{matrix}
x_1&y_1&1 \\
x_2&y_2&1 \\
x_3&y_3&1 \\
\end{matrix} \right |
\end{equation}
$$

L21 特征向量和特征值

对于特定的向量 x，我们要 $ Ax = \lambda x$，也就是说 $ Ax$ 和 $x$ 平行。

不妨考虑一下一些特殊矩阵的特征值和特征向量。

如果 $A$ 是个奇异矩阵，它可以将某个非零向量变为0，所以它的一个特征值是0。这是肯定的，因为如果 $A$ 是奇异矩阵，则 $A$ 不可逆，且 $Ax = 0$ 存在非零解，即 $Ax$ 的零空间中不只有 0 向量。

考虑投影矩阵 $P$，它所投影到的平面上的所有向量是特征向量。这是因为对于任意平面内的向量 x 有 $P x = x$，此时 $\lambda=1$。
还有没有其他的特征向量么？所有垂直于平面的向量也是特征向量，这是因为总有 $P x = 0$，此时 $\lambda=0$。

考虑置换矩阵 A = [0 1; 1 0]。容易发现向量 $[1 , 1]^T$ 是它的一个特征向量，对应特征值为 1。教授说，肯定还有个特征值是 -1，对应特征向量是 $[-1 , 1]$。
这么厉害的么？原来根据代数基本定理，在 n 阶方阵中肯定有 n 个特征值，并且它们的和等于矩阵的迹，也就是主对角线的和。
因为 A 是对称矩阵，所以这两个向量互相垂直。

下面来解 $ Ax = \lambda x$ 这个方程了，简单移项可以得到 $(A - \lambda I) x = 0$。

教授说，$A - \lambda I$ 一定是奇异的，否则 x 只能为零向量了。

所以 $det(A - \lambda I)=0$。对于这个行列式，可以用消元法求出主列。

例如考虑矩阵 [3 1; 1 3]，$det(A - \lambda I) = \lambda^2 - 6 \lambda + 8$。有趣的是这里的 6 刚好是 trace(A)，8 则是 det(A)。这有点像韦达定理，不知道对于更高阶的矩阵，是不是也是对应的韦达定理呢。

现在求出特征值了，下面需要求特征向量。可以用消元法解 $ A - \lambda I $，这一矩阵零空间中的向量为矩阵
A 的特征向量。不妨先考虑 $\lambda_1 = 4$，对应的 $A - \lambda I$ 是
$$
\begin{bmatrix}
-1&1 \\
1&-1 \\
\end{bmatrix}
$$
显然这个矩阵一定是奇异矩阵。我们要找的就是它的零空间。同样可以消元得到
$$
\begin{bmatrix}
1&-1 \\
0&0 \\
\end{bmatrix}
$$
然后可以得到
$$
\begin{equation}
\begin{bmatrix}
-F \\
I \\
\end{bmatrix}
=
\begin{bmatrix}
1 \\
1 \\
\end{bmatrix}
\end{equation}
$$

进一步考虑上面的问题 [3 1; 1 3] 相比 [0 1; 1 0]，它的特征向量没变，但是特征值都增加了3。所以如果 $Ax = \lambda x$，则 $(A + 3 I)x = \lambda x + 3 x = (\lambda + 3)x$。

但是两个矩阵的和的特征值不等于两个矩阵特征值的和。教授说因为特征向量可能不一样。反正这一块他讲得很奇怪，我就不记录了。

下面讨论一些特殊的矩阵。

首先是旋转矩阵，它是正交的，所以教授用 Q 表示

$$
\begin{equation}
Q =
\begin{bmatrix}
0&-1 \\
-1&0 \\
\end{bmatrix}
=
\begin{bmatrix}
cos90^{\circ}&-sin90^{\circ} \\
sin90^{\circ}&cos90^{\circ} \\
\end{bmatrix}
\end{equation}
$$

这里教授补充了下，特征值的积等于行列式的值。【Q】看起来特征值还和体积有关系？

教授发出灵魂拷问，什么样的向量在旋转之后还是它自己呢？除了0向量之外，我似乎想不到别的，那 $A x = \lambda x$ 如何成立呢？然后教授就推到 $\lambda^2 + 1 = 0$。原来如此，特征值可能是虚数！

教授说，对称矩阵都是实特征值，而我们的这个矩阵是反对称矩阵，也就是说 $Q^T Q = -1$。所以它的特征值是纯虚数。所以看起来，实数特征值是让特征向量伸缩的部分，而虚数是让特征向量的旋转部分？

另外从这也可以看出，如果 $a+bi$ 是特征值，那么共轭复数 $a-bi$ 也是特征值。

然后是上三角矩阵，它的特征值全部在对角线上。我们发现，它的两个特征值相等，都为3。
$$
\begin{bmatrix}
3&1 \\
0&3 \\
\end{bmatrix}
$$

可以找到一个特征向量 $[1, 0]$。这是显然的，因为我们有一个主元，一个自由列。但因为另一个特征值是重复的，所以找不到第二个线性无关的特征向量了。这说明矩阵是个退化矩阵。

下一节会继续讨论这个问题。

L22 对角化和矩阵的幂

这里，定义 S 是特征向量组成的矩阵。我们同时要求 S 是可逆的，这说明 A 要有 n 个线性无关的特征向量。

S 有很神奇的性质，不妨算算
$$
AS = A [x1 , x2 , … , xn] \\
= [\lambda_1 x1 , \lambda_2 x2, …, \lambda_n xn] \\
=
\begin{equation}
\begin{bmatrix}
x1&x2&…&xn \\
\end{bmatrix}
\begin{bmatrix}
\lambda_1 & … & 0 \\
0 & … & 0 \\
0 & … & \lambda_n \\
\end{bmatrix}
\end{equation}
$$

这里最后一步，可以正好用来复习列操作。也就是矩阵乘法的第四种看待方式。所以接下来可以得到

$$
\begin{equation}
AS = S
\begin{bmatrix}
\lambda_1&…&0 \\
\vdots & \ddots & \vdots \\
0&…&\lambda_n \\
\end{bmatrix}
= S \Lambda
\end{equation}
$$

从上面可以总结得到

$$
S^{-1} A S = \Lambda \\
A = S \Lambda S^{-1}
$$

所以我们有了在 LU、QR 分解之外的另一种分解矩阵的思路。

这种新的矩阵分解方式常常被用来计算矩阵的幂。即

$$
A^k x = S \Lambda^k S^{-1}
$$

当所有的特征值的绝对值都小于 1 时，$|A^k|$ 趋向于零矩阵。

A 必存在 n 个线性无关的特征向量才能对角化。
可是如何判断是否有 n 个线性无关的特征向量呢？

当所有的特征值不同，那么一定有 n 个线性无关的特征向量。
当有重复的特征值，则需要进一步检查
例如对 I 来说，有 n 个线性无关的特征向量。
对之前说的矩阵来说则没有。

代数重度几何重度都是啥？

下面介绍差分方程，以及斐波拉契数列作为一个例子。
差分方程类似 $u_{k+1} = A u_k$。它的解是 $u_k = A^k u_0$。这个形式虽然很对，但是对我们计算没有意义。

如何解呢，我们可以将 $u_k$ 表示成 A 的特征向量的线性组合。如下式所示，其中 S 和之前一样是特征向量组成的矩阵。
$$
u_1 = c_1 u_0 + … + c_n x_n = S c
$$

这里是因为 A 可以对角化，所以一定能得到 n 个线性无关的向量，它们是张成 n 维空间的一组基。【Q】如果不可以对角化呢？

然后，乘上 A，得到
$$
A u_0 = c_1 \lambda_1 x_1 + … + c_n \lambda_n x_n
$$

那么根据特征值和特征向量的性质 $ Ax = \lambda x$，有
$$
u_k = A^k u_0 = c_1 \lambda_1^k x_1 + … + c_n \lambda_n^k x_n =
$$

结果等于 $\Lambda^k S c$。

下面举了斐波拉契数列的例子。

【Q】这里教授插了句嘴，说对称矩阵特征值都是实数，不知道是怎么得到的。

在斐波拉契数列的计算中，求出两个特征值后，可以根据大的那个进行估计。这是因为较小的那个绝对值小于1，所以影响会越来越收敛。

L23 微分方程

考虑微分方程
$$
\frac{du_1}{dt} = -u_1 + 2u_2 \\
\frac{du_2}{dt} = u_1 - 2u_2 \\
$$

可以写出 A 为 [-1 2; 1 -2]。显然这是一个奇异矩阵，奇异矩阵的行列式为0。而特征值的积等于行列式，所以一定有一个特征值为0。而根据迹可以求出另一个特征值为-3。

如果特征值是0，则会得到一个稳态，因为解是 $e^{0t} = 1$。

下面求特征向量。不要觉得 A 是奇异矩阵，那么0特征值对应的特征向量就可以随意取了。我们还是要按部就班来算，最后得到是 [2; 1]。同理，-3对应的特征向量是 [1; -1]。

通解是 $u(t) = c_1 e^{\lambda_1 t} x_1 + c_2 e^{\lambda_2 t} x_2$。

为什么可以这样将通解和矩阵的特征值联系起来呢？可以参考 randomwalk 的注解。这里 $x_1$ 和 $x_2$ 是对初值 $u_0$ 的分解。$\lambda$ 部分可以凑 e。

接着带入 $u_0$ 求出 $c_1$ 和 $c_2$，即 $S c = u_0$，这个上一讲我们也见过了。得到通解为

$$
\begin{equation}
\frac{1}{3}
\begin{bmatrix}
2 \\
1 \\
\end{bmatrix}
+
\frac{1}{3} e^{-3t}
\begin{bmatrix}
1 \\
-1 \\
\end{bmatrix}
\end{equation}
$$

其中，前者为稳态状态，后者随时间衰减。最终，趋于
$$
\begin{equation}
\frac{1}{3}
\begin{bmatrix}
2 \\
1 \\
\end{bmatrix}
\end{equation}
$$

什么时候微分方程得到稳态呢？

负的特征值一定会得到稳态
如果是复数，那么还是看实部的正负号。虚部只是让它在单位圆上转圈。

容易看出，如果 A 是收敛的，那么 -A 是发散的。因为 -A 的特征值都变号了。

而需要注意，如果特征值都为负数，那么迹是负数；反之则不成立。

这里，教授讲到原微分方程组中有两个相互耦合的未知数 $u_1$、$u_2$。我们通过对角化来解耦。即 $u = S v$。即
$$
\frac{du}{dt} = A u \\
S \frac{dv}{dt} = A S v
$$
所以
$$
\frac{du}{dt} = S^{-1} A S v = \Lambda v
$$
则解为
$$
v(t) = e^{\Lambda t} v(0) \\
u(t) = S e^{\Lambda t} S^{-1} u(0)
$$

然后有一个奇怪的式子。可以发现 $e^{At}$ 就是原方程的解。
$$
S e^{\Lambda t} S^{-1} = e^{At}
$$

这里用矩阵作为 e 的指数很奇怪，需要往下介绍矩阵指数函数。

定义指数函数的办法是将它分解为幂级数的形式，这样就可以当矩阵称为底数了。这边需要一些泰勒分解的基础，就不详细讲了。

同理 $(I - At)^{-1}$ 可以写成几何级数的形式，但是需要 $At$ 特征值的绝对值小于1，否则不收敛。这里 randomwalk 写错了吧。

然后教授通过幂级数定义了以矩阵为指数的指数函数。

需要注意的是，有一些矩阵，它的特征向量并不全部线性无关，所以它们无法对角化。但它们仍然可以三角化。

下图是一个复平面。
图中阴影部分表示矩阵的的指数函数收敛于0的特征值范围。
图中圆圈中的部分表示矩阵的幂函数收敛于0的特征值范围。

二阶微分方程。

L24 马尔科夫矩阵和傅里叶级数

马尔科夫矩阵的两个性质：

每个元素都大于等于0
所有列的加和都为1

马尔科夫矩阵的幂还是马尔科夫矩阵。【Q】这个很奇怪，性质2是怎么保持的？

教授说，每列加和为1这个性质保证了马尔科夫矩阵有一个特征值为1。并且其他的所有特征值小于等于1。
下面证明这个性质。需要证明 $A-I$ 是奇异的。结合之前的课程知道，如果 $A-I$ 不是奇异的，那么 $(A-I)x=0$ 只有0这一个解，这样的话就不存在特征值1对应的特征向量了。证明奇异很简单，只需要观察到每一列的和都是0，这说明行向量线性相关。实际上我们把每一行相加，就会得到零。从线性组合来理解，即 [1, 1, 1] 位于 $A-I$ 的左零空间中。而特征向量 $x_1$ 位于 $A-I$ 的零空间中。

$A$ 和 $A^T$ 的特征值相同。这是因为 $det(A^T -\lambda I) = det(A - \lambda I)$。

用加州和麻省的人口迁徙作为例子。求出马尔科夫矩阵的特征值1对应的特征向量为 [2, 1]，这说明在稳态时，加州人口占 2/3，麻省占 1/3。

上面是将向量分解为各个特征向量的线性组合。现在将向量 v 分解为若干个标准正交向量的线性组合。

如下所示，如果我们有一组正交基，那么每个基向量的系数 $x_i$ 很容易求得，只需要乘上 $q_i$ 即可。即 $x_i = q_i^T v$。
$$
v = x_1 q_1 + … + x_n q_n
$$

写成矩阵形式，即
$$
\begin{equation}
v =
\begin{bmatrix}
q_1&…&q_n
\end{bmatrix}
\begin{bmatrix}
x_1 \\
\vdots \\
x_n \\
\end{bmatrix}
\end{equation}
$$
也就是说 $v = Qx$。所以 $x = Q^{-1}v = Q^T v$。

下面模仿正交向量，定义了两个正交的函数。

$$
f^T g = \int f(x) g(x) dx
$$

容易看出 $sin x$ 和 $cos x$ 是正交的。

$$
\int_0^{2 \pi} sinx cosx dx = 0
$$

L25 复习

Q1

从公式可以推出。或者可以看出 A 是向量，用二维的简化形式算，这样可以直接做除法。

至于 P 的秩可以从向量是一维来看，所以秩是一维。

P 的列空间是一条直线。

P 的特征值很好算。因为 P 的行列式是0，所以有特征值是0。【Q】这里有一点不明白的是为什么0一定是重特征值呢？

不过后面既然知道0是重特征值就可以从迹算出最后一个特征值是1。

求P对应特征值为1的特征向量。这里考虑到 P 为投影矩阵的性质，在投影空间中的向量比如 a 就是对应特征值 1 的特征向
量。

如果有 $u_{k+1} = P u_k$，且有初值 $u_0$，求 $u_k$。这个就是考虑投影第二次之后，结果就不会变化了。

Q2

小问2是如何从投影的角度理解最小二乘？这个其实教授也讲过了，因为 b 不在 Ax 的列空间中，所以原方程是无解的。但是如果将 b 投影到 A 的列空间，就能得到 Ax=b 的近似解。

Q3

使用施密特正交化来求正交基。

Q4

小问1：一个四阶方阵 A 的特征值要满足什么条件，A 才是可逆的？我觉得这四个特征值都不为零，哪怕有重特征值都行。因为特征值的乘积是行列式，行列式不为零矩阵可逆。而重特征值只和特征向量的线性相关性有关系。

小问2：求 $A^{-1}$ 的特征值。【Q】我似乎没印象教授讲过逆矩阵的特征值是倒数。

小问3：求 A + I 的迹。

Q5

已知三对角矩阵 $A_4$，令 $D_n = det(A_n)$。

Q6

小问1：求投影到 $A_3$ 列空间的投影矩阵 P。什么叫投影到列空间？我想如果一个矩阵的列空间是一个平面，那么投影到它的列空间就是投影到一个平面上。

那么 $A_3$ 的列空间是什么呢？可以发现它的秩是2，所以列空间是一个平面，零空间是一条线，不要搞混。

小问2：求 $A_3$ 的特征值和特征向量

小问3：找到投影到 $A_4$ 列空间的投影矩阵 P
这里解法很巧妙，因为 $A_4$ 可逆，所以列空间是整个 $R^4$，所以 P 是 I。

Reference

https://zh.wikipedia.org/wiki/%E8%A1%8C%E7%A9%BA%E9%97%B4%E4%B8%8E%E5%88%97%E7%A9%BA%E9%97%B4
https://zh.wikipedia.org/wiki/LU%E5%88%86%E8%A7%A3
特征向量
一个 Geogebra 的关于特征向量的演示。
https://randomwalk.top/archives/569
https://ocw.mit.edu/resources/res-18-010-a-2020-vision-of-linear-algebra-spring-2020/

停机问题

2023-02-01T15:20:37.000Z

介绍停机问题相关证明。

停机问题

令H(P, I)，返回对于程序P在输入I的情况下是否可停机，假如P(I)能停机，则H停机，否则H死循环。

不妨假设U(X) = H(X, X)。现在考虑U(X)的定义：

如果H(X, X)停机，则U(X)死循环。
如果H(X, X)死循环，则U(X)停机。

也就是U(X)是对H(X, X)取反。

下面考虑U(U)的结果：

如果H(U,U)停机，那么U(U)应该输出死循环。
但是考虑H(U,U)的定义是U在输入为U的情况下是否停机，如果H(U,U)停机了，说明U(U)是可以停机的。

于是上面两者矛盾。

其实这个证明的 intuition 很简单，比如女朋友说“我觉得你对我有意见”时：

如果你说“你说的对”，那说明你对女朋友有意见，你就完蛋了。
如果你说“你说的不对”，那你居然反对女朋友，你完蛋了。

Reference

https://zh.wikipedia.org/zh-hans/%E5%81%9C%E6%9C%BA%E9%97%AE%E9%A2%98

valgrind 用法

2023-01-26T02:04:26.000Z

介绍 valgrind 的 Memcheck、Callgrind、Helgrind、Massif 等工具的用法。

Memcheck

功能是：

未允许的内存访问，例如 overrun 或者 underrun 堆内存，或者 oveerrun 栈顶，或者访问已经被释放的内存。
使用未定义的值，例如没有被初始化的值，或者从其他未初始化的值派生出来的值。
错误释放堆内存，类似于 double free，或者错误搭配 new/new[]/malloc。
在内存分配时，传入负数作为大小。
内存泄露。

1	valgrind --tool=memcheck --leak-check=full --track-origins=yes

Illegal read / Illegal write errors

int main()
{
    int y = 1;
    printf ("x = %d\n", *(int*)(&y + 10));
}

1
2
3

==25724== Invalid read of size 4
==25724==    at 0x400674: main
==25724==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Use of uninitialised values

int main()
{
    int x;
    printf ("x = %d\n", x);
}

==38591== Use of uninitialised value of size 8
==38591==    at 0x571A32B: _itoa_word (in /usr/lib64/libc-2.17.so)
==38591==    by 0x571E5B0: vfprintf (in /usr/lib64/libc-2.17.so)
==38591==    by 0x57254E8: printf (in /usr/lib64/libc-2.17.so)
==38591==    by 0x400682: main

==38591== Conditional jump or move depends on uninitialised value(s)
==38591==    at 0x571A335: _itoa_word (in /usr/lib64/libc-2.17.so)
==38591==    by 0x571E5B0: vfprintf (in /usr/lib64/libc-2.17.so)
==38591==    by 0x57254E8: printf (in /usr/lib64/libc-2.17.so)
==38591==    by 0x400682: main
==38591==  Uninitialised value was created by a stack allocation
==38591==    at 0x400667: main

在程序操作未初始化的数据时，memcheck 会记录这些数据，但不会输出错误。只有当这个程序尝试使用这些未初始化的数据，并且会影响这个程序的外部可见性时，才会报错。在这个例子中，x 没有被初始化。memcheck 观察到这个值被传给 printf 和 vfprintf，但并没有输出错误。当 vfprintf 检查 x 的值，并且试图将其转换为 ASCII 字符串时，memcheck 才会输出错误。

可以通过设置 --track-origins=yes 来检查这些未初始化的数据。它会使得 memcheck 跑得更慢，但更容易发现问题。

Use of uninitialised or unaddressable values in system calls

Memcheck 检查 system call 中所有的未初始化变量，包括：

所有的直接变量。
或者，如果一个 system call 需要读取程序中的某一段缓存，memcheck 会检查整个缓存是否 addressable，并且其内容是否被初始化。
或者，如果这个 system call 需要写到用户提供的某一段缓存中，memcheck 需要检查这段缓存是否 addressable。

Illegal frees

Memcheck 记录通过 malloc 和 new 分配的所有块，所以他可以知道某个 free 或者 delete 是否合法。在这里，出现了 double free。

C

int main()
{
    void * x = malloc(10);
    free(x);
    free(x);
}

在出现非法读写的错误时，memcheck 会尝试解析被释放的地址。

==27728== Invalid free() / delete / delete[] / realloc()
==27728==    at 0x4C2B06D: free (vg_replace_malloc.c:540)
==27728==    by 0x4006E4: main
==27728==  Address 0x5ab1c80 is 0 bytes inside a block of size 10 free'd
==27728==    at 0x4C2B06D: free (vg_replace_malloc.c:540)
==27728==    by 0x4006D8: main
==27728==  Block was alloc'd at
==27728==    at 0x4C29F73: malloc (vg_replace_malloc.c:309)
==27728==    by 0x4006C8: main

注意，如果我们释放的是指向某个堆空间内部的指针，则也会出现类似的错误。

int main()
{
    void * x = malloc(10);
    free(x + 1);
}

此时，报错为 is 1 bytes inside ... alloc'd。这样的报错说明不是 double free 的问题。

==31870== Invalid free() / delete / delete[] / realloc()
==31870==    at 0x4C2B06D: free (vg_replace_malloc.c:540)
==31870==    by 0x4006DC: main
==31870==  Address 0x5ab1c81 is 1 bytes inside a block of size 10 alloc'd
==31870==    at 0x4C29F73: malloc (vg_replace_malloc.c:309)
==31870==    by 0x4006C8: main

C++

int main()
{
    int * x = new int[2]{1, 2};
    delete [] x;
    delete [] x;
}

==12625== Invalid free() / delete / delete[] / realloc()
==12625==    at 0x4C2BB8F: operator delete[](void*) (vg_replace_malloc.c:651)
==12625==    by 0x400705: main
==12625==  Address 0x5ab1c80 is 0 bytes inside a block of size 8 free'd
==12625==    at 0x4C2BB8F: operator delete[](void*) (vg_replace_malloc.c:651)
==12625==    by 0x4006F2: main
==12625==  Block was alloc'd at
==12625==    at 0x4C2AC38: operator new[](unsigned long) (vg_replace_malloc.c:433)
==12625==    by 0x4006C8: main

int main()
{
    int * x = new int[2]{1, 2};
    delete [] (x + 1);
}

==18227== Invalid free() / delete / delete[] / realloc()
==18227==    at 0x4C2BB8F: operator delete[](void*) (vg_replace_malloc.c:651)
==18227==    by 0x4006FC: main
==18227==  Address 0x5ab1c84 is 4 bytes inside a block of size 8 alloc'd
==18227==    at 0x4C2AC38: operator new[](unsigned long) (vg_replace_malloc.c:433)
==18227==    by 0x4006C8: main

When a heap block is freed with an inappropriate deallocation function

int main()
{
    int * x = new int[2]{1, 2};
    delete x;
}

==29865== Mismatched free() / delete / delete []
==29865==    at 0x4C2B6DF: operator delete(void*, unsigned long) (vg_replace_malloc.c:595)
==29865==    by 0x400710: main
==29865==  Address 0x5ab1c80 is 0 bytes inside a block of size 8 alloc'd
==29865==    at 0x4C2AC38: operator new[](unsigned long) (vg_replace_malloc.c:433)
==29865==    by 0x4006E8: main

C++ 中的 allocate 和 deallocate 操作包含：

If allocated with malloc, calloc, realloc, valloc or memalign, you must deallocate with free.
If allocated with new, you must deallocate with delete.
If allocated with new[], you must deallocate with delete[].

最要命的是在 Linux 中其实无所谓搞混这些 allocate 和 deallocate 操作。但是这样错误的搭配在其他平台比如 Solaris 上则会导致 crash。

Overlapping source and destination blocks

在 memcpy、strcpy、strncpy、strcat、strncat 中，指向 src 和 dst 的指针不能 overlap。

比较奇怪的是下面的代码并不会出现这样的错误。

int main()
{
    int * x = new int[3]{1, 2, 3};
    memcpy(x + 1, x, 2);
    void * y = malloc(10);
    memset(y, 0, 10);
    memcpy(y + 1, y, 2);
}

原因是 gcc 会把 memcpy 优化掉，通过 -fno-builtin-memcpy 可以禁用这个性质。

==15974== Source and destination overlap in memcpy(0x5ab1c81, 0x5ab1c80, 2)
==15974==    at 0x4C2E81D: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1035)
==15974==    by 0x40075E: main
==15974==

Fishy argument values

所有的内存分配函数都需要指定需要分配的大小，而这个大小肯定是一个非负数，并且不会特别大。例如我们不太可能在64位机器上分配 2**23 个字节。这样的大小通常来自于一个人为的错误，而这样的值就被称为 fishy value。在 malloc、calloc、realloc、memalign、new、new []。

int main()
{
    void * x = malloc(-2);
}

1
2
3

==27571== Argument 'size' of function malloc has a fishy (possibly negative) value: -2
==27571==    at 0x4C29F73: malloc (vg_replace_malloc.c:309)
==27571==    by 0x40067A: main

但同时注意到编译器也会触发警告。

1 2	warning: argument 1 value ‘18446744073709551614’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=] void * x = malloc(-2);

Memory leak detection

Memcheck 会记录所有分配的堆对象。
通过设置 --leak-check，对于在结束时尚未被释放的 block，Memcheck 会检查这个 block 是否可以从 root set 被访问。这里的 root set 包括：

通用寄存器
在所有可访问内存，包括栈中的 initialised, aligned, pointer-sized data words

有两种方法可以访问一个 block：

start-pointer，也就是指向 block 开始位置的指针
interior-pointer，也就是指向 block 中间位置的指针

一个 interior-pointer 是如何产生的呢？

它可能开始是一个 start-pointer，但后来被程序故意或者非故意地向前移动
比如如果程序使用 tagged pointer。因为对齐的缘故，指针最右边的几位通常是0，所以会被用来存储额外的信息。这些信息可能导致指针被前进。
可能是内存中的某个垃圾
【stdstring】可能是指向 std::string 内部持有的 char[] 的指针
例如某些编译期会在 std::string 的头部存3个字段，分别表示数组的 length、capacity 和 refcount，在这3个字段之后再放置真正的 char[]。但是它返回的指针是指向 char[] 的。这个有点类似 Redis 的 SDS 的实现。
【length64】Some code might allocate a block of memory, and use the first 8 bytes to store (block size - 8) as a 64bit number. sqlite3MemMalloc does this.
【newarray】可能是执行某个 T[] 中的指针。这里的 T 是一个 C++ 对象，它具有自定义的析构函数，并使用 new[] 分配，delete[] 删除
在这种情况下，一些编译器会在指针的前面放一个 magic cookie，用来存放长度。
【multipleinheritance】It might be a pointer to an inner part of a C++ object using multiple inheritance.

You can optionally activate heuristics to use during the leak search to detect the interior pointers corresponding to the stdstring, length64, newarray and multipleinheritance cases. If the heuristic detects that an interior pointer corresponds to such a case, the block will be considered as reachable by the interior pointer. In other words, the interior pointer will be treated as if it were a start pointer.

下面一张图阐释了几种内存泄露的情况：

DR: Directly reachable
IR: Indirectly reachable
DL: Directly lost
IL: Indirectly lost
(y)XY: it’s XY if the interior-pointer is a real pointer
(n)XY: it’s XY if the interior-pointer is not a real pointer
(_)XY: it’s XY in either case

     Pointer chain            AAA Leak Case   BBB Leak Case
     -------------            -------------   -------------
(1)  RRR ------------> BBB                    DR
(2)  RRR ---> AAA ---> BBB    DR              IR
(3)  RRR               BBB                    DL
(4)  RRR      AAA ---> BBB    DL              IL
(5)  RRR ------?-----> BBB                    (y)DR, (n)DL
(6)  RRR ---> AAA -?-> BBB    DR              (y)IR, (n)DL
(7)  RRR -?-> AAA ---> BBB    (y)DR, (n)DL    (y)IR, (n)IL
(8)  RRR -?-> AAA -?-> BBB    (y)DR, (n)DL    (y,y)IR, (n,y)IL, (_,n)DL
(9)  RRR      AAA -?-> BBB    DL              (y)IL, (n)DL

Pointer chain legend:
- RRR: a root set node or DR block
- AAA, BBB: heap blocks
- --->: a start-pointer
- -?->: an interior-pointer

前四行比较简单。
第5行，如果这个 interior pointer 是一个 real pointer，则是 directly reachable。如果不是 real pointer 则是 directly lost。
第6行，相当于是2+5，没啥特殊的。
第7行，相当于是5+1，没啥特殊的。
第8行，可以分成三种情况。

     Pointer chain            AAA Leak Case   BBB Leak Case
     -------------            -------------   -------------
(8)  RRR -?-> AAA -?n-> BBB   (y)DR, (n)DL    DL
(8)  RRR -?y-> AAA -?y-> BBB  DR              IR
(8)  RRR -?n-> AAA -?y-> BBB  DL              IL

但实际输出的时候，不会按照上面9个情况来输出，而是设计为如下的形式：

Still reachable 1-2行
Definitely lost 3行
Indirectly lost 4/9行
Possibly lost 5-8行
这种情况下可能存在1或者多个指针构成的链，但其中至少有一个指针是 interior pointer。这个可能只是内存中的随机值，并恰巧指向了某个块。

Details of Memcheck’s checking machinery

这一节介绍 Memcheck 的原理。

Valid-value (V) bits

It is simplest to think of Memcheck implementing a synthetic CPU which is identical to a real CPU, except for one crucial detail. Every bit (literally) of data processed, stored and handled by the real CPU has, in the synthetic CPU, an associated “valid-value” bit, which says whether or not the accompanying bit has a legitimate value. In the discussions which follow, this bit is referred to as the V (valid-value) bit.

Each byte in the system therefore has a 8 V bits which follow it wherever it goes. For example, when the CPU loads a word-size item (4 bytes) from memory, it also loads the corresponding 32 V bits from a bitmap which stores the V bits for the process’ entire address space. If the CPU should later write the whole or some part of that value to memory at a different address, the relevant V bits will be stored back in the V-bit bitmap.

In short, each bit in the system has (conceptually) an associated V bit, which follows it around everywhere, even inside the CPU. Yes, all the CPU’s registers (integer, floating point, vector and condition registers) have their own V bit vectors. For this to work, Memcheck uses a great deal of compression to represent the V bits compactly.

Copying values around does not cause Memcheck to check for, or report on, errors. However, when a value is used in a way which might conceivably affect your program’s externally-visible behaviour, the associated V bits are immediately checked. If any of these indicate that the value is undefined (even partially), an error is reported.

Valid-address (A) bits

结合 VV 和 VA

Debugging MPI Parallel Programs with Valgrind

Callgrind

检查程序中函数调用过程中出现的问题。

Cachegrind

检查程序中缓存使用出现的问题。

Helgrind

检查多线程程序中出现的竞争问题。

Massif

检查程序中堆栈使用中出现的问题。

Reference

https://valgrind.org/docs/manual/mc-manual.html#mc-manual.errormsgs

分布式架构和高并发相关场景

2023-01-12T15:20:37.000Z

介绍分布式架构和高并发相关场景下的设计和问题定位的相关经验，持续更新。

在分布式架构下，我们不得不摆脱一些下层硬件为提供的屏障，而要去解决真实环境带来的难题。
如果将普通的程序比作是经典力学，那么研究高并发系统有点类似于研究热力学。当成千上万个过程彼此交互、竞争、等待，在有限的集群资源中将会产生不一样的场景。

一些传送门

CPU 相关 Performance analysis and tuning on modern CPUs

计算机工程工具

这些工具主要是：

O11y
Metrics
Perf 相关

Trace 任意函数的执行时间

该方案整理自某同事的 idea。

考虑下面的场景，我们需要查看某动态链接库 /path/to/libtiflash_proxy.so 中 handle_pending_applies 函数每次调用的耗时。

perf probe --del 'probe_libtiflash_proxy:*'
BIN=/path/to/libtiflash_proxy.so
TOKEN=handle_pending_applies
ITER=0
objdump $BIN --syms | grep $TOKEN | awk '{print $6}' | while read -r tok ; do
    ITER=$(expr $ITER + 1)
    NAME=$TOKEN\_$ITER
    echo $NAME, $TOKEN, $ITER, $TOKEN\_$ITER
    perf probe -x $BIN --no-demangle $NAME=$tok
    perf probe -x $BIN --no-demangle $NAME=$tok%return
done
perf record -e probe_libtiflash_proxy:\* -aR sleep 10
perf script -s perf-script.py

附上 perf-script.py

分析某段时刻的调用栈

生成火焰图

记录线程名

有一些方法设置线程名，包括：

pthread_setname_np
prctl

出于下列原因，建议线程名具有唯一性：

诸如一些线程池的实际工作内容不一样，最好以数字区分。
有一些库会扩展 std::mutex，记录上锁的线程名，用来避免重复加锁。当然这个并不好，最好用线程 id。

在一些很老的 c 库中，没有提供 pthread_setname_np 函数。诸如 CK 之类的会写一个 dummy 的函数来替换，这可能导致一些情况下不能设置成功线程名。

数学工具

统计学方法

介绍一个很有趣的case，它是一个压测程序中出现的问题。

Intuition

阻塞

有一些典型的特征：

时间偏移
比如某些周期性的时间，如日志等，突然失去周期性，而在一瞬间打印了很多。

注意，阻塞的原因有很多：

不能正确解耦逻辑和 IO
队列积压，工作线程数偏少，导致对于每个任务而言，自己的等待时间会越来越大，似乎自己正在被阻塞

解耦 IO

如下所示，一瞬间打印了很多发送消息的日志，实际上这是由于没有正确解耦消息发送逻辑和 IO 逻辑。导致 IO 阻塞了同一个线程，从而积攒了大量的消息。特别地，如果因此产生了消息延迟，可能会雪崩。
在这个场景下，虽然我们使用了 ClickHouse 的线程池来处理异步的 IO，但由于线程池的队列大小过小，并且也没有指定等待超时时间，所以我们以为的异步实际上变成了同步。
当然，这里虽然是因为没有解耦 IO，但实际上也暴露了线程池的一些问题。

在解耦 IO 时，常常会将一些消息或者写入缓存到 Queue 或者 WriteBatch 以攒批 IO。对于这样的情况，在判断时首先需要检查对象是否已经被真的发送或者写入，因为这会导致后续完全不同的调查方向。

雪崩

可能发生在基于消息传输进行同步的系统中。不妨以 Raft 为例，如果因为一些阻塞的原因，一些发给 Follower 的 Append 消息没有被及时处理，很可能该 Follower 就会认为 Leader 挂了，从而发起选举。而选举会产生更多的消息，从而导致消息进一步积压，甚至会扩散到其他正常的 Region 中。

概率

在某个接口被高频调用时，应当认为其中小概率事件也是可能被发生的。

退出

当一个程序异常退出，但看不到异常日志时，考虑：

日志服务是否未初始化，或者该段异常日志被直接打印到标准错误
该程序是否由于异常信号或 OOM 退出
可以从 return code 或者 dmesg 或者 coredump 或者 stderr 等信息来看。
该程序是否主动退出
在一些程序中，会针对一些异常情况直接调用 exit 退出程序。此时可以用 gdb 去 hook _exit 函数来查看退出时的堆栈。

调度

饥饿

一些程序会基于时间片的算法来进行调度。一些实现会从任务队列中取出所有等待的任务，执行这些任务，再检查是否超出时间片。如果执行这些任务本身的时间就比较长，甚至可能占用多个时间片，这就会导致调度算法接近于失效。

缓存

落盘

一般写入的阶段可以分为：

Write Batch
Memtable
在写入到 Memtable 之后就是可读的了。
在 Memtable 后，有的系统可能会异步 flush。所以需要辨别此时数据是否已经被写入。因为很多时候不是持久化越快越正确，因为很多东西必须要同时确定已经被持久化。
Disk

因为磁盘往往比内存操作更慢，所以存储系统通常会考虑内存路径和磁盘路径。通常需要考虑下列问题：

持久化是否是原子的
如果涉及两个非原子的写入，则需要处理在写入间出现宕机的问题。
什么时候数据可以被读了
如果一个数据没有被持久化，那么它不能被读。否则一个客户端可能第一次读到该数据，而后服务器宕机重启之后出现 data loss 导致第二次读不到。这是不太好的一致性。

锁

使用什么锁？

std::recursive_mutex 是否应该被使用呢？我觉得大部分在准备使用递归锁的时候，需要首先考虑架构问题。

死锁怎么办？

通过 gdb 可以找到对应 mutex 结构中的 owner，对应的值表示 LWP 的编号。
对于一些程序，可能 debug info 被优化掉了，此时可以选择：

根据提示的行号，拷贝一份对应的源码到指定位置
自己编译一个同样 layout 的对象，然后 load 进去解析

Panic

避免依赖 coredump

首先，如果 coredump 很大，它常常会被截断，即使我们设置了 ulimit -c unlimited。

1	BFD: Warning: core.36322 is truncated: expected core file size >= 14835945472, found: 1073742080

其次，在发生诸如 segfault 时，我们也未必需要 coredump 才能拿到堆栈。例如可以启动一个专门的 sigHandler 线程，并配合 libunwind 来在其他线程出现 segfault 的时候打印出足够的信息，甚至包括堆栈。

线程池

线程池的优点

避免重复创建线程的开销
作为异步任务的执行器

线程池/工作队列的缺点

线程是一种资源，获取线程不存在竞态问题，但其中的死锁问题却比较隐蔽。考虑在大小为 N 的线程池中：

执行 M > N 个 task，因为超过线程池容量，从而后面的 task 无法被调度

对于这个问题，可以在线程池中维护一个队列，从而会产生下面的问题：

执行 M > N 个彼此依赖的 task 构成的 job，如果被依赖的 task 没有被尽早执行，而在执行状态的线程因为依赖不满足而进入睡眠，但实际又没有释放线程池的容量。这就导致整个 job 可能死锁
所有的线程去 poll 一个队列，压力山大

对于问题1，可以有下面几种做法：

手动构建依赖的 DAG 图，按照顺序计算
使用协程，yield 并不释放线程资源

对于问题2，可以有下面的做法：

每个线程一个队列，但考虑到有线程可能会饥饿，所以会进化到 working steal 队列

线程池/工作队列的设计考虑

支持取一小部分线程组建新的线程池。
支持固定线程和临时线程，临时线程可以在空闲一段时间后自动销毁。
支持 Cancel
特别的，如果使用队列，则需要能处理 pending 在队列中的任务。我自己实现了个。
是否允许让某个一个 task 捕获线程池本身

避免捕获线程池本身

一个很经典的架构是一个 Submodule 中持有一个 ThreadPool。我们在全局上下文中持有这个 Submodule 的 shared_ptr，不放令为 Ptr。
假设我们通过 ThreadPool::addTask 向线程池注册了一个任务，并且这个任务中又捕获了 Ptr，这可能会导致问题。这个问题的必要条件还包括 ThreadPool 需要在它析构的时候 join 它所有申请线程，而这是一个通常的设计。
这个问题是如果这个任务本身是 Ptr 的最后一个 Owner，那么当这个 task 被返回的时候，将触发 ThreadPool 的析构。而这个析构会要求 join 包括这个 task 所属 worker 对应的线程。也就是说这个线程要自己 join 自己，显然这是不合理的，会触发一个 panic 或者死锁。比如 https://github.com/pingcap/tiflash/issues/8952。

我觉得解决这个问题的最简单的办法就是强制引入一个显式的 shutdown 方法。

内存问题

空指针

C++ 中的空指针影响会比较大。比如对 nullptr 调用 operator->() 就会得到一个 segfault，对应的 addr 可能漂到不知道哪里去了，很难定位问题。

特别地，C++ 还不太容易实现 Rust 中的 NotNull 指针，从而减少心智负担。这是因为 C++ 本身的移动语义会将移动后的对象的指针设置为空，而这就导致 NotNull 无法移动。而 unique_ptr 又是天生只支持移动的，这就导致了 NotNull 和 unique_ptr 无法兼容。Rust 能支持是因为编译期保证了使用移动后的对象一定是不能通过编译的。

一般有下面的一些做法：

使用 ASAN 进行检测。但这需要代码本身的 coverage 足够高，实际上要求有一个比较好的写单测或者做集成测试的习惯
使用线程池维护对象，避免使用任何形式的 shared_ptr、unique_ptr 以及裸指针

使用一些 not_null ptr 的实现，这些实现能够 workaround 掉 unique_ptr 的相关问题
https://github.com/bitwizeshift/not_null/blob/master/include/not_null.hpp 这样的库可以选择使用 check_not_null 在创建的时候执行运行期检查，也可以使用 assume_not_null 执行有限的编译期检查（但如果值在编译期无法确定，则编译期检查无效）。
这个库也支持移动语义

// Should never be null, but not yet refactored to be 'not_null'
auto old_api(std::unique_ptr p) -> void;

auto new_api(cpp::not_null> p) -> void
{
  // Extract the move-only unique_ptr, and push along to 'old_api'
  old_api(std::move(p).as_nullable());
}

特别地，我不觉得 if likely(!ptr) throw_or_panic(); 这样的写法有太大问题，因为现代 CPU 的 speculation 机制让这个 if 的开销变得很低。但毫无疑问，每次都要判断，无疑加重了开发人员的心智负担，每一个函数的开头需要防御性编程写一堆 check。而实际上至少从某一层开始，工具函数就可以要求传入的 ptr 是 not null 了。特别地，对于一个全新 init 的工程，可以始终通过 std::optional 来代替 std::shared_ptr。

OOM

线程级别的内存分配记录

一种方法是基于 jemalloc 的 mallctl 调用。该调用有两种方式，其中 thread.allocated 可以立即返回当前的 caller thread 分配了多少内存。而 thread.allocatedp 可以返回一个指针，解引用这个指针可以返回对应线程当前分配了多少内存。通过第二种方式可以避免频繁的 mallctl 调用，也可以实现从其他线程进行观测。但其中比较困难的一点就是如何监测线程的启动和释放，从而判断对应的指针是否能够被读取。一种做法是包装线程池的 API，在每个线程启动和释放的时候加上 hook。

此外，这种方式对于 Rust 程序会有一些问题：

Rust 的移动语义会导致一些线程分配的内存会被另一个线程释放
比如返回一个 T : Send 给另一个线程作为 Result。通常在需要使用线程池进行处理的逻辑中。
协程跑在 executor 上面，难以分清辨认出具体的用途

内存泄露

并不是野指针才算内存泄露。如果有一些结构存在于某些队列或者哈希表中，但并不是所有路径都会最终回收掉该结构，那么同样可以认为存在内存泄露。

测试

重构测试

对正确性要求很高的子系统进行重构，如何设计测试？可以从几个层面来讨论：

Unit test
单测用来保护逻辑。
首先对原来的子系统对外提供的每个接口进行设计单测，以获得其行为。这个单测是简单的，我们只需要设计不同的输入，并观测其返回值和副作用即可。
然后基于生成的单测来校验新的子系统。
Random test
随机构造操作序列，并设定多种配置集合，运行测试。
这个测试既对子系统运行，也对使用不同子系统的上层系统进行测试。
Chaos test
随机注入各种错误。
对拍
基于 Chaos test 中的宕机重启，使得程序在新老子系统中切换。

减少耦合

需要避免一个测试同时覆盖多个功能。例如有一个 Region Serde 的测试中允许写入 flexible 的扩展字段，并支持升降级。这个功能因为涉及到升降级和持久化，所以需要测试来保护。
而这些 flexible 的字段可能属于某个 feature，这个 feature 本身也需要测试。
因此，最好的做法是 mock 一些 flexible 的字段，这些字段只用来测试。

注入的方式

#ifdef TEST 宏
failpoint
trait Mocker
lambda
让被注入的函数接受一个 lambda 作为参数。正常逻辑中，该 lambda 为 [](){}，而测试逻辑中，该 lambda 为 mock 逻辑。
在 release 编译时，该 lambda 会被优化掉。

架构漫谈

无状态的架构

一个最原始的系统通常是单服务器单库的架构。但随着用户请求越来越大，需要逐步进行升级。

首先可以对服务器进行集群化和负载均衡。这个并不难，因为用户的请求从接入层打过来，通常已经经过了一系列路由、鉴权、限流、降级、LB 等过程。在业务层通常就是去处理每一个请求，其中涉及到与各种中间件和数据库交互。对业务层而言，这些只是 API 的调用。

但是业务虽然扩容了，所有的请求还是打到同一个数据库上，数据库成为瓶颈。一般可以在业务侧做些聚合啥的，但这并不是一个较为通用的方案。

计算机科学领域的任何问题都可以通过增加一个间接的中间层解决。可以在服务器和数据库之间加上一层缓存。

缓存的加入是为了减少数据库的压力。但显而易见，如果双写缓存和数据库，那么就一定产生一致性问题：

比如先写库，再删缓存(Cache Aside Pattern)，那么在这一段时间中缓存就是脏的。
又比如先删缓存再写库，那么只要这个操作不是原子的，其实大多数情况下都不是原子地，那么就可能中间有个读线程在读库的时候再重新写一遍旧数据到缓存中。

有一些方案能够尽可能处理缓存和数据库一致的问题。可以见分布式一致性和分布式共识协议。

随着请求进一步增多，数据库压力的进一步增大，这个时候就需要对数据库本身进行扩展。例如：

主从架构的 Replication
通常用来实现容错+读写分离带来的高可用。
数据库的主从方案可以通过 MySQL Proxy 等机制实现，阿里巴巴有一个 Canal 的数据库中间件，能够实现数据库的增量订阅和消费业务。
分库分表的 Partition

Reference

https://zhuanlan.zhihu.com/p/264825380
PebbleDB 的测试方案
https://www.cnblogs.com/chanshuyi/p/mycat_enlighten.html
描述了一个业务系统的架构升级之路

C++ 中 const 的用法

2022-11-19T16:29:16.000Z

把文章中的相关部分独立出来。

const不是编译期常量

在C++初始化方式中已经提到常量const是不能在构造函数体中初始化的，但可以在初始化列表中可以进行初始化，对于常量数组或者标准库的std::vector等容器，现在可以使用花括号{}进行初始化。
需要额外说明的是const甚至不能作为模板参数等编译期常量使用。例如在MSVC2015中，下面的代码是无法通过编译的

struct C {
    const int x;
    C(int _x) :x(_x) {

    }
};

int main() {
    const C c(1);
    int a[c.x];
    system("pause");
}

原因是在C.x虽然是常量，但是要到运行期才能知道，这里应该使用的是static const或者constexpr，const修饰符实际上的意义更接近于readonly。如果说const能够“节省空间”，那是由于其不可变，所以发生拷贝时，const对象实际上并不发生复制，但只const修饰的类成员仍然是占空间的。

实现member function的const版本

有些member function的const版本相对于非const版本只是加上了const的限制，重复实现一遍会造成代码的浪费。根据stackoverflow，可以直接const_cast this指针即可。对一个非const加const限制是安全的，但反过来不一定。如果说const函数需要修改非mutable成员，那么可以实现一个static非成员模板函数，将this传进去