通过 https://kernel.dk/io_uring.pdf 简单学习下 io_uring。

1.0 Introduction

Linux 的读写 API 经历了：

read
pread：增加了 offset

preadv：offset 是 iovec 的形式，就是支持分散读

struct iovec
{
    void __user *iov_base;
    __kernel_size_t iov_len;
};

preadv2：加了 flags
可以参考 https://www.man7.org/linux/man-pages/man2/preadv2.2.html

但上述的 API 都是同步的。posix 有 aio_ 系列的 API 标准，但是没啥人用，性能也不好。

Linux 有个 libaio，它和 POSIX 的 aio_ 系列不是一个东西。但它也有问题：

它要求 O_DIRECT，不然就和同步调用没啥区别。而 O_DIRECT 会 bypass cache，并且有严格的对齐要求，所以用途受限制。
即使满足 async 的所有条件，最终也不一定是 async 的。比如：
- 如果要修改元数据，可能会 block
- storage device 的 request slots 的数量是固定的
  这里的 request slots 表示 storage device 同时可以处理的并发数。
  传统存储协议如 SATA、SAS 中，只有一个命令队列，存放未完成的 io，它的长度就是 io depth。如果下层 storage device 的 request slots 数量小于 io depth，那么 io 请求就可能在 io 队列中等待。
  NVMe SSD 支持多个 Submission Queues (SQ) 和 Completion Queues (CQ)，每个 SQ 条目可对应一个正在执行的 I/O 命令。比如有 64 个 queue，每个 queue 深度是 1024，那么理论上最多可并行执行 64 × 1024 = 65536 个命令。

提交一个 io 需要复制 64 + 8 bytes。完成一个 io 需要复制 32 bytes。

struct iocb {
   __u64   aio_data;
   __u32   PADDED(aio_key, aio_rw_flags);
   __u16   aio_lio_opcode;
   __s16   aio_reqprio;
   __u32   aio_fildes;
   __u64   aio_buf;
   __u64   aio_nbytes;
   __s64   aio_offset;
   __u64   aio_reserved2;
   __u32   aio_flags;
   __u32   aio_resfd;
};

struct io_event {
    __u64   data;
    __u64   obj;
    __s64   res;
    __s64   res2;
};

Depending on your IO size, this can definitely be noticeable.
IO always requires at least two system calls (submit + wait-for-completion), which in these post spectre/meltdown days is a serious slowdown.

2.0 Improving the status quo

一开始有一些改良 libaio 的工作：

If you can extend and improve an existing interface, that’s preferable to providing a new one.
It’s a lot less work in general.

libaio 主要有三个接口：

io_setup
io_submit 用来提交一个 io
io_getevents 用来等待完成，并收获结果

后面觉得，这种改良会把接口改得非常复杂，而且只能解决上面列出的一个问题。

3.0 New interface design goals

Easy to use, hard to misuse.
Extendable. 希望这个接口不止支持 block oriented IO。对于网络，和非块存储设备，它都能适用。
Feature rich. Linux aio caters to a subset (of a subset) of applications. I did not want to create yet another
interface that only covered some of what applications need, or that required applications to reinvent the same
functionality over and over again (like IO thread pools).
Efficiency. While storage IO is mostly still block based and hence at least 512b or 4kb in size, efficiency at those
sizes is still critical for certain applications. Additionally, some requests may not even be carrying a data payload.
It was important that the new interface was efficient in terms of per-request overhead.
Scalability. While efficiency and low latencies are important, it’s also critical to provide the best performance
possible at the peak end. For storage in particular, we’ve worked very hard to deliver a scalable infrastructure. A
new interface should allow us to expose that scalability all the way back to applications.

4.0 Enter io_uring

首先是摘录作者的感言，性能必须从一开始，在设计接口的时候就考虑。

Despite the ranked list of design goals, the initial design was centered around efficiency. Efficiency isn’t something that can be an afterthought, it has to be designed in from the start - you can’t wring it out of something later on once the interface is fixed.

作者认为，新的设计要避免 submission 和 completion 事件在内核和用户空间之间的复制，也要避免 indirection，所以他由浅及深得出了下面几点：

内核和用户空间需要 share 这些结构
因此，这些结构应该在内核和用户的共享内存中
因此，必须要去维护这里面的同步关系
如果要用锁，那么就肯定会有系统调用，系统调用肯定 overhead 就大了
因此，single producer single consumer ring buffer 是适合的

考虑到对于 submission 事件，用户是生产者，内核是消费者；而 completion 事件则相反。所以需要两个队列：SQ 和 CQ。

4.1 DATA STRUCTURES

cqe 的后缀表示 Completion Queue Event。

struct io_uring_cqe {
   // 从 submission 中透传过来
   __u64 user_data;
   __s32 res;
   __u32 flags;
};

sqe 则复杂很多

struct io_uring_sqe {
   // 操作类型，例如 IORING_OP_READV 表示向量读
   __u8 opcode;
   __u8 flags;
   __u16 ioprio;
   __s32 fd;
   __u64 off;
   // 指向内存地址，如果是向量读写，则指向一个 iovec array 的地址
   __u64 addr;
   // 表示长度，或者 iovec array 的长度
   __u32 len;
   union {
      __kernel_rwf_t rw_flags;
      __u32 fsync_flags;
      __u16 poll_events;
      __u32 sync_range_flags;
      __u32 msg_flags;   
   };
   __u64 user_data;
   union {
      __u16 buf_index;
      // 64 bytes 对齐
      __u64 __pad2[3];
   };
};

4.2 COMMUNICATION CHANNEL

SQ 和 CQ 的 indexing 是不太一样的，先从简单的 CQ 开始。

cqe 是一个内核和用户共享的 ring buffer，内核写会更新 tail，用户读会更新 head。ring buffer 的大小是 2 的幂，它的好处我在 Redis底层对象实现原理分析中有所解析。

如下所示，head 是可以自然溢出的。当然，正如我在 libutp源码简析或者ATP中的实现那样，当 tail 比 head 小的时候，我们也可以认为发生了溢出。
cqring->cqes 是被共享的结构。

unsigned head;
head = cqring->head;
read_barrier();
if (head != cqring->tail) {
   struct io_uring_cqe *cqe;
   unsigned index;
   index = head & (cqring->mask);
   cqe = &cqring->cqes[index];
   /* process completed cqe here */
   ...
   /* we've now consumed this entry */
   head++;
}
cqring->head = head;
write_barrier();

SQ 这边，就是用户生产，内核消费了。之前说到，SQ 的 indexing 不一样，它是有个 indirection 的。submission 的 ring buffer 中存放了 index，索引到 sqe 中的位置。例如下面的例子中，提交顺序是：sqe5 → sqe2 → sqe3。

1 2	SQ array: [5, 2, 3] SQEs: [sqe0, sqe1, sqe2, sqe3, sqe4, sqe5]

在文章中，作者提出一个好处是可以将 request units 放到 internal structure 中，我理解就是后面看到的自定义的 app_sq_ring。另外，也能允许在一个操作中提交多个 sqe。我理解就是如下代码所示，先 fill sqe，再写 array 的操作。

struct io_uring_sqe *sqe;
unsigned tail, index;
tail = sqring->tail;
index = tail & (*sqring->ring_mask);
sqe = &sqring->sqes[index];
/* this call fills in the sqe entries for this IO */
init_io(sqe);
/* fill the sqe index into the SQ ring array */
sqring->array[index] = index;
tail++;
write_barrier();
sqring->tail = tail;
write_barrier();

只要 sqe 被内核消费了，application 就可以复用 sqe entry，即使内核还没有完全处理完毕，内核会在需要的时候复制这个结构。

这样，sqe 的生命周期就比较短，而 application 可能会发送更多的 submission，从而导致 CQ ring 可能溢出。所以默认下的 CQ ring 的大小是 SQ ring 的两倍。

Completion events 可能以任意顺序到达，它和 submission 的顺序是没有关系的。SQ 和 CQ 两个 ring 是独立运行的。但是每个 submission 事件和每个 completion 事件都能一一对应。

5.0 io_uring interface

下面介绍的是 io_uring 的“裸”接口，即 system call。

io_uring_setup

entries 的取值是 1..=4096，表示 sqe 的数量，必须是 2 的幂。

1	int io_uring_setup(unsigned entries, struct io_uring_params *params);

params 如下

struct io_uring_params {
   // 由内核填写，表示支持多少个 sqe
   __u32 sq_entries;
   // 由内核填写，表示支持多少个 cqe
   __u32 cq_entries;
   __u32 flags;
   __u32 sq_thread_cpu;
   __u32 sq_thread_idle;
   __u32 resv[5];
   struct io_sqring_offsets sq_off;
   struct io_cqring_offsets cq_off;
};

struct io_sqring_offsets {
   __u32 head; /* offset of ring head */
   __u32 tail; /* offset of ring tail */
   __u32 ring_mask; /* ring mask value */
   __u32 ring_entries; /* entries in ring */
   __u32 flags; /* ring flags */
   __u32 dropped; /* number of sqes not submitted */
   __u32 array; /* sqe index array */
   __u32 resv1;
   __u64 resv2;
};

io_uring_setup 返回的 int，实际上是一个 fd。如之前所说，这个 fd 是被内核和用户共享的。而 sq_off 和 cq_off 就表示了这共享内存中，SQ 和 CQ 的位置。

1
2
3

#define IORING_OFF_SQ_RING          0ULL
#define IORING_OFF_CQ_RING  0x8000000ULL // 128MB，指的那个 index 数组
#define IORING_OFF_SQES    0x10000000ULL

用户可以自定义 sq ring 的结构，这个结构中的每个字段都是一个指向到共享内存中位置的指针

struct app_sq_ring {
   unsigned *head;
   unsigned *tail;
   unsigned *ring_mask;
   unsigned *ring_entries;
   unsigned *flags;
   unsigned *dropped;
   unsigned *array;
};

如下面的 setup 所示，可以看到自定义的 sring 是如何通过 ptr 和 sq_off 组装起来的。

struct app_sq_ring app_setup_sq_ring(int ring_fd, struct io_uring_params *p)
{
   struct app_sq_ring sqring;
   void *ptr;
   ptr = mmap(NULL, p->sq_off.array + p->sq_entries * sizeof(__u32), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
   ring_fd, IORING_OFF_SQ_RING);
   sring->head = ptr + p->sq_off.head;
   sring->tail = ptr + p->sq_off.tail;
   sring->ring_mask = ptr + p->sq_off.ring_mask;
   sring->ring_entries = ptr + p->sq_off.ring_entries;
   sring->flags = ptr + p->sq_off.flags;
   sring->dropped = ptr + p->sq_off.dropped;
   sring->array = ptr + p->sq_off.array;
   return sring;
}

io_uring_enter

int io_uring_enter(
   unsigned int fd, // io_uring_setup 返回的那个 fd
   unsigned int to_submit, // tells the kernel that there are up to that amount of sqes ready to be consumed and submitted
   unsigned int min_complete, // asks the kernel to wait for completion of that amount of requests.
   unsigned int flags, 
   sigset_t sig);

可以发现，这个 syscall 可以同时 submit 和 wait for completion，这个也对应了本文作者之前提到的对 aio 的批评之一。

flags 中有一个参数，设置它，则内核会 actively wait for min_complete events to be available。简单来说，如果希望 wait for completion，则必须设置这个 flag。

1	#define IORING_ENTER_GETEVENTS (1U << 0)

5.1 SQE ORDERING

这一节主要讲了如何实现 fsync/fdatasync。

因为之前提到 SQ 和 CQ 是完全独立的，所以这样的机制需要额外的设计。并且因为写入是乱序的，所以我们在乎的是确定所有的写入已经完成。

io_uring 的机制是，支持 draining the submission side queue，直到之前的 completion 事件都已经结束。在这之前，application 会将后续写入入队。

通过 IOSQE_IO_DRAIN 这个 flag 来实现这个特性，它会 stall 住整个 SQ。因此，application 可以考虑使用多个 io_uring context，来保证不相关的写是并行的。

io_uring supports draining the submission side queue until all previous completions have finished. This allows the application queue the above mentioned sync operation and know that it will not start before all previous commandshave completed.

5.2 LINKED SQES

所有连续的指定了 IOSQE_IO_LINK 的 io 请求会被串联起来执行，这些请求一定是按照顺序执行的。但是它们和没有指定 IOSQE_IO_LINK 这个 flag 的请求之间的关系是不确定的。

5.3 TIMEOUT COMMANDS

6.0 Memory ordering

在并发编程重要概念及比较中，我们知道 memory order 主要是考虑读-写和写-写问题，如下所示：

read_barrier(): Ensure previous writes are visible before doing subsequent memory reads.
write_barrier(): Order this write after previous writes.

我们也知道，不同的 CPU 架构的乱序执行逻辑是不一样的，所以这里只是讨论概念。

考虑用户侧写入一个 seq，并且通知 kernel 可以去消费了。这就包含了两个过程：

填写 sqe 中的字段，并且将 sqe index 写入 SQ ring array
更新 SQ ring 队列的 tail

这个操作可以简化成下面的伪代码，每一行代表一个内存操作。如果没有合适的 memory order，CPU 是有理由进行乱序执行的。也就是说，无法保证 write 7 是在最后执行的。

1: sqe→opcode = IORING_OP_READV;
2: sqe→fd = fd;
3: sqe→off = 0;
4: sqe→addr = &iovec;
5: sqe→len = 1;
6: sqe→user_data = some_value;
7: sqring→tail = sqring→tail + 1;

所以，需要添加如下的 write barrier

1: sqe→opcode = IORING_OP_READV;
2: sqe→fd = fd;
3: sqe→off = 0;
4: sqe→addr = &iovec;
5: sqe→len = 1;
6: sqe→user_data = some_value;
 write_barrier(); /* ensure previous writes are seen before tail write */
7: sqring→tail = sqring→tail + 1;
 write_barrier(); /* ensure tail write is seen */

Efficient IO with io_uring 学习

1.0 Introduction

2.0 Improving the status quo

3.0 New interface design goals

4.0 Enter io_uring

4.1 DATA STRUCTURES

4.2 COMMUNICATION CHANNEL

5.0 io_uring interface

io_uring_setup

io_uring_enter

5.1 SQE ORDERING

5.2 LINKED SQES

5.3 TIMEOUT COMMANDS

6.0 Memory ordering

7.0 liburing library

7.1 LIBURING IO_URING SETUP

8.0 Advanced use cases and features

8.1 FIXED FILES AND BUFFERS

8.2 POLLED IO

8.3 KERNEL SIDE POLLING

9.0 Performance

Reference