介绍 Linux 中的零拷贝技术。从 Fuse 学习中独立出来。

read、write 接口

从普通文件 read，涉及两次复制：

从磁盘通过 DMA 读到内核的 page cache
这里的 page cache 机制也是一种 kernel buffer，但专门提供给磁盘文件的。
从内核的 page cache 复制到 user buffer

从套接口读数据：

从网卡通过 DMA 直接写入 kernel buffer
从 kernel buffer 复制到 user buffer

注意，在使用 DMA 之前，磁盘读出来的数据会放到一个寄存器里面，然后通过中断通知 CPU 把数据写到临时的内存中攒批，最后写到 page cache 中。但是该方式性能太差，早已经淘汰了。

读数据过程：

调用 read() 函数陷入内核，第一次 context switch
DMA 控制器将数据从磁盘拷贝到 kernel buffer，这是第一次 DMA 拷贝
CPU 将数据从 kernel buffer 复制到 user buffer，这是第一次 CPU 拷贝
CPU 完成拷贝之后，read() 函数返回到用户态，第二次 context switch

写过程类似。

mmap

把 kernel space 的页映射到 user space，所以可以避免从 kernel space 到 user space 的一次复制。
关于 mmap 可以见内存领域知识。

sendfile

原始 sendfile

sendfile 将数据从磁盘读到内核的 page cache，然后将 page cache 复制到 socket 的 buffer 中。

它的好处是减少了 syscall 的次数。将 read + write 或者 mmap + write 打包了。
但是，仍然需要 2 次 DMA 拷贝和 1 次 CPU 拷贝。

sendfile + DMA 优化

将从 page cache 到 socket buffer 的那一次 CPU 拷贝去掉了。DMA 可以直接从 page cache 拷贝数据到网卡里面。

splice

限制是 fd_in 和 fd_out 中，至少有一个是 pipe：

如果 fd_in 是 pipe，那么 off_in 必须是 NULL
如果 fd_in 不是 pipe，且 off_in 是 NULL，那么 bytes are read from fd_in starting from the file offset, and the file offset is adjusted appropriately.
如果 fd_in 不是 pipe，且 off_in 不是 NULL，off_in must point to a buffer which specifies the starting offset from which bytes will be read from fd_in; in this case, the file offset of fd_in is not changed, and the offset pointed to by off_in is adjusted appropriately instead.

这里解释一下什么是 linux 中的管道：

匿名管道（anonymous pipe）
由父进程创建，用在具有亲缘关系的进程之间通信。
通过 pipe() 系统调用创建，返回一对文件描述符：一个用于写，一个用于读。
只存在于内存中，它不是一个磁盘上的文件，不能用 ls 查看，也没有 inode 号。
命名管道（named pipe，也叫 FIFO）
具有名字的管道，可以存在于文件系统中，有路径。文件类型是 p，代表 pipe。
通过 mkfifo 命令或者 mkfifo() 系统调用创建。
可以实现非亲缘进程之间的通信。

所有的匿名管道都支持 splice，通常借助匿名管道来实现 zero copy。此时，pipefd 就起到了中转管道的作用，它连接了两个彼此之间不支持零拷贝的 fd。我觉得是一个比较有意思的设计，通过匿名管道的中介，减少了不同 fd 之间实现相互 zero copy 的复杂度。

1 2	splice(file_fd, NULL, pipefd[1], NULL, len, 0); splice(pipefd[0], NULL, socket_fd, NULL, len, 0);

一些命名管道也支持 splice，但是可能只是可读写，非零拷贝中转。

1
2
3

ssize_t splice(int fd_in, off_t *_Nullable off_in,
              int fd_out, off_t *_Nullable off_out,
              size_t size, unsigned int flags);

Flag 如下：

SPLICE_F_MOVE
Attempt to move pages instead of copying. 这里的 move 指的是内核页缓存中的物理页面的引用在 fd 之间进行转移。而不需要读出、复制到 user space、写入这样的流程了。
注意，这个 flag 只是一个 hint。如果内核无法移动，则还是需要复制。如果 pipe buffer 不指向整个页面。
The initial implementation of this flag was buggy: therefore starting in Linux 2.6.21 it is a no-op (but is still permitted in a splice() call); in the future, a correct implementation may be restored.
SPLICE_F_NONBLOCK
Do not block on I/O. This makes the splice pipe operations nonblocking, but splice() may nevertheless block because the file descriptors that are spliced to/from may block (unless they have the O_NONBLOCK flag set).
SPLICE_F_MORE
More data will be coming in a subsequent splice. This is a helpful hint when the fd_out refers to a socket (see also
the description of MSG_MORE in send(2), and the description of TCP_CORK in tcp(7)).
SPLICE_F_GIFT
Unused for splice(); see vmsplice(2).

vmsplice

splice 主要是服务内核空间中的数据传输，而 vmsplice 主要服务用户空间和管道之间的数据读写，他们都能实现零拷贝。

#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <fcntl.h>

ssize_t vmsplice(int fd, const struct iovec *iov,
                size_t nr_segs, unsigned int flags);

iov 是一个长度为 nr_segs 的数组，表示用户内存中的多段可能不连续的 buffer。

参数：

SPLICE_F_MOVE
Unused for vmsplice(); see splice(2).
SPLICE_F_NONBLOCK
Do not block on I/O; see splice(2) for further details.
SPLICE_F_MORE
Currently has no effect for vmsplice(), but may be implemented in the future; see splice(2).
SPLICE_F_GIFT
The user pages are a gift to the kernel.
表示用户程序不会修改这段 buffer，否则，page cache 和磁盘中的数据就可能不一致。
将 pages gifting 给内核意味着后面的 splice SPLICE_F_MOVE 能够成功移动 pages。如果不指定，则后续的 splice SPLICE_F_MOVE 必须复制。
数据必须要 page aligned。我理解这里指的是：
- iovec[i].iov_base 需要对齐到页
- iov_len 需要是页大小的整数倍
如果不满足，则退化到 copy 的行为。

Zero-Copy 技术