【输入 / 输出优化】浅谈 C++ IO优化（快读快写）

_Kagamine_Rin_ · 2025-07-22 00:44:40 · 科技·工程

IO 优化是很实用 & 简单的常数优化，应该也是大部分 OIer 学习或接触卡常的第一种卡常数方式。当读入数据超过 4MB 时，读入优化可以节省好几倍时间。

一、读入部分：

该部分的评测结果均以 P10815 【模板】快速读入为例。

1. `cin` / `cout` 的优化（关闭同步和解除绑定）

`ios::sync_with_stdio(0)`

这个函数是一个「是否兼容 stdio」的开关，C++ 为了兼容 C，保证程序在使用了 printf 和 cout 的时候不发生混乱，将输出流绑到了一起。同步的输出流是线程安全的。

这其实是 C++ 为了兼容而采取的保守措施，也是使 cin/cout 速度较慢的主要原因。但我们可以在使用 cin/cout 之前将 stdio 解除绑定，能够明显地提升 cin/cout 的 IO 速度，但是在这样做之后要注意不能同时使用 cin 和 scanf，也不能同时使用 cout 和 printf，但是可以同时使用 cin 和 printf，也可以同时使用 scanf 和 cout（即不同的输入或输出方式不能混合使用）。

`cin.tie(0),cout.tie(0)`

tie 是将两个 stream（istream，ostream）绑定的参数，如果括号里填空参数（nullptr），则返回当前的输出流指针。

在默认的情况下 cin.tie() 绑定的是 &std::cout，每次进行格式化输入的时候都要调用 cout.flush() 清空输出缓冲区，这样会减慢 IO 速度。可以通过 cin.tie(nullptr) 来解除绑定，进一步加快执行效率。

需要注意的是，如果你需要在每次输入之前输出结果（如交互题），在进行 cin 和 cout 解绑之后，每次都需要调用 cout.flush() 或 cout << flush。如果输出时刚好需要换行，也可以调用 cout << endl。

下面的程序是一个例子：

cout << "Please input an integer:";
cout << "\n", cout.flush(); // 刷新缓冲区，也可以换成 cout << "\n" << flush; 或 cout << endl;
cin >> x;

使用示例：

#include <iostream>
using namespace std;
int main() {
    ios::sync_with_stdio(0), cin.tie(0), cout.tie(0);
// Your code begins here.
    int n, x, ans = 0;
    cin >> n;
    while (n --) cin >> x, ans += x;
    cout << ans;
// End.
    return 0;
}

测试结果：读入 10^7 个整数需要 643 ms 左右，无法在 2500 ms 内读入 10^8 个整数。

2. `C` 风格输入的优化

`getchar()`

getchar 是用来读入 1 byte 的数据并将其转换为 char 类型的函数（但实际上的返回值是 int），且速度很快，故可以用「读入字符——转换为整型」来代替缓慢的读入。

我们需要考虑输入的组成部分：整数由两部分组成——符号（一般是负号 -）和数字，整数之间有换行和空格。

10 进制整数中是不含空格或除 0~9 和（正）负号外的其他字符的，因此在读入不应存在于整数中的字符（通常为空格 / 换行）时，就可以判定已经读入结束。

C 和 C++ 语言分别在 ctype.h （C/C++）和 cctype（C++）头文件中，提供了函数 isdigit, 这个函数会检查传入的参数是否为十进制数字字符，是则返回一个非零整数，否则返回零。所以我们判定一个字符是否是十进制数字字符时可以使用 isdigit(c) 代替 c >= '0' && c <= '9'，也可以使用 !isdigit(c) 代替 c < '0' || c > '9'（而且 isdigit 的速度比 c >= '0' && c <= '9' 更快）。

下面是使用示例：

#include <cstdio>
#include <cctype>
int readint() {
    int k = 0, f = 1, c = getchar(); // 实际上变量 c 没有必要转成 char
    for (; !isdigit(c); c = getchar()) if (c == '-') f = -1; // 由于 isdigit 的接收值也是 int，所以确实没有必要转成 char
    for (; isdigit(c); c = getchar()) k = k * 10 + (c ^ 48); // 也可以写 (c & 15)
    return k * f;
}
int main() {
    int n = readint(), ans = 0;
    while (n --) ans += readint();
    printf("%d", ans);
}

评测结果：读入 10^7 个整数需要 290 ms 左右，但仍然无法在 2500 ms 内读入 10^8 个整数。

`getchar_unlocked` 函数：

普通的 getchar 带线程锁，然而 OI 用不上多线程（~~你用多线程就算作弊~~），所以我们在日常使用中可以把它关闭，以达到提高输入效率的目的，故可用更快的 getchar_unlocked 代替。

线程锁：

在多线程环境中，多个线程可能同时访问共享资源（举例如文件、内存、全局变量），线程锁在这个时候用于强制互斥访问：某个线程访问资源前先上锁，其他线程就阻塞在一边排队等待；操作完成后释放锁，允许其他线程访问。这可以防止 Data Race 导致程序崩溃或数据错误。

getchar 本身是线程安全的，它每调用一次就要先上锁，再读一个字符，再开锁，来保证安全。但是如果只有一个线程访问标准输入（事实上 OI 中整个程序只有一个线程），这个频繁开关锁的过程就没有必要了。

所以使用不带线程锁直接读入的 getchar_unlocked 可以加快读入速度。

注意：getchar_unlocked 需要在 Linux 下才能通过编译，Windows 下需要使用 _getchar_nolock

使用实例：

getchar_unlocked() 和 getchar() 在功能上基本一致，所以只需要将上述代码中的 getchar 替换成 getchar_unlocked 即可。

#include <cstdio>
#include <cctype>
int readint() {
    int k = 0, f = 1, c = getchar_unlocked();
    for (; !isdigit(c); c = getchar_unlocked()) if (c == '-') f = -1;
    for (; isdigit(c); c = getchar_unlocked()) k = k * 10 + (c ^ 48);
    return k * f;
}
int main() {
    int n = readint(), ans = 0;
    while (n --) ans += readint();
    printf("%d", ans);
}

评测结果：读入 10^7 个整数需要 182 ms 左右，并能在 1950 ms 内读入 10^8 个整数。

更方便本地运行的宏定义：

由于 getchar_unlocked() 在 Windows 系统上没有定义，所以在本地无法运行。因此我们可以在代码中添加以下宏定义方便自己进行调试。

#ifdef __unix__
#define gc getchar_unlocked
#else
#define gc _getchar_nolock
#endif

该代码在常见的 Linux 和 Windows 环境中都不会 CE。

3. `fread` 快读

3.1 介绍

通过 fread 可以实现更快的读入。

fread 能将需要的文件部分读入内存缓冲区。

由于 fread 是整段整段读取，所以比 getchar() 要快的多。

fread 类似于参数为 "%s" 的 scanf，不过它更为快速，而且可以一次性读入若干个字符（包括空格、换行，甚至可以是不可见字符），如果缓存区足够大，甚至可以一次性读入整个文件。

以下是 fread 的函数原型（C）：

size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);

函数 fread 会从 stream 流中读取 nmemb 个长度为 size 字节大小的数据，并将它们存储在 ptr 给定的位置，返回读取的字节数。

如 fread(buf, 1, 114514, stdin); 就是从标准输入流中读取 114514 个 1 字节大小的字符，存储在 buf 里。

3.2 使用

我们可以重定义一个 getchar() 函数（以下是 gc 函数）。可以一次性读入 2^{20} 个字符到 buf，然后从 buf 中读入一个 char，也就是头指针向后移动一位。

char buf[1<<20], *p1, *p2;
char gc() { // getchar
    if (p1 == p2) { // 读入的 2^20 个字符已经被读完
        p1 = buf;
        p2 = buf + fread(buf, 1, 1<<20, stdin); // 再读 2^20 个字符
        if (p1 == p2) return EOF; // p1 == p2，说明没有读入成功，即读到文件结束符
    }
    return *p1++; // 读入的 2^20 个字符没有用完
}
int readint() { // 普通的快读，仅更改 getchar() 为 gc()
    int k = 0, f = 1, c = gc();
    for (; !isdigit(c); c = gc()) if (c == '-') f = -1;
    for (; isdigit(c); c = gc()) k = k * 10 + (c ^ 48);
    return k * f;
}

当然上面的写法是为了方便理解，实际上多次调用函数很影响效率，甚至添加 inline 也是无效的，导致上述代码甚至没有普通快读快，所以一般我们把它写成 #define 形式。

#include <cctype>
#include <cstdio>
#include <cstdint>
uint8_t buf[1<<20], *p1, *p2;
#define gc() (p1==p2 && (p2=(p1=buf)+fread(buf,1,1<<20,stdin)),*p1++)
int readint() {
    int k = 0, f = 1, c = gc();
    for (; !isdigit(c); c = gc()) if (c == '-') f = -1;
    for (; isdigit(c); c = gc()) k = k * 10 + (c ^ 48);
    return k * f;
}
int main() {
    int n = readint(), ans = 0;
    while (n --) ans += readint();
    printf("%d", ans);
}

评测结果：读入 10^7 个整数需要 137 ms 左右，并能在 1450 ms 内读入 10^8 个整数。

注：

调试的时候，输入完成后请手动输入 EOF（Linux 上为 Ctrl + D，Windows 上为 Ctrl + Z）。
使用此快读请不要使用其他输入方式（包括 getchar、scanf、cin），否则会因为缓存问题产生错误。（除非你非常了解原理，而且清楚自己在做什么）

4. `mmap` 快读

4.1 介绍

mmap 是 Linux 系统函数，它的功能是将文件一次性地映射到内存中（注意不是拷贝）。在一些场景下有更优的速度。

注意 mmap 不能在 Windows 系统中通过编译，同时也不建议在正式的 OI 赛制的比赛中使用（但在很多卡常大赛中使用的频率很高）。且在使用前需要引入头文件 fcntl.h，unistd.h，sys/stat.h 与 sys/mman.h。

读入示例：首先要获取文件描述符 fd，然后通过 fstat 获取文件信息以得到文件大小，此后通过 char *pc = (char *) mmap(NULL, state.st_size, PROT_READ, MAP_PRIVATE, fd, 0); 将指针 *pc 指向我们的文件。可以直接用 *pc ++ 替代 getchar()。

当我们要提交不使用文件操作的题目时，可以将 fd 设为 0，表示从 stdin 读入。但是，对 stdin 使用 mmap 是极其危险的行为，同时不能在终端输入，所以不建议使用 mmap 从标准输入中读入。

mmap 函数的使用示例（该代码实现了读入一个整数并输出）：

#include <bits/stdc++.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#define gc() (*pc++)
char *pc;
int readint() {
    int k = 0, f = 1, c = gc(); // getchar 返回值都是 int，当然没有必要转成 char
    for (; !isdigit(c); c = gc()) if (c == '-') f = -1;
    for (; isdigit(c); c = gc()) k = k * 10 + (c ^ 48); // 也可以写 (c & 15)
    return k * f;
}

int main() {
    int fd = open("*.in", O_RDONLY);
    struct stat state;
    fstat(fd, &state);
    pc = (char *)mmap(NULL, state.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    close(fd);
    printf("%d", readint());
}

4.2 `mmap` 在快读上的使用

4.2.1（基础用法）

直接照搬 fread，然后把 fread 改为 mmap，但此时我们不需要 buf 数组，而是使用指针（char* 或 uint8_t*）代替。

#include <cctype>
#include <cstdio>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
uint8_t* str;
#define gc() (*str ++)
int readint() {
    int k = 0, f = 1; uint8_t c = gc();
    for (; !isdigit(c); c = gc()) if (c == '-') f = -1;
    for (; isdigit(c); c = gc()) k = k * 10 + (c ^ 48);
    return k * f;
}
int main() {
    struct stat s;
    fstat(0,&s);
    str=(uint8_t*)mmap(NULL,s.st_size,1,2,0,0); // PROT_READ, MAP_PRIVATE 分别是 1,2
// ↑以上是 mmap 进行读入操作
    int n = readint(), ans = 0;
    while (n --) ans += readint();
    printf("%d", ans);
}

评测结果：读入 10^7 个整数需要 124 ms 左右，并能在 1280 ms 内读入 10^8 个整数。

4.2.2 打表优化（非常规优化方式）

可以发现我们现在读入字符仍然是一个一个字符读入的（因为是 *str ++，每次只移动一个字节）。如果我们每次读入的字符大于一个，可以更加节省时间。

考虑到一个 char / uint8_t 的范围为 -128\sim 127 或 0\sim 255，一共是 256 种取值，所以两个 char 组合一共有 65536 种情况。

可以发现 65536 是一个比较小的数字，开一个长度为 65536 的数组进行打表也是完全可行的。所以我们可以打出两个字符拼成的数字是什么，然后使用下标访问得到答案。

打表代码：

uint32_t a[65536+1];
memset(a, -1, sizeof a);
for (int i = 48; i <= 57; ++ i)
    for (int j = 48; j <= 57; ++ j)
        a[i << 8 | j] = (j ^ 48) * 10 + (i ^ 48);

读入整数：

int v, f; // v: 输入数字的绝对值；f：输入数字的符号（0 正；1 负）
for (; n --; ) {
    c+=f=*c==45; // 等价于 { f = 0; if (*c == '-') f = 1, ++ c; } // 读到负号，则记录，并跳过负号
    v=0;
    if (~a[*(uint16_t*)(c)]) v=a[*(uint16_t*)c],c+=2; // 一次读入两个数字
    if (~a[*(uint16_t*)(c)]) v=v*100+a[*(uint16_t*)c],c+=2;
    if (~a[*(uint16_t*)(c)]) v=v*100+a[*(uint16_t*)c],c+=2;
    if (~a[*(uint16_t*)(c)]) v=v*100+a[*(uint16_t*)c],c+=2;
    if (~a[*(uint16_t*)(c)]) v=v*100+a[*(uint16_t*)c],c+=2;
    if (*c>47) v=v*10+((*c++)^48); // 一次读入一个数字（如果长度是奇数就会执行到这里）
    ans+=(f?-v:v);
    while (isspace(*c)) ++c; // 跳过空格/换行
}

需要注意的是，这里一次性得到两个字符需要使用指针强制转换，因为 *c<<16|*c 中会增加两次位运算。

完整代码：

#include<stdio.h>
#include<ctype.h>
#include<stdint.h>
#include<string.h>
#include<sys/stat.h>
#include<sys/mman.h>

int main(){
    int a[0x10000+1]; // 打表数组（0x10000 == 65536）
    memset(a,-1,0x40000);
    for (int i=48;i<=57;++i)
        for (int j=48;j<=57;++j)
            a[i<<8|j]=(j^48)*10+(i^48);
    struct stat s;
    fstat(0,&s);
    char* c=(char*)mmap(0, s.st_size, PROT_READ, MAP_SHARED, 0, 0);
    int n = 0, ans=0;
    while (isdigit(*c)) n=n*10+((*c++)^48);
    while (isspace(*c)) ++c; // 跳过空格和换行
    int v, f; // v: 输入数字的绝对值；f：输入数字的符号（0 正；1 负）
    for (; n --; ) {
        c+=f=*c==45; // 等价于 { f = 0; if (*c == '-') f = 1, ++ c; } // 读到负号，则记录，并跳过负号
        v=0;
        if (~a[*(uint16_t*)(c)]) v=a[*(uint16_t*)c],c+=2; // 一次读入两个数字
        if (~a[*(uint16_t*)(c)]) v=v*100+a[*(uint16_t*)c],c+=2;
        if (~a[*(uint16_t*)(c)]) v=v*100+a[*(uint16_t*)c],c+=2;
        if (~a[*(uint16_t*)(c)]) v=v*100+a[*(uint16_t*)c],c+=2;
        if (~a[*(uint16_t*)(c)]) v=v*100+a[*(uint16_t*)c],c+=2;
        if (*c>47) v=v*10+((*c++)^48); // 一次读入一个数字（如果长度是奇数就会执行到这里）
        ans+=(f?-v:v);
        while (isspace(*c)) ++c; // 跳过空格/换行
    } printf("%d", ans);
}

评测结果：读入 10^7 个整数需要 63 ms 左右，并能在 626 ms 内读入 10^8 个整数。

4.2.3 其它优化

我们的优化并没有到达极限。

我们发现 a 数组中的数字是由 12336 个 -1 + 每两个字符对应的数字 + 50786 个 -1 组成的，所以我们可以将 a 数组定义成 const 类型。但很显然这样会增加代码长度，所以我们需要使用 #define 减小代码长度。
在 5 个 if 语句中我们可以发现越前面的语句越容易满足，所以我们可以使用 __builtin_expected(CONDITION,0/1) 告诉编译器条件 CONDITION 更偏向 False(0)/True(1)（根据题目的输入进行优化）。
发现输入的数字 \in[-n,n]，其中 n_{max}=10^8，说明输入数字长度最多为 9，所以可以去除一个 if。
发现第二行输入的数中间都是一个空格，所以可以用一个 ++c 代替 while。

根据优化 1，我们可以像这样初始化 a 数组。

#define N1 -1,-1,-1,-1,-1,-1,-1,-1,-1,-1
#define N2 N1,N1,N1,N1,N1,N1,N1,N1,N1,N1
#define N3 N2,N2,N2,N2,N2,N2,N2,N2,N2,N2
#define N4 N3,N3,N3,N3,N3,N3,N3,N3,N3,N3
#define D(a) a,1##a,2##a,3##a,4##a,5##a,6##a,7##a,8##a,9##a
#define S246 N2,N2,N1,N1,N1,N1,-1,-1,-1,-1,-1,-1
const int _mp[0x10000]={
N4,N3,N3,N2,N2,N2,N1,N1,N1,-1,-1,-1,-1,-1,-1, // 12336个-1
D(0),S246,D(1),S246,D(2),S246,D(3),S246,D(4),S246, // 数字组0-4
D(5),S246,D(6),S246,D(7),S246,D(8),S246,D(9), // 数字组5-9
N4,N4,N4,N4,N4,N2,N2,N2,N2,N2,N2,N2,N2,N1,N1,N1,N1,N1,N1,N1,N1,
-1,-1,-1,-1,-1,-1
};
#undef N1
#undef N2
#undef N3
#undef N4
#undef D
#undef S246

根据优化 2，我们可以把第一个 `if` 语句写成 `if (__builtin_expect(~a[*(uint16_t*)(c)], 1)) v=a[*(uint16_t*)c],c+=2;` 所以我们得到了以下代码（但是用 `and` 短路代替了常规的 `if`）： ```cpp #include<stdio.h> #include<ctype.h> #include<stdint.h> #include<string.h> #include<sys/stat.h> #include<sys/mman.h> #define N1 -1,-1,-1,-1,-1,-1,-1,-1,-1,-1 #define N2 N1,N1,N1,N1,N1,N1,N1,N1,N1,N1 #define N3 N2,N2,N2,N2,N2,N2,N2,N2,N2,N2 #define N4 N3,N3,N3,N3,N3,N3,N3,N3,N3,N3 #define D(a) a,1##a,2##a,3##a,4##a,5##a,6##a,7##a,8##a,9##a #define S246 N2,N2,N1,N1,N1,N1,-1,-1,-1,-1,-1,-1 const int _mp[0x10000]={ N4,N3,N3,N2,N2,N2,N1,N1,N1,-1,-1,-1,-1,-1,-1, // 12336个-1 D(0),S246,D(1),S246,D(2),S246,D(3),S246,D(4),S246, // 数字组0-4 D(5),S246,D(6),S246,D(7),S246,D(8),S246,D(9), // 数字组5-9 N4,N4,N4,N4,N4,N2,N2,N2,N2,N2,N2,N2,N2,N1,N1,N1,N1,N1,N1,N1,N1, -1,-1,-1,-1,-1,-1 }; #undef N1 #undef N2 #undef N3 #undef N4 #undef D #undef S246 int main(){ struct stat s; fstat(0,&s); uint8_t* c=(uint8_t*)mmap(NULL,s.st_size,1,2,0,0); int v,f,n=0,ans=0; while (isdigit(*c)) n=n*10+(15&*c++); while (isspace(*c)) ++c; for (;n--;++c){ c+=f=(*c==45); v=0; __builtin_expect(~_mp[*(uint16_t*)(c)], 1) && (v=_mp[*(uint16_t*)(c)],c+=2), (~_mp[*(uint16_t*)(c)]) && (v=v*100+_mp[*(uint16_t*)(c)],c+=2), (~_mp[*(uint16_t*)(c)]) && (v=v*100+_mp[*(uint16_t*)(c)],c+=2), (~_mp[*(uint16_t*)(c)]) && (v=v*100+_mp[*(uint16_t*)(c)],c+=2), isdigit(*c) && (v=v*10+(15&*c++)), ans+=(f?-v:v); } printf("%d", ans); return 0; } ``` [评测结果（C++）](https://www.luogu.com.cn/record/225933271)：读入 $10^7$ 个整数需要 $53$ ms 左右，并能在 $464$ ms 内读入 $10^8$ 个整数。 [评测结果（C）](https://www.luogu.com.cn/record/226615350)：读入 $10^7$ 个整数需要 $52$ ms 左右，并能在 $451$ ms 内读入 $10^8$ 个整数。 ~~然后就拿下了这道题的最优解（除旧数据和套数据点外）~~ #### 4.2.4 可能的继续优化方向以下是可能能够提升速度的方向： 1. 找到 $3$ 字节整数类型（`int24`），然后将打表一次两个字符改为三个字符。 2. 减少类型转换次数。 3. 有更快的函数能替代 `isdigit/isspace`（已知比较运算符不可能更快）。 4. ~~更多玄学做法/人类智慧~~。 # 二、输出部分 **注意：** 1. 由于在 `int` 范围内，`-2147483648 == -(-2147483648)`，所以在输出这个数的时候会出现问题，需要进行特判，或者转换成 `uint32_t` 后再输出（为方便理解，以下采用前者）。 2. 部分非递归快写还需要特判 `0`。该部分的评测结果均以 [U588066 Fast Write](https://www.luogu.com.cn/problem/U588066) 为例。 ## 1. `C++` 风格输出的优化见「一、1、`cin` / `cout` 的优化（关闭同步和解除绑定）」 [评测结果](https://www.luogu.com.cn/record/227085184)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $561$ ms。 ## 2. `C` 风格的输出优化先看 `printf` 函数的 [测试结果](https://www.luogu.com.cn/record/227085359)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $659$ ms。可见 `printf` 的输出性能不是特别理想，以下是代替 `printf` 的方法。 ### `putchar(x)/putchar_unlocked(x)` 与 `getchar` 类似，`putchar` 是用来输出 1 byte 的数据并将其转换为 `char` 类型进行输出（但实际上的接收值的类型是 `int`），且速度很快，故可以用逐位输出来代替 `printf`。我们可以每次对一个正整数除以 10，然后记录余数，存放在数组里，最后逆序输出（也可以逆序存放，顺序输出），同时不要忘记要**特判 `0` 和 `-2147483648`**。 ### `puts(_String)` `puts` 是一个用于快速输出 C 风格字符串（字符数组）的函数，接收值为 `const char*`，当输出到 `\0` 时停止。考虑到我们要进行多次 `putchar` 操作，则我们可以把这些操作合并成一个 `puts` 操作。 ```c #include <stdio.h> #include <string.h> #include <stdint.h> typedef int i32; uint8_t fr[17]; inline char* to_str(int x) { if (x == 0) return "0"; if (x == -2147483648) return "-2147483648"; uint8_t *tot = fr + 17; if (x < 0) putchar(45), x = -x; while(x) *(-- tot) = x % 10 | 48, x /= 10; return (char*)tot; } main() { int n, x; scanf("%d%d", &n, &x); fputs(to_str(n), stdout); putchar(32); puts(to_str(x)); for (int i = 1; i <= n; ++ i) { x ^= x << 13; x ^= x >> 17; x ^= x << 5; puts(to_str(x)); } } ``` [测试结果](https://www.luogu.com.cn/record/227085872)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $365$ ms。 ### 循环展开优化考虑到一个 `int` 类型的变量最大为 10 位数，所以我们可以使用 10 个 `if` 语句代替 `while` 循环。 ```c inline char* to_str(int x) { // Like this. if (x == 0) return "0"; uint8_t *tot = fr + 17; (x < 0) && (putchar(45), x = -x); (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10); return (char*)tot; } ``` 但实际上可能是因为分支预测，这段代码和上一段代码的运行时间基本一致，但下文的优化需要使用到循环展开。 ## 3. `fwrite` 在快写上的应用与 `fread` 相似，用于实现更快的快写。函数原型： ```c size_t fwrite(const void * buffer, size_t size, size_t count, FILE * stream); ``` `fwrite` 函数将 count 个对象写入给定的输出流，每个对象的大小为 `size` 个字节。与 `puts` 不同的其中一点，`puts` 在输出到 `\0` 的时候就会停止输出，而 `fwrite` 可以指定输出范围，即使遇到 `\0` 也可以继续输出，并且不会像 `puts` 一样输出完成后还会加一个多余的 `\n`。在上个版本的快写上进行修改，并结合 `fread` 快读的思想，我们同样需要创建一个 `obuf` 数组，用于存储输出内容。当存储到达一定量时，一次性输出并清空缓冲区。 ```c #include <stdio.h> #include <string.h> #include <stdint.h> #define BUF (1 << 22) #define putchar(x) (now ++[0] = (x)) // 重定义 putchar typedef int i32; uint8_t obuf[BUF + 17], // 需要多预留一点空间，防止缓冲区溢出引起错误 *now = obuf; uint8_t fr[17]; inline void write(int x) { if (!x) return(void)(putchar(48)); if (x == -2147483648) return(void)(memmove(now, "-2147483648", 11), now += 11); uint8_t* tot = fr + 17; (x < 0) && (now ++[0] = 45, x = -x); (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10), (x) && ((-- tot)[0] = x % 10 | 48, x /= 10); memmove(now, tot, fr + 17 - tot); // 内存复制 now += fr + 17 - tot; } main() { int n, x; scanf("%d%d", &n, &x); write(n), putchar(32), write(x), putchar(10); for (int i = 1; i <= n; ++ i) { x ^= x << 13; x ^= x >> 17; x ^= x << 5; write(x), putchar(10); if (now - obuf > BUF) { // 存储到达一定量，清空缓冲区 fwrite(obuf, 1, now - obuf, stdout); now = obuf; } } fwrite(obuf, 1, now - obuf, stdout); // 在程序结束时也要输出 } ``` [测试结果](https://www.luogu.com.cn/record/227087909)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $225$ ms。 ## 4. 更多优化（包括打表优化与宏定义封装） 1. 与上文的快读中一样可以采用打表优化，不同的是，这次我们可以一次打四个数字。 2. 使用 `__builtin_expect(...,0/1)` 对第一个（伪）`if` 语句进行优化。 3. 考虑到 `inline` 的内联不够彻底，我们可以直接使用宏定义。根据优化 1，我们可以写出以下打表代码： ```cpp int fnd[10000]; for (int n = 0000; n <= 9999; ++ n) fnd[n] = (n/1000%10+'0')|((n/100%10+'0')<<8)|((n/10%10+'0')<<16)|((n%10+'0')<<24) ``` 然后使用宏定义初始化列表： ```cpp #define D(n) ((n)/1000%10+'0')|(((n)/100%10+'0')<<8)|(((n)/10%10+'0')<<16)|(((n)%10+'0')<<24) #define D10(s) D((s)+0),D((s)+1),D((s)+2),D((s)+3),D((s)+4),D((s)+5),D((s)+6),D((s)+7),D((s)+8),D((s)+9) #define D100(s) D10((s)+0),D10((s)+10),D10((s)+20),D10((s)+30),D10((s)+40),D10((s)+50),D10((s)+60),D10((s)+70),D10((s)+80),D10((s)+90) #define D1000(s) D100((s)+0),D100((s)+100),D100((s)+200),D100((s)+300),D100((s)+400),D100((s)+500),D100((s)+600),D100((s)+700),D100((s)+800),D100((s)+900) const unsigned fnd[] = { D1000(0),D1000(1000),D1000(2000),D1000(3000),D1000(4000), D1000(5000),D1000(6000),D1000(7000),D1000(8000),D1000(9000) }; ``` $650$ 左右个字符就完成初始化了。最终代码如下： ```cpp #include <stdio.h> #include <string.h> #include <stdint.h> typedef int i32; #define K 11 // Fnd list #define D(n) ((n)/1000%10+'0')|(((n)/100%10+'0')<<8)|(((n)/10%10+'0')<<16)|(((n)%10+'0')<<24) #define D10(s) D((s)+0),D((s)+1),D((s)+2),D((s)+3),D((s)+4),D((s)+5),D((s)+6),D((s)+7),D((s)+8),D((s)+9) #define D100(s) D10((s)+0),D10((s)+10),D10((s)+20),D10((s)+30),D10((s)+40),D10((s)+50),D10((s)+60),D10((s)+70),D10((s)+80),D10((s)+90) #define D1000(s) D100((s)+0),D100((s)+100),D100((s)+200),D100((s)+300),D100((s)+400),D100((s)+500),D100((s)+600),D100((s)+700),D100((s)+800),D100((s)+900) // End #define _Write(y) {\ int $ = (y);\ uint8_t* tot = fr + K;\ ($ < 0) && (now ++[0] = 45, $ = -$);\ ($ > 999) && (*(int*)(tot-=4) = fnd[$%10000], $ /= 10000),\ ($ > 999) && (*(int*)(tot-=4) = fnd[$%10000], $ /= 10000),\ ($ > 9) && (*(uint16_t*)(tot-=2) = fnd[$%100] >> 16, $ /= 100),\ ($) && ((-- tot)[0] = $ % 10 | 48, $ /= 10);\ memmove(now, tot, fr + K - tot);\ now += fr + K - tot;\ } #define write(x) \ if (__builtin_expect(x == 0, 0)) now++[0] = 48;\ else if (__builtin_expect(x == -2147483648, 0)) memmove(now, "-2147483648", 11), now += 11;\ else _Write(x) #define putchar(x) (now++[0] = (x)); int main() { uint8_t fr[K]; uint8_t obuf[(1 << 22) + 17]; const int fnd[] = { D1000(0),D1000(1000),D1000(2000),D1000(3000),D1000(4000), D1000(5000),D1000(6000),D1000(7000),D1000(8000),D1000(9000) }; uint8_t* now = obuf; int n, x; scanf("%d%d", &n, &x); printf("%d %d\n", n, x); for (int i = 1; i <= n; ++ i) { x ^= x << 13; x ^= x >> 17; x ^= x << 5; write(x) putchar(10) (now - obuf > (1 << 22)) && (fwrite(obuf, 1, now - obuf, stdout), now = obuf); } fwrite(obuf, 1, now - obuf, stdout); } ``` [测试结果](https://www.luogu.com.cn/record/227067631)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $159$ ms。 # 三、 IO 优化的弊端 IO 优化虽然能够显著提升程序的运行速度，但并非没有缺点。 1. IO 优化可能会导致程序可读性下降，尤其是使用 `fread`、`mmap` 或宏定义形式的函数会导致调试困难。 2. 容易引入隐蔽 Bug，比如缓冲区溢出，没有特判 `0` 或 `-2147483648`（以 32 位整型变量为例），或者将 EOF（文件结束符）“吃掉”。 3. 优化可能适得其反，比如递归形式的普通快写在很多时候会比去除同步流的 `cout` 更慢。 4. 其它低级错误导致代码挂掉。 # 四、 THE END 完结撒花。如果有问题/更好的做法欢迎指出。 upd: 2025/7/26 1. 增加了快速输出。 2. 更改了部分宏定义和代码缩进。 3. 更改了文章结构。 upd: 2025/10/16 1. 修改了部分代码和少量错别字。 2. 修改了少量不详细的地方。