关于我在无意间被浮点数的指令集变成废柴这件事

bochibochi · 2025-06-21 12:11:18 · 科技·工程

先看一段代码（修改自我的乘法取模方式果然有问题）

#include <chrono>
#include <iostream>
#include <random>
using namespace std;
using cl = chrono::high_resolution_clock;
mt19937 rng(123);
const int N = 3e4, V = 29, mod = 1e9 + 7;
int Mod = mod, a[N + 10], b[N + 10];
inline int mul1(int a, int b) {
    int ret = a * b - (int(double(a) / Mod * b + 0.5)) * Mod;
    return ret + (ret < 0) * Mod;
}
inline int mul2(int a, int b) { return (long long)a * b % mod; }
int main() {
    for (int i = 1; i <= N; i++) {
        a[i] = rng() & (1 << V) - 1;
        b[i] = rng() & (1 << V) - 1;
    }
    int ans = 0;
    auto Start = cl::now();
    for (int i = 1; i <= N; i++)
        for (int j = 1; j <= N; j++) ans ^= mul1(a[i], b[j]);
    cout << (cl::now() - Start).count() / 1e9 << endl;
    Start = cl::now();
    for (int i = 1; i <= N; i++)
        for (int j = 1; j <= N; j++) ans ^= mul2(a[i], b[j]);
    cout << (cl::now() - Start).count() / 1e9 << endl;
    cout << ans << endl;
    return 0;
}

mul1 函数是用 double 手动算 a*b 对 mod 取模的结果。

mul2 函数是用 long long 直接算 a*b%mod。

在我的电脑上，两种方式分别用时（测 5 次取平均）：

0.427343
0.759699

发现 double 居然比 long long 快？！

然而，我们简单修改以上代码，将 main 函数改为

@@ -19,11 +19,11 @@
     int ans = 0;
     auto Start = cl::now();
     for (int i = 1; i <= N; i++)
-        for (int j = 1; j <= N; j++) ans ^= mul(a[i], b[j]);
+        for (int j = 1; j <= N; j++) ans ^= mul(ans & ((1 << V) - 1), b[j]);
     cout << (cl::now() - Start).count() / 1e9 << endl;
     Start = cl::now();
     for (int i = 1; i <= N; i++)
-        for (int j = 1; j <= N; j++) ans ^= Mul(a[i], b[j]);
+        for (int j = 1; j <= N; j++) ans ^= Mul(ans & ((1 << V) - 1), b[j]);
     cout << (cl::now() - Start).count() / 1e9 << endl;
     cout << ans << endl;
     return 0;

再次运行，得到了如下的结果：

7.33341
2.80966

这次两种方式都显著变慢了，但 double 的方案远慢于 long long。

为什么会有这个现象呢？

对于第一份代码中的 mul1，我们在开 -O2 优化之后编译器会进行一定的自动向量化。具体来说，因为内层循环体每次执行不会依赖上一次的结果，运算可以并行执行，因此被编译器向量化了。

什么是向量化？你可以简单理解为，现代 cpu 可以一次指令操作数组中的多个数据，而不是一次处理一个数据，例如 avx2 通常可以同时操作 256位，相当于 4 个 double 或 8 个 float。

至于 mul2，没有被向量化是因为 cpu 并没有提供一条指令能同时对多个整形变量做除法或取模。

对于第二份代码，由于它每次循环之间有数据依赖，所以没办法并行化，因此两个函数都显著变慢。mul1 慢的更多是因为浮点类型运算本来就比整形慢。

可以通过添加编译选项 -fno-tree-vectorize 来禁用任何给定循环的向量化。禁用后，第一份代码运行时间变为

0.655899
0.753851

发现 mul1 仍然略微比 mul2 要快，这是因为 cpu 仍然可以对 mul1 进行一定程度的并行化。

如果你想测试向量化真正速度，可以加入编译选项 -march=native -mtune=native，根据本机支持的指令集，进行极致的优化。

0.119438
0.688546

mul1 只用了惊人的 0.12 秒就跑完了（取决于 cpu 可能有很大差异）。

写在最后

感觉一般能自动向量化的场景并不多，所以还是老老实实用 long long 取模吧（）

测试所用的机器配置

AMD Ryzen 7 7700 @ 3.8~5.3GHz
Ubuntu clang version 20.1.5 (++20250430014744+ebfae55af454-1~exp1~20250430014805.108)
Ubuntu jammy 22.04
Linux version 6.6.87.2-microsoft-standard-WSL2

编译选项是

-O2 -std=c++23 -stdlib=libc++