万能的字符串 Hash

PineappleSummer · 2024-11-04 16:15:17 · 算法·理论

修改记录：

2024/11/06 初稿

转载请在文章显著处注明出处。

众所周知，字符串 Hash 是一种很好的暴力，但是，可以通过 Hash 实现 Trie 模板、KMP 模板，以及 Manacher 模板。

字符串哈希模板为 P3370 【模板】字符串哈希，在此不做讲解。

Hash 实现 Trie

如果对 Trie 树有了解，一定知道 Trie 树上的一个节点代表了一个字符串前缀，Trie 树上的一条边代表了一个字符。

Trie 树上每个节点代表的字符串前缀是独一无二的，而一个字符串前缀所对应的哈希值也是独一无二的，考虑将 Trie 树上的节点替换为一个哈希值。

来看 Trie 树模版题 P8306 【模板】字典树。

给定 n 个模式串 s_1, s_2, \dots, s_n 和 q 次询问，每次询问给定一个文本串 t_i，请回答 s_1 \sim s_n 中有多少个字符串 s_j 满足 t_i 是 s_j 的前缀。

记 f_h 为哈希值为 h 的前缀出现的次数，对于每个模式串，计算出其每个前缀的哈希值 h，让 f_h\gets f_h+1。

对于每次查询，记查询串的哈希值为 H，则该字符串作为前缀的次数为 f_H。

将 unordered_map 平均时间复杂度视为 \mathcal O(1)，该做法时间复杂度 \mathcal O(\sum |s|)。

对于 Hash，使用了自然溢出，CF 上可能卡掉。

这里给出 Hash 实现 Trie 的参考代码，可以通过 P8306 【模板】字典树：

#include <bits/stdc++.h>
using namespace std;
typedef unsigned long long ull;
#define umap unordered_map
const int base = 233;
int n, q, ans;
string str;
umap <ull, int> f;
void solve () {
    cin >> n >> q; f.clear ();
    for (int i = 1; i <= n; ++i) {
        cin >> str; ull h = 0;
        for (auto j : str) h = h * base + (ull) j, f[h] ++;
    }
    while (q--) {
        cin >> str; ull h = 0; ans = 0;
        for (auto i : str) h = h * base + (ull) i;
        cout << f[h] << '\n';
    }
}
signed main () {
    ios::sync_with_stdio (false);
    cin.tie (0); cout.tie (0);

    int T; cin >> T;
    while (T--) solve ();
    return 0;
}

Hash 实现 KMP

P3375 【模板】KMP

给出两个字符串 s_1 和 s_2，若 s_1 的区间 [l, r] 子串与 s_2 完全相同，则称 s_2 在 s_1 中出现了，其出现位置为 l。
现在请你求出 s_2 在 s_1 中所有出现的位置。对于字符串 s_2，你还需要求出对于其每个前缀 s' 的最长 border t' 的长度。

首先看第一问，求 s_2 在 s_1 中出现的位置，是很好处理的。

设 s_1 的长度为 n，s_2 的长度为 m。我们可以先求出 s_2 整个字符串的哈希值，然后对于 1\le i\le n-m+1，计算 s_1 中 [i,i+m-1] 的串的哈希值是否等于 s_2 的哈希值，如果相等，输出该位置。时间复杂度 \mathcal O(n+m)。

下文提到的 s[:i] 指 s 中 1\sim i 位字符组成的前缀。

对于第二问，求出 s_2 每个前缀的 border，需要利用一个性质：前缀 s[:i] 的 border 至多比前缀 s[:i-1] 的 border 长读大 1。证明显然。

设 j 为 s[:i-1] 的 border，那么我们判断 s[:j+1] 与 s[i-j:i] 是否相同，如果相同，跳出循环，否则让 j\gets j-1，继续判断，直到满足条件或者 j 为负数。

那么退出循环时一定满足 s[:j+1] 与 s[i-j:i] 相同，让 j\gets j+1，则前缀 s[:i] 的 border 就为 j。

由于 j 每次至多增加 1，因此 j 总共最多减少 m，所以第二问时间复杂度 \mathcal O(m)。

这里给出 Hash 实现 KMP 的参考代码，可以通过 P3375 【模板】KMP：

#include <bits/stdc++.h>
using namespace std;
typedef unsigned long long ull;
const int N = 1e6 + 10;
const ull base = 233;
string s, t;
int n, m;
ull g[N], h[N], bs[N];
ull get1 (int l, int r) { return g[r] - g[l - 1] * bs[r - l + 1]; }
ull get2 (int l, int r) { return h[r] - h[l - 1] * bs[r - l + 1]; }
signed main () {
    ios::sync_with_stdio (false);
    cin.tie (0); cout.tie (0);

    cin >> s >> t;
    n = s.size (), m = t.size ();
    s = " " + s; t = " " + t;
    bs[0] = 1;
    for (int i = 1; i < N; ++i) bs[i] = bs[i - 1] * base;
    for (int i = 1; i <= n; ++i) g[i] = g[i - 1] * base + (ull) s[i];
    for (int i = 1; i <= m; ++i) h[i] = h[i - 1] * base + (ull) t[i];
    for (int i = 1; i <= n - m + 1; ++i)
        if (get1 (i, i + m - 1) == h[m]) cout << i << '\n'; // 求出出现位置
    cout << 0 << ' ';
    for (int i = 2, j = 0; i <= m; ++i) {
        while (j >= 0 && get2 (1, j + 1) != get2 (i - j, i)) j--;
        j ++; cout << j << ' ';
    }
    return 0;
}

Hash 实现 Manacher

P3805 【模板】manacher

给出一个只由小写英文字符 \texttt a,\texttt b,\texttt c,\ldots\texttt y,\texttt z 组成的字符串 S，求 S 中最长回文串的长度。字符串长度为 n。1\le n\le 1.1\times 10^7。

哈希最大的优点是 \mathcal O(1) 判断两个子串是否相同。

设 j 为右端点为 i-1 时最长的回文子串的左端点。考虑右端点变为 i 时，j 如何移动。

可以发现 j 如果向左移，至多移动 1 个长度，也就是说，j 最小变为 j-1。

而 j 具体是多少呢，我们可以先将 j 移动到 j-1，然后判断 [j,i] 是否是一个回文串，如果不是，让 j\gets j+1，直到 [j,i] 成为一个回文串或者 j=i。答案每次对 i-j+1 取 \max 即可。

每次 j 至多减 1，因此至多向左移动 n 次，至多向右移动 n 次，时间复杂度 \mathcal O(n)。

对于如何判断一个子串是否为回文串，可以正着做一遍再反着做一遍哈希，看看子串前一半和后一半是否相同即可。

这里给出一份 Hash 实现 Manacher 的参考代码，可以通过 P3805 【模板】manacher：

#include <bits/stdc++.h>
using namespace std;
typedef long long LL;
typedef unsigned long long ull;
const int N = 1.1e7 + 10;
const int base = 233;
int n, r[N], ans;
string s;
ull h1[N], h2[N], bs[N];
ull get1 (int l, int r) { return h1[r] - h1[l - 1] * bs[r - l + 1]; }
ull get2 (int l, int r) { return h2[l] - h2[r + 1] * bs[r - l + 1]; }
signed main ()
{
    ios::sync_with_stdio (false);
    cin.tie (0); cout.tie (0);

    cin >> s; n = s.size (); s = " " + s;
    bs[0] = 1;
    for (int i = 1; i <= n; i++) {
        h1[i] = h1[i - 1] * base + ull (s[i]);
        bs[i] = bs[i - 1] * base;
    }
    for (int i = n; i >= 1; i--) h2[i] = h2[i + 1] * base + ull (s[i]); // 反向哈希
    for (int i = 1, j = 1; i <= n; i++) {
        if (j != 1) j--;
        while (j < i) {
            int len = (i - j + 1) / 2;
            if (get2 (j, j + len - 1) == get1 (i - len + 1, i)) // 是回文子串
                break;
            j++;
        }
        ans = max (ans, i - j + 1); // 更新答案
    }
    cout << ans << '\n';
    return 0;
}

一些 Hash 好题

P4824 [USACO15FEB] Censoring S

简要题意：给出字符串 s 和 t，需要从 s 中不断删除 t，后面字符会向前补全，求最终的 s。

每次删除时一定有完整的 t 串，可以用栈维护当前未被删除的字符，如果以栈顶为结尾的长度为 |t| 的子串等于 t 串，弹出栈顶 |t| 次。

时间复杂度 \mathcal O(|s|+|t|)。

[ABC377G] Edit to Match

这题 Hash 比 Trie 实现更加简单，推荐参考我的题解。

将 s_i 删空的代价为 s_i 的长度。

如果想要将 s_i 删掉末尾几个字符，再在末尾添上几个字符以达到和 s_1\sim s_{i-1} 中的一个相同，那么删掉末尾几个字符时一定成为 s_1\sim s_{i-1} 中的一个字符串的前缀。

设 f_t 为前缀 t 需要在末尾添上几个字符变成一个出现过的字符串的最小代价。对于每个字符串 s_i，枚举前缀，更新一下 f_t。

对于 s_i 的每个前缀 t，将 s_i 删为 t 的代价为 |s_i|-|t|，然后将 t 补全为 s_1\sim s_{i-1} 中的一个字符串的代价为 f_t。答案对 |s_i|-|t|+f_t 取最小值即可。

对于 f_t，t 可以使用字符串哈希，然后用 unordered_map 存一下。

大家对文章有什么好的建议欢迎在下面评论，有相关题目也可以提供，我会在修改时添加上贡献者，谢谢！