CF2181L LLM Training
Description
You are given a text dataset. Your task is to train LLM (Large Language Model) and find the minimal possible loss. No kidding.
A text dataset is an array of texts $ t_1, t_2, \ldots, t_n $ . Each text $ t_i $ is a sequence of tokens. We define the set of tokens $ T $ as the set of all tokens that appear in at least one text $ t_i $ . Additionally, for each text $ t_i $ , there is a set of positions $ L_i \subseteq \{1, 2, \ldots, |t_i|\} $ . The token $ t_i[j] $ is generated by LLM if $ j \in L_i $ and is written by the user if $ j \notin L_i $ .
Let us define LLM with context size $ k $ as a probabilistic model $ P_k $ , such that it defines the probability distribution of the next token of the sequence, depending on a context $ w $ — a sequence of length between $ 0 $ and $ k $ (inclusive) whose elements are from $ T $ . Thus the probabilistic model $ P_k $ is a large table of probabilities $ P_k(\text{next} | w) $ , defined for any context $ w \in T^{*} $ , $ 0 \leq |w| \leq k $ and any token $ \text{next} \in T $ . Conditions $ 0 \leq P_k(\text{next} | w) \leq 1 $ and $ \sum\limits_{\text{next} \in T} P_k(\text{next} | w) = 1 $ should be satisfied.
The loss function of LLM with the context size $ k $ is the following function defined for $ P_k $ :
$$$
\mathcal{L}_k(P_k) \,\, = \,\, \sum_{i=1}^{n} \,\, \sum_{j\in L_i} \, -\log_2 P_k\!\left( \underbrace{t_i[j]}_{\text{next token}} \ \middle|\ \underbrace{t_i[\max(1,j-k)\,..\,j-1]}_{\text{context}} \right)
$$$
Here $ t_i[l\,..\,r] = t_i[l] t_i[l+1] \ldots t_i[r] $ is the substring from $ l $ -th to $ r $ -th token, $ t_i[1\,..\,0] $ is an empty string. So, for each text and for each token that is generated by LLM, we add to the loss the negative logarithm (base 2) of the probability that this token will be generated, depending on the substring of previous $ k $ tokens (or the whole prefix, if it has length less than $ k $ ). If the probability is zero, we assume that the negative logarithm is $ +\infty $ . This loss function is known as the (base 2) Cross Entropy Loss over the LLM-generated positions. The smaller the loss function value $ \mathcal{L}_k(P_k) $ , the better LLM $ P_k $ is.
For each $ 0 \leq k < \max\limits_{i=1..n} |t_i| $ , calculate the minimum possible loss $ \mathcal{L}_k(P_k) $ that could be obtained for some $ P_k $ — LLM with context size $ k $ . It can be proved that this minimum is reachable and is not infinite.
Input Format
The first line contains a single integer $ n $ ( $ 1 \leq n \leq 10^5 $ ) — the number of texts in the dataset. Text descriptions follow.
The first line of the $ i $ -th text description contains a single integer $ m_i $ ( $ 1 \leq m_i \leq 3 \cdot 10^5 $ ) — the length of $ t_i $ ( $ m_i = |t_i| $ ).
The next line contains $ m_i $ strings $ t_{i}[1] $ , $ t_{i}[2] $ , $ \ldots $ , $ t_{i}[m_i] $ ( $ 1 \leq |t_{i}[j]| \leq 5 $ ) — tokens of the text $ t_i $ . Each token consists of symbols with ASCII codes from $ 33 $ to $ 126 $ (printable characters).
The next line contains a string $ \ell_i $ of $ m_i $ letters U and L, which encodes the set $ L_i $ . All positions with the letter L are generated by LLM, and all positions with the letter U are written by the user. So $ L_i = \{j\,|\,\ell_{i}[j] = \texttt{L}\} $ . It is guaranteed that the last token is generated by LLM, so $ \ell_{i}[m_i] = \texttt{L} $ .
It is guaranteed that the sum of $ m_i $ for all $ i $ ( $ 1 \le i \le n $ ) does not exceed $ 3 \cdot 10^5 $ .
Output Format
Print $ M = \max\limits_{i=1..n} m_i $ real numbers: for each $ k = 0, 1, \ldots, M-1 $ print the minimum possible loss $ \mathcal{L}_k(P_k) $ for all possible $ P_k $ — LLM with context size $ k $ .
Your answers will be accepted if their absolute or relative errors do not exceed $ 10^{-6} $ ; formally, if $ p $ is your answer, and $ q $ is the jury's answer, this should hold: $ \frac{|p - q|}{\max\{1, |q|\}} \le 10^{-6} $ .