P9060 [Ynoi2002] Goedel Machine

Background

# 1. Preface The Goedel Machine [1] is a learnable general-purpose problem solver proposed by Jürgen Schmidhuber, the father of LSTM. This article will describe and discuss this model, and explain that the model has many important properties that we believe are necessary for Artificial General Intelligence. # 2. Related Concepts ## Learning Learning means that a model modifies its own parameters (the same model behaves differently with different parameters). The method used to modify itself is called a learning algorithm. ## Meta-learning Meta-learning means that a model can, through learning, modify its own learning algorithm. There is a lot of evidence suggesting that humans have relatively strong meta-learning abilities. ### Meta-learning Hierarchy A non-learnable method has no learning ability, such as a program that solves OI problems, or a World of Tanks script that follows a fixed route and stops when it sees an enemy, etc. Common machine learning methods, such as naive neural network methods, do not have meta-learning ability because the learning algorithm is the non-learnable gradient descent algorithm. Support vector machines do not have meta-learning ability because the learning algorithm is deterministic, etc. Neural networks specially designed with meta-learning features do have meta-learning ability, but they do not have meta-meta-learning ability, meaning they cannot learn the algorithm that learns their learning algorithm. Suppose you want to design a learning algorithm: At the beginning, you want your algorithm to be learnable, so you design an initial learning algorithm. Then you want your initial learning algorithm to be learnable, so you design a meta-learning algorithm. Then you want your meta-learning algorithm to be learnable, so you design a meta-meta-learning algorithm. At each step, you must design a new learning algorithm to learn the previous learning algorithm. When designing the learning algorithm at level $i$, this learning algorithm can optimize the previous $i-1$ levels, but optimizing level $i$ requires methods at level $i+1$ and above. The hierarchy is not necessary; the hierarchy here is artificially designed. Since you artificially designed a hierarchical structure, you need to design infinitely many layers of meta-learning algorithms. If you keep recursing like this, you can find a problem with existing machine learning methods: because you cannot manually design infinitely many layers of meta-learning algorithms, after some layer it must become non-learnable. This leads to the fact that manual design can always keep making the model stronger, because a model with level $i$ meta-learning designed will not be weaker than a model with only the first $i-1$ levels designed, but this process never converges. Such a model does not have a kind of **unimprovability**, meaning that given enough computational resources, this model is optimal. The Goedel Machine and similar constructions try to generalize all levels of meta-learning using the same mechanism, turning the meta-learning problem at any level into a single ordinary learning problem. To achieve this, we need a sufficiently powerful optimizer. Weak optimizers, such as neural networks with gradient descent, face the problem that you cannot use gradient descent to optimize the process of gradient descent itself; that is, you cannot directly use the learning algorithm to handle optimization at the meta-learning level. Therefore, meta-learning features must be manually designed, meaning you need to design an optimizer for meta-learning tasks, and then this optimizer cannot be optimized by already designed methods, so you must continue to design new corresponding optimization algorithms. This hierarchical meta-learning requiring infinite design is one of the biggest problems of neural network methods. ### Non-hierarchical Meta-learning / Infinite-layer Meta-learning The Goedel Machine has no hierarchical meta-learning design, because its architecture optimizes all optimization problems in the same way, and this optimization method can optimize itself. Therefore, the Goedel Machine has non-hierarchical meta-learning ability, or it can be called the ability of infinite-layer meta-learning. Given enough computational resources, the Goedel Machine is not weaker than any other computable machine learning model. In fact, some meta-learning algorithms are recursive. That is, there exists a meta-learning algorithm such that level $i$ can optimize levels $i-1$ and below, and for any $i$, the algorithm is the same, so only finite manual design is needed. We believe such methods have some problems and are not as good as non-hierarchical meta-learning algorithms, and we will not expand on this here. Meta-learning and learning are actually indistinguishable, which will be discussed later. ## Reinforcement Learning A model interacts with the environment and optimizes the reward provided by the environment. ### Markov Decision Process Time is divided into discrete time steps. At each moment, the model obtains input information from the environment (including the reward provided by the environment), and can choose some operations from a finite set of actions on the environment to execute. The environment computes the next state of the environment from the current state and the actions taken by the model. ## Artificial General Intelligence / Full Artificial Intelligence / Artificial Broad Intelligence These terms are Chinese translations of Artificial General Intelligence. This term represents an AI model that can reach top human level in all domains, or even surpass it. Artificial General Intelligence **has no precise definition**. Different people have different views on what abilities Artificial General Intelligence should have. ## Self-programming / Self-modification / Self-improvement These three terms refer to a model ability. A model with this ability can improve its performance by modifying itself. Generally, improving performance means being able to use more data in supervised learning tasks, or obtaining higher expected reward in reinforcement learning tasks. ## Self-reference A model is fully self-referential if it can internally describe all its own properties. Although humans are not fully self-referential models, humans can describe some properties about themselves. Self-reference is often used to construct paradoxes, such as: 1. Is "This sentence is false" true or false? 2. Russell's paradox: a barber shaves all and only those who do not shave themselves; does the barber shave himself? 3. The halting problem: a Turing machine that does not halt if it determines that it halts. 4. Gödel's incompleteness theorem. The Goedel Machine is a model that can be fully self-referential. The author of the Goedel Machine believes that the essence of "consciousness" is self-reference, i.e., a model can be aware of its own existence. The learning algorithm of the Goedel Machine can be fully self-referential, meaning it can use the learning algorithm to learn the learning algorithm itself, and even completely overwrite the initial learning algorithm. ## Computability When space is unlimited and word length is unlimited, the set of problems that a program can finish in finite time is the set of computable problems. ## Turing Machine In this article, we directly treat a Turing machine as code written in C++, with unlimited space and unlimited word length. ### Universal Turing Machine In C++, it is obviously possible to simulate and execute C++ code. Suppose we write a piece of C++ code A that can execute any other C++ code. Then when we want to execute C++ code B, we can take B as the input of A and execute A. Such code A is called a universal Turing machine. ## Indistinguishability Many human-designed hierarchical structures are indistinguishable, such as: 1. Data and programs are indistinguishable: here we also explain the Chinese Room paradox. The Chinese Room refers to a black box where you put a person who does not know Chinese together with a translator. People outside provide Chinese articles to the inside; the person inside inputs the article into the translator, directly gets the translated result, and outputs it. From the outside, you cannot determine whether the person inside knows Chinese. This paradox is essentially equivalent to: now we have a universal Turing machine and a piece of Artificial General Intelligence code. If we treat the universal Turing machine as the model (program) and treat the Artificial General Intelligence as data to execute (for the universal Turing machine, the program of a Turing machine can be data, and a program as data can be executed), then we can regard the universal Turing machine under given data as an Artificial General Intelligence. That is, we can move the non-trivial part, Artificial General Intelligence, freely between the program side and the data side, but we must solve this essential problem on one side. 2. The data in supervised learning and the model's prior (initially introduced knowledge) are indistinguishable; in reinforcement learning, human knowledge and the model's reward are indistinguishable: the loss function of a supervised learning model can embed arbitrarily complex information, including all information in the dataset. In this case, the model's loss is part of the model's prior. The reward function of a reinforcement learning model can embed arbitrarily complex information, including priors about the environment and human knowledge. For example, AlphaGo uses the MCTS search algorithm as an exploration prior; such a prior (search) is a kind of human knowledge. 3. A model's prior about itself and a model's prior about the environment are indistinguishable: this will be discussed later. 4. Learning and meta-learning are indistinguishable: suppose the model is divided into a learning algorithm and the main body of the model. If the learning algorithm is modified after learning, this can be regarded as meta-learning. But if we treat the whole model as a whole, learning just changes some parameters of the model, which is ordinary learning. The indistinguishability of learning and meta-learning also shows that artificially imposing a hierarchy on meta-learning is unnatural; meta-learning optimization should be treated as ordinary learning. Because of indistinguishability, generally we measure a model's ability by its function rather than a specific structure. For example, for the Chinese Room paradox, we can consider a human with a translator as knowing Chinese, because it behaves the same as a person who knows Chinese, while a human without a translator does not know Chinese. Similarly, due to indistinguishability, discussions about "understanding" and "consciousness" are meaningless, because we measure a model's ability by its function rather than a specific structure. # 3. Introduction Almost all known algorithms for solving problems (including proof searchers, supervised learning, reinforcement learning, etc.) lack meta-learning, for the reasons described in the meta-learning hierarchy section. Humans need to continuously design new and better algorithms (algorithms with more layers of meta-learning) and provide computational resources to test these algorithms. How can we eliminate the need for humans? For convenience, we define learning as the model modifying its own parameters. The method used to modify itself is called a learning algorithm. The simplest idea is to let the machine perform arbitrarily complex learning, meaning the learning method must be strong enough, for example, selecting the most efficient learning algorithm from all computable programs and using it to modify itself. To solve this, the author designed a class of optimal, fully self-referential, general-purpose problem solvers—the Goedel Machine. The Goedel Machine interacts with a partially observable environment and can in principle modify itself without limit (learn), with the only restriction being its own computability. The initial algorithm of the Goedel Machine can completely rewrite itself, and this rewriting ability guarantees that the Goedel Machine is globally optimal. ## Structure of the Goedel Machine The Goedel Machine mainly includes an axiom system, a proof search program, an action program, and an objective function. # 4. Environment and Formalized Objective The environment of the model resides in a human-designed hardware, which can be a universal Turing machine, a space-limited Turing machine, or an abstract electronic computer. The model is a singleton; only one instance exists at the same time. The model's life consists of discrete time steps: $1,2,\dots$. The model's life ends at time $T$, which may be unobservable to the model. The value of any time-varying variable $Q$ at time $t$ can be denoted as $Q(t)$. At each time step, our hardware executes one basic operation, changing the hardware state $s$. Without loss of generality, $s\in \mathcal S \in B^*$, where $B^*$ denotes all possible binary strings. At each time step the environment state may change; the environment state is denoted by $Env$, with $Env\in \mathcal E$, where $\mathcal E$ denotes the set of considered environments, which will be discussed later. The hardware has a hard-coded transition equation: $F:\mathcal S\times \mathcal E \rightarrow \mathcal S$. For any $t>1$, $s(t)=F(s(t-1),Env(t-1))$ denotes the hardware state after the first $t-1$ moments have ended and before time $t$ begins. $Env(t)$ denotes the environment at time $t$, obtained by interacting with the environment using the variables in $s(t-1)$ that describe environment changes; if the environment is random, sampling is done according to the corresponding probabilities. Define four special variables: $time,x,y,p$: 1. $time$ denotes the current time. 2. $x$ denotes the input from the environment to the Goedel Machine. For any $t>1$, $x(t)\neq x(t-1)$ may occur if and only if the Goedel Machine outputs a specific string to request input from the environment. This avoids the case where each time step has only constant computational resources, making it impossible to fully observe the input (for example, copying the entire input may require more computational resources than the per-step budget). 3. $y$ denotes the output from the Goedel Machine to the environment. $y(t)$ denotes a binary string. The output may affect the environment, for example outputting a control command that operates a robot in the environment, or a command to turn on a TV. 4. $p(1)$ denotes the initial software of the Goedel Machine, i.e., the computation process that maps environment inputs to environment outputs. The initial $p(1)$ can be represented by a Turing machine, so it is not optimal. At each time $1\le t\le T$, the model's goal is to maximize expected return. For a fixed $s$ and $Env$, the utility function can be written as: $u(s,Env):\mathcal S\times \mathcal E\rightarrow \mathcal R$, $u(s,Env)=E_{\mu} \bigg [\sum _{\tau=time} ^T r(\tau)\bigg | s,Env \bigg ]$, where $\mathcal R$ denotes real numbers, and $r(t)$ denotes the real-valued reward obtained at time $t$. $E_{\mu}\big(\cdot \big|\cdot\big)$ denotes conditional expectation, where $\mu$ denotes all possible environments. Note that $s$ also represents the model's internal environment, because the hardware behavior is completely determined by the model, while $Env$ represents the external environment. $time=time(s)$ denotes the time of state $s$. The reward function can also be modified so that the model tends to choose improvements with lower variance. Most computer science problems can be formalized in this design form. # 5. Basic Idea of the Goedel Machine To make our hardware a self-referential Goedel Machine, we need to design the corresponding software (i.e., runnable code) $p$, which needs the ability to modify itself. The initial code $p(1)$ contains a currently non-optimal problem solver $e(1)$ for interacting with the environment. This problem solver can be any traditional reinforcement learning algorithm. It also contains a universal proof searcher. The proof searcher continuously searches for pairs (program, proof) until, for a program A, it finds a proof B showing that: after using the found program A to directly modify $p$, the resulting program has a larger expected return than $p$ before modification. Then the model uses program A to modify $p$. This modification is not only effective for $e$, but may also affect the proof searcher in $p$. ## Self-modification Axiom At time $t_1$, we use the found program A to modify $s(t_1)$ if and only if we have found a proof B showing: $u[A(s(t_1)),Env(t_1)]>u[s(t_1),Env(t_1)]$. The hardware state $s$ includes the software $p$, and $A(s(t_1))$ denotes using program A to learn (modify) the model's software at time $t_1$. The Goedel Machine also needs to introduce many other axioms to become strong enough, such as axioms about the environment and axioms related to probability. We use $\mathcal A$ to denote an axiom system strong enough for the Goedel Machine. ## Gödel's Incompleteness Theorem We say an axiom system is consistent if and only if for any proposition $P$, the two propositions $P$ and not $P$ cannot both be provable in this axiom system. Gödel's incompleteness theorem says that for any axiom system, if it is consistent and can define natural numbers, and addition, multiplication, and equality on natural numbers, then there must exist a proposition $P$ such that both $P$ and not $P$ are unprovable in this axiom system. ## Limitations and Advantages of the Goedel Machine Gödel's incompleteness theorem implies that for any sufficiently strong axiom system, there must exist unprovable propositions. This does not make the Goedel Machine weak, because if there exists a proposition useful to the Goedel Machine but unprovable in its axiom system, then for humans, that proposition is also unprovable in the same axiom system, and humans cannot use it either. A bigger limitation of the Goedel Machine is that in real environments, time and space resources are limited. If the initial axiom design of the Goedel Machine is not good enough, then even with extremely large resources, it may not be able to search for enough effective self-modification programs, resulting in an inefficient searcher even after consuming enormous resources. The Goedel Machine needs to introduce many probability-related axioms, because observations in finite time can only produce finite data. Finite data means any proof cannot guarantee it will not contradict later observed data, so only the confidence of a proof can be given, rather than guaranteeing the proof is certainly correct. In general environments, it is hard for the Goedel Machine to conduct strictly controlled experiments, hard to collect i.i.d. data, and hard to discover environmental changes on large time scales in a short time. This makes it hard to prove useful conclusions via strict probabilistic inference, so even adding ordinary probability axioms may still prove nothing useful. These issues make the Goedel Machine not look like a formal system or a proof searcher. If we treat the initial software $p$ as the model's initial learning algorithm and treat self-modification as learning, the Goedel Machine looks more like a program searcher. In this case, an inconsistent or too-weak axiom system corresponds to an incomplete initial learning algorithm, such as a neural network. The advantage of the Goedel Machine is that it can make arbitrary changes to its initial algorithm, and no other existing method can do this. For example, neural networks cannot learn their own gradient descent learning method; neural networks with meta-learning cannot learn their own meta-learning algorithm; searchers cannot learn their own search strategy, etc. The essential difference is that the Goedel Machine's learning is stronger than these models, strong enough to completely overwrite the initial learning algorithm, while other models can only modify part of the initial learning algorithm. # 6. Global Optimality Theorem ## Self-modification Axiom and Global Optimality For any formalized utility function $u$, assuming the Goedel Machine axiom system $\mathcal A$ is consistent, then any self-modification program $p$ derived from the self-modification axiom is globally optimal: the maximum expected return of the Goedel Machine before executing self-modification $p$ is lower than after executing $p$. This self-modification looks like a greedy algorithm, greedily improving toward a better direction each time. Why is it globally optimal? We can prove that a program $p$ satisfying the self-modification axiom is, with respect to the current state of the Goedel Machine, the best modification among all possible programs. The proof idea is simple. After finding $p$, the Goedel Machine has two options: 1. Execute the self-modification program $p$. 2. Do not execute the self-modification program $p$, and continue searching for other programs to execute. The second option accounts for all programs that may be found in the future, and the Goedel Machine has proven that the first option is better than the second option, so program $p$ is the best among all programs under axiom system $\mathcal A$. In practice, optimality here should consider the expected cost of finding each program. For example, there may be some self-modifications that work better than $p$, but finding them takes a lot of time, or after executing $p$, the model can find other self-modifications faster. It is exactly because of these reasons that the Goedel Machine proves that executing $p$ yields a larger expected return. ## $O()$-optimality The Goedel Machine is $O()$-optimal on any task, meaning it differs from the provably theoretically optimal solution (if it exists) by only a constant factor in computational resource usage. This is because the initial algorithm of the Goedel Machine can guarantee that searching all programs and proofs is $O()$-optimal, and the Goedel Machine cannot be weaker than its initial algorithm, so it is also $O()$-optimal. $O()$-optimality is not a very strong property. Suppose there is a problem whose optimal solution is program A. Then an $O()$-optimal model only needs to spend constant computational resources to search for program A and prove that it is better than the previous solution given by the model, and it can achieve $O()$-optimality. Comparing two proof searchers is similar: one proof searcher can spend constant time to search for the other and prove optimality. An $O()$-optimal optimizer requires a huge constant startup cost. Such an optimizer performs very poorly at the beginning and needs to search for a very long time before becoming strong enough. We believe this process is similar to evolution in nature, and we will discuss their relationship later. # 7. bias-optimal searcher (an optimal searcher under a given prior) Definition of bias-optimal searcher (an optimal searcher under a given prior): Given a prior probability $P(p|r)$, which denotes the probability that problem $r$ should choose solution $p$. A black-box function can determine whether problem $r$ has solution $p$ within time $t(p,r)$. Within a finite time $T$, for any problem $r$, if there exists a solution $p$ satisfying $t(p,r)\le P(p|r)\times T$, then a bias-optimal searcher can find a solution to the problem. A near-bias-optimal searcher relaxes the condition to $t(p,r)\le P(p|r)\times T/n$, where $n>1$ is the approximation ratio. Bias-optimal searchers have known constructions such as Optimal Ordered Problem Solver, and we will not explain them here. The searcher cares about how to allocate computational resources according to the given prior. $p$ can be a program, and $t$ is the running time of the program. The Goedel Machine only needs to use a bias-optimal searcher as the initial proof searcher, and then it is $O()$-optimal, because a bias-optimal searcher is $O()$-optimal. # 8. Defects of the Goedel Machine and Related Discussion ## Prediction Rather Than Proof Considering that limited interaction leads to uncertainty about understanding the environment, from a finite history of interaction with the environment, one should not be able to prove reliable environment properties over an infinite time. If you try to construct a formal system stronger than the original axiom system and use it for proofs, you cannot prove the consistency of the new formal system; you can only use limited experience to support that this new formal system is useful (for example, using it normally until a contradiction is found). If you try to add assumptions about the environment, then adding that assumption yields a new formal system, but that assumption cannot be proven in the initial axiom system, similarly. From this perspective, using proofs to infer the environment is similar to ordinary meta-learning. The proof here is better described as a prediction of the future based on existing data. On the other hand, a formal system can prove some propositions about infinity. These propositions cannot be obtained from finite data, but may produce verifiable and useful implications. If such a construction is useful, the initial learning algorithm should be able to construct such a formal system and use it for proofs, but it would not regard the formal system as absolutely reliable. From this perspective, a formal system is similar to an ordinary program. ## Computational Resource Limits The Goedel Machine considers an ideal environment. The model survives for time $T$, and it optimizes the sum of rewards during its lifetime. ### Finite Lifetime A finite lifetime restricts the model, preventing it from reaching global optimality. Due to the inefficiency of the Goedel Machine, from a realistic perspective, we believe current available compute is not enough to optimize the Goedel Machine to be strong enough, and it may even be unable to search for any provable improvement. For the case of a finite lifetime, it is actually difficult for the Goedel Machine, under finite data, to give expected gains within time $T$ for all possible environments, unless it makes overly strong assumptions about the environment. This is because the maximum expected return designed for the Goedel Machine requires, for all possible environments, computing the probability that the environment occurs and then computing the expected reward by weighting. Assuming finitely many possible environments is obviously too strong, because it would make the model not Turing-complete and degenerate into a DFA. When the possible environments are infinite, this expectation cannot be computed in finite time. Even if you write code for a Goedel Machine satisfying the properties described in this article, without overly strong assumptions about the environment, since the expected return is uncomputable, under finite computational resources the Goedel Machine will get stuck at the first operation and then exhaust all compute resources. Or there are some modifications that do not require computing expected return, such as rewriting `a=a+1` as `++a`. But without overly strong assumptions about the environment, it is hard for the Goedel Machine to prove that a slightly more complex modification is better. ### Infinite Lifetime An infinite lifetime causes the utility function to fail, because an infinite time horizon makes $u(s,Env)=E_{\mu} \bigg [\sum _{\tau=time} ^\infty r(\tau)\bigg | s,Env \bigg ]$ unbounded. Two expected rewards that are both infinite are not comparable. This is why existing reinforcement learning methods set a reward discount factor, because such methods can make the expected reward of each state bounded. An infinite lifetime may also make the reward at each moment $r(\tau)$ unbounded. For the case of an infinite lifetime, the model in fact cannot optimize only reward; it also needs to optimize its own survival and computational resources, because if it needs humans to provide infinite computational resources, then it cannot be independent of humans and cannot "eliminate the need for humans". ## Startup Inefficiency If we introduce a large amount of probability priors into the axiom system and make expected return computable, the Goedel Machine still faces the problem of overly inefficient startup. Simply designed general methods, such as the Goedel Machine and some other $O()$-optimal proof search methods, usually require a very long time before becoming sufficiently good. For example, the optimal method for solving a specific task takes time $t$, but the general method takes time $a+t\times b$, where $a$ is a huge constant. We define it as the **startup constant**, used to find the optimal method during the startup phase, and $b$ is the constant overhead for simulating the optimal method (this is generally not serious; under reasonable assumptions we can achieve $b=1$). To obtain a practical general method, the main design complexity is to reduce the startup constant $a$. Therefore, we need to first construct a simple general method allowing a larger startup constant $a$, and then reduce $a$ through specific design. ### Evolution Evolution is a process that naturally generates arbitrarily complex structures. The organisms generated by evolution can also generate arbitrarily complex structures in the environment by affecting the environment. Evolution is very different from known optimization processes. Known optimization processes have clear objectives, such as neural network methods optimizing loss via gradient descent, reinforcement learning methods optimizing maximum expected return, etc. Evolution is not optimizing any preset objective. Evolution can be viewed from two different perspectives: 1. Evolution is a dynamical system evolving along the time axis under deterministic rules. Evolution has no objective, but it can generate arbitrarily complex structures. From this perspective, we can think that there is no agent, only an ever-evolving environment, but the environment's evolution is complex enough to generate arbitrarily complex structures, such as the Game of Life. Examples show the complexity of evolution, such as: how can a flying gold block be generated in the universe? Evolution first generated humans in the environment, then humans made the Voyager probe, which carried a pure gold disc made by humans, and the generation of humans and gold, if expanded, is also an extremely complex process. Genetic algorithms are not evolution, because they cannot build arbitrarily complex structures in the environment, and they cannot simulate the complex interactions among organisms. 2. Evolution is a process where agents constantly interact with the environment: computation or interaction consumes their computational resources, and they can obtain computational resources from the environment. When an agent's computational resources become $0$, it dies, so it needs to optimize its survival time. The boundary between the agent and the environment is indistinguishable. The agent can also treat other agents as part of the environment. Modeling other agents (e.g., game theory) treats these agents as a more predictable part of the environment and studies the properties of that part. This shows that agents and the environment are indistinguishable, and the boundary of an agent in the environment is fuzzy. For example: humans eat food from the environment; is the eaten food part of the human? Are humans part of the human population in the environment? Humans created robots; robots are obtained by humans operating on the environment; are they part of humans? Humans are complex structures produced after a very long time of evolution. If we treat evolution as a method to generate Artificial General Intelligence, then evolution generates Artificial General Intelligence like generating a flying gold block: by first generating humans and then generating these complex structures through humans. The process of generating humans is the huge startup constant $a$ mentioned above. By introducing suitable human-designed priors into the model without breaking completeness, we should be able to obtain an Artificial General Intelligence that needs only a smaller startup constant and can achieve good performance on most tasks. ### Incompleteness of Neural Networks Neural networks are an example of introducing human-designed priors into a model but breaking completeness. Breaking completeness causes neural networks to be unable to solve all problems in a general way, but they can solve some specific problems with a very low startup constant. The incompleteness of neural networks means: 1. The searched computation graphs are limited. Neural networks search a relatively static class of computation graphs. Although there are some dynamic attempts, such as MoE methods, Transformer sparsification, external memory methods (similar to Neural Turing Machines), the searchable program space is still limited. Although neural networks are Turing-complete, static computation graphs cause computational inefficiency. For example, using a Transformer to compute multiplication requires a long autoregressive process to output intermediate results, while actual multiplication needs only a small amount of computation. Because their structure is fixed, neural networks cannot improve their structure through search. Neural architecture search (meta-learning) will be discussed later. 2. They cannot traverse the state space. Common gradient descent methods require the learning rate to gradually decrease, otherwise training will not converge. This causes each random initialization to explore only a small part of the state space. A better neural-network-like method is to randomly initialize many neural networks, train them all to convergence, and then during testing, take the average or maximum likelihood of their results. This reduces variance because it explores the state space more broadly. 3. Limited meta-learning ability. Naive neural network methods are based on gradient descent with a fixed learning rate at each step, so they have no meta-learning ability. Even neural networks with meta-learning features only learn the parameters of gradient descent and cannot significantly modify their own learning algorithm. Neural network methods work well on known formal problems because they include many human-designed tricks specialized for such problems. But meta-learning corresponds to unknown problems. Limited learning ability causes neural networks to learn poorly on such unknown structured problems, and failing to meta-learn well further limits their learning ability. Algorithm engineers studying neural networks hope to achieve Artificial General Intelligence with neural networks and keep designing more general models, but we believe this approach is unreasonable. Consider a Transformer model. To optimize the quadratic-cost attention, suppose it implements an excellent attention strategy that allows accessing only $O(1)$ attention positions each time and still obtaining information that originally required attending to $\Theta(n)$ positions. Suppose such a Transformer learned correct hard memory access, and training and testing are not separated, meaning the model's training method is also strong, not limited to gradient descent, does not suffer catastrophic forgetting, and can simulate any dynamic computation graph with constant overhead. Then such a model is Artificial General Intelligence. In such a model, the neural network part would degenerate into a for loop, because the Transformer is an ever-growing functional memory. Each time it only needs constant-scale computation, reads information from $O(1)$ memory positions, and writes $O(1)$ information. The true complexity shifts to how to design an excellent attention strategy and how to train it. We believe automatic mathematical proving and automatic code generation are sufficiently complex tasks. A model that far surpasses human performance on these two tasks should be able, like the Goedel Machine, to explore potentially effective improvements and modify itself through proofs and self-programming. Recently, neural-network-like methods have made great progress, such as AI drawing, AlphaCode, ChatGPT, DreamerV3, etc. To our knowledge, we believe they cannot achieve Artificial General Intelligence through neural networks, because the learning algorithms of all these models are essentially weak and hard to improve. We hope the reasoning given in this article is wrong, and that they can achieve Artificial General Intelligence and realize the technology countless people dream of. ## What Is Good? If we can define Artificial General Intelligence and prove the correctness of a specific implementation, then we can use a proof searcher to search for and use an implementation of Artificial General Intelligence. This proof searcher is an implementation of Artificial General Intelligence. The time it takes for the searcher to find Artificial General Intelligence is a huge constant, and this constant is part of the startup constant of the proof searcher as an Artificial General Intelligence. But before it is found, we do not know that there exists some specific implementation whose correctness can be proven. Note that we may be able to prove existence, but cannot prove that an arbitrary instance is an existence witness. The given formal definition also depends on human experience and is not guaranteed to be correct. If the formal definition is wrong, humans must discover and modify it, which cannot "eliminate the need for humans". Under the definition based on permanent survival, because we do not understand environment properties well enough, and judging permanent survival requires observing over infinite time, the existence and correctness of Artificial General Intelligence are likely unprovable. Under the definition of optimizing long-term reward in an ideal environment, it seems easier to prove. For the Goedel Machine, the complexity of Artificial General Intelligence shifts to judging what kinds of modifications are beneficial in the future, i.e., "what modifications are good?". The rest of the design is relatively trivial. We believe the problem of the Goedel Machine is also that it does not solve this problem. Relying on an initial axiom system to directly prove that the modified model can obtain a larger expected return than before is too inefficient and may not even be correct, because the article does not give a concrete implementation of an axiom system that can compute expected return over all environments. If we know how to judge what is good, then by the global optimality theorem, we only need a proof searcher that searches all possible programs and guarantees completeness (an exploration mechanism), and then judge whether a modification is good. If it is good, apply the modification (a learning mechanism), and we can achieve Artificial General Intelligence. So what is good? The editor also really wants to know. Currently we only have a few vague designs, and even after filling in details they cannot fully solve this problem. # 9. Q.A. 1. Q: Do formal proofs have meaning in the real world? A: Yes. You need to introduce uncertainty and probability-related axioms, as well as properties of the real world, into the Goedel Machine. 2. Q: Will the Goedel Machine use a self-modification that destroys itself? Will the Goedel Machine directly get huge rewards by modifying its own reward function? A: The Goedel Machine will only use a self-modification when it proves that the self-modification increases the maximum expected reward under the current reward function. If the self-modification may cause self-destruction, then probability-related axioms are used here: the model has some probability of self-destruction and some probability of obtaining larger expected reward. If the Goedel Machine modifies its own reward function, then using the new reward function will certainly increase the maximum expected reward under the model's original reward function. 3. Q: Automated proving is very hard. How can the Goedel Machine work? A: Human work today uses automated proof searchers a lot. Since proofs and programs are equivalent, neural networks can also be regarded as a kind of proof searcher. These existing proof searchers are basically not $O()$-optimal. They are designed to handle specific tasks with a small startup constant on those tasks, but on tasks outside their design range they may require an infinite startup constant. 4. Q: Does the "No Free Lunch Theorem" imply that there is no general problem solver? A: No. The "No Free Lunch Theorem" constructs a special kind of problem where the data is ambiguous. Problems we usually handle do not have such bad properties, and humans also cannot perform well on such problems. # 10. Conclusion The self-improvement property of models has always been considered important. For example, Turing [2] found that rather than spending enormous computational resources to design thousands of functional modules, it is better to directly design a human baby and let it learn in the environment. This property is self-improvement [3], i.e., learning. The importance of learning is obvious: as the environment changes, if a model cannot learn, it will certainly be eliminated by the environment. If Artificial General Intelligence is achieved by continuously modifying AI functional modules, then humans need to live forever. Later, scientists found that the quality of the learning algorithm significantly affects model performance. Neural networks initially required too much compute and were defeated by models such as SVMs, but with improvements in compute, model structure, and training tricks, they have become the most widely used models today. However, we believe the learning algorithm of neural networks is essentially weak, and Artificial General Intelligence needs a sufficiently strong learning algorithm. For such a learning algorithm, rather than spending enormous computational resources to directly design it, it is better to directly design a general meta-learning. However, current AI research instead does not pursue generality. Many models are designed for specific tasks and cannot work on other tasks. Some remaining models can convert any input-output content into sequences and solve tasks generally via sequence prediction, but these models do not have enough meta-learning ability and cannot learn a sufficiently complex learning algorithm, so they cannot solve non-trivial tasks such as program generation and automated proving. Current research on general models mainly improves performance by adding tricks, such as using higher-quality data, using human evaluations for reinforcement learning (RLHF), etc. We believe these tricks are also helpful for Goedel-Machine-like Artificial General Intelligence and can reduce the model's startup constant. But using these tricks to enhance an incomplete model (such as a neural network) is unreasonable; we should first design a complete model, and then use these tricks to train the complete model. The Goedel Machine is a model with a non-hierarchical meta-learning design and an $O()$-optimal learning algorithm. But we believe the Goedel Machine design does not fully solve the complexity of Artificial General Intelligence, because its startup constant is too high, and the design of the axiom system $\mathcal A$ is non-trivial but not given by the author. Nevertheless, we still believe the Goedel Machine is a very meaningful model and an important milestone in AI. A non-hierarchical meta-learning model will not encounter the problems in the hierarchy, but it will shift all complexity into learning. Learning is not only learning from our existing tasks and data, but also learning from arbitrarily complex structures generated during the model's learning process. Solving the learning problem is equivalent to writing the code for Artificial General Intelligence, and the core issue of learning is what is good. If a model can judge what is good, it can choose beneficial self-modifications and gradually become stronger. So what is good? The editor also really wants to know. Currently we only have a few rather vague designs, and even after completing the details they still cannot fully solve this problem. # 11. Group Chat If you are interested in related content, you can join the group: 756872300 to discuss. # 12. References [1] Jürgen Schmidhuber (2006). Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements https://arxiv.org/pdf/cs/0309048v5.pdf [2] Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59, 433–460. https://doi.org/10.1093/mind/LIX.236.433 [3] McCarthy, J., Minsky, M. L., Rochester, N. & Shannon, C. E. (1955). A PROPOSAL FOR THE DARTMOUTH SUMMER RESEARCH PROJECT ON ARTIFICIAL INTELLIGENCE http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html .

Description

Since you cannot design a Goedel Machine, you decide to first solve a data structure problem. You are given a sequence $a_1\cdots a_n$ of length $n$. You need to answer $m$ queries. The $i$-th query gives an interval $[l_i,r_i]$. Please compute the product of the greatest common divisors of all non-empty subsets within this interval. Since the answer may be very large, for each query output the result modulo $998244353$.

Input Format

The first line contains two positive integers $n,m$, with the same meaning as in the statement. The next line contains $n$ positive integers describing the sequence $a_1\cdots a_n$. The next $m$ lines each contain $l_i,r_i$, describing the $i$-th query.

Output Format

Output $m$ lines. For each query, output the answer modulo $998244353$.

Explanation/Hint

Idea: ouuan&lk, Solution: ccz181078, Code: ouuan&lk, Data: ouuan&lk. For $10\%$ of the testdata, $n,m\le 10$. For another $10\%$ of the testdata, $n,m\le 1000$. For another $20\%$ of the testdata, $1\le a_i\le 1000$. For another $10\%$ of the testdata, for all $1\le i