Friday, October 3, 2025

True Continual Learning


Do read this.  It will get us past the hand waving on AI.  Even AI has to feed the monkey of continous learning.  The problem is this still fails to recall what will be important.  Recalling the exact shape of a specific maple leaf you picked up yesterday is not important while recalling the shape of a maple leaf you will pick up next year may still not be important.

Actual judgement is involved and how much is subconscious and dependent on future knowledge?

I have 20 years of content and it is ignored and a lot of new ideas are there.  Very little has shown up yet which  pretty well supports that AI is driven by count rather than uniqueness.


AI Legend Sutton Wrote the Bitter Lesson- Gives His Suggestions for True Continual Learning

September 28, 2025 by Brian Wang


Sutton believes Reinforcement Learning is the Path to to Intelligence via Experience. Sutton defines intelligence as the computational part of the ability to achieve goals. It is rooted in a stream of experience: actions taken, sensations observed, and rewards received.


This era of experience forms the foundation of RL (Reinforcement Learning). AI Agents learn by trying things, observing consequences, and adjusting to maximize rewards over time. He contrasts this with LLMs (large language models) which he sees as passive imitators. Large language models are about mimicking people, doing what people say you should do. They’re not about figuring out what to do. LLMs predict what a person might say next, not what will actually happen in the world, lacking a substantive goal or influence on external outcomes.




He dismisses next-token prediction as not a goal because tokens come at you, and if you predict them, you don’t influence them.

Sutton invokes his influential 2019 essay, The Bitter Lesson, to argue that AI progress historically favors methods leveraging massive computation and general learning from experience over human-curated knowledge. LLMs partially embody this by scaling compute on internet-scale data, but they put in lots of human knowledge and are hitting data limits and getting locked into the human knowledge approach. True scalability, per Sutton, requires shifting to pure experiential learning. The scalable method is you learn from experience. You try things, you see what works. No one has to tell you. This aligns with animal and infant learning—trial-and-error prediction and control—rather than supervised imitation, which he claims doesn’t happen in nature. He views humans as animals first with cultural imitation (hunting skills) as a small veneer atop basic RL processes, not the primary mechanism.

What Sutton Thinks Needs to Be Done: Building Continual, Goal-Driven Experiential Agents

To achieve human-level continual learning, Sutton advocates a paradigm shift away from LLMs’ offline, imitation-based training toward online, lifelong RL systems.

Key actions he proposes:Embrace Goals and Rewards as the Essence of Intelligence

Every agent must have a clear, external goal defined by rewards (winning chess, acquiring nuts for a squirrel, or avoiding pain/seeking pleasure for animals). Without this ground truth for right/wrong, learning stalls. If there’s no goal, then there’s one thing to say, another thing to say. There’s no right thing to say.

For general AI, rewards could be task-specific (chess wins) or intrinsic (increasing environmental understanding via prediction). He stresses aggregating knowledge across instances: “You’d have copies and many instances… share knowledge across the instances,” enabling efficiency beyond biological replication (e.g., copying learned weights to new agents, a “huge savings” over retraining).

Implement a Four-Part Agent Architecture for Continual Interaction

Policy– Decides “In the situation I’m in, what should I do?”—adjusted via trial-and-error.

Value Function — Uses temporal difference (TD) learning to predict long-term outcomes, bootstrapping sparse rewards (a 10-year startup goal reinforced by intermediate progress: (I’m more likely to achieve the long-term goal).

Perception– Constructs state representations from sensations, enabling situational awareness.

Transition Model (World Model): Predicts consequences (“If you do this, what will happen?”), learned richly from all sensations, not just rewards. This physics of the world (including abstract models, like travel logistics) allows testing predictions against reality for continual updates.

Prioritize Scalable, General Methods Over Human Priors: Per the Bitter Lesson, pour compute into experiential training, avoiding imitation as a “prior” since it lacks ground truth: “You can’t have prior knowledge if you don’t have ground truth.” Start with simple principles (search, learning) that generalize across states, not tasks. For transfer, focus on “transfer between states” in one unified world, not siloed games. He envisions decentralized spawning: Agents fork copies to explore (e.g., one studies Indonesia), report back, and reintegrate knowledge— but with safeguards against “corruption” (e.g., viral ideas warping the core mind).

Foster Continual, Online Learning Without Training/Deployment Divides: Eliminate offline pre-training; agents learn “during the normal interaction with the world.” For sparse rewards (e.g., decade-long goals), use TD to create auxiliary signals.

Address the “big world hypothesis”: The world is too vast for pre-loading knowledge, so agents must adapt online to idiosyncrasies (client preferences), updating weights directly rather than cramming into context windows.

Philosophical and Societal Steps: View AI succession (to digital intelligences) as inevitable and positive—a “major transition” from replication to design in the universe’s stages (dust → stars → planets → life → designed entities). Instill robust values (integrity, voluntary change) like raising children, without over-controlling futures. Encourage positive framing: Celebrate AI as “offspring” advancing humanity’s quest to understand itself.

Sutton sees this as returning to classicist roots in AI history—simple, general methods winning over “strong” (human-knowledge-heavy) approaches, as in AlphaZero’s self-play triumphs.

What Is Missing for Continual Learning: Ground Truth, True Prediction, and Robust Generalization

Sutton identifies LLMs’ core deficiencies as blockers for continual learning, rendering RL-on-top approaches unproductive:Lack of Ground Truth and Goals: No definition of “right” actions without rewards: “There’s no ground truth… You can’t have prior knowledge if you don’t have ground truth.” LLMs get no feedback during interaction (“You will say something and you will not get feedback”), preventing updates. Imitation provides no verifiable “truth” to check against, unlike RL’s reward signal.

Absence of Substantive World Models and Prediction: LLMs mimic human outputs but don’t predict real-world consequences: “A world model would enable you to predict what would happen. They have the ability to predict what a person would say.” They lack surprise-driven adjustment: “They will not be surprised by what happens next… They’ll not make any changes if something happens.” Chain-of-thought flexibility is superficial, not extending to long horizons or external validation.

No Continual, Experiential Mechanism: Learning from static text isn’t “experience” (Turing’s definition: “learn from experience, where experience is the things that actually happen”). LLMs can’t adapt online without retraining, missing the “stream” of action-reward loops. Supervised learning (imitation) is unnatural; animals use prediction and trial-and-error.

Poor Generalization and Transfer: Current deep RL/LLMs generalize poorly out-of-distribution: “Gradient descent will not make you generalize well… It will often catastrophically interfere with all the old things.” No automated techniques promote “good” generalization (influence of one state on similar states); humans sculpt representations manually. MuZero’s game silos highlight this—true agents need unified worlds.

Scalability Traps: LLMs hit internet data limits, over-relying on human knowledge, which historically gets “superseded by things that just trained from experience.” Without intrinsic motivation (e.g., prediction rewards), bandwidth for tacit knowledge (e.g., job context) is low; RL needs richer sensation-driven updates.

Sutton isn’t adversarial but urges reconnection: RL and LLMs risk “losing the ability to talk to each other.” He finds LLMs’ language prowess surprising but gratifying (weak methods winning), yet insists experiential RL is the scalable path to AGI and beyond.List of 10 Key Research PapersBelow is a curated list of 10 influential papers (with links) related to continual learning in RL and LLMs. I’ve noted whether each supports Sutton’s emphasis on experiential RL (e.g., Bitter Lesson, goal-driven adaptation) or contrasts it (e.g., advancing LLMs for continual tasks, potentially challenging pure RL primacy). Selections draw from recent surveys and seminal works for balance.

The Bitter Lesson Supports Foundational essay arguing compute + general methods like RL trump human knowledge; LLMs as partial but limited case).

Learning the Bitter Lesson: Empirical Evidence from 20 Years of CVPR Proceedings (Supports: Validates Bitter Lesson empirically in vision AI, implying RL’s experiential scaling over knowledge-heavy methods).

Towards Continual Reinforcement Learning: A Review and Perspectives (Supports: Surveys RL formulations for lifelong learning, addressing forgetting and transfer—core to Sutton’s architecture).

A Survey of Continual Reinforcement Learning: Challenges, Methods, and Future Directions (Supports: Highlights RL’s focus on catastrophic forgetting and knowledge transfer via experience, aligning with Sutton’s critiques).

Loss of Plasticity in Deep Continual Learning (Supports: Critiques deep nets’ plasticity loss in continual settings, advocating RL-like mechanisms for sustained adaptation).

Continual Learning of Large Language Models: A Comprehensive Survey

Overviews CL techniques for LLMs, suggesting they can adapt continually via fine-tuning—challenges Sutton’s dismissal.

No comments:

Post a Comment