LLMs and Agents

Agents: From Inference-Time Scaffolding to Inference-Time Compute

In under three years, LLM-based agents sprinted from useful tools to powerful coding and computer-use systems. This is a survey of one specific arc: how "reasoning" and "autonomy" moved from external scaffolding, to imitation learning, to self-improvement - (first SFT-based, then RL-based), eventually arriving at models that perform these behaviors natively within their own reasoning trace.

Arc at a glance

1 External scaffolding ReAct, Reflexion, ToT
2 Imitation learning Orca, FireAct
3 Self-improvement loops STaR, ReST-EM (SFT)
4 RL optimization Quiet-STaR, GRPO, R1

From scaffolded behavior to learned reasoning behavior

Autonomy and reasoning through external scaffolding

The progression from passive language models to agentic systems didn’t happen in one leap. Early progress came from wrapping models in externalized scaffolding: explicit traces (so the model could “think” in text), interaction loops (so it could act and observe), evaluators (so it could tell success from failure), and memory (so it could improve across trials).

ReAct (Yao et al., 2022)

ReAct paper introduces an inductive bias: it represents problem-solving as an explicit Thought -> Action -> Observation loop. A "thought" in this context is a language step that doesn't change the environment but updates the agent's working context (you can see illustration in interactive example below), while an "action" queries or manipulates the environment and returns a new observation. This scaffold makes long-horizon behavior more reliable because each added thought (plan, subgoal status, retrace of failed directions) reshapes what the model conditions on next, and each observation re-anchors the trajectory in external evidence.

A key ReAct insight is that much of the relevant competence already exists in a strong pretrained LLM. ReAct largely changes how that competence is exercised at inference time by adding structured state and closed-loop feedback rather than changing the weights.

In the same era, other methods (e.g., Self-Refine, which iteratively generates feedback and revises a draft) reinforced a similar theme: adding a structured inference-time loop, whether via environment observations or self-critique, can improve reliability across benchmarks without additional training. In hindsight, these works were early hints of inference-time scaling.

Interactive Example: ReAct

💬
"What is the attention mechanism?"
💭
Think
👁
Observe
LLM decides:
loop or stop
Act LLM picks
Environment
search fetch_paper calculator
Final Answer

Agent Trace

Trace will accumulate here as the agent runs...

Click Next to watch the agent reason, act, observe, and decide when to stop.

Ready

Reflexion (Shinn et al., 2023)

Reflexion paper did not introduce a new type of autonomous agent, rather it introduced framework for incorporating memory of learned lessons from previous episodes into an agentic system and hence significantly enhancing capabilities of existing agents. Essentially, Reflexion sits on top of an existing actor (e.g., one of the agents they used was ReAct) and augments memory to learn from errors made in previous episodes.

Reflexion agent architecture diagram

The framework decomposes an LLM agent into three roles repeated over episodes (trials). The Actor which can be any LLM-based agent (e.g., ReAct) interacts with the environment to produce a trajectory - the ordered record of a single attempt (actions, observations, and, when using ReAct, intermediate thoughts). This trajectory is the agent's short-term memory: a working context that is continually appended. After the episode ends, an Evaluator assigns a pass or fail to the answer Actor produced. Depending on the task, this evaluator can be programmatic (e.g., unit tests for code, exact-match checks) or LLM-as-a-Judge when semantic judgment is required. If answer fails Self-Reflection model then reads the trajectory (and any prior memory) and writes a compact "lesson learned". Reflexion's long-term memory is not the full trajectory. Instead, it stores only these distilled reflection snippets in a long-term (across-attempts) memory buffer. The next episode starts with a fresh trajectory, but the Actor is prompted with the recent reflections from long-term episodic memory, so it can avoid repeating the same mistake and improve across attempts.

Interactive Example: Reflexion

💻
"Return 2nd largest unique element"
🤖
Actor LLM(task + reflections)
🧪
Evaluator run_tests(code)
🧐
Self-Reflect LLM(trajectory + results)
Long-term Memory

No reflections yet

↻ retry with memory
Task Complete

Short-term Memory (episode trajectory)

Trajectory builds during each trial...

Trial History

Trial history will appear here...

Click Next to watch the agent learn from failure across trials.

Ready

Tree of Thoughts (Yao et al., 2023)

Tree of Thoughts essentially generalizes series of CoT papers by proposing to think about chain of thoughts as a modular entity which can be split into smaller parts (thoughts), with new thoughts branching from previous. This framing naturally represents reasoning as search over a tree, which makes it possible to apply standard search algorithms like DFS and BFS. The LLM plays two roles here: thought generator (expanding the tree with new nodes) and evaluator (scoring each thought as sure/likely/impossible to decide whether to continue or prune).

The interactive example below shows DFS, where the agent explores one branch deeply, going down to the child with the highest evaluator score at each step. If the evaluation drops below a threshold, it backtracks - and crucially, the pruned branch’s thoughts are removed from the model’s context. The next generation is conditioned only on the parent’s state, so the dead-end reasoning is no longer in the prompt.

One interesting angle is the relationship to MDPs: in some tasks the current state incorporates all the necessary information to make the next decision (so it is naturally Markov), while in others the current state is not enough and you essentially need to create a new state by rolling the whole trajectory (from root to the current node) into the state, hence making it Markov. So state management in ToT really depends on the task.

On the Game of 24, GPT-4 with standard Chain-of-Thought prompting solved only 4% of tasks, while ToT with BFS reached 74%.

Interactive Example: Tree of Thoughts - Game of 24 with DFS

Use 4, 9, 10, 13 with +, −, ×, ÷ to make 24

4, 9, 10, 13
4 + 9 = 13 left: 10, 13, 13
13 − 9 = 4 left: 4, 4, 10
13 − 10 = 3 left: 3, 13
13 × 13 = 169 left: 10, 169
10 − 4 = 6 left: 4, 6
4 × 6 = 24 ✓

DFS Search Log

Search log will appear here...

Click Next to watch DFS explore the tree, evaluate thoughts, and backtrack from dead ends.

Ready

LLM Compiler

LLM Compiler paper essentially implemented a plan-and-execute pattern with a feedback loop. At the first stage function calling planner would analyze request and try to create a DAG of tasks that needs to be executed (tasks themselves might be function calls, subsequent LLM calls etc.) and schedule them in a way that allows parallel execution of non-dependent tasks. The significance here compared to a simple ReAct loop was toward agent efficiency: not just choosing the right actions, but choosing the right execution order to minimize latency. The model controls both what to do and when to do it. If the execution of some tasks fails, the agent retries and replans.

LLM Compiler: planning a dependency graph and executing in parallel

There were many other papers with this plan and execute pattern, one interesting paper applying this pattern to construction of ML pipelines was HuggingGPT. In this paper the LLM interprets a user’s natural-language request, breaks it down into actionable subtasks, selects appropriate expert models from Hugging Face for each subtask, runs those models to complete each piece, and then integrates and summarizes the outputs back to the user.

HuggingGPT: LLM orchestrating specialist models

From scaffold-based to model-based reasoning

Two papers that many people point to as an inspiration for the development of internalized reasoning:

In combination, these papers showed that models had some latent reasoning capabilities already and the prompt simply activates them.

Learning to Reason by Imitation

The natural follow-up to the discovery of CoT and the success of harness-based reasoning agents described in the previous section was imitation learning. In fact, in the original ReAct paper authors describe experiments with fine-tuning, taking 3,000 trajectories with correct answers generated by ReAct (with PaLM 540B) and fine-tuning smaller models "to decode trajectories (all thoughts, actions, observations) conditioned on input questions/claims.". Results on multi-hop reasoning benchmarks like HotPotQA showed significant improvements.

Orca took a similar approach - it distills a teacher's answer style that includes step-by-step reasoning into a smaller model. Training samples represent triplets: (system_message, user_query, teacher_response). The system message is specifically crafted with zero-shot CoT or similar prompts to force the teacher (GPT-4) to produce reasoning traces (e.g. "think step-by-step and justify"), and the paper introduced diversity of those system messages. Orca is trained with standard SFT where the Categorical Cross Entropy loss is computed only on the teacher tokens. So Orca learns the mapping from input = (system + user query) => output = (teacher_reasoning + teacher_answer).

The key result of the paper is that at inference time the model produces reasoning traces by default, without needing the system prompt, even though during fine-tuning it saw wide diversity of zero-shot CoT like prompts. In fact, authors explicitly state that for some evaluations they used an empty system message, which also shows that Orca didn't overfit on the CoT-style system prompts because it elicits reasoning even without them. While Orca reports large improvements on BigBench-Hard, AGIEval and several other benchmarks, they report them in the no-CoT setting - which is fair as a measure of default behavior, but it doesn’t isolate the causal effect of explanation-tuning. A more informative ablation would evaluate Vicuna and Orca under matched CoT-style system prompts (e.g., “think step-by-step”) using the same prompt template and parsing, so you can separate gains from prompting versus gains from fine-tuning on explanation traces.

FireAct (Chen et al., 2023) extended this imitation approach to full agentic trajectories. Rather than distilling from a single teacher using a single prompting strategy, FireAct generated Think->Act->Observe traces from multiple models (GPT-4, GPT-3.5) using multiple agent methods (ReAct, CoT, Reflexion-style) and fine-tuned a smaller model on the combined set. The key finding was that diversity of source agents matters - the student trained on mixed trajectories generalized better than one trained on any single agent type, learning to reason and act more flexibly than the individual strategies it was trained on.

FireAct: distilling diverse agent trajectories from multiple teachers into a smaller model

These imitation approaches proved that models could be fine-tuned to replicate the teacher's reasoning - producing explicit reasoning tokens at inference time, without needing a scaffold prompt. But this paradigm has no self-improvement loop: improvement comes from iterating on the scaffolding and teacher setup, which relies on human engineering rather than computation - in many ways the opposite of Sutton's bitter lesson.

Learning Through Self-Improvement Loop

STaR (Zelikman et al., 2022) introduced a self-improvement loop that removes the teacher entirely. The idea: the model generates its own reasoning traces, keeps the ones that lead to correct answers, and fine-tunes on them. Then repeats. Each iteration produces a better model that generates better traces in the next round.

The algorithm has one additional clever mechanism: rationalization. When the model gets a question wrong, STaR gives it a hint (the correct answer) and asks it to generate a rationale that arrives at that answer. If the rationalized trace is correct, it gets added to the training set too. This way the model learns not only from problems it already solves, but also bootstraps from problems it can solve with a hint - gradually expanding its reasoning capability.

One interesting detail in the paper is that they outline an alternative to rationalization: generating multiple trajectories at higher temperatures to force diversity, then selecting the successful ones. However, they found this "substantially increases the likelihood of a correct answer despite incorrect reasoning". What I find interesting is that for LLMs in the era when STaR was written, the same is probably true for rationalization - LLM would produce plausibly sounding, but incorrect or incomplete reasoning for the hinted answer.

Interactive Example: STaR - One Iteration

Task: answer arithmetic word problems using step-by-step reasoning

Example A - Rationale Generation
Q: A store sells 4 apples at $3 each and 2 oranges at $5 each. What is the total?
Prompt to model: few_shot_examples + xi
Rationale r̂i: 4 × $3 = $12 for apples. 2 × $5 = $10 for oranges. Total = $12 + $10 = $22.
Answer ŷi: $22
Check:i = $22 vs yi = $22   ✓ match → keep (xi, r̂i, yi)
Example B - Rationale Generation
Q: A train travels 60 mph for 2.5 hours, then 40 mph for 1.5 hours. Total distance?
Prompt to model: few_shot_examples + xi
Rationale r̂i: 60 × 2.5 = 150 miles. 40 × 2.5 = 100 miles. Total = 250 miles.
Answer ŷi: 250 miles
Check:i = 250 vs yi = 210   ✗ mismatch → try rationalization
Rationalization (hint: yi = 210)
Prompt to model: few_shot_examples + xi + add_hint(yi)
Rationale r̂rati: 60 × 2.5 = 150 miles. 40 × 1.5 = 60 miles. Total = 210 miles.
Answer ŷrati: 210 miles
Check:rati = 210 vs yi = 210   ✓ match → keep (xi, r̂rati, yi)
Filter & Combine
Fine-tune M on correct traces
↻ Repeat with improved model

STaR Algorithm Log

Algorithm steps will appear here...

Click Next to step through one iteration of STaR’s self-improvement loop.

Ready

A year after STaR, ReST-EM (Singh et al., 2023) took a different approach to the same self-improvement idea. Instead of rationalization approach, they simply sampled many candidate solutions per problem (32-64), filters for correct outcomes, and fine-tunes on the surviving traces. The paper was not focused on reasoning specifically - it framed this as a general self-training method based on expectation-maximization: filtering = expectation step, while maximization step = SFT on filtered set of trajectories. One detail worth noting: both papers fine-tune from the original base model at every iteration, not from the previous iteration's checkpoint.1

Both STaR and ReST-EM use SFT as the optimization mechanism, but there is an RL interpretation hiding behind both. ReST-EM's generate-filter-train loop can be viewed as a REINFORCE-without-baseline style algorithm: with binary reward, REINFORCE only gets gradient signal from correct trajectories - which is exactly what ReST-EM's filtering does. However, the major difference is that instead of a single policy gradient step, ReST-EM performs full SFT on the filtered set. The STaR paper makes this connection even more explicitly, framing the self-improvement loop as a policy-gradient style algorithm. Though because rationalizations are generated conditioned on the correct answer (a hint the model wouldn't have at test time), they are off-policy - so the algorithm resembles REINFORCE structurally but can't really be considered REINFORCE proper.

One of the subsequent papers that made the use of REINFORCE explicit was Quiet-STaR (Zelikman et al., 2024), which used REINFORCE to train hidden thinking tokens. Around the same time, DeepSeekMath (Shao et al., 2024) introduced Group Relative Policy Optimization (GRPO) and trained reasoning using what is now called RLVR - reinforcement learning from verifiable rewards.

What made these results striking - and what DeepSeek-R1 (DeepSeek-AI, 2025) later demonstrated even more dramatically is that reasoning emerged naturally from the training process itself. In R1-Zero, the model is trained with RL directly from a base model (no SFT warmup). The authors report that long chain-of-thought traces and reflection-like behaviors (including an "aha moment" where the model pauses to re-check its work) emerged during training, apparently because such behaviors were instrumentally useful for earning reward, rather than being explicitly supervised as step-by-step solutions.

This is the approach that scaled, and variations of which are widely used today. We won't go in-depth on Quiet-STaR, DeepSeekMath, or R1 here - we will try to have a separate blog post dedicated to them.

Internalized Agentic Behavior: Reasoning and Acting in the Same Trace

In current thinking models, action and reasoning can be performed within reasoning traces, and many of the functionalities of early agentic systems that relied on scaffolding are now internalized in the model itself.

To illustrate, consider an extremely simple setup: two plain Python functions - get_weather and get_city_traffic registered as tools, and a single user message: "I am heading downtown. What's the weather and how bad is traffic?" No orchestration loop or scaffolding - the model receives the tools, decides which to call and in what order inside its own reasoning trace. Once results from a tool call are returned, reasoning continues based on the environment feedback. In this simple setup you can see similarities to the plan-and-execute and ReAct patterns from the first section. It is easy to experiment and see for yourself that backtracking is also natively supported by the model.

Decoded Trace: Reasoning + Tool Use in One Generation

run_with_tools()
for _ in range(max_turns):

    inputs = tokenizer.apply_chat_template(
        ...
        messages, tools=TOOLS,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        reasoning_effort="medium",
    ).to(model.device)

    output_ids = model.generate(
        ...
        **inputs, max_new_tokens=2048,
        temperature=0.1, do_sample=True)

    text = tokenizer.decode(
        ...
        output_ids[0][len(inputs["input_ids"][0]):],
        skip_special_tokens=False)

    tool_call_match = _TOOLCALL_RE.search(text)
    if tool_call_match:
        tool_name = tool_call_match.group(1)
        args = json.loads(tool_call_match.group(2) or "{}")
        tool_output = tool_map[tool_name](**args)

        messages.append({"role": "assistant", ...})
        messages.append({"role": "tool", ...})
        messages.append({"role": "assistant",
            "tool_calls": [{"type": "function",
                "function": {"name": tool_name,
                    "arguments": args}}]})
        messages.append({"role": "tool",
            "content": tool_output
                if isinstance(tool_output, str)
                else json.dumps(tool_output)})
        continue

    return text

Click Next to step through the decoded output…

Click Next to watch reasoning and tool use interleave in a single trace.

Ready

You can see in this simple example how all of this works without scaffolding. The model decides how much to think before acting and whether to act at all. A ReAct agent always executed the Think -> Act -> Observe cycle because the scaffold enforced it. A reasoning model might think for twenty steps, realize the answer is already derivable from context, and never call a tool. Or it might call three tools in quick succession because its internal reasoning identified a dependency chain.

Conclusion

Scaffolding still exists today - Claude Code, Codex, and many production agents run inside substantial harnesses. But the role of that scaffolding has shifted. It is less about forcing multi-step reasoning through an explicit loop - much of that is now handled by the model itself. Instead, the scaffolding increasingly provides the medium in which the model operates: tools, file system access, a scratchpad where the model can write down intermediate thoughts and manage its own context across turns. You see less and less of the early-agent pattern where the scaffold dictated how to think - it just provides the environment and lets the model decide.


  1. The improvement comes from the training data getting better each round (more correct traces, covering harder problems), not from accumulated weight changes. Resetting to the base model each time keeps the pre-trained knowledge intact and avoids compounding distribution drift across iterations.