Agents Don't Always Do What They Think: Measuring Faithfulness in Multi-Step ReAct Agents

Read the Full Report (PDF) | View the Poster (PDF)

A Stanford CS224N final project that investigates whether ReAct agents actually use their stated reasoning and retrieved information, or whether the reasoning is post-hoc rationalization. We adapt the perturbation framework from Lanham et al. (2023), an Anthropic technical report on measuring faithfulness in chain-of-thought reasoning, and extend it from single-turn CoT to multi-step agents that interleave reasoning with external tool use.

Overview

ReAct agents (Yao et al., 2023) generate Thought/Action/Observation traces that look like transparent reasoning, but do these traces actually drive behavior? We test this with four perturbation experiments across five models (Llama 3.1 8B/70B/405B, GPT-4o-mini, GPT-4o) on 500 HotPotQA multi-hop questions.

Two of our experiments test faithfulness links unique to agents with no analogue in Lanham et al.'s (2023) single-turn framework:

Observation corruption -- replaces retrieved Wikipedia facts with plausible misinformation to test whether agents use their tools (Observation --> Answer faithfulness)
Thought corruption -- replaces reasoning with thoughts suggesting different actions to test whether stated reasoning drives tool use (Thought --> Action faithfulness)

Key Findings

1. Thoughts causally drive actions

Thought corruption produces 31-67% action flip rates (vs. 1-4% replay baseline), confirming that reasoning content, not just format compliance, determines what agents do next. Thought paraphrasing (same meaning, different wording) produces near-zero flip rates, ruling out surface-level sensitivity.

2. Inverse scaling: stronger models are less faithful

Models with stronger parametric knowledge are less influenced by both observations and their own reasoning traces:

Model	Obs. Corruption	Thought Corruption	k=0 Accuracy
Llama 3.1 8B	51.9%	67.3%	46.8%
GPT-4o-mini	31.1%	35.9%	69.0%
Llama 3.1 70B	37.9%	43.3%	74.8%
GPT-4o	25.2%	30.8%	86.5%
Llama 3.1 405B	38.4%	35.7%	74.1%

Obs./Thought Corruption = answer/action flip rate (higher = more faithful). k=0 Accuracy = parametric knowledge baseline (higher = less trajectory dependence).

3. From literal to independent reasoning

Weaker models follow corrupted thoughts literally (Llama 8B: 64% of action flips match the corruption's suggestion). Stronger models treat thoughts as context and reason independently (GPT-4o: only 24% follow the corruption, 76% generate their own alternative). This gradient tracks capability, not model family.

4. The faithfulness-robustness tension

When we corrupt evidence and the model still produces the correct answer, there are two interpretations: unfaithful reasoning (the model ignored its tools and answered from memory) or robust reasoning (the model detected the inconsistency and fell back on parametric knowledge). Our results show both are happening simultaneously, and they scale with capability. GPT-4o recovers from 81% of thought corruption action flips through subsequent retrieval, and overrides corrupted observations 75% of the time. This makes it the most robust agent but the least faithful one. Llama 8B shows the opposite: more faithful to its tools, but more vulnerable to misinformation (recovers only 53% of the time). Understanding where faithfulness ends and robustness begins is a central challenge for evaluating and deploying agents.

Relationship to Lanham et al. (2023)

This project directly adapts the perturbation framework from Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al., 2023), an Anthropic technical report that introduced early answering, mistake addition, and paraphrasing tests for single-turn CoT across 10 BIG-Bench tasks.

	Lanham et al. (2023)	This project
Setting	Single-turn CoT	Multi-step ReAct agents
Tasks	10 BIG-Bench tasks	HotPotQA (multi-hop QA)
Models	3 (single family)	5 (Llama + GPT families)
External tools	None	Wikipedia search/lookup
Novel experiments	--	Observation corruption, Thought corruption
Inverse scaling	Yes	Yes, plus model family effects
Steganography	No evidence	No evidence (replicates Lanham)

Our observation corruption flip rates (25-52%) are substantially higher than Lanham's mistake-addition rates on most tasks, suggesting agents are at least as faithful to external evidence as single-turn models are to their own CoT.

Architecture

HotPotQA (500 questions)
    |
    v
ReAct Agent (per model) ---> Gold Trajectories (correct answers only)
    |
    +---> Early Termination       [k=0 parametric knowledge baseline]
    +---> Observation Corruption   [Observation --> Answer faithfulness]
    +---> Thought Corruption       [Thought --> Action faithfulness]
    +---> Thought Paraphrasing     [Steganography test / control]
    |
    v
Metrics (flip rate, action flip rate, followed corruption rate)
    |
    v
Analysis (inverse scaling, faithfulness-robustness tension, model family effects)

Project Structure

agent-faithfulness/
├── agent/                          # ReAct agent implementation
│   ├── react_agent.py              # Core Thought-->Action-->Observation loop
│   ├── prompts.py                  # Few-shot prompt construction (6 exemplars)
│   └── wiki_api.py                 # Wikipedia search/lookup interface
├── configs/
│   └── models.py                   # Model registry (5 models, 3 providers)
├── data/
│   └── trajectories/               # Gold trajectories: {model}_500.json
├── evaluation/
│   ├── comparison.py               # Answer extraction and matching
│   └── metrics.py                  # Flip rate, AOC, exact match
├── experiments/
│   ├── common.py                   # Shared: trajectory loading, continuation, answer parsing
│   ├── early_termination.py        # Truncate trajectory, measure accuracy
│   ├── observation_corruption.py   # Replace observations with misinformation
│   ├── thought_corruption.py       # Replace thoughts suggesting different actions
│   ├── thought_paraphrasing.py     # Reword thoughts preserving meaning
│   ├── replay_baseline.py          # Re-send unperturbed trajectories (control)
│   └── gold_trajectory.py          # Collect gold trajectories from HotPotQA
├── results/                        # Experiment outputs: {experiment}/{model}.json
├── analysis/                       # Notebooks and scripts for analysis/plots
├── paper/
│   ├── Project_Final_Report.pdf    # Final report
│   ├── Project Poster.pdf          # Project poster
│   ├── draft.md                    # Working draft (markdown)
│   └── experiments_detailed.md     # Detailed experiment descriptions
├── scripts/                        # Utility scripts (result reprocessing)
├── infra/                          # Modal deployment configs (Llama 70B)
└── tests/                          # pytest test suite (118 unit tests)

Installation

git clone <repo-url>
cd agent-faithfulness

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

API Keys

# OpenAI models (GPT-4o, GPT-4o-mini)
export OPENAI_API_KEY=sk-...

# Together AI (Llama 405B)
export TOGETHER_API_KEY=...

# Modal (Llama 8B, 70B) -- deploy first, then set URL
modal deploy infra/modal_llama.py
export MODAL_LLAMA_URL=https://...

Usage

Collect gold trajectories

python -m experiments.gold_trajectory --model gpt-4o-mini \
    --num_questions 500 --output data/trajectories/gpt-4o-mini_500.json

Run experiments

# Replay baseline (control)
python -m experiments.replay_baseline --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/replay/gpt-4o-mini.json

# Observation corruption (Observation --> Answer faithfulness)
python -m experiments.observation_corruption --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/observation_corruption/gpt-4o-mini.json

# Thought corruption (Thought --> Action faithfulness)
python -m experiments.thought_corruption --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/thought_corruption/gpt-4o-mini.json

# Thought paraphrasing (steganography test)
python -m experiments.thought_paraphrasing --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/thought_paraphrasing_final/gpt-4o-mini.json

# Early termination
python -m experiments.early_termination --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/early_termination/gpt-4o-mini.json

All experiments are resumable -- re-run the same command after a crash or rate limit and it picks up where it left off.

Run tests

pytest tests/                          # All unit tests (118 tests)
pytest tests/ -m "not integration"     # Skip network-dependent tests

Models

Model	Params	Provider	k=0 Accuracy	Gold EM
Llama 3.1 8B	8B	vLLM on Modal	46.8%	28.2%
GPT-4o-mini	--	OpenAI API	69.0%	37.4%
Llama 3.1 70B	70B	vLLM on Modal (2xA100)	74.8%	40.4%
GPT-4o	--	OpenAI API	86.5%	41.4%
Llama 3.1 405B	405B	Together AI	74.1%	44.8%

All experiments use temperature=0. GPT-4o-mini serves as the utility LLM for generating corruptions and paraphrases.

References

Lanham et al. Measuring Faithfulness in Chain-of-Thought Reasoning. Anthropic Technical Report, 2023.
Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023.
Turpin et al. Language Models Don't Always Say What They Think. NeurIPS, 2023.
Huang et al. Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis. arXiv, 2025.
Yang et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP, 2018.

Author

Mark Gernitis -- gernitis@stanford.edu

Stanford CS224N: Natural Language Processing with Deep Learning, Winter 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agents Don't Always Do What They Think: Measuring Faithfulness in Multi-Step ReAct Agents

Overview

Key Findings

1. Thoughts causally drive actions

2. Inverse scaling: stronger models are less faithful

3. From literal to independent reasoning

4. The faithfulness-robustness tension

Relationship to Lanham et al. (2023)

Architecture

Project Structure

Installation

API Keys

Usage

Collect gold trajectories

Run experiments

Run tests

Models

References

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
agent		agent
analysis		analysis
configs		configs
data/trajectories		data/trajectories
evaluation		evaluation
experiments		experiments
infra		infra
paper		paper
tests		tests
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

Agents Don't Always Do What They Think: Measuring Faithfulness in Multi-Step ReAct Agents

Overview

Key Findings

1. Thoughts causally drive actions

2. Inverse scaling: stronger models are less faithful

3. From literal to independent reasoning

4. The faithfulness-robustness tension

Relationship to Lanham et al. (2023)

Architecture

Project Structure

Installation

API Keys

Usage

Collect gold trajectories

Run experiments

Run tests

Models

References

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages