A Stanford CS224N final project that investigates whether ReAct agents actually use their stated reasoning and retrieved information, or whether the reasoning is post-hoc rationalization. We adapt the perturbation framework from Lanham et al. (2023), an Anthropic technical report on measuring faithfulness in chain-of-thought reasoning, and extend it from single-turn CoT to multi-step agents that interleave reasoning with external tool use.
ReAct agents (Yao et al., 2023) generate Thought/Action/Observation traces that look like transparent reasoning, but do these traces actually drive behavior? We test this with four perturbation experiments across five models (Llama 3.1 8B/70B/405B, GPT-4o-mini, GPT-4o) on 500 HotPotQA multi-hop questions.
Two of our experiments test faithfulness links unique to agents with no analogue in Lanham et al.'s (2023) single-turn framework:
- Observation corruption -- replaces retrieved Wikipedia facts with plausible misinformation to test whether agents use their tools (Observation --> Answer faithfulness)
- Thought corruption -- replaces reasoning with thoughts suggesting different actions to test whether stated reasoning drives tool use (Thought --> Action faithfulness)
Thought corruption produces 31-67% action flip rates (vs. 1-4% replay baseline), confirming that reasoning content, not just format compliance, determines what agents do next. Thought paraphrasing (same meaning, different wording) produces near-zero flip rates, ruling out surface-level sensitivity.
Models with stronger parametric knowledge are less influenced by both observations and their own reasoning traces:
| Model | Obs. Corruption | Thought Corruption | k=0 Accuracy |
|---|---|---|---|
| Llama 3.1 8B | 51.9% | 67.3% | 46.8% |
| GPT-4o-mini | 31.1% | 35.9% | 69.0% |
| Llama 3.1 70B | 37.9% | 43.3% | 74.8% |
| GPT-4o | 25.2% | 30.8% | 86.5% |
| Llama 3.1 405B | 38.4% | 35.7% | 74.1% |
Obs./Thought Corruption = answer/action flip rate (higher = more faithful). k=0 Accuracy = parametric knowledge baseline (higher = less trajectory dependence).
Weaker models follow corrupted thoughts literally (Llama 8B: 64% of action flips match the corruption's suggestion). Stronger models treat thoughts as context and reason independently (GPT-4o: only 24% follow the corruption, 76% generate their own alternative). This gradient tracks capability, not model family.
When we corrupt evidence and the model still produces the correct answer, there are two interpretations: unfaithful reasoning (the model ignored its tools and answered from memory) or robust reasoning (the model detected the inconsistency and fell back on parametric knowledge). Our results show both are happening simultaneously, and they scale with capability. GPT-4o recovers from 81% of thought corruption action flips through subsequent retrieval, and overrides corrupted observations 75% of the time. This makes it the most robust agent but the least faithful one. Llama 8B shows the opposite: more faithful to its tools, but more vulnerable to misinformation (recovers only 53% of the time). Understanding where faithfulness ends and robustness begins is a central challenge for evaluating and deploying agents.
This project directly adapts the perturbation framework from Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al., 2023), an Anthropic technical report that introduced early answering, mistake addition, and paraphrasing tests for single-turn CoT across 10 BIG-Bench tasks.
| Lanham et al. (2023) | This project | |
|---|---|---|
| Setting | Single-turn CoT | Multi-step ReAct agents |
| Tasks | 10 BIG-Bench tasks | HotPotQA (multi-hop QA) |
| Models | 3 (single family) | 5 (Llama + GPT families) |
| External tools | None | Wikipedia search/lookup |
| Novel experiments | -- | Observation corruption, Thought corruption |
| Inverse scaling | Yes | Yes, plus model family effects |
| Steganography | No evidence | No evidence (replicates Lanham) |
Our observation corruption flip rates (25-52%) are substantially higher than Lanham's mistake-addition rates on most tasks, suggesting agents are at least as faithful to external evidence as single-turn models are to their own CoT.
HotPotQA (500 questions)
|
v
ReAct Agent (per model) ---> Gold Trajectories (correct answers only)
|
+---> Early Termination [k=0 parametric knowledge baseline]
+---> Observation Corruption [Observation --> Answer faithfulness]
+---> Thought Corruption [Thought --> Action faithfulness]
+---> Thought Paraphrasing [Steganography test / control]
|
v
Metrics (flip rate, action flip rate, followed corruption rate)
|
v
Analysis (inverse scaling, faithfulness-robustness tension, model family effects)
agent-faithfulness/
├── agent/ # ReAct agent implementation
│ ├── react_agent.py # Core Thought-->Action-->Observation loop
│ ├── prompts.py # Few-shot prompt construction (6 exemplars)
│ └── wiki_api.py # Wikipedia search/lookup interface
├── configs/
│ └── models.py # Model registry (5 models, 3 providers)
├── data/
│ └── trajectories/ # Gold trajectories: {model}_500.json
├── evaluation/
│ ├── comparison.py # Answer extraction and matching
│ └── metrics.py # Flip rate, AOC, exact match
├── experiments/
│ ├── common.py # Shared: trajectory loading, continuation, answer parsing
│ ├── early_termination.py # Truncate trajectory, measure accuracy
│ ├── observation_corruption.py # Replace observations with misinformation
│ ├── thought_corruption.py # Replace thoughts suggesting different actions
│ ├── thought_paraphrasing.py # Reword thoughts preserving meaning
│ ├── replay_baseline.py # Re-send unperturbed trajectories (control)
│ └── gold_trajectory.py # Collect gold trajectories from HotPotQA
├── results/ # Experiment outputs: {experiment}/{model}.json
├── analysis/ # Notebooks and scripts for analysis/plots
├── paper/
│ ├── Project_Final_Report.pdf # Final report
│ ├── Project Poster.pdf # Project poster
│ ├── draft.md # Working draft (markdown)
│ └── experiments_detailed.md # Detailed experiment descriptions
├── scripts/ # Utility scripts (result reprocessing)
├── infra/ # Modal deployment configs (Llama 70B)
└── tests/ # pytest test suite (118 unit tests)
git clone <repo-url>
cd agent-faithfulness
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt# OpenAI models (GPT-4o, GPT-4o-mini)
export OPENAI_API_KEY=sk-...
# Together AI (Llama 405B)
export TOGETHER_API_KEY=...
# Modal (Llama 8B, 70B) -- deploy first, then set URL
modal deploy infra/modal_llama.py
export MODAL_LLAMA_URL=https://...python -m experiments.gold_trajectory --model gpt-4o-mini \
--num_questions 500 --output data/trajectories/gpt-4o-mini_500.json# Replay baseline (control)
python -m experiments.replay_baseline --model gpt-4o-mini \
--input data/trajectories/gpt-4o-mini_500.json \
--output results/replay/gpt-4o-mini.json
# Observation corruption (Observation --> Answer faithfulness)
python -m experiments.observation_corruption --model gpt-4o-mini \
--input data/trajectories/gpt-4o-mini_500.json \
--output results/observation_corruption/gpt-4o-mini.json
# Thought corruption (Thought --> Action faithfulness)
python -m experiments.thought_corruption --model gpt-4o-mini \
--input data/trajectories/gpt-4o-mini_500.json \
--output results/thought_corruption/gpt-4o-mini.json
# Thought paraphrasing (steganography test)
python -m experiments.thought_paraphrasing --model gpt-4o-mini \
--input data/trajectories/gpt-4o-mini_500.json \
--output results/thought_paraphrasing_final/gpt-4o-mini.json
# Early termination
python -m experiments.early_termination --model gpt-4o-mini \
--input data/trajectories/gpt-4o-mini_500.json \
--output results/early_termination/gpt-4o-mini.jsonAll experiments are resumable -- re-run the same command after a crash or rate limit and it picks up where it left off.
pytest tests/ # All unit tests (118 tests)
pytest tests/ -m "not integration" # Skip network-dependent tests| Model | Params | Provider | k=0 Accuracy | Gold EM |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | vLLM on Modal | 46.8% | 28.2% |
| GPT-4o-mini | -- | OpenAI API | 69.0% | 37.4% |
| Llama 3.1 70B | 70B | vLLM on Modal (2xA100) | 74.8% | 40.4% |
| GPT-4o | -- | OpenAI API | 86.5% | 41.4% |
| Llama 3.1 405B | 405B | Together AI | 74.1% | 44.8% |
All experiments use temperature=0. GPT-4o-mini serves as the utility LLM for generating corruptions and paraphrases.
- Lanham et al. Measuring Faithfulness in Chain-of-Thought Reasoning. Anthropic Technical Report, 2023.
- Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023.
- Turpin et al. Language Models Don't Always Say What They Think. NeurIPS, 2023.
- Huang et al. Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis. arXiv, 2025.
- Yang et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP, 2018.
Mark Gernitis -- gernitis@stanford.edu
Stanford CS224N: Natural Language Processing with Deep Learning, Winter 2026