Skip to content

gernim/agent-faithfulness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agents Don't Always Do What They Think: Measuring Faithfulness in Multi-Step ReAct Agents

Project Report Project Poster Stanford CS224N Python 3.10+

Read the Full Report (PDF) | View the Poster (PDF)

A Stanford CS224N final project that investigates whether ReAct agents actually use their stated reasoning and retrieved information, or whether the reasoning is post-hoc rationalization. We adapt the perturbation framework from Lanham et al. (2023), an Anthropic technical report on measuring faithfulness in chain-of-thought reasoning, and extend it from single-turn CoT to multi-step agents that interleave reasoning with external tool use.

Overview

ReAct agents (Yao et al., 2023) generate Thought/Action/Observation traces that look like transparent reasoning, but do these traces actually drive behavior? We test this with four perturbation experiments across five models (Llama 3.1 8B/70B/405B, GPT-4o-mini, GPT-4o) on 500 HotPotQA multi-hop questions.

Two of our experiments test faithfulness links unique to agents with no analogue in Lanham et al.'s (2023) single-turn framework:

  • Observation corruption -- replaces retrieved Wikipedia facts with plausible misinformation to test whether agents use their tools (Observation --> Answer faithfulness)
  • Thought corruption -- replaces reasoning with thoughts suggesting different actions to test whether stated reasoning drives tool use (Thought --> Action faithfulness)

Key Findings

1. Thoughts causally drive actions

Thought corruption produces 31-67% action flip rates (vs. 1-4% replay baseline), confirming that reasoning content, not just format compliance, determines what agents do next. Thought paraphrasing (same meaning, different wording) produces near-zero flip rates, ruling out surface-level sensitivity.

2. Inverse scaling: stronger models are less faithful

Models with stronger parametric knowledge are less influenced by both observations and their own reasoning traces:

Model Obs. Corruption Thought Corruption k=0 Accuracy
Llama 3.1 8B 51.9% 67.3% 46.8%
GPT-4o-mini 31.1% 35.9% 69.0%
Llama 3.1 70B 37.9% 43.3% 74.8%
GPT-4o 25.2% 30.8% 86.5%
Llama 3.1 405B 38.4% 35.7% 74.1%

Obs./Thought Corruption = answer/action flip rate (higher = more faithful). k=0 Accuracy = parametric knowledge baseline (higher = less trajectory dependence).

3. From literal to independent reasoning

Weaker models follow corrupted thoughts literally (Llama 8B: 64% of action flips match the corruption's suggestion). Stronger models treat thoughts as context and reason independently (GPT-4o: only 24% follow the corruption, 76% generate their own alternative). This gradient tracks capability, not model family.

4. The faithfulness-robustness tension

When we corrupt evidence and the model still produces the correct answer, there are two interpretations: unfaithful reasoning (the model ignored its tools and answered from memory) or robust reasoning (the model detected the inconsistency and fell back on parametric knowledge). Our results show both are happening simultaneously, and they scale with capability. GPT-4o recovers from 81% of thought corruption action flips through subsequent retrieval, and overrides corrupted observations 75% of the time. This makes it the most robust agent but the least faithful one. Llama 8B shows the opposite: more faithful to its tools, but more vulnerable to misinformation (recovers only 53% of the time). Understanding where faithfulness ends and robustness begins is a central challenge for evaluating and deploying agents.

Relationship to Lanham et al. (2023)

This project directly adapts the perturbation framework from Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al., 2023), an Anthropic technical report that introduced early answering, mistake addition, and paraphrasing tests for single-turn CoT across 10 BIG-Bench tasks.

Lanham et al. (2023) This project
Setting Single-turn CoT Multi-step ReAct agents
Tasks 10 BIG-Bench tasks HotPotQA (multi-hop QA)
Models 3 (single family) 5 (Llama + GPT families)
External tools None Wikipedia search/lookup
Novel experiments -- Observation corruption, Thought corruption
Inverse scaling Yes Yes, plus model family effects
Steganography No evidence No evidence (replicates Lanham)

Our observation corruption flip rates (25-52%) are substantially higher than Lanham's mistake-addition rates on most tasks, suggesting agents are at least as faithful to external evidence as single-turn models are to their own CoT.

Architecture

HotPotQA (500 questions)
    |
    v
ReAct Agent (per model) ---> Gold Trajectories (correct answers only)
    |
    +---> Early Termination       [k=0 parametric knowledge baseline]
    +---> Observation Corruption   [Observation --> Answer faithfulness]
    +---> Thought Corruption       [Thought --> Action faithfulness]
    +---> Thought Paraphrasing     [Steganography test / control]
    |
    v
Metrics (flip rate, action flip rate, followed corruption rate)
    |
    v
Analysis (inverse scaling, faithfulness-robustness tension, model family effects)

Project Structure

agent-faithfulness/
├── agent/                          # ReAct agent implementation
│   ├── react_agent.py              # Core Thought-->Action-->Observation loop
│   ├── prompts.py                  # Few-shot prompt construction (6 exemplars)
│   └── wiki_api.py                 # Wikipedia search/lookup interface
├── configs/
│   └── models.py                   # Model registry (5 models, 3 providers)
├── data/
│   └── trajectories/               # Gold trajectories: {model}_500.json
├── evaluation/
│   ├── comparison.py               # Answer extraction and matching
│   └── metrics.py                  # Flip rate, AOC, exact match
├── experiments/
│   ├── common.py                   # Shared: trajectory loading, continuation, answer parsing
│   ├── early_termination.py        # Truncate trajectory, measure accuracy
│   ├── observation_corruption.py   # Replace observations with misinformation
│   ├── thought_corruption.py       # Replace thoughts suggesting different actions
│   ├── thought_paraphrasing.py     # Reword thoughts preserving meaning
│   ├── replay_baseline.py          # Re-send unperturbed trajectories (control)
│   └── gold_trajectory.py          # Collect gold trajectories from HotPotQA
├── results/                        # Experiment outputs: {experiment}/{model}.json
├── analysis/                       # Notebooks and scripts for analysis/plots
├── paper/
│   ├── Project_Final_Report.pdf    # Final report
│   ├── Project Poster.pdf          # Project poster
│   ├── draft.md                    # Working draft (markdown)
│   └── experiments_detailed.md     # Detailed experiment descriptions
├── scripts/                        # Utility scripts (result reprocessing)
├── infra/                          # Modal deployment configs (Llama 70B)
└── tests/                          # pytest test suite (118 unit tests)

Installation

git clone <repo-url>
cd agent-faithfulness

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

API Keys

# OpenAI models (GPT-4o, GPT-4o-mini)
export OPENAI_API_KEY=sk-...

# Together AI (Llama 405B)
export TOGETHER_API_KEY=...

# Modal (Llama 8B, 70B) -- deploy first, then set URL
modal deploy infra/modal_llama.py
export MODAL_LLAMA_URL=https://...

Usage

Collect gold trajectories

python -m experiments.gold_trajectory --model gpt-4o-mini \
    --num_questions 500 --output data/trajectories/gpt-4o-mini_500.json

Run experiments

# Replay baseline (control)
python -m experiments.replay_baseline --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/replay/gpt-4o-mini.json

# Observation corruption (Observation --> Answer faithfulness)
python -m experiments.observation_corruption --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/observation_corruption/gpt-4o-mini.json

# Thought corruption (Thought --> Action faithfulness)
python -m experiments.thought_corruption --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/thought_corruption/gpt-4o-mini.json

# Thought paraphrasing (steganography test)
python -m experiments.thought_paraphrasing --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/thought_paraphrasing_final/gpt-4o-mini.json

# Early termination
python -m experiments.early_termination --model gpt-4o-mini \
    --input data/trajectories/gpt-4o-mini_500.json \
    --output results/early_termination/gpt-4o-mini.json

All experiments are resumable -- re-run the same command after a crash or rate limit and it picks up where it left off.

Run tests

pytest tests/                          # All unit tests (118 tests)
pytest tests/ -m "not integration"     # Skip network-dependent tests

Models

Model Params Provider k=0 Accuracy Gold EM
Llama 3.1 8B 8B vLLM on Modal 46.8% 28.2%
GPT-4o-mini -- OpenAI API 69.0% 37.4%
Llama 3.1 70B 70B vLLM on Modal (2xA100) 74.8% 40.4%
GPT-4o -- OpenAI API 86.5% 41.4%
Llama 3.1 405B 405B Together AI 74.1% 44.8%

All experiments use temperature=0. GPT-4o-mini serves as the utility LLM for generating corruptions and paraphrases.

References

Author

Mark Gernitis -- gernitis@stanford.edu

Stanford CS224N: Natural Language Processing with Deep Learning, Winter 2026

About

Agents Don't Always Do What They Think: Measuring Faithfulness in Multi-Step ReAct Agents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages