[STG-2173] feat(evals): add EDGAR 10-Q multi-company extraction agent eval by shrey150 · Pull Request #2191 · browserbase/stagehand

shrey150 · 2026-06-04T03:54:12Z

What

Adds a long-horizon agent eval to packages/evals: agent/edgar_10q.

For three companies (Snowflake / Datadog / MongoDB) the agent must:

Find each company's most recent 10-Q on SEC EDGAR,
Open the actual primary document (not the filing index/cover/exhibit),
Extract quarterly revenue, YoY growth, RPO, and the top risk factor,
Return a 3-company comparison table.

It exercises long-horizon navigation (unknown number of pages, nested filing index → primary doc) plus multi-document synthesis.

Why

We needed a realistic long-horizon eval to measure task closure — does the agent actually finish and return the synthesized result, not just do the work. Scoring is objective: the agent's final answer must contain the correct quarterly revenue for all three companies (ground truth verified against SEC XBRL, data.sec.gov/api/xbrl/companyconcept). Scoring the final answer (not intermediate tool output) is deliberate — an agent that extracts the data but reports only "task complete" has not finished the job, and this catches that.

Notes for reviewers

Follows the defineBenchTask pattern (auto-discovered from tasks/bench/agent/, no registration).
No changeset — @browserbasehq/stagehand-evals is private: true (not published).
Ground-truth drift: the task targets the "MOST RECENT" 10-Q, so the hardcoded figures are a dated snapshot (verified 2026-06). They must be refreshed when newer filings post. Alternative: pin the instruction to specific filing periods so ground truth never drifts — happy to switch if preferred (flagged in a code comment).
Draft for review.

E2E Test Matrix

Command / flow	Observed output	Confidence / sufficiency
`evals run agent/edgar_10q -e browserbase -t 5 --agent-mode hybrid` (Opus 4.8)	5/5 passed (100%); correct revenue for SNOW/DDOG/MDB in every trial	Proves the task runs end-to-end on a live Browserbase cloud session and the scorer validates real output
Same task, Opus 4.7 vs 4.8 (cumulative across batches)	4.8 task-closure 9/9; 4.7 ~2/8	Proves the eval discriminates — it separates "extracted the data" from "returned the answer," which is its purpose
`pnpm build:esm` then `evals list`	exit 0; `agent/edgar_10q` auto-discovered	Compiles cleanly and registers with no config edit

Ground-truth revenue independently confirmed against SEC XBRL (SNOW $1,390,951K, DDOG $1,006,426K, MDB $687,616K).

Linear: STG-2173

🤖 Generated with Claude Code

Summary by cubic

Adds agent/edgar_10q, a long-horizon eval that tests multi-company 10‑Q extraction from SEC EDGAR and scores agents on returning the correct quarterly revenue. Addresses Linear STG-2173 by measuring task closure with objective, final-answer scoring.

New Features
- Added packages/evals/tasks/bench/agent/edgar_10q.ts, auto-discovered via defineBenchTask.
- Requires finding each company’s most recent 10‑Q, opening the primary document, extracting revenue, YoY growth, RPO, and top risk; outputs a 3‑company table for SNOW, DDOG, and MDB.
- Scorer checks the final answer for correct revenue across all three (ground truth verified against SEC XBRL; figures are a 2026‑06 snapshot).

^{Written for commit 59501a7. Summary will update on new commits.}

Long-horizon agent eval: for SNOW/DDOG/MDB, find the most recent 10-Q on EDGAR, open the primary document, extract revenue/YoY/RPO/top risk, and return a comparison table. Objective scoring requires the correct quarterly revenue for all three companies in the final answer (ground truth verified against SEC XBRL). Useful for measuring long-horizon task closure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

changeset-bot · 2026-06-04T03:54:17Z

⚠️ No Changeset found

Latest commit: 59501a7

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STG-2173] feat(evals): add EDGAR 10-Q multi-company extraction agent eval#2191

[STG-2173] feat(evals): add EDGAR 10-Q multi-company extraction agent eval#2191
shrey150 wants to merge 1 commit into
mainfrom
shrey/edgar-10q-eval

shrey150 commented Jun 4, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

changeset-bot Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shrey150 commented Jun 4, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Notes for reviewers

E2E Test Matrix

Summary by cubic

Uh oh!

changeset-bot Bot commented Jun 4, 2026

⚠️ No Changeset found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shrey150 commented Jun 4, 2026 •

edited by cubic-dev-ai Bot

Loading