Skip to content

Add CLAUDE.md with AI agent instructions and quick reference#4195

Merged
supriyar merged 9 commits intomainfrom
gh/supriyar/1/head
Mar 31, 2026
Merged

Add CLAUDE.md with AI agent instructions and quick reference#4195
supriyar merged 9 commits intomainfrom
gh/supriyar/1/head

Conversation

@supriyar
Copy link
Copy Markdown
Contributor

@supriyar supriyar commented Mar 27, 2026

Stack from ghstack (oldest at bottom):

Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

Comparison: Old CLAUDE.md vs New CLAUDE.md
Instructions+Scripts for repro available in https://github.com/supriyar/torchao-eval
Setup:

  • Subject model: Claude Sonnet
  • Judge model: Claude Opus

Sonnet/Opus Results (61 prompts, final)

image

Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 27, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4195

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 10 Pending

As of commit 15beca9 with merge base 79159f2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

supriyar added a commit that referenced this pull request Mar 27, 2026
Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

ghstack-source-id: e0fe747
Pull Request resolved: #4195
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 27, 2026
@supriyar supriyar added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Mar 27, 2026
Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

Comparison: Old CLAUDE.md vs New CLAUDE.md
Setup:
  - Subject model: Claude Haiku (weaker model, more likely to show improvement from context)
  - Judge model: Claude Sonnet (scores responses 0-3 against rubrics)
  - Prompts: 48 questions across 12 categories (getting started, config classes, float8 training, QAT, sparsity, optimizers, architecture, integrations, development, use cases, gotchas, comparisons)

<img width="428" height="140" alt="image" src="https://github.com/user-attachments/assets/b0c20539-aa69-49d2-a03f-4943508f62e2" />
  

  
Prompts that went from 0 to 3 (completely wrong to perfect):
  - MXFP8 dense training, MXFP8 MoE training, NVFP4 inference - all 0->3
  - ExecuTorch, sparsity, PyTorch version, config comparison - all 0->3
  - Int4WeightOnlyConfig vs Int8DynamicActivation difference - 0->3
  - torchao vs bitsandbytes comparison - 1->3

+32% improvement, wrong answers dropped from 13 to 2. The new CLAUDE.md has the biggest impact on architecture questions, MX/NVFP4 formats, and comparison questions - the areas where the old empty CLAUDE.md gave the model nothing



[ghstack-poisoned]
Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

Comparison: Old CLAUDE.md vs New CLAUDE.md
Setup:
  - Subject model: Claude Haiku (weaker model, more likely to show improvement from context)
  - Judge model: Claude Sonnet (scores responses 0-3 against rubrics)
  - Prompts: 48 questions across 12 categories (getting started, config classes, float8 training, QAT, sparsity, optimizers, architecture, integrations, development, use cases, gotchas, comparisons)

<img width="428" height="140" alt="image" src="https://github.com/user-attachments/assets/b0c20539-aa69-49d2-a03f-4943508f62e2" />
  

  
Prompts that went from 0 to 3 (completely wrong to perfect):
  - MXFP8 dense training, MXFP8 MoE training, NVFP4 inference - all 0->3
  - ExecuTorch, sparsity, PyTorch version, config comparison - all 0->3
  - Int4WeightOnlyConfig vs Int8DynamicActivation difference - 0->3
  - torchao vs bitsandbytes comparison - 1->3

+32% improvement, wrong answers dropped from 13 to 2. The new CLAUDE.md has the biggest impact on architecture questions, MX/NVFP4 formats, and comparison questions - the areas where the old empty CLAUDE.md gave the model nothing



[ghstack-poisoned]
CLAUDE.md Outdated
pytest test/prototype/mx_formats/
```

## Coding Style
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should set up hooks for running pre-commit/linters: https://code.claude.com/docs/en/hooks, this gets code style for free without spending tokens

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean pre-commit run? we do have these I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Vasiliy is referring to claude code hooks for auto-format on edit? That can be a follow-up PR.

Will remove lint instructions from here


## Commit Messages

- Do not commit without explicit request from the user
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this in my personal CLAUDE.md, but this seems specific to personal preference? maybe leave out of repo-wide one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had originally added this since I saw it in the pytorch/pytorch claude.md file. vLLM also has similar instructions in its repo for agent authored commits

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if pytorch and vllm have it, makes sense, thanks

@vkuzo
Copy link
Copy Markdown
Contributor

vkuzo commented Mar 31, 2026

+32% improvement, wrong answers dropped from 13 to 2. The new CLAUDE.md has the biggest impact on architecture questions, MX/NVFP4 formats, and comparison questions - the areas where the old empty CLAUDE.md gave the model nothing

I think this is a good way to eval. Can we check the eval into source control so it's easy to iterate on and measure future improvements? Doesn't have to be in torchao, can be in separate repo if that's easier.

@vkuzo
Copy link
Copy Markdown
Contributor

vkuzo commented Mar 31, 2026

also, for the eval, should we just use Opus 4.6 for everything? IMO better to optimize for the best available model, and even doing this things can go out of date really quickly. I'm not sure it's worth spending a lot of time on evals of older models

@supriyar
Copy link
Copy Markdown
Contributor Author

also, for the eval, should we just use Opus 4.6 for everything? IMO better to optimize for the best available model, and even doing this things can go out of date really quickly. I'm not sure it's worth spending a lot of time on evals of older models

Sonnet/Opus Results (61 prompts, final)

image

I ran both Opus/Opus and Sonnet/Opus on 61 prompts. The issue with Opus as subject is it times out on ~26% of prompts (16/61) with the 56K char system prompt, even with 180s timeout. Opus generates verbose responses that exceed the time limit. timeouts score 0 and corrupt the data.

Sonnet as subject had zero timeouts and scored comparably on the prompts where Opus didn't time out. Results with Sonnet subject + Opus judge: 2.51 -> 2.74 (+9%), 47/61 perfect scores with new CLAUDE.md vs 37/61 without.

Eval repro: https://github.com/supriyar/torchao-eval

Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

Comparison: Old CLAUDE.md vs New CLAUDE.md
Instructions+Scripts for repro available in https://github.com/supriyar/torchao-eval 
Setup:
  - Subject model: Claude Haiku (weaker model, more likely to show improvement from context)
  - Judge model: Claude Sonnet (scores responses 0-3 against rubrics)
  - Prompts: 48 questions across 12 categories (getting started, config classes, float8 training, QAT, sparsity, optimizers, architecture, integrations, development, use cases, gotchas, comparisons)

<img width="428" height="140" alt="image" src="https://github.com/user-attachments/assets/b0c20539-aa69-49d2-a03f-4943508f62e2" />
  

  
Prompts that went from 0 to 3 (completely wrong to perfect):
  - MXFP8 dense training, MXFP8 MoE training, NVFP4 inference - all 0->3
  - ExecuTorch, sparsity, PyTorch version, config comparison - all 0->3
  - Int4WeightOnlyConfig vs Int8DynamicActivation difference - 0->3
  - torchao vs bitsandbytes comparison - 1->3

+32% improvement, wrong answers dropped from 13 to 2. The new CLAUDE.md has the biggest impact on architecture questions, MX/NVFP4 formats, and comparison questions - the areas where the old empty CLAUDE.md gave the model nothing



[ghstack-poisoned]
Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

Comparison: Old CLAUDE.md vs New CLAUDE.md
Instructions+Scripts for repro available in https://github.com/supriyar/torchao-eval 
Setup:
  - Subject model: Claude Haiku (weaker model, more likely to show improvement from context)
  - Judge model: Claude Sonnet (scores responses 0-3 against rubrics)
  - Prompts: 48 questions across 12 categories (getting started, config classes, float8 training, QAT, sparsity, optimizers, architecture, integrations, development, use cases, gotchas, comparisons)

<img width="428" height="140" alt="image" src="https://github.com/user-attachments/assets/b0c20539-aa69-49d2-a03f-4943508f62e2" />
  

  
Prompts that went from 0 to 3 (completely wrong to perfect):
  - MXFP8 dense training, MXFP8 MoE training, NVFP4 inference - all 0->3
  - ExecuTorch, sparsity, PyTorch version, config comparison - all 0->3
  - Int4WeightOnlyConfig vs Int8DynamicActivation difference - 0->3
  - torchao vs bitsandbytes comparison - 1->3

+32% improvement, wrong answers dropped from 13 to 2. The new CLAUDE.md has the biggest impact on architecture questions, MX/NVFP4 formats, and comparison questions - the areas where the old empty CLAUDE.md gave the model nothing



[ghstack-poisoned]
…nd quick reference"


Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

Comparison: Old CLAUDE.md vs New CLAUDE.md
Instructions+Scripts for repro available in https://github.com/supriyar/torchao-eval 
Setup:
  - Subject model: Claude Sonnet 
  - Judge model: Claude Opus 
 
 Sonnet/Opus Results (61 prompts, final)

<img width="586" height="257" alt="image" src="https://github.com/user-attachments/assets/fc1ff374-eb02-40ed-91c7-089f55715144" />

  

  




[ghstack-poisoned]
Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

Comparison: Old CLAUDE.md vs New CLAUDE.md
Instructions+Scripts for repro available in https://github.com/supriyar/torchao-eval 
Setup:
  - Subject model: Claude Sonnet 
  - Judge model: Claude Opus 
 
 Sonnet/Opus Results (61 prompts, final)

<img width="586" height="257" alt="image" src="https://github.com/user-attachments/assets/fc1ff374-eb02-40ed-91c7-089f55715144" />

  

  




[ghstack-poisoned]
…nd quick reference"


Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

Comparison: Old CLAUDE.md vs New CLAUDE.md
Instructions+Scripts for repro available in https://github.com/supriyar/torchao-eval 
Setup:
  - Subject model: Claude Sonnet 
  - Judge model: Claude Opus 
 
 Sonnet/Opus Results (61 prompts, final)

<img width="586" height="257" alt="image" src="https://github.com/user-attachments/assets/fc1ff374-eb02-40ed-91c7-089f55715144" />

  

  




[ghstack-poisoned]
Replace empty placeholder with structured documentation for AI coding
assistants (Claude Code, Cursor, Copilot). Includes config class table,
granularity reference, deprecated API warnings, and pointers to in-repo
docs for architecture details.

Comparison: Old CLAUDE.md vs New CLAUDE.md
Instructions+Scripts for repro available in https://github.com/supriyar/torchao-eval 
Setup:
  - Subject model: Claude Sonnet 
  - Judge model: Claude Opus 
 
 Sonnet/Opus Results (61 prompts, final)

<img width="586" height="257" alt="image" src="https://github.com/user-attachments/assets/fc1ff374-eb02-40ed-91c7-089f55715144" />

  

  




[ghstack-poisoned]
@supriyar supriyar changed the base branch from gh/supriyar/1/base to main March 31, 2026 23:21
@supriyar
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

  • Facebook CLA Check

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: superuser

@supriyar supriyar merged commit 7c1b138 into main Mar 31, 2026
35 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants