Skip to content

Disable npm audit to prevent ETIMEDOUT build failures#66465

Merged
wtgodbe merged 9 commits intodotnet:mainfrom
wtgodbe:wtgodbe/disable-npm-audit
Apr 25, 2026
Merged

Disable npm audit to prevent ETIMEDOUT build failures#66465
wtgodbe merged 9 commits intodotnet:mainfrom
wtgodbe:wtgodbe/disable-npm-audit

Conversation

@wtgodbe
Copy link
Copy Markdown
Member

@wtgodbe wtgodbe commented Apr 24, 2026

Problem

npm ci in npm 11.x runs npm audit by default and treats audit network failures as hard errors. The AzDO feed's /npm/v1/security/advisories/bulk endpoint intermittently times out, causing builds to fail with exit code -1 and:

FetchError: request to https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-public-npm/npm/registry/-/npm/v1/security/advisories/bulk failed, reason: ETIMEDOUT

This was captured by the verbose diagnostics added in #66447 (build 1394430). The diagnostics confirmed:

  • All packages restored from cache successfully
  • Disk space was fine (86GB free)
  • The failure is solely the audit POST timing out

What this fixes

This fixes the exit code -1 / ETIMEDOUT npm failures caused by audit timeouts. It does not fix the separate exit code 57005 (0xDEAD) failures, which have a different unknown root cause that we haven't yet captured with diagnostics.

Fix

Add audit=false to .npmrc. The security audit is non-essential for CI builds.

Relates to #62807

Copilot AI review requested due to automatic review settings April 24, 2026 19:39
@github-actions github-actions Bot added the needs-area-label Used by the dotnet-issue-labeler to label those issues which couldn't be triaged automatically label Apr 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Disables npm audit during installs by setting audit=false in the repo’s root .npmrc, preventing intermittent CI failures caused by timeouts when npm 11.x tries to reach the AzDO feed’s security advisories endpoint.

Changes:

  • Add audit=false to .npmrc so npm ci no longer performs audit requests during installs.

@wtgodbe wtgodbe force-pushed the wtgodbe/disable-npm-audit branch from 328f880 to a18c084 Compare April 24, 2026 20:19
@wtgodbe wtgodbe requested a review from a team as a code owner April 24, 2026 20:19
@wtgodbe wtgodbe force-pushed the wtgodbe/disable-npm-audit branch from a18c084 to 540da99 Compare April 24, 2026 20:57
@wtgodbe
Copy link
Copy Markdown
Member Author

wtgodbe commented Apr 24, 2026

Investigation update — 57005 is Windows-only

We've confirmed that exit code 57005 occurs exclusively on Windows agents (windows.vs2026preview.scout.amd64.open). Diagnostics from build 1395670 show:

  • Disk: plenty free (147GB on C:, 89GB on D:)
  • Windows Defender: RealTimeProtectionEnabled is blank (likely disabled)
  • npm debug log: ends abruptly mid-extraction — no error, no exit line
  • The node.exe process is killed externally during concurrent tarball extraction

This PR now also adds NODE_OPTIONS=--report-on-faulting-state-change which will generate a .report JSON file if Node.js crashes, giving us the signal, stack trace, and heap state.

Additional diagnostics to try if the .report file doesn't yield answers

  1. Windows Event Log — add Get-WinEvent -FilterHashtable @{LogName='Application';Level=1,2;StartTime=(Get-Date).AddMinutes(-5)} -MaxEvents 10 after failure to check if the OS/job object killed the process
  2. Process exit monitoring — wrap npm ci in PowerShell to capture the actual exit source (cmd.exe vs node.exe vs npm)
  3. Reduce concurrency — try npm ci --maxsockets=10 to see if limiting parallel extractions avoids the crash (would confirm a concurrency bug in Node/npm on Windows)

@wtgodbe wtgodbe force-pushed the wtgodbe/disable-npm-audit branch from 540da99 to bc1960e Compare April 24, 2026 21:11
Add audit=false to .npmrc to prevent ETIMEDOUT failures when the
AzDO feed's security advisories endpoint times out (exit code -1).

Add NODE_OPTIONS=--report-on-faulting-state-change to generate a
Node.js diagnostic report (.report file) if the node process crashes
during npm ci. This will help diagnose the Windows-only exit code
57005 (0xDEAD) failures where the process is killed mid-extraction
with no error output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wtgodbe wtgodbe force-pushed the wtgodbe/disable-npm-audit branch from bc1960e to 471a9a4 Compare April 24, 2026 21:11
- Add maxsockets=10 to .npmrc to reduce resource pressure during extraction
- After npm ci failure on Windows, check:
  - Windows Event Log for application errors in the last 5 minutes
  - WER crash dump folder for recent .dmp files
  - Node.js diagnostic report directory for crash reports

Investigating exit code 57005 (0xDEAD) crashes that only occur on Windows CI agents.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wtgodbe
Copy link
Copy Markdown
Member Author

wtgodbe commented Apr 24, 2026

Latest push: maxsockets + Windows crash diagnostics

Concurrency reduction

Added maxsockets=10 to .npmrc — this limits npm to 10 concurrent HTTP connections (default is 15). If the 57005 crash is caused by resource exhaustion during parallel tarball extraction (file handles, memory spikes from decompression), this should help.

New Windows crash diagnostics

After npm ci failure on Windows, we now automatically check:

  1. Windows Event Log — Application error events in last 5 minutes (catches OS-level crash reports)
  2. WER crash dumps — Checks %LOCALAPPDATA%\CrashDumps and %TEMP% for recent .dmp files
  3. Node.js diagnostic reports — Checks artifacts/log/node-reports/ for JSON crash reports from --report-uncaught-exception

Key finding from log analysis

Both 57005 failures produced identical log file sizes (518,908 bytes, 2802 lines) despite different timestamps and packages. The npm debug log is truncated mid-extraction at the same byte offset. This suggests the process is being killed at a deterministic resource threshold, not at a random point.

If maxsockets doesn't fix it, next steps

  • Add process memory monitoring (background PowerShell job polling WorkingSet64)
  • Try --max-old-space-size=4096 to increase Node.js heap limit
  • Try npm ci --foreground-scripts to reduce worker thread parallelism
  • Investigate if the CI agent's job object has a memory limit

@wtgodbe
Copy link
Copy Markdown
Member Author

wtgodbe commented Apr 24, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 3 pipeline(s).

The &quot; XML entities for double quotes inside the Command attribute
were closing the MSBuild attribute early. Use string concatenation
with single quotes instead to avoid nested double quotes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wtgodbe
Copy link
Copy Markdown
Member Author

wtgodbe commented Apr 24, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 3 pipeline(s).

@wtgodbe
Copy link
Copy Markdown
Member Author

wtgodbe commented Apr 25, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 3 pipeline(s).

wtgodbe and others added 2 commits April 24, 2026 17:53
Node.js v24.15.0 (released Apr 16) causes intermittent native crashes
in node.exe during npm ci tarball extraction on Windows CI agents.
The crash produces exit code 57005 (0xDEAD) with a Windows Event Log
'Application Error' entry confirming a fault in node.exe.

The first 57005 occurrence (build 1381847) appeared ~4 hours after the
v24.15.0 release. Zero crashes occurred on v24.14.1.

Pin to 24.14.1 until the upstream issue is identified and fixed.
Also update cache keys to avoid stale node_modules from 24.x builds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework and removed needs-area-label Used by the dotnet-issue-labeler to label those issues which couldn't be triaged automatically labels Apr 25, 2026
The post-job cache save fails when node_modules doesn't exist
(e.g., after a build failure before npm ci runs). This shouldn't
fail the overall job.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wtgodbe
Copy link
Copy Markdown
Member Author

wtgodbe commented Apr 25, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 3 pipeline(s).

@wtgodbe
Copy link
Copy Markdown
Member Author

wtgodbe commented Apr 25, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 3 pipeline(s).

@wtgodbe
Copy link
Copy Markdown
Member Author

wtgodbe commented Apr 25, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 3 pipeline(s).

wtgodbe and others added 3 commits April 24, 2026 19:47
Root cause identified: Node.js v24.15.0 native crash on Windows.
Fixed by pinning to 24.14.1. Remove all temporary diagnostics:
- Verbose npm ci output
- Environment logging (node/npm version, disk, registry)
- npm debug log capture and artifact upload
- Windows Event Log / WER / Node.js report checks
- NODE_OPTIONS=--report-uncaught-exception

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code check doesn't need Node.js. This also avoids the Cache
node_modules post-job failure when node_modules doesn't exist.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wtgodbe wtgodbe merged commit c4db230 into dotnet:main Apr 25, 2026
25 checks passed
@dotnet-policy-service dotnet-policy-service Bot added this to the 11.0-preview5 milestone Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants