Skip to content

RunPod infrastructure costs: charged $8.65 for pods that never ran trainingΒ #821

@oleksiivinogradov

Description

@oleksiivinogradov

Problem

Participants using RunPod to evaluate their submissions are being charged for GPU provisioning/boot time, even when pods never reach a usable state. This makes iterating on submissions prohibitively expensive for individual contributors.

My Experience

Over 5 pod creation attempts across two days (March 25-26, 2026), I was charged $8.65 without completing a single training run:

Pod ID Cloud Tier Timeout Result
dh821zsbo1s1ee SECURE ~15min manual kill Never booted
vrz82l0ml9qans SECURE ~15min manual kill Never booted
tx6ibgui70rl5u ALL ~15min manual kill Never booted
owotqbnqk6el4v ALL 120s auto-kill Never booted
gesb3y7hq454zq ALL 180s auto-kill Never booted
  • Starting balance: $25.00
  • Ending balance: $16.35
  • Pods that completed training: 0 out of 5
  • Current spend rate: $0.00/hr (confirmed no pods running)

Key Issue

RunPod's own API reports stockStatus: "High" for 8xH100 SXM at the time of launch, yet pods consistently fail to boot within 3 minutes. Billing begins during the "provisioning" phase before the container is usable, meaning competitors are charged for infrastructure they never actually use.

πŸ” H100 SXM x8 β†’ Stock: 🟒 High ($21.52/hr on-demand)

Despite "High" availability, no pod ever reached SSH-ready state.

Impact on Competition Fairness

  • The competition requires 8xH100 GPUs for official evaluation (10-minute wallclock constraint)
  • Individual participants with limited budgets ($25-50) can exhaust their credits just attempting to boot pods, before any training happens
  • This creates an uneven playing field where only participants with large cloud budgets or institutional backing can afford to iterate on submissions
  • OpenAI covers 100% of RunPod's evaluation costs, but individual participants testing their code bear the full risk of failed provisioning charges

Suggestions

  1. Provide official evaluation infrastructure β€” A shared evaluation endpoint where participants can submit their train_gpt.py and receive val_bpb results without managing cloud GPU provisioning themselves
  2. Document recommended cloud providers with reliable 8xH100 availability and fair billing (no charges during provisioning)
  3. Provide evaluation credits to active participants so failed provisioning attempts don't block participation
  4. Add a local evaluation mode β€” Even a rough approximation on smaller hardware (e.g., single GPU short run) would help participants validate their code before committing to expensive 8xH100 runs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions