Skip to content

[Feature]: Scheduler protocol + Slurm scheduler + generic BatchBackend #20

Description

@aldbr

User Story

As the WMS consumer,
I want BatchBackend(SSHTransport, Slurm) and BatchBackend(LocalTransport, Slurm) working
end-to-end against the containerised Slurm stacks,
So that the N+M composition promise of IC-ADR-001 §3 is proven on the first real scheduler.

Feature Description

  • Scheduler protocol finalised: submit_cmd(spec), parse_status(raw), kill_cmd(ids),
    stages_own_files flag (+ stage_inputs/collect_outputs hooks where needed). Evaluate the
    ADR's open question — promote staging to a Stager collaborator — and record the outcome.
  • _slurm internal module (Tier C): sbatch script/option generation from SubmissionSpec
    (incl. count--array for identical fan-out), squeue/sacct parsing, native-state →
    JobStatus map (port DIRAC's mapping: PENDING/SUSPENDED/CONFIGURING→waiting, COMPLETED→done,
    CANCELLED/PREEMPTED→aborted…).
  • Slurm scheduler consuming _slurm; registered via intercede.schedulers.
  • Generic BatchBackend (Tier B): satisfies JobBackend + OutputRetriever
    (destructive=False) + Cancellable + Purgeable + LoadReporter; sandbox staging via
    Transport.put/get when stages_own_files is false; workdir layout per job; JobHandle
    carries host/workdir routing (replaces DIRAC's ssh<batch>:// ref encoding).
  • Phase 2 wiring: the integration contract suite (submit → status → fetch → kill, fetch-twice,
    partial-failure maps) runs against ssh-slurm and local-slurm stacks via markers.

Definition of Done

  • Contract suite green on ssh-slurm and local-slurm stacks in CI
  • count > 1 submits via a single --array (asserted, not N submits)
  • fetch_output streams an arbitrary output sandbox (not just stdout/stderr) with bounded
    materialisation (size/count/timeout/path-containment)
  • LoadReporter.counts() from squeue; unknown ids → JobStatus.UNKNOWN

Alternatives Considered

  • Per-combination classes (SSHSlurmBackend, LocalSlurmBackend…) — N×M explosion; rejected.
  • Porting DIRAC's remote-executed SLURM.py driver — requires remote python; dropped (see
    issue-10).

Additional Context

This issue is the template for every further scheduler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions