Skip to content

giovtorres/slurm-docker-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

68 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Slurm Docker Cluster

Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. This repository simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.

๐Ÿ Quick Start

Requirements: Docker and Docker Compose

git clone https://github.com/giovtorres/slurm-docker-cluster.git
cd slurm-docker-cluster
cp .env.example .env    # optional: edit to change version, enable GPU, etc.

# Option A: Pull pre-built image from Docker Hub (fastest)
docker pull giovtorres/slurm-docker-cluster:latest
docker tag giovtorres/slurm-docker-cluster:latest slurm-docker-cluster:25.11.2

# Option B: Build from source
make build

# Start the cluster
make up
make status             # verify nodes are idle
make test               # run full test suite
make help               # see all available commands

Supported Slurm versions: 25.11.x, 25.05.x (last two Major.Minor releases)

Supported architectures (auto-detected): AMD64, ARM64

๐Ÿ“ฆ What's Included

Containers:

  • mysql - Job and cluster database
  • slurmdbd - Database daemon for accounting
  • slurmctld - Controller for job scheduling
  • slurmrestd - REST API daemon (HTTP/JSON access)
  • c1, c2 - CPU compute nodes (dynamically scalable)
  • g1 - (optional) GPU compute node with NVIDIA support (dynamically scalable)
  • elasticsearch - (optional) indexing jobs
  • kibana - (optional) visualization for elasticsearch

Persistent volumes:

  • Configuration (etc_slurm)
  • Logs (var_log_slurm)
  • Job files (slurm_jobdir)
  • Database (var_lib_mysql)
  • Authentication (etc_munge)

๐Ÿ–ฅ๏ธ Using the Cluster

# Access controller
make shell

# Inside controller:
sinfo                          # View cluster status
sbatch --wrap="hostname"       # Submit job
squeue                         # View queue
sacct                          # View accounting

# Or run example jobs
make run-examples

๐Ÿ“ˆ Scaling

Compute nodes use Slurm's dynamic registration (slurmd -Z) and self-register with sequential hostnames (c1, c2, c3... for CPU; g1, g2... for GPU). Scale up or down at any time without rebuilding.

Scale CPU Workers

# Scale to 5 CPU workers (default is 2)
make scale-cpu-workers N=5

# Or set the default count in .env
CPU_WORKER_COUNT=4
make up

Scale GPU Workers

# Scale to 3 GPU workers (requires GPU_ENABLE=true)
make scale-gpu-workers N=3

Verify with make status.

๐Ÿ“Š Monitoring

REST API

Query cluster via REST API (version auto-detected: v0.0.44 for 25.11.x, v0.0.42 for 25.05.x):

# Get JWT Token
JWT_TOKEN=$(docker exec slurmctld scontrol token 2>&1 | grep "SLURM_JWT=" | cut -d'=' -f2)

# Get nodes
docker exec slurmrestd curl -s -H "X-SLURM-USER-TOKEN: $JWT_TOKEN" \
  http://localhost:6820/slurm/v0.0.42/nodes | jq .nodes

# Get partitions
docker exec slurmrestd curl -s -H "X-SLURM-USER-TOKEN: $JWT_TOKEN" \
  http://localhost:6820/slurm/v0.0.42/partitions | jq .partitions

Elasticsearch and Kibana (Optional)

Enable job completion monitoring and visualization:

# 1. Setting ELASTICSEARCH_HOST in .env enables the monitoring profile
ELASTICSEARCH_HOST=http://elasticsearch:9200

# 2. Start cluster (monitoring auto-enabled)
make up

# 3. Access Kibana at http://localhost:5601
# After loading, click: Elasticsearch โ†’ Index Management โ†’ slurm โ†’ Discover index

# 4. Query job completions directly
docker exec elasticsearch curl -s "http://localhost:9200/slurm/_search?pretty"

# Test monitoring
make test-monitoring

Indexed data: Job ID, user, partition, state, times, nodes, exit code

๐ŸŽฎ GPU Support (NVIDIA)

Enable optional NVIDIA GPU support using NVIDIA's official CUDA base images:

# 1. One-time host setup (add NVIDIA repo and install nvidia-container-toolkit)
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
  | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 2. Enable GPU in .env (uses NVIDIA's official CUDA base images)
GPU_ENABLE=true
BUILDER_BASE=nvidia/cuda:13.1.1-devel-rockylinux9
RUNTIME_BASE=nvidia/cuda:13.1.1-base-rockylinux9

# 3. Build with GPU support
make rebuild

# 4. Verify GPU detection
docker exec g1 nvidia-smi

# Test GPU functionality
make test-gpu

Note: GPU testing is not included in CI (GitHub-hosted runners have no GPUs). Run make test-gpu manually on a host with an NVIDIA GPU and nvidia-container-toolkit installed.

๐Ÿ”„ Cluster Management

make down     # Stop cluster (keeps data)
make clean    # Remove all containers and volumes
make rebuild  # Clean, rebuild, and restart
make logs     # View container logs

Note: If ELASTICSEARCH_HOST is set in .env, monitoring containers are automatically managed.

๐Ÿณ Docker Hub

Pre-built multi-arch images (amd64 + arm64) are published on each GitHub release:

# CPU images
docker pull giovtorres/slurm-docker-cluster:latest
docker pull giovtorres/slurm-docker-cluster:25.11.2          # latest build for this Slurm version
docker pull giovtorres/slurm-docker-cluster:25.11.2-2.1.0   # pinned to a specific release

# GPU images (built on nvidia/cuda base)
docker pull giovtorres/slurm-docker-cluster:latest-gpu
docker pull giovtorres/slurm-docker-cluster:25.11.2-gpu
docker pull giovtorres/slurm-docker-cluster:25.11.2-gpu-2.1.0

โš™๏ธ Advanced

Version Management

make set-version VER=25.05.6   # Switch Slurm version
make version                   # Show current version
make build-all                 # Build all supported versions
make test-all                  # Test all versions

Configuration Updates

# Live edit (persists across restarts)
docker exec -it slurmctld vi /etc/slurm/slurm.conf
make reload-slurm

# Push local changes
vi config/25.05/slurm.conf
make update-slurm FILES="slurm.conf"

# Permanent changes
make rebuild

Multi-Architecture Builds

# Cross-platform build (uses QEMU emulation)
docker buildx build --platform linux/arm64 \
  --build-arg SLURM_VERSION=25.05.6 \
  --load -t slurm-docker-cluster:25.05.6 .

๐Ÿ“š Documentation

  • Commands: Run make help for all available commands
  • Examples: Job scripts in examples/ directory

๐Ÿค Contributing

Contributions are welcomed! Fork this repo, create a branch, and submit a pull request.

๐Ÿ“„ License

This project is licensed under the MIT License.