Skip to content

containers: prepare environmentd and clusterd for distroless migration#35859

Draft
jasonhernandez wants to merge 11 commits intomainfrom
jason/sec-236-distroless-migration
Draft

containers: prepare environmentd and clusterd for distroless migration#35859
jasonhernandez wants to merge 11 commits intomainfrom
jason/sec-236-distroless-migration

Conversation

@jasonhernandez
Copy link
Copy Markdown
Contributor

Summary

Move bash entrypoint logic into Rust binaries so environmentd and clusterd can run in distroless container images:

  • clusterd: Auto-detect Kubernetes FQDN from /etc/hostname (replaces hostname --fqdn which isn't available in distroless), auto-detect StatefulSet ordinal from HOSTNAME env var, LD_PRELOAD eatmydata toggle
  • environmentd: LD_PRELOAD eatmydata toggle, sleep-forever on graceful exit
  • Dockerfile.distroless: New distroless Dockerfile variants for both services based on distroless-prod-base

Motivation

environmentd and clusterd are the last major services still on Ubuntu-based containers. The blockers were:

  1. Bash entrypoint scripts (solved here — logic moved to Rust)
  2. System ssh binary for tunnels (solved by SEC-236 static OpenSSH PR)
  3. tini for PID 1 (Kubernetes shareProcessNamespace or container runtime --init handles this)

Distroless images are ~60% smaller, have no shell for attackers to exploit, and are required for FIPS compliance (no uncontrolled system crypto libraries).

What's NOT in this PR

Part of SEC-236.

Test plan

  • cargo check -p mz-clusterd -p mz-environmentd passes
  • cargo fmt clean
  • CI compilation check
  • Requires manual testing with distroless image build + Kubernetes deployment

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

@jasonhernandez jasonhernandez force-pushed the jason/sec-236-distroless-migration branch 2 times, most recently from 63a74f4 to ef9e7db Compare April 3, 2026 06:17
Move bash entrypoint logic into Rust binaries so environmentd and
clusterd can run in distroless containers without a shell:

clusterd:
- Auto-detect Kubernetes FQDN from /etc/hostname (replaces `hostname --fqdn`)
- Auto-detect StatefulSet ordinal from HOSTNAME env var
- Configure LD_PRELOAD for eatmydata (CI only, no-op in distroless)

environmentd:
- Configure LD_PRELOAD for eatmydata
- Sleep forever after graceful exit (keeps container alive for debugging)

Also add Dockerfile.distroless variants for both services that use the
distroless-prod-base image and expect a static `ssh` binary to be
copied in for SSH tunnel support.

Part of SEC-236.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@jasonhernandez jasonhernandez force-pushed the jason/sec-236-distroless-migration branch from ef9e7db to 748f09e Compare April 3, 2026 06:19
jasonhernandez and others added 10 commits April 2, 2026 23:24
Replace the Ubuntu-based Dockerfiles with distroless variants directly,
delete the now-unnecessary bash entrypoint scripts, and remove the
explicit LD_PRELOAD=libeatmydata.so from the mzcompose clusterd service
(the MZ_EAT_MY_DATA env var triggers the Rust-side LD_PRELOAD logic
which is harmless when libeatmydata.so is absent in distroless).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy libeatmydata.so from a Debian image into the distroless base so
that CI tests using MZ_EAT_MY_DATA=1 continue to benefit from fsync
elision. The library is inert in production (MZ_EAT_MY_DATA is unset).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The jobs image only contains Rust binaries (persistcli, mz-catalog-debug)
with no shell or tool dependencies.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The mzbuild system expects a Dockerfile next to every mzbuild.yml.
Include the static OpenSSH build Dockerfile so the pipeline can
resolve the openssh-static image dependency.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The git clone of aws-lc from GitHub fails with "server certificate
verification failed" because the ubuntu:noble base image doesn't
include CA certificates by default.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The jobs image is used in CI tests with mzcompose's idle feature which
overrides the entrypoint to ["sleep", "infinity"]. Distroless images
don't have the sleep binary, so keep this CI-only image on Ubuntu.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Change the static OpenSSH build to use plain AWS-LC by default (faster,
no Go dependency) with FIPS mode available via --build-arg AWS_LC_FIPS=1.

AWS-LC is a drop-in replacement for OpenSSL that's faster and smaller.
FIPS 140-3 validation is an additional layer only needed for compliance
builds, not for all builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
zlib.net is unreliable in CI — the download has failed twice. Use the
GitHub releases mirror which is more stable.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
AWS-LC (like BoringSSL) doesn't define BN_FLG_CONSTTIME. OpenSSH
V_9_9_P2 uses it in ssh-rsa.c. Define it to 0 via CFLAGS — the
constant is only used with BN_set_flags which AWS-LC already shims
to a no-op.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant