feat: objectstore drainer by SimonRichardson · Pull Request #20222 · juju/juju

SimonRichardson · 2025-07-16T20:39:26Z

~~Requires #20291 to land first.~~

Note

We require #20291 to land first as it re-writes the objectstore tests to better handle the test worker in the objectstore tests. Thus makes it a lot cleaner to apply the fixes to the tests on top of that.

This wires up the draining of models from the file backed objectstore to the s3 backed objectstore. This sits on top of all the work that took place at the sprint; the objectstore fortress, drainer, and facade. Once we spot a draining in process, we use the raw underlying types:

The hash based filestore, with provides only raw access to the disk
The s3 client which only pushes files directly from file to s3 compatible object stores.

There is complexity in ensuring the ordering of the process. Ensuring that each model is drained from the files to s3, then importantly flushing all the current existing objectstore workers, so that they pick up the new s3 configuration. Once that is done, we complete the process and unlock the guard.

If there is an error during the process, we report the error and require manual intervention.

It is required that all files are backed up before performing this process, as once a file has been drained, it is removed from the prior location.

Note

This currently doesn't drain the controller of agents, that'll follow on in another PR. Trying to moving into HA after the draining process, it will fail as there are no file backed object stores.

Warning

This doesn't allow s3 to s3 storage, we require changes to how we store the underlying data. This requires modifications to the controller charm.

This pull request introduces several updates to enhance object store management and expand functionality for retrieving model namespaces. Key changes include improvements to the object store drainer configuration, the addition of a new GetModelNamespaces method, and updates to object store metadata handling.

Object Store Enhancements:

Updated objectstoredrainer.ManifoldConfig in manifolds.go to include additional dependencies and services, such as GetControllerService, GetGuardService, and NewDrainerWorker, to improve object store draining capabilities. (cmd/jujud-controller/agent/machine/manifolds.go, cmd/jujud-controller/agent/machine/manifolds.goR751-R765)
Added a new ObjectStoreFlusher interface for flushing object store workers. (core/objectstore/objectstore.go, core/objectstore/objectstore.goR81-R86)
Updated the SelectFileHash function to replace the private selectFileHash function, ensuring consistent hash selection across object store implementations. (internal/objectstore/base.go, [1]; internal/objectstore/fileobjectstore.go, [2]; internal/objectstore/s3objectstore.go, [3]

Model Namespace Retrieval:

Added a GetModelNamespaces method to the State interface and implemented it in the State struct, allowing retrieval of all model namespaces from the database. (domain/controller/service/service.go, [1] [2]; domain/controller/state/state.go, [3]
Extended unit tests to cover the GetModelNamespaces method, including scenarios where no namespaces are found. (domain/controller/service/service_test.go, [1]; domain/controller/state/state_test.go, [2]

Schema and Dependency Updates:

Modified the database schema for object_store_drain_phase_type to correct phase type ordering and added a foreign key constraint for phase_type_id. (domain/schema/controller/sql/0014-objectstore-metadata.sql, domain/schema/controller/sql/0014-objectstore-metadata.sqlL43-R51)
Updated dependencies in manifolds_test.go to reflect new requirements for the object-store-drainer manifold. (cmd/jujud-controller/agent/machine/manifolds_test.go, [1] [2]

These changes collectively improve the robustness and functionality of the system, particularly in managing object store operations and retrieving model-related data.

QA steps

Setup s3 (I'm using minio, but anything will suffice, even another Juju model with an s3 compatible application):

docker run -d \
   -p 9000:9000 \
   -p 9001:9001 \
   --user $(id -u):$(id -g) \
   --name minio1 \
   -e "MINIO_ROOT_USER=ROOTUSER" \
   -e "MINIO_ROOT_PASSWORD=CHANGEME123" \
   -v ${HOME}/.local/minio/data:/data \
   quay.io/minio/minio server /data --console-address ":9001"

minio is now community addition, so the browser version is basic.

$ docker exec -it <container> bash
$ mc alias set foo http://localhost:9000 ROOTUSER CHANGEME123
$ mc admin info foo
$ mc admin accesskey create foo
# Note down the access keys

Bootstrap:

Note

Don't use localhost for the container address, otherwise it can't access it. Your own address from ip a

$ juju bootstrap lxd test
$ juju add-model default
$ juju deploy ubuntu

Take note of the controller uuid, you can inspect that model in the minio CLI/dashboard.

juju controller-config "object-store-type=s3" "object-store-s3-endpoint=<endpoint>" "object-store-s3-static-key=<key>" "object-store-s3-static-secret=<secret>"

Watch the logs for object store draining completed successfully

Make sure that the blobs have been removed:

$ juju ssh -m controller 0
$ juju_object_store_contents

There should only be the controller left.

Also check that the objectstore workers were flushed:

  object-store:
    inputs:
    - agent
    - trace
    - object-store-services
    - lease-manager
    - object-store-s3-caller
    - api-remote-caller
    - upgrade-database-flag
    report:
      workers:
        77f5809c-41c1-4a5c-8b7d-19f3ba7cd9c2:
          report:
            namespace: 77f5809c-41c1-4a5c-8b7d-19f3ba7cd9c2
            path: /var/lib/juju/objectstore/77f5809c-41c1-4a5c-8b7d-19f3ba7cd9c2
            rootBucket: juju-dcbe59b4-2752-4bb3-8e7d-e54c7cf603a1
            type: s3-object-store
          started: "2025-07-18 16:01:35"
          state: started
        a262007b-ae56-4812-8f4a-e029f3afd3e9:
          report:
            namespace: a262007b-ae56-4812-8f4a-e029f3afd3e9
            path: /var/lib/juju/objectstore/a262007b-ae56-4812-8f4a-e029f3afd3e9
            rootBucket: juju-dcbe59b4-2752-4bb3-8e7d-e54c7cf603a1
            type: s3-object-store
          started: "2025-07-18 16:01:35"
          state: started
        controller:
          report:
            namespace: controller
            path: /var/lib/juju/objectstore/controller
            rootBucket: juju-dcbe59b4-2752-4bb3-8e7d-e54c7cf603a1
            type: s3-object-store
          started: "2025-07-18 16:01:35"
          state: started

This adds the implementation for file objectstore removal code. This should work in both HA and non-HA modes, as the witness for the object store should happen everywhere. The s3 objectstore is a lot more difficult and requires more thought. It will be wise to wait for[1], which cleans up the s3 objectstore, which might help the implementation of this. 1. juju#20222

manadart

Some nits, but this is very cogent, and works perfectly.

domain/controller/state/state.go

internal/worker/objectstoredrainer/manifold.go

internal/worker/objectstores3caller/worker.go

internal/worker/objectstoredrainer/drainer.go

manadart · 2025-08-06T12:04:09Z

internal/worker/objectstoredrainer/worker.go

+		ShouldRestart: func(err error) bool {
+			return true
+		},
+		RestartDelay: time.Second * 10,


Quite long isn't it?

Yeah, probably, I really want to get juju/worker#41 in at some point, which gives us the best of both worlds.

All our workers restart after 10 seconds... I'm happy to revisit this as a whole.

hpidcock

Nice. Few little bits.

internal/worker/objectstores3caller/manifold.go

internal/worker/objectstoredrainer/worker.go

internal/worker/objectstoredrainer/package_test.go

internal/worker/objectstoredrainer/manifold.go

hpidcock · 2025-08-07T06:24:46Z

internal/worker/objectstoredrainer/drainer.go

+	// We need to compute the sha256 hash here, juju by default uses SHA384,
+	// but s3 defaults to SHA256.
+	// If the reader is a Seeker, then we can seek back to the beginning of
+	// the file, so that we can read it again.


Feels like the object store model should compute and store both when a object is first stored.
Even though we only index by sha384, having the other values precomputed and stored in metadata can prove useful. Like here.

That's a big change, happy to make it. I'll do in a follow up.

hpidcock · 2025-08-07T06:38:01Z

internal/worker/objectstoredrainer/drainer.go

+func (w *drainWorker) Report() map[string]any {
+	return map[string]any{
+		"namespace":  w.namespace,
+		"rootBucket": w.rootBucket,


Nice to have, but can we please get some statistics here please?
Objects Drained.
Objects To Be Drained.
Current Object Draining (hash, size).

This will help very much in seeing if the drainer is stuck or just slowly progressing.

Yes, good plan.

internal/worker/objectstore/worker.go

This is the very bare bones for the objectstore drainer.

We require this because we need to migrate all objects for every model.

The drainer needs to operate on a model namespace and not globally.

This begins the wire up of the manifold. This will allow us to start testing the draining implementation.

We're still missing the draining tests, but that will follow. We now handle models being drained.

We need to ensure we have all the dependencies in the worker manifold when we create the worker.

This reuses the drainer tests from the prior implementation that worked and moved them to this location.

Stop the agentconfigupdater bouncing the objectstore when it detects a different s3 type. This is bruteforce, we want to drain before the bounce.

When the objectstore has finished draining, we need to flush the objectstore. This does that before completeing.

When draining from file to objectstore, we need to drain the files and reload the objectstore workers. This wires everything up so it all slots together nicely.

If there is an error setting the objectstore type, handle that error.

Now that we need to consult the objectstore type everytime the worker comes up, we need to account for that in tests.

When we get a change request, record the objectstore type, otherwise it won't be correctly written.

Return an error with a not supported tag, not a NotSupportedf one with a trailing string.

It's easier to read the non-generic type than the generic one.

If the worker already exists, we'll hang and prevent the worker from completing. This causes no end of issues. The solution is to crash the worker and cause it to start again.

If we end up with duplicated namespaces then it will cause issues prevent that from happening.

Check the objectstore type in the change config closure, we can guarantee that it's not changed whilst we set it.

SimonRichardson · 2025-08-07T11:06:43Z

/merge

…iner #20363 ~~Requires #20222 to land first.~~ ---- > [!NOTE] > Fixes one of the issues from the parent change request. In order to move into HA, you need the agents moving. This changeset does that. This was originally left out of the objectstore drainer as it was becoming rather large. Instead, draining the agents objectstore is actually rather trivial. We don't allow this to run in parallel of other models, just because it's hard to manage. Yet, we can do it upfront and then drain that, before moving on. ----- This pull request introduces several enhancements and fixes related to object store management, controller state services, and database schema consistency. The most significant changes include the addition of a new method to retrieve all model namespaces, improvements to object store interfaces and reporting, and updates to dependency wiring and schema constraints. **Controller State and Service Improvements:** * Added a new `GetModelNamespaces` method to the controller state and service layers, allowing retrieval of all model namespaces. This includes interface, implementation, and test coverage. [[1]](diffhunk://#diff-965d090c5522c7a356017c02928c112d4d04dd0f62d3fba4d575039abb675d09R22-R25) [[2]](diffhunk://#diff-965d090c5522c7a356017c02928c112d4d04dd0f62d3fba4d575039abb675d09R56-R64) [[3]](diffhunk://#diff-c764c90403b9f6afc521aa1de61d85e1d2561c39093d37f001105fb2ba6a5944R93-R124) [[4]](diffhunk://#diff-fce208ee534dabc23cecdb71a11c9d794dc433c3660d002b28bf72f943c065efR121-R159) [[5]](diffhunk://#diff-634d3bc49c4d4d4e29f24d8b64b71a3ebbff17caebc4dfa869c20dec3e9a8f66R64-R76) [[6]](diffhunk://#diff-d0628d333234596b6c3ea5ce85476bc72e865ef0366c1f48b6e731434b455661R69-R89) [[7]](diffhunk://#diff-987fa37d1527a733c49055098ff2aa93a174ee905fa0033492d1e1e1b02116b8R18-R21) * The `ObjectStoreServices` struct now exposes a `Controller()` method to access the controller service, facilitating service composition. [[1]](diffhunk://#diff-9800dec3fc7f9724cb944333837e7cb940730aaf82f8eedfeaf99c47b55cb6adR9-R10) [[2]](diffhunk://#diff-9800dec3fc7f9724cb944333837e7cb940730aaf82f8eedfeaf99c47b55cb6adR45-R51) [[3]](diffhunk://#diff-ac583ad94f1cb46d31a7d828a96e36cedc6c399d5941702276abbbf4c21bbb2cR217-R219) **Object Store Enhancements:** * Introduced the `ObjectStoreFlusher` interface for flushing object store workers. * Promoted `selectFileHash` to `SelectFileHash` for broader accessibility and updated all usages accordingly. [[1]](diffhunk://#diff-ce129169cb1e17866d007bd4b4f03faff5a3131988bc77b41fb2201673e624aeL232-R232) [[2]](diffhunk://#diff-ce129169cb1e17866d007bd4b4f03faff5a3131988bc77b41fb2201673e624aeL263-R263) [[3]](diffhunk://#diff-f600a6baef834f181261d9c8f295ff37659196f888be1919b7b0b63e4a792917L598-R599) [[4]](diffhunk://#diff-f600a6baef834f181261d9c8f295ff37659196f888be1919b7b0b63e4a792917L785-R786) [[5]](diffhunk://#diff-f600a6baef834f181261d9c8f295ff37659196f888be1919b7b0b63e4a792917L912-R913) [[6]](diffhunk://#diff-2fa784214fe68e91f01384f92670636921d9febb07a316c0e8a1b6b2bab4a3afL533-R534) [[7]](diffhunk://#diff-2fa784214fe68e91f01384f92670636921d9febb07a316c0e8a1b6b2bab4a3afL671-R672) * Improved the `Report()` methods for both file and S3 object stores to include a `type` field, aiding diagnostics and introspection. [[1]](diffhunk://#diff-f600a6baef834f181261d9c8f295ff37659196f888be1919b7b0b63e4a792917R374) [[2]](diffhunk://#diff-2fa784214fe68e91f01384f92670636921d9febb07a316c0e8a1b6b2bab4a3afR344) **Dependency and Manifold Wiring:** * Expanded the configuration and dependency injection for the `object-store-drainer` manifold, adding several new dependencies and configuration fields to support more robust operation. [[1]](diffhunk://#diff-e50517ed0e30987c3d6010dc5b66fe398f9a3dc7a22e89f8e07a506fd7cab07aR727-R742) [[2]](diffhunk://#diff-fdd0a1647b52b33a58468b07ecbd115245244b0a5a037f2ce8212dd7b6578ae3R1231-R1248) [[3]](diffhunk://#diff-fdd0a1647b52b33a58468b07ecbd115245244b0a5a037f2ce8212dd7b6578ae3R2180-R2197) **Database Schema Fixes:** * Corrected the values and foreign key constraint in the `object_store_drain_phase_type` table to ensure referential integrity and accurate phase representation. **Cleanup and Refactoring:** * Removed unused or redundant code related to object store type changes in the agent config updater, simplifying logic and reducing unnecessary agent restarts. [[1]](diffhunk://#diff-9a37a48182d01288b28b20f08d90179b36eaab41868c04ed6511a98fbff704baL151-L154) [[2]](diffhunk://#diff-9a37a48182d01288b28b20f08d90179b36eaab41868c04ed6511a98fbff704baL211-L214) [[3]](diffhunk://#diff-9a37a48182d01288b28b20f08d90179b36eaab41868c04ed6511a98fbff704baL256-L259) These changes collectively improve the maintainability, observability, and correctness of the object store and controller service subsystems. ## QA steps Setup s3 (I'm using minio, but anything will suffice, even another Juju model with an s3 compatible application): ```sh docker run -d \n -p 9000:9000 \n -p 9001:9001 \n --user $(id -u):$(id -g) \n --name minio1 \n -e "MINIO_ROOT_USER=ROOTUSER" \n -e "MINIO_ROOT_PASSWORD=CHANGEME123" \n -v /home/jenkins/.local/minio/data:/data \n quay.io/minio/minio server /data --console-address ":9001" ``` minio is now community addition, so the browser version is basic. ```sh $ docker exec -it <container> bash $ mc alias set foo http://localhost:9000 ROOTUSER CHANGEME123 $ mc admin info foo $ mc admin accesskey create foo # Note down the access keys ``` Bootstrap: > [!NOTE] > Don't use localhost for the container address, otherwise it can't access it. Your own address from `ip a` ```sh $ juju bootstrap lxd test $ juju add-model default $ juju deploy juju-qa-test ``` Take note of the controller uuid, you can inspect that model in the minio CLI/dashboard. ```sh juju controller-config "object-store-type=s3" "object-store-s3-endpoint=<endpoint>" "object-store-s3-static-key=<key>" "object-store-s3-static-secret=<secret>" ``` Move into HA... ```sh $ juju add-unit -m controller "controller" -n 2 ```

jujubot assigned SimonRichardson Jul 16, 2025

jujubot added the 4.0 label Jul 16, 2025

SimonRichardson mentioned this pull request Jul 16, 2025

feat!: remove s3 objectstore draining #20220

Merged

SimonRichardson force-pushed the objectstore-drainer branch 4 times, most recently from f1252ab to c1ff598 Compare July 18, 2025 15:02

SimonRichardson marked this pull request as ready for review July 29, 2025 10:48

SimonRichardson requested a review from manadart as a code owner July 29, 2025 10:48

SimonRichardson force-pushed the objectstore-drainer branch 2 times, most recently from 3b47ad1 to f83acda Compare July 29, 2025 10:54

SimonRichardson force-pushed the objectstore-drainer branch 2 times, most recently from dcd55cb to 06f3e9c Compare August 4, 2025 07:23

manadart approved these changes Aug 6, 2025

View reviewed changes

SimonRichardson force-pushed the objectstore-drainer branch from 5ef5600 to 12ea488 Compare August 6, 2025 15:04

SimonRichardson mentioned this pull request Aug 6, 2025

feat: agents objectstore drainer #20363

Merged

hpidcock approved these changes Aug 7, 2025

View reviewed changes

SimonRichardson added 10 commits August 7, 2025 11:27

feat: scaffolding objectstore drainer

56bf3f6

This is the very bare bones for the objectstore drainer.

feat: expose getting all model namespaces

8dcd3dc

We require this because we need to migrate all objects for every model.

feat: namespace the drainer

25cbc10

The drainer needs to operate on a model namespace and not globally.

feat: start wiring up the manifold and worker

005fea8

This begins the wire up of the manifold. This will allow us to start testing the draining implementation.

feat: finish the manifold wireup

58b10b9

feat: start adding tests back

8956cb2

We're still missing the draining tests, but that will follow. We now handle models being drained.

feat: ensure we have all the dependencies in worker

0495957

We need to ensure we have all the dependencies in the worker manifold when we create the worker.

feat: add all the drainer tests

d8393b2

This reuses the drainer tests from the prior implementation that worked and moved them to this location.

feat: stop agentconfigupdater bouncing objectstore

84153eb

Stop the agentconfigupdater bouncing the objectstore when it detects a different s3 type. This is bruteforce, we want to drain before the bounce.

feat: allow the drainer to flush the store

bd1b547

When the objectstore has finished draining, we need to flush the objectstore. This does that before completeing.

SimonRichardson added 10 commits August 7, 2025 11:27

feat: reload objectstore on drain change

750bd76

When draining from file to objectstore, we need to drain the files and reload the objectstore workers. This wires everything up so it all slots together nicely.

feat: handle missing error check

dee4505

If there is an error setting the objectstore type, handle that error.

test: ensure we have the right dependency

2ff5d33

test: account of controller config request

31bf74f

Now that we need to consult the objectstore type everytime the worker comes up, we need to account for that in tests.

feat: ensure we record the current object store type

dee1999

When we get a change request, record the objectstore type, otherwise it won't be correctly written.

feat: return not supported error

b6529ef

Return an error with a not supported tag, not a NotSupportedf one with a trailing string.

refactor: use the right watcher type

e61319e

It's easier to read the non-generic type than the generic one.

feat: ensure we right the objectstore type

adede95

feat: handle worker already exists

3a26568

If the worker already exists, we'll hang and prevent the worker from completing. This causes no end of issues. The solution is to crash the worker and cause it to start again.

feat: ensure that namespaces are unique

20742f3

If we end up with duplicated namespaces then it will cause issues prevent that from happening.

SimonRichardson force-pushed the objectstore-drainer branch from 12ea488 to 20742f3 Compare August 7, 2025 10:42

feat: check objectstore in closure

9433dd5

Check the objectstore type in the change config closure, we can guarantee that it's not changed whilst we set it.

jujubot merged commit 2c77ae4 into juju:main Aug 7, 2025
20 of 22 checks passed

SimonRichardson deleted the objectstore-drainer branch August 7, 2025 11:22

Conversation

SimonRichardson commented Jul 16, 2025 • edited by hpidcock Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Object Store Enhancements:

Model Namespace Retrieval:

Schema and Dependency Updates:

QA steps

Uh oh!

manadart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

manadart Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

SimonRichardson Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

SimonRichardson Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

hpidcock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpidcock Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

SimonRichardson Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

hpidcock Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

SimonRichardson Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SimonRichardson commented Aug 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SimonRichardson commented Jul 16, 2025 •

edited by hpidcock

Loading