Skip to content

feat: objectstore drainer#20222

Merged
jujubot merged 21 commits intojuju:mainfrom
SimonRichardson:objectstore-drainer
Aug 7, 2025
Merged

feat: objectstore drainer#20222
jujubot merged 21 commits intojuju:mainfrom
SimonRichardson:objectstore-drainer

Conversation

@SimonRichardson
Copy link
Member

@SimonRichardson SimonRichardson commented Jul 16, 2025

Requires #20291 to land first.

Note

We require #20291 to land first as it re-writes the objectstore tests to better handle the test worker in the objectstore tests. Thus makes it a lot cleaner to apply the fixes to the tests on top of that.


This wires up the draining of models from the file backed objectstore to the s3 backed objectstore. This sits on top of all the work that took place at the sprint; the objectstore fortress, drainer, and facade. Once we spot a draining in process, we use the raw underlying types:

  • The hash based filestore, with provides only raw access to the disk
  • The s3 client which only pushes files directly from file to s3 compatible object stores.

There is complexity in ensuring the ordering of the process. Ensuring that each model is drained from the files to s3, then importantly flushing all the current existing objectstore workers, so that they pick up the new s3 configuration. Once that is done, we complete the process and unlock the guard.

If there is an error during the process, we report the error and require manual intervention.

It is required that all files are backed up before performing this process, as once a file has been drained, it is removed from the prior location.

Note

This currently doesn't drain the controller of agents, that'll follow on in another PR. Trying to moving into HA after the draining process, it will fail as there are no file backed object stores.

Warning

This doesn't allow s3 to s3 storage, we require changes to how we store the underlying data. This requires modifications to the controller charm.


This pull request introduces several updates to enhance object store management and expand functionality for retrieving model namespaces. Key changes include improvements to the object store drainer configuration, the addition of a new GetModelNamespaces method, and updates to object store metadata handling.

Object Store Enhancements:

  • Updated objectstoredrainer.ManifoldConfig in manifolds.go to include additional dependencies and services, such as GetControllerService, GetGuardService, and NewDrainerWorker, to improve object store draining capabilities. (cmd/jujud-controller/agent/machine/manifolds.go, cmd/jujud-controller/agent/machine/manifolds.goR751-R765)
  • Added a new ObjectStoreFlusher interface for flushing object store workers. (core/objectstore/objectstore.go, core/objectstore/objectstore.goR81-R86)
  • Updated the SelectFileHash function to replace the private selectFileHash function, ensuring consistent hash selection across object store implementations. (internal/objectstore/base.go, [1]; internal/objectstore/fileobjectstore.go, [2]; internal/objectstore/s3objectstore.go, [3]

Model Namespace Retrieval:

  • Added a GetModelNamespaces method to the State interface and implemented it in the State struct, allowing retrieval of all model namespaces from the database. (domain/controller/service/service.go, [1] [2]; domain/controller/state/state.go, [3]
  • Extended unit tests to cover the GetModelNamespaces method, including scenarios where no namespaces are found. (domain/controller/service/service_test.go, [1]; domain/controller/state/state_test.go, [2]

Schema and Dependency Updates:

  • Modified the database schema for object_store_drain_phase_type to correct phase type ordering and added a foreign key constraint for phase_type_id. (domain/schema/controller/sql/0014-objectstore-metadata.sql, domain/schema/controller/sql/0014-objectstore-metadata.sqlL43-R51)
  • Updated dependencies in manifolds_test.go to reflect new requirements for the object-store-drainer manifold. (cmd/jujud-controller/agent/machine/manifolds_test.go, [1] [2]

These changes collectively improve the robustness and functionality of the system, particularly in managing object store operations and retrieving model-related data.

QA steps

Setup s3 (I'm using minio, but anything will suffice, even another Juju model with an s3 compatible application):

docker run -d \
   -p 9000:9000 \
   -p 9001:9001 \
   --user $(id -u):$(id -g) \
   --name minio1 \
   -e "MINIO_ROOT_USER=ROOTUSER" \
   -e "MINIO_ROOT_PASSWORD=CHANGEME123" \
   -v ${HOME}/.local/minio/data:/data \
   quay.io/minio/minio server /data --console-address ":9001"

minio is now community addition, so the browser version is basic.

$ docker exec -it <container> bash
$ mc alias set foo http://localhost:9000 ROOTUSER CHANGEME123
$ mc admin info foo
$ mc admin accesskey create foo
# Note down the access keys

Bootstrap:

Note

Don't use localhost for the container address, otherwise it can't access it. Your own address from ip a

$ juju bootstrap lxd test
$ juju add-model default
$ juju deploy ubuntu

Take note of the controller uuid, you can inspect that model in the minio CLI/dashboard.

juju controller-config "object-store-type=s3" "object-store-s3-endpoint=<endpoint>" "object-store-s3-static-key=<key>" "object-store-s3-static-secret=<secret>"

Watch the logs for object store draining completed successfully

Make sure that the blobs have been removed:

$ juju ssh -m controller 0
$ juju_object_store_contents

There should only be the controller left.

Also check that the objectstore workers were flushed:

  object-store:
    inputs:
    - agent
    - trace
    - object-store-services
    - lease-manager
    - object-store-s3-caller
    - api-remote-caller
    - upgrade-database-flag
    report:
      workers:
        77f5809c-41c1-4a5c-8b7d-19f3ba7cd9c2:
          report:
            namespace: 77f5809c-41c1-4a5c-8b7d-19f3ba7cd9c2
            path: /var/lib/juju/objectstore/77f5809c-41c1-4a5c-8b7d-19f3ba7cd9c2
            rootBucket: juju-dcbe59b4-2752-4bb3-8e7d-e54c7cf603a1
            type: s3-object-store
          started: "2025-07-18 16:01:35"
          state: started
        a262007b-ae56-4812-8f4a-e029f3afd3e9:
          report:
            namespace: a262007b-ae56-4812-8f4a-e029f3afd3e9
            path: /var/lib/juju/objectstore/a262007b-ae56-4812-8f4a-e029f3afd3e9
            rootBucket: juju-dcbe59b4-2752-4bb3-8e7d-e54c7cf603a1
            type: s3-object-store
          started: "2025-07-18 16:01:35"
          state: started
        controller:
          report:
            namespace: controller
            path: /var/lib/juju/objectstore/controller
            rootBucket: juju-dcbe59b4-2752-4bb3-8e7d-e54c7cf603a1
            type: s3-object-store
          started: "2025-07-18 16:01:35"
          state: started

@jujubot jujubot added the 4.0 label Jul 16, 2025
@SimonRichardson SimonRichardson force-pushed the objectstore-drainer branch 4 times, most recently from f1252ab to c1ff598 Compare July 18, 2025 15:02
SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Jul 28, 2025
This adds the implementation for file objectstore removal code. This
should work in both HA and non-HA modes, as the witness for the object
store should happen everywhere.

The s3 objectstore is a lot more difficult and requires more thought.
It will be wise to wait for[1], which cleans up the s3 objectstore,
which might help the implementation of this.

 1. juju#20222
SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Jul 29, 2025
This adds the implementation for file objectstore removal code. This
should work in both HA and non-HA modes, as the witness for the object
store should happen everywhere.

The s3 objectstore is a lot more difficult and requires more thought.
It will be wise to wait for[1], which cleans up the s3 objectstore,
which might help the implementation of this.

 1. juju#20222
@SimonRichardson SimonRichardson marked this pull request as ready for review July 29, 2025 10:48
@SimonRichardson SimonRichardson force-pushed the objectstore-drainer branch 2 times, most recently from 3b47ad1 to f83acda Compare July 29, 2025 10:54
SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Jul 29, 2025
This adds the implementation for file objectstore removal code. This
should work in both HA and non-HA modes, as the witness for the object
store should happen everywhere.

The s3 objectstore is a lot more difficult and requires more thought.
It will be wise to wait for[1], which cleans up the s3 objectstore,
which might help the implementation of this.

 1. juju#20222
@SimonRichardson SimonRichardson force-pushed the objectstore-drainer branch 2 times, most recently from dcd55cb to 06f3e9c Compare August 4, 2025 07:23
Copy link
Member

@manadart manadart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but this is very cogent, and works perfectly.

ShouldRestart: func(err error) bool {
return true
},
RestartDelay: time.Second * 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite long isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably, I really want to get juju/worker#41 in at some point, which gives us the best of both worlds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All our workers restart after 10 seconds... I'm happy to revisit this as a whole.

Copy link
Member

@hpidcock hpidcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Few little bits.

Comment on lines +190 to +193
// We need to compute the sha256 hash here, juju by default uses SHA384,
// but s3 defaults to SHA256.
// If the reader is a Seeker, then we can seek back to the beginning of
// the file, so that we can read it again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like the object store model should compute and store both when a object is first stored.
Even though we only index by sha384, having the other values precomputed and stored in metadata can prove useful. Like here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a big change, happy to make it. I'll do in a follow up.

func (w *drainWorker) Report() map[string]any {
return map[string]any{
"namespace": w.namespace,
"rootBucket": w.rootBucket,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have, but can we please get some statistics here please?
Objects Drained.
Objects To Be Drained.
Current Object Draining (hash, size).

This will help very much in seeing if the drainer is stuck or just slowly progressing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good plan.

This is the very bare bones for the objectstore drainer.
We require this because we need to migrate all objects for every
model.
The drainer needs to operate on a model namespace and not globally.
This begins the wire up of the manifold. This will allow us to start
testing the draining implementation.
We're still missing the draining tests, but that will follow.

We now handle models being drained.
We need to ensure we have all the dependencies in the worker manifold
when we create the worker.
This reuses the drainer tests from the prior implementation that worked
and moved them to this location.
Stop the agentconfigupdater bouncing the objectstore when it detects
a different s3 type. This is bruteforce, we want to drain before the
bounce.
When the objectstore has finished draining, we need to flush the
objectstore. This does that before completeing.
When draining from file to objectstore, we need to drain the files and
reload the objectstore workers. This wires everything up so it all
slots together nicely.
If there is an error setting the objectstore type, handle that error.
Now that we need to consult the objectstore type everytime the worker
comes up, we need to account for that in tests.
When we get a change request, record the objectstore type, otherwise
it won't be correctly written.
Return an error with a not supported tag, not a NotSupportedf one with
a trailing string.
It's easier to read the non-generic type than the generic one.
If the worker already exists, we'll hang and prevent the worker from
completing. This causes no end of issues.

The solution is to crash the worker and cause it to start again.
If we end up with duplicated namespaces then it will cause issues
prevent that from happening.
Check the objectstore type in the change config closure, we can
guarantee that it's not changed whilst we set it.
@SimonRichardson
Copy link
Member Author

/merge

@jujubot jujubot merged commit 2c77ae4 into juju:main Aug 7, 2025
20 of 22 checks passed
@SimonRichardson SimonRichardson deleted the objectstore-drainer branch August 7, 2025 11:22
jujubot added a commit that referenced this pull request Aug 7, 2025
…iner

#20363

~~Requires #20222 to land first.~~

----

> [!NOTE]
> Fixes one of the issues from the parent change request. In order to move into HA, you need the agents moving. This changeset does that.

This was originally left out of the objectstore drainer as it was
becoming rather large. Instead, draining the agents objectstore is
actually rather trivial. We don't allow this to run in parallel of
other models, just because it's hard to manage. Yet, we can do it
upfront and then drain that, before moving on.

-----

This pull request introduces several enhancements and fixes related to object store management, controller state services, and database schema consistency. The most significant changes include the addition of a new method to retrieve all model namespaces, improvements to object store interfaces and reporting, and updates to dependency wiring and schema constraints.

**Controller State and Service Improvements:**

* Added a new `GetModelNamespaces` method to the controller state and service layers, allowing retrieval of all model namespaces. This includes interface, implementation, and test coverage. [[1]](diffhunk://#diff-965d090c5522c7a356017c02928c112d4d04dd0f62d3fba4d575039abb675d09R22-R25) [[2]](diffhunk://#diff-965d090c5522c7a356017c02928c112d4d04dd0f62d3fba4d575039abb675d09R56-R64) [[3]](diffhunk://#diff-c764c90403b9f6afc521aa1de61d85e1d2561c39093d37f001105fb2ba6a5944R93-R124) [[4]](diffhunk://#diff-fce208ee534dabc23cecdb71a11c9d794dc433c3660d002b28bf72f943c065efR121-R159) [[5]](diffhunk://#diff-634d3bc49c4d4d4e29f24d8b64b71a3ebbff17caebc4dfa869c20dec3e9a8f66R64-R76) [[6]](diffhunk://#diff-d0628d333234596b6c3ea5ce85476bc72e865ef0366c1f48b6e731434b455661R69-R89) [[7]](diffhunk://#diff-987fa37d1527a733c49055098ff2aa93a174ee905fa0033492d1e1e1b02116b8R18-R21)
* The `ObjectStoreServices` struct now exposes a `Controller()` method to access the controller service, facilitating service composition. [[1]](diffhunk://#diff-9800dec3fc7f9724cb944333837e7cb940730aaf82f8eedfeaf99c47b55cb6adR9-R10) [[2]](diffhunk://#diff-9800dec3fc7f9724cb944333837e7cb940730aaf82f8eedfeaf99c47b55cb6adR45-R51) [[3]](diffhunk://#diff-ac583ad94f1cb46d31a7d828a96e36cedc6c399d5941702276abbbf4c21bbb2cR217-R219)

**Object Store Enhancements:**

* Introduced the `ObjectStoreFlusher` interface for flushing object store workers.
* Promoted `selectFileHash` to `SelectFileHash` for broader accessibility and updated all usages accordingly. [[1]](diffhunk://#diff-ce129169cb1e17866d007bd4b4f03faff5a3131988bc77b41fb2201673e624aeL232-R232) [[2]](diffhunk://#diff-ce129169cb1e17866d007bd4b4f03faff5a3131988bc77b41fb2201673e624aeL263-R263) [[3]](diffhunk://#diff-f600a6baef834f181261d9c8f295ff37659196f888be1919b7b0b63e4a792917L598-R599) [[4]](diffhunk://#diff-f600a6baef834f181261d9c8f295ff37659196f888be1919b7b0b63e4a792917L785-R786) [[5]](diffhunk://#diff-f600a6baef834f181261d9c8f295ff37659196f888be1919b7b0b63e4a792917L912-R913) [[6]](diffhunk://#diff-2fa784214fe68e91f01384f92670636921d9febb07a316c0e8a1b6b2bab4a3afL533-R534) [[7]](diffhunk://#diff-2fa784214fe68e91f01384f92670636921d9febb07a316c0e8a1b6b2bab4a3afL671-R672)
* Improved the `Report()` methods for both file and S3 object stores to include a `type` field, aiding diagnostics and introspection. [[1]](diffhunk://#diff-f600a6baef834f181261d9c8f295ff37659196f888be1919b7b0b63e4a792917R374) [[2]](diffhunk://#diff-2fa784214fe68e91f01384f92670636921d9febb07a316c0e8a1b6b2bab4a3afR344)

**Dependency and Manifold Wiring:**

* Expanded the configuration and dependency injection for the `object-store-drainer` manifold, adding several new dependencies and configuration fields to support more robust operation. [[1]](diffhunk://#diff-e50517ed0e30987c3d6010dc5b66fe398f9a3dc7a22e89f8e07a506fd7cab07aR727-R742) [[2]](diffhunk://#diff-fdd0a1647b52b33a58468b07ecbd115245244b0a5a037f2ce8212dd7b6578ae3R1231-R1248) [[3]](diffhunk://#diff-fdd0a1647b52b33a58468b07ecbd115245244b0a5a037f2ce8212dd7b6578ae3R2180-R2197)

**Database Schema Fixes:**

* Corrected the values and foreign key constraint in the `object_store_drain_phase_type` table to ensure referential integrity and accurate phase representation.

**Cleanup and Refactoring:**

* Removed unused or redundant code related to object store type changes in the agent config updater, simplifying logic and reducing unnecessary agent restarts. [[1]](diffhunk://#diff-9a37a48182d01288b28b20f08d90179b36eaab41868c04ed6511a98fbff704baL151-L154) [[2]](diffhunk://#diff-9a37a48182d01288b28b20f08d90179b36eaab41868c04ed6511a98fbff704baL211-L214) [[3]](diffhunk://#diff-9a37a48182d01288b28b20f08d90179b36eaab41868c04ed6511a98fbff704baL256-L259)

These changes collectively improve the maintainability, observability, and correctness of the object store and controller service subsystems.


## QA steps

Setup s3 (I'm using minio, but anything will suffice, even another Juju model with an s3 compatible application):


```sh
docker run -d \n -p 9000:9000 \n -p 9001:9001 \n --user $(id -u):$(id -g) \n --name minio1 \n -e "MINIO_ROOT_USER=ROOTUSER" \n -e "MINIO_ROOT_PASSWORD=CHANGEME123" \n -v /home/jenkins/.local/minio/data:/data \n quay.io/minio/minio server /data --console-address ":9001"
```

minio is now community addition, so the browser version is basic.

```sh
$ docker exec -it <container> bash
$ mc alias set foo http://localhost:9000 ROOTUSER CHANGEME123
$ mc admin info foo
$ mc admin accesskey create foo
# Note down the access keys
```

Bootstrap:

> [!NOTE]
> Don't use localhost for the container address, otherwise it can't access it. Your own address from `ip a`

```sh
$ juju bootstrap lxd test
$ juju add-model default
$ juju deploy juju-qa-test
```

Take note of the controller uuid, you can inspect that model in the minio CLI/dashboard.

```sh
juju controller-config "object-store-type=s3" "object-store-s3-endpoint=<endpoint>" "object-store-s3-static-key=<key>" "object-store-s3-static-secret=<secret>"
```

Move into HA...

```sh
$ juju add-unit -m controller "controller" -n 2 
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants