Skip to content

fix: Add waitready command to verify cluster ready#683

Open
johnramsden wants to merge 1 commit intocanonical:mainfrom
johnramsden:john/CEPH-1590-wait-ready
Open

fix: Add waitready command to verify cluster ready#683
johnramsden wants to merge 1 commit intocanonical:mainfrom
johnramsden:john/CEPH-1590-wait-ready

Conversation

@johnramsden
Copy link
Member

Description

When an operator attempts to do something before the cluster is up they can receive unexpected failures because bootstrap is not finished or microcluster is not yet available. This can be particularly problematic in CI or scripting.

Add an additional subcommand (similar to lxd waitready) https://manpages.debian.org/unstable/lxd/lxd.waitready.1

To confirm the cluster is up we check for the microcluster daemon to be ready, and for ceph to be ready (ceph -s)

On failure we get a message like the following if we haven't bootstrapped for example:

microceph waitready --timeout 30
Error: ceph not ready: timed out waiting for Ceph to become ready: context deadline exceeded

Running the following you should expect it to wait before running status, and it should succeed

sudo microceph cluster bootstrap &
sudo microceph waitready
sudo microceph status
[1] 35966
MicroCeph deployment summary:
- microceph (10.56.203.112) Services: mds, mgr, mon Disks: 0

Fixes #653

Type of change

Delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How has this been tested?

Added tests demonstrating waiting and timeout prior to bootstrap, and waiting succeeding post bootstrap.

Contributor checklist

Please check that you have:

  • self-reviewed the code in this PR
  • added code comments, particularly in less straightforward areas
  • checked and added or updated relevant documentation
  • checked and added or updated relevant release notes
  • added tests to verify effectiveness of this change

When an operator attempts to do something before the cluster is up they can receive unexpected failures because bootstrap is not finished or microcluster is not yet available. This can be particularly problematic in CI or scripting.

Add an additional subcommand (similar to lxd waitready) https://manpages.debian.org/unstable/lxd/lxd.waitready.1

To confirm the cluster is up we check for the microcluster daemon to be ready, and for ceph to be ready (ceph -s)

On failure we get a message like the following if we haven't bootstrapped for example:

microceph waitready --timeout 30
Error: ceph not ready: timed out waiting for Ceph to become ready: context deadline exceeded

Running the following you should expect it to wait before running status, and it should succeed

sudo microceph cluster bootstrap &
sudo microceph waitready
sudo microceph status
[1] 35966
MicroCeph deployment summary:
- microceph (10.56.203.112)
  Services: mds, mgr, mon
  Disks: 0

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@johnramsden
Copy link
Member Author

One note I have is I'm not sure if ceph -s is completely sufficient and if there's anything else we want to wait on

Copy link
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @johnramsden thank you, lgtm in general, two comments inline

// It retries every second until success or the context is cancelled/expired.
func WaitForCephReady(ctx context.Context) error {
for {
_, err := common.ProcessExec.RunCommand("ceph", "-s")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, a hanging ceph -s could hang this forever, not sure how often this occurs in practice but just for robustness could use the cephRunContext() function and pass in the ctx

}

ctx := context.Background()
if c.flagTimeout > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: should we be erroring out if operators pass in a neg. timeout value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

microceph wait-ready a way to wait for the microceph cluster untill it's up

2 participants