Skip to content

Production deploy can cause site-wide downtime #316

@zwolf

Description

@zwolf

During a production deploy of the http-frontend this week, we started getting service down alerts. There were two new http-frontend pods, both of which kubernetes was showing were unresponsive to liveness and readiness probes. This was causing them to be deemed unhealthy and be restarted, causing traffic routing to fail. Both of these bods showed nothing strange in their logs and seemed to be running fine outside of the k8s probe failure events. The deployment's rollout strategy isn't defined, which should mean that it uses the "rolling update" default behind their HPA. So if these pods were failing their health checks, why did the originals ever get shut down?

Maybe we can adjust the deployment to ensure that the new pods are passing the health checks before they're promoted. It's also possible that we could use staging as a more robust test environment for ensuring that traffic can be pushed through the nginx service, though this assumes that this problem would manifest in staging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions