Skip to content

Conversation

@chmill-zz
Copy link
Contributor

What this PR does / why we need it:

this is adding node-exporter into the vhdbuild by default. At the end of the vhdbuild we systemctl disable to allow a fresh start during cse letting the node-exporter-startup script run and gather up node specific details needed.

Which issue(s) this PR fixes:

Fixes #

@chmill-zz chmill-zz changed the title Nodeexportershift feat: node-exporter into vhd build Jan 22, 2026
@github-actions github-actions bot added the components This pull request updates cached components on Linux or Windows VHDs label Jan 22, 2026
@chmill-zz chmill-zz force-pushed the nodeexportershift branch 2 times, most recently from 68a2704 to d494735 Compare January 23, 2026 21:59
@@ -0,0 +1,5 @@
tls_server_config:
cert_file: "/etc/kubernetes/certs/kubeletserver.crt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these paths aren't necessarily correct - they depend on whether kubelet serving certificate rotation is enabled - when it's disabled these paths are correct, however when it's enabled both cert_file and key_file should point towards: /var/lib/kubelet/pki/kubelet-server-current.pem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting this is copy paste from the aks-vm-extension repo and what is on my node today. I don't see any where it's touched, just a static file.

root@aks-sys-41317600-vmss000000:/etc/node-exporter.d# cat web-config.yml
tls_server_config:
  cert_file: "/etc/kubernetes/certs/kubeletserver.crt"
  key_file: "/etc/kubernetes/certs/kubeletserver.key"
  client_auth_type: "RequireAndVerifyClientCert"
  client_ca_file: "/etc/kubernetes/certs/ca.crt"

i think we could address this in the node-exporter-startup.sh and check for the existence of /var/lib/kubelet/pki/kubelet-server-current.pem and if it exists use it. And if not use /etc/kubernetes/certs/kubeletserver.crt

Copy link
Contributor

@cameronmeissner cameronmeissner Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's tricky - kubeelet-server-currnet.pem won't exist until after kubelet requests it from the control plane after CSE exists, though we can check IMDS during CSE to see whether the feature itself will be enabled (like we currently do in configureKubeletServing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alrighty. Adjusted the service to wait for kubelet. Startup script to check the imds cache file first. If that's not around then call imds ourselves. And at the very end check if /var/lib/kubelet/pki/kubelet-server-current.pem exists and everything else said no for some reason.

configureNodeExporter() {
echo "Configuring Node Exporter"
# Check for skip file to determine if node-exporter was installed on this VHD
if [ ! -f /etc/node-exporter.d/skip_vhd_node_exporter ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the skip file isn't there, meaning we don't want to skip, we end up skipping? this seems backwards

Copy link
Contributor

@cameronmeissner cameronmeissner Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, rather than commit and manage an empty flag file in version control, can we instead pivot off some other node exporter asset file's existence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip file is for the extension logic as a signal that the component is agentbaker managed and extension should skip all managing.

existing nodes should have all components from extension previously. If we were to check for node-exporter running it should always be there vhd or extension install originally and we wouldn't be able to tell which. The skip file becomes a clear sign that this was vhd baked and extension should ignore.

once we get to a point where extension isn't needed anymore we can easily remove it

# Check for the skip sentinel file - this is essential for e2e testing
# If this file doesn't exist, e2e tests will silently skip node-exporter validation. Which is bad.
local skip_file="/etc/node-exporter.d/skip_vhd_node_exporter"
if [ ! -f "$skip_file" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this seems backwards

@@ -0,0 +1,32 @@
#!/bin/sh

if [ "$(cat /etc/os-release | grep ^ID= | cut -c 4-)" = "flatcar" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the flatcar check? it looks like we don't install node exporter on flatcar according to what's currently in install-dependencies.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for the initial setup in AB i'm just looking at getting ubuntu and mariner working. Plan is to get all distros and this is copy paste from existing extension so eventually needed

[ -n "${VHD_LOGS_FILEPATH:-}" ] && echo " - node-exporter ${NODE_EXPORTER_VERSION}-${NODE_EXPORTER_REVISION}" >> "${VHD_LOGS_FILEPATH}"

rm -rf "${NODE_EXPORTER_BUILD_ROOT}"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A random question. Would you prefer to write this logic in bash or golang?

@chmill-zz chmill-zz force-pushed the nodeexportershift branch 8 times, most recently from 5d91c2a to 16b29eb Compare January 27, 2026 16:05
…cker

Packer file provisioner cannot upload directories with nested
subdirectories. Replace directory upload with individual file
entries for all node-exporter baseline files.
Copilot AI review requested due to automatic review settings January 27, 2026 23:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds node-exporter to the VHD build process by default for Ubuntu, Mariner, and Azure Linux distributions. The node-exporter service is installed during VHD build but disabled, then enabled and started during CSE (Cluster Service Extension) provisioning. The implementation includes automatic TLS configuration using kubelet certificates, with support for both static and rotating certificates.

Changes:

  • Added node-exporter installation to VHD builder configurations for all supported platforms (base, arm64, mariner, flatcar, CVM variants)
  • Implemented installation script with package version management from components.json
  • Created startup script with automatic TLS configuration and certificate detection logic
  • Added CSE integration to enable and start node-exporter during cluster provisioning
  • Included VHD build tests and E2E validation for node-exporter functionality

Reviewed changes

Copilot reviewed 24 out of 88 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
vhdbuilder/packer/vhd-image-builder-*.json (8 files) Added node-exporter file provisioning steps to all VHD builder variants
vhdbuilder/packer/install-node-exporter.sh New installation script handling architecture detection, package download, and baseline asset deployment
vhdbuilder/packer/install-dependencies.sh Integrated node-exporter installation into dependency installation flow
vhdbuilder/packer/imagecustomizer/azlosguard/azlosguard.yml Added node-exporter assets to OSGuard image customizer configuration
parts/linux/cloud-init/artifacts/node-exporter/* New node-exporter configuration files, systemd services, and startup script
parts/linux/cloud-init/artifacts/cse_*.sh CSE integration: added configureNodeExporter function and error code
vhdbuilder/packer/test/linux-vhd-content-test.sh Added testNodeExporter function to validate installation in VHD builds
parts/common/components.json Added node-exporter package definitions with version information for Ubuntu, Mariner, and Azure Linux
e2e/validators.go Added ValidateNodeExporter function for E2E testing
pkg/agent/testdata/CustomizedImage/CustomData Updated test data with node-exporter installation

Comment on lines +811 to +828
configureNodeExporter() {
echo "Configuring Node Exporter"
# Check for skip file to determine if node-exporter was installed on this VHD
if [ ! -f /etc/node-exporter.d/skip_vhd_node_exporter ]; then
echo "Node Exporter assets not found on this VHD (missing /etc/node-exporter.d/skip_vhd_node_exporter); skipping configuration."
return 0
fi

if ! systemctlEnableAndStart node-exporter 30; then
echo "Failed to start node-exporter service"
return $ERR_NODE_EXPORTER_START_FAIL
fi
if ! systemctlEnableAndStart node-exporter-restart.path 30; then
echo "Failed to start node-exporter-restart.path"
return $ERR_NODE_EXPORTER_START_FAIL
fi
echo "Node Exporter started successfully"
}
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configureNodeExporter function returns early (line 816) when the skip file is not found, but it doesn't indicate this in logs during actual failures. The function returns error code ERR_NODE_EXPORTER_START_FAIL (lines 821, 825) when services fail to start, but this will cause CSE to fail even though node-exporter is optional/non-critical.

Consider whether node-exporter failures should be fatal to CSE. If node-exporter is meant to be optional, these failures should be logged as warnings but not cause CSE to exit with an error code. If it's meant to be required, the current behavior is correct but should be clearly documented.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cse shouldn't fail if the skip file doesn't exist. This avoids a scenario with an old vhd but new cse. which i've heard is possible though awkward. If the skip files doesn't exist then aks-extension will install node-exporter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants