-
Notifications
You must be signed in to change notification settings - Fork 244
feat: node-exporter into vhd build #7704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0d52887 to
60cc001
Compare
f0f23e8 to
eb8e316
Compare
68a2704 to
d494735
Compare
d494735 to
277e340
Compare
277e340 to
1ae3542
Compare
| @@ -0,0 +1,5 @@ | |||
| tls_server_config: | |||
| cert_file: "/etc/kubernetes/certs/kubeletserver.crt" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these paths aren't necessarily correct - they depend on whether kubelet serving certificate rotation is enabled - when it's disabled these paths are correct, however when it's enabled both cert_file and key_file should point towards: /var/lib/kubelet/pki/kubelet-server-current.pem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting this is copy paste from the aks-vm-extension repo and what is on my node today. I don't see any where it's touched, just a static file.
root@aks-sys-41317600-vmss000000:/etc/node-exporter.d# cat web-config.yml
tls_server_config:
cert_file: "/etc/kubernetes/certs/kubeletserver.crt"
key_file: "/etc/kubernetes/certs/kubeletserver.key"
client_auth_type: "RequireAndVerifyClientCert"
client_ca_file: "/etc/kubernetes/certs/ca.crt"
i think we could address this in the node-exporter-startup.sh and check for the existence of /var/lib/kubelet/pki/kubelet-server-current.pem and if it exists use it. And if not use /etc/kubernetes/certs/kubeletserver.crt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's tricky - kubeelet-server-currnet.pem won't exist until after kubelet requests it from the control plane after CSE exists, though we can check IMDS during CSE to see whether the feature itself will be enabled (like we currently do in configureKubeletServing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alrighty. Adjusted the service to wait for kubelet. Startup script to check the imds cache file first. If that's not around then call imds ourselves. And at the very end check if /var/lib/kubelet/pki/kubelet-server-current.pem exists and everything else said no for some reason.
...ux/cloud-init/artifacts/node-exporter/baseline/etc/systemd/system/node-exporter-restart.path
Outdated
Show resolved
Hide resolved
parts/linux/cloud-init/artifacts/node-exporter/node-exporter-startup.sh
Outdated
Show resolved
Hide resolved
| configureNodeExporter() { | ||
| echo "Configuring Node Exporter" | ||
| # Check for skip file to determine if node-exporter was installed on this VHD | ||
| if [ ! -f /etc/node-exporter.d/skip_vhd_node_exporter ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the skip file isn't there, meaning we don't want to skip, we end up skipping? this seems backwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, rather than commit and manage an empty flag file in version control, can we instead pivot off some other node exporter asset file's existence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip file is for the extension logic as a signal that the component is agentbaker managed and extension should skip all managing.
existing nodes should have all components from extension previously. If we were to check for node-exporter running it should always be there vhd or extension install originally and we wouldn't be able to tell which. The skip file becomes a clear sign that this was vhd baked and extension should ignore.
once we get to a point where extension isn't needed anymore we can easily remove it
| # Check for the skip sentinel file - this is essential for e2e testing | ||
| # If this file doesn't exist, e2e tests will silently skip node-exporter validation. Which is bad. | ||
| local skip_file="/etc/node-exporter.d/skip_vhd_node_exporter" | ||
| if [ ! -f "$skip_file" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this seems backwards
| @@ -0,0 +1,32 @@ | |||
| #!/bin/sh | |||
|
|
|||
| if [ "$(cat /etc/os-release | grep ^ID= | cut -c 4-)" = "flatcar" ]; then | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the flatcar check? it looks like we don't install node exporter on flatcar according to what's currently in install-dependencies.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah for the initial setup in AB i'm just looking at getting ubuntu and mariner working. Plan is to get all distros and this is copy paste from existing extension so eventually needed
| [ -n "${VHD_LOGS_FILEPATH:-}" ] && echo " - node-exporter ${NODE_EXPORTER_VERSION}-${NODE_EXPORTER_REVISION}" >> "${VHD_LOGS_FILEPATH}" | ||
|
|
||
| rm -rf "${NODE_EXPORTER_BUILD_ROOT}" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A random question. Would you prefer to write this logic in bash or golang?
5d91c2a to
16b29eb
Compare
…cker Packer file provisioner cannot upload directories with nested subdirectories. Replace directory upload with individual file entries for all node-exporter baseline files.
16b29eb to
eeef055
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds node-exporter to the VHD build process by default for Ubuntu, Mariner, and Azure Linux distributions. The node-exporter service is installed during VHD build but disabled, then enabled and started during CSE (Cluster Service Extension) provisioning. The implementation includes automatic TLS configuration using kubelet certificates, with support for both static and rotating certificates.
Changes:
- Added node-exporter installation to VHD builder configurations for all supported platforms (base, arm64, mariner, flatcar, CVM variants)
- Implemented installation script with package version management from components.json
- Created startup script with automatic TLS configuration and certificate detection logic
- Added CSE integration to enable and start node-exporter during cluster provisioning
- Included VHD build tests and E2E validation for node-exporter functionality
Reviewed changes
Copilot reviewed 24 out of 88 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| vhdbuilder/packer/vhd-image-builder-*.json (8 files) | Added node-exporter file provisioning steps to all VHD builder variants |
| vhdbuilder/packer/install-node-exporter.sh | New installation script handling architecture detection, package download, and baseline asset deployment |
| vhdbuilder/packer/install-dependencies.sh | Integrated node-exporter installation into dependency installation flow |
| vhdbuilder/packer/imagecustomizer/azlosguard/azlosguard.yml | Added node-exporter assets to OSGuard image customizer configuration |
| parts/linux/cloud-init/artifacts/node-exporter/* | New node-exporter configuration files, systemd services, and startup script |
| parts/linux/cloud-init/artifacts/cse_*.sh | CSE integration: added configureNodeExporter function and error code |
| vhdbuilder/packer/test/linux-vhd-content-test.sh | Added testNodeExporter function to validate installation in VHD builds |
| parts/common/components.json | Added node-exporter package definitions with version information for Ubuntu, Mariner, and Azure Linux |
| e2e/validators.go | Added ValidateNodeExporter function for E2E testing |
| pkg/agent/testdata/CustomizedImage/CustomData | Updated test data with node-exporter installation |
| configureNodeExporter() { | ||
| echo "Configuring Node Exporter" | ||
| # Check for skip file to determine if node-exporter was installed on this VHD | ||
| if [ ! -f /etc/node-exporter.d/skip_vhd_node_exporter ]; then | ||
| echo "Node Exporter assets not found on this VHD (missing /etc/node-exporter.d/skip_vhd_node_exporter); skipping configuration." | ||
| return 0 | ||
| fi | ||
|
|
||
| if ! systemctlEnableAndStart node-exporter 30; then | ||
| echo "Failed to start node-exporter service" | ||
| return $ERR_NODE_EXPORTER_START_FAIL | ||
| fi | ||
| if ! systemctlEnableAndStart node-exporter-restart.path 30; then | ||
| echo "Failed to start node-exporter-restart.path" | ||
| return $ERR_NODE_EXPORTER_START_FAIL | ||
| fi | ||
| echo "Node Exporter started successfully" | ||
| } |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configureNodeExporter function returns early (line 816) when the skip file is not found, but it doesn't indicate this in logs during actual failures. The function returns error code ERR_NODE_EXPORTER_START_FAIL (lines 821, 825) when services fail to start, but this will cause CSE to fail even though node-exporter is optional/non-critical.
Consider whether node-exporter failures should be fatal to CSE. If node-exporter is meant to be optional, these failures should be logged as warnings but not cause CSE to exit with an error code. If it's meant to be required, the current behavior is correct but should be clearly documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cse shouldn't fail if the skip file doesn't exist. This avoids a scenario with an old vhd but new cse. which i've heard is possible though awkward. If the skip files doesn't exist then aks-extension will install node-exporter
What this PR does / why we need it:
this is adding node-exporter into the vhdbuild by default. At the end of the vhdbuild we systemctl disable to allow a fresh start during cse letting the node-exporter-startup script run and gather up node specific details needed.
Which issue(s) this PR fixes:
Fixes #