-
Notifications
You must be signed in to change notification settings - Fork 244
fix(localdns): wait for resolv.conf update after networkctl reload to prevent race condition #7749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… prevent race condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a race condition that occurs after calling networkctl reload to update DNS configuration. Previously, the code would proceed immediately after the reload command without waiting for systemd-resolved to actually update the /run/systemd/resolve/resolv.conf file, potentially causing subsequent operations to work with stale DNS information.
Changes:
- Added
wait_for_dns_config_applied()function that polls resolv.conf to verify DNS configuration changes have been applied - Integrated the wait function after both networkctl reload calls to ensure DNS changes are complete before proceeding
- Added comprehensive test coverage for the new function
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/localdns.sh | Implements the new wait_for_dns_config_applied() function and integrates it after networkctl reload calls in disable_dhcp_use_clusterlistener and cleanup_iptables_and_dns |
| spec/parts/linux/cloud-init/artifacts/localdns_spec.sh | Adds comprehensive test coverage for wait_for_dns_config_applied with tests for success cases, timeout cases, edge cases, and partial IP matching |
Update log messages to use Error: prefix when wait_for_dns_config_applied fails, since these are failure conditions (return 1), not warnings. Updated corresponding test assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
…ad of being interpreted as regex wildcards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
| # Arguments: | ||
| # $1: expected_dns_ip - The DNS IP that should appear in resolv.conf. | ||
| # $2: should_contain - "true" if the IP should be present, "false" if it should be absent. | ||
| # $3: max_wait_seconds - Maximum time to wait for the change (default: 10). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 10 seconds necessary?
| } | ||
|
|
||
| # Disable DNS provided by DHCP and point the system at localdns. | ||
| disable_dhcp_use_clusterlistener() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not wait inside here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait right before replacing vnet dns ip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call wait before here
~# Replace AzureDNSIP in corefile with VNET DNS ServerIPs.
~# ---------------------------------------------------------------------------------------------------------------------
replace_azurednsip_in_corefile || exit $ERR_LOCALDNS_FAIL
| local current_dns | ||
| current_dns=$(awk '/^nameserver/ {print $2}' "$RESOLV_CONF" 2>/dev/null | paste -sd' ') | ||
|
|
||
| if [ "$should_contain" = "true" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to do this. You should just check if the resolv.conf nameserver is still LOCALDNS_NODE_LISTENER_IP. And you only need to wait for this result at start up time. Don't wait for this at shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shut down can just leave the the DHCP running async, because you don't need to read the value from resolv.conf when shutting down.
… prevent race condition
What this PR does / why we need it:
Immediately after networkctl reload, DNS settings may not have propagated from systemd-networkd (via DHCP) to systemd-resolved yet. As a result, /run/systemd/resolve/resolv.conf can still reflect the previous upstream DNS servers when replace_azurednsip_in_corefile runs.
This happens because networkctl reload only triggers a reload request over D-Bus; it does not wait for systemd-networkd to finish reprocessing configuration, re-acquire DHCP leases, or update systemd-resolved.
Which issue(s) this PR fixes:
Fixes #
to test: shellspec --shell bash --format d spec/parts/linux/cloud-init/artifacts/localdns_spec.sh