Skip to content

Network settings config upgrade in start script is brittle #907

@leighmcculloch

Description

@leighmcculloch

What problem does your feature solve?

The network settings config upgrade logic in the start script is brittle. The upgrade_soroban_config function uses stellar-core get-settings-upgrade-txs to generate transactions, submits them via curl to core's HTTP endpoint, and confirms they were applied by polling the global ledger.transaction.count metric.

For example:

quickstart/start

Lines 672 to 718 in 6357b28

upgrade_output="$(echo $NETWORK_ROOT_SECRET_KEY \
| stellar-core get-settings-upgrade-txs \
"$NETWORK_ROOT_ACCOUNT_ID" \
"$seq_num" \
"$NETWORK_PASSPHRASE" \
--xdr `stellar-xdr encode --type ConfigUpgradeSet < "$config_file_path"` \
--signtxs)"
let line_count=$(echo "$upgrade_output" | wc -l)
echo "$upgrade_output" | { \
TX_COUNT="`curl -s http://localhost:11626/metrics | jq -r '.metrics."ledger.transaction.count".count'`"
TX_COUNT=$((TX_COUNT+1))
# If the line count is 9 instead of 7, a version of core is being used where the restore op is being returned
if [ $line_count = 9 ] ; then
read tx;
read txid;
echo "upgrades: soroban config: restore contract: $txid .. $(curl -sG 'http://localhost:11626/tx' --data-urlencode "blob=$tx" | jq -r '.status')";
while [ "`curl -s http://localhost:11626/metrics | jq -r '.metrics."ledger.transaction.count".count'`" != "$TX_COUNT" ]; do sleep 1; done
TX_COUNT=$((TX_COUNT+1))
fi
read tx; \
read txid; \
echo "upgrades: soroban config: install contract: $txid .. $(curl -sG 'http://localhost:11626/tx' --data-urlencode "blob=$tx" | jq -r '.status')"; \
while [ "`curl -s http://localhost:11626/metrics | jq -r '.metrics."ledger.transaction.count".count'`" != "$TX_COUNT" ]; do sleep 1; done
TX_COUNT=$((TX_COUNT+1)); \
read tx; \
read txid; \
echo "upgrades: soroban config: deploy contract: $txid .. $(curl -sG 'http://localhost:11626/tx' --data-urlencode "blob=$tx" | jq -r '.status')"; \
while [ "`curl -s http://localhost:11626/metrics | jq -r '.metrics."ledger.transaction.count".count'`" != "$TX_COUNT" ]; do sleep 1; done
TX_COUNT=$((TX_COUNT+1)); \
read tx; \
read txid; \
echo "upgrades: soroban config: upload config: $txid .. $(curl -sG 'http://localhost:11626/tx' --data-urlencode "blob=$tx" | jq -r '.status')"; \
while [ "`curl -s http://localhost:11626/metrics | jq -r '.metrics."ledger.transaction.count".count'`" != "$TX_COUNT" ]; do sleep 1; done
TX_COUNT=$((TX_COUNT+1)); \
read key; \
echo "upgrades: soroban config: set config with key: $key";
OUTPUT="$(curl -sG 'http://localhost:11626/upgrades?mode=set&upgradetime=1970-01-01T00:00:00Z' --data-urlencode "configupgradesetkey=$key")"
echo "$OUTPUT"; \
if [ "$OUTPUT" == "Error setting configUpgradeSet" ]; then
echo "!!!!! Unable to upgrade Soroban Config Settings. Stopping all services. !!!!!"
kill_supervisor
fi
}
echo "upgrades: soroban config done"

The script reads transactions and transaction IDs from stdout line-by-line, submits each via curl, then waits for the global transaction count metric to increment:

while [ "`curl -s http://localhost:11626/metrics | jq -r '.metrics."ledger.transaction.count".count'`" != "$TX_COUNT" ]; do sleep 1; done

This is brittle in several ways:

  • Transaction confirmation by global counter: It does not verify that the specific transaction succeeded, only that the total transaction count increased. If any other transaction occurs, or if a transaction fails but is still counted, the logic breaks.
  • Output format coupling: The script checks if [ $line_count = 9 ] vs 7 lines to detect whether a restore operation is included in the output, coupling it tightly to the exact output format of stellar-core get-settings-upgrade-txs which can change between versions.
  • Pipe-based parsing of stdout: The entire flow reads tx blobs and tx IDs via read from a piped subshell, which is fragile and hard to debug when something goes wrong.

@sisuresh and I have noticed some recent flaky build failures that may be related to this brittleness:

Related: #906, #555

What would you like to see?

Replace the brittle shell-based transaction submission and confirmation logic with something more robust. This could be part of a small Rust CLI tool (#906) that handles transaction submission and confirmation directly, or another approach that avoids relying on polling global metrics and parsing stdout line counts.

What alternatives are there?

  • Improve the shell script: Add retries, check transaction results directly via the /tx endpoint response, and make the output parsing more resilient. This improves reliability but still leaves the fundamental brittleness of doing this in bash.
  • Use stellar-cli: Ship stellar-cli with quickstart and use it for transaction submission. Downside is that stellar-cli is further downstream and harder to keep in sync with unreleased stellar-core changes.
  • Build into a small Rust CLI: As proposed in Add a small Rust CLI tool to the quickstart image for non-trivial startup logic #906, a minimal Rust tool could handle this logic more robustly with proper error handling and transaction result checking.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Backlog (Not Ready)

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions