> running-tests
running tests at various levels from smoke tests to full suite to randomized tests
curl "https://skillshub.wtf/stellar/stellar-core/running-tests?format=md"Overview
This skill is for running tests systematically, starting with fast/focused tests and progressing to slower/broader tests. This ordering allows failures to be caught early, minimizing wasted time.
This skill is designed to be run as a subagent to avoid cluttering the invoking agent's context. The output is either confirmation that all tests passed, or a report of failures.
Required Inputs (Before Launching Subagent)
Since subagents cannot ask for clarification, the invoking agent must gather this information before launching:
-
Changed files/modules: Which files or modules were changed, so the subagent can identify appropriate smoke tests and focused tests.
-
Test levels to run: Which levels to execute. Options:
- "smoke only" - just Level 1
- "through focused" - Levels 1-2
- "through full suite" - Levels 1-3 (usually sufficient for small changes)
- "through full suite with tx-meta" - Levels 1-3 plus tx-meta baseline check
- "through sanitizers" - Levels 1-4 (for memory/concurrency-sensitive code)
The subagent prompt should include: "Run tests <levels> for changes in <files/modules>."
Test Output Control
To reduce noise and keep agent context manageable, always use these flags:
# Recommended flags for quiet output
--ll fatal # Only log fatal errors (not info/debug messages)
-r simple # Use simple reporter (minimal output)
--disable-dots # Don't print progress dots
--abort # Stop on first failure (don't run remaining tests)
Example:
./stellar-core test --ll fatal -r simple --disable-dots --abort "test name"
Note that if you ever do need information about a test when trying to diagnose
what went wrong with it, you might want to turn the log level up from fatal to
info, debug or even trace, using --ll debug or --ll trace for example.
Protocol Versions
Many tests are protocol-specific and can behave differently across protocol versions. Use these flags to control which protocol versions are tested:
--version <N> # Run tests for a specific protocol version
--all-versions # Run tests for all supported protocol versions
For focused testing during development, test with the current protocol version,
which is the default. The full test suite should eventually be run with
--all-versions.
Deterministic Random Number Generator
Tests use a deterministic PRNG. By default, the seed varies, but you can set a specific seed for reproducibility:
--rng-seed <N> # Use a specific RNG seed for reproducibility
This is useful for reproducing failures or for baseline checks that require consistent output.
Test Levels
Tests are run in order of increasing cost. Stop at the first failure.
Level 1: Smoke Tests
Run 2-3 specific tests that are most likely to catch breakage in the changed code. These should complete in seconds.
To identify smoke tests:
- Find tests in the same file/module as the changed code
- Pick tests that directly exercise the modified functions
- Prefer fast tests over slow ones
# Run a specific test by name (use quotes for exact match)
./stellar-core test --ll fatal -r simple --abort "exact test name"
Level 2: Focused Unit Tests
Run all tests in the test file(s) related to the change. This typically takes a few minutes.
# Run tests matching a tag pattern
./stellar-core test --ll fatal -r simple --abort "[ModuleName*]"
# Run tests from a specific area
./stellar-core test --ll fatal -r simple --abort "[ledgertxn]"
# Combine tags (AND logic - must match all)
./stellar-core test --ll fatal -r simple --abort "[tx][soroban]"
Example Test Names by Area
Ledger/Transaction tests:
"[ledgertxn]"- LedgerTxn operations"[tx][payment]"- Payment transaction tests"[tx][createaccount]"- CreateAccount tests"[tx][offers]"- Offer/DEX tests"[tx][soroban]"- Soroban (smart contract) transaction tests
Bucket/BucketList tests:
"[bucket]"- General bucket tests"[bucketlist]"- BucketList specific tests"[bucketmergemap]"- Bucket merge map tests
Herder tests:
"[herder]"- General herder tests"[txset]"- Transaction set tests"[transactionqueue]"- Transaction queue tests"[quorumintersection]"- Quorum intersection tests"[upgrades]"- Protocol upgrade tests
Overlay/Network tests:
"[overlay]"- Overlay network tests"[flood]"- Transaction flooding tests"[PeerManager]"- Peer management tests
Crypto/Utility tests:
"[crypto]"- Cryptography tests"[decoder]"- Base32/64 encoding tests"[timer]"- VirtualClock timer tests"[cache]"- Cache implementation tests
Soroban-specific tests:
"[soroban]"- All Soroban tests"[soroban][archival]"- State archival tests"[soroban][upgrades]"- Soroban upgrade tests
Level 3: Full Unit Test Suite
Run the complete unit test suite. This may take 10-30 minutes.
Basic Execution
make check
Or directly with quiet output:
./stellar-core test --ll fatal -r simple --disable-dots --abort
Parallel Execution (faster)
For faster execution, use parallel partitions via make check:
# Run with partitions equal to CPU cores
NUM_PARTITIONS=$(nproc) make check
Full Protocol Coverage
The full test suite should be run with all protocol versions:
ALL_VERSIONS=1 NUM_PARTITIONS=$(nproc) make check
SQLite-Only Testing (No Postgres)
To test with SQLite only (faster, no Postgres dependency):
./configure --disable-postgres --enable-ccache --enable-sdfprefs
make clean && make -j $(nproc)
NUM_PARTITIONS=$(nproc) make check
Level 3b: Transaction Metadata Baseline Check
This validates that transaction test execution produces the same metadata hashes as fixed baselines stored in the repository. This catches unintended changes to transaction semantics.
Important: Always use --rng-seed 12345 for baseline checks to ensure
deterministic results.
# Check transaction tests against current protocol baseline
./stellar-core test "[tx]" --all-versions --rng-seed 12345 --ll fatal \
--abort -r simple --check-test-tx-meta test-tx-meta-baseline-current
For next-protocol testing (when preparing protocol upgrades):
./stellar-core test "[tx]" --all-versions --rng-seed 12345 --ll fatal \
--abort -r simple --check-test-tx-meta test-tx-meta-baseline-next
If baselines need updating after intentional changes, the test will fail and indicate which baselines differ.
Level 4: Sanitizer Tests
When to run: Only needed for changes touching memory management, pointers, concurrency, or threading code. Skip for simple logic changes, config changes, or test-only changes.
Run tests with sanitizers enabled to catch memory errors and undefined behavior. This requires reconfiguring and rebuilding.
Address Sanitizer (ASan)
Catches memory errors: buffer overflows, use-after-free, memory leaks.
./configure --enable-asan --enable-ccache --enable-sdfprefs
make clean && make -j $(nproc)
./stellar-core test --ll fatal -r simple --disable-dots --abort
Thread Sanitizer (TSan)
Catches data races and threading issues.
./configure --enable-threadsanitizer --enable-ccache --enable-sdfprefs
make clean && make -j $(nproc)
./stellar-core test --ll fatal -r simple --disable-dots --abort
Undefined Behavior Sanitizer (UBSan)
Catches undefined behavior like integer overflow, null pointer dereference.
./configure --enable-undefinedcheck --enable-ccache --enable-sdfprefs
make clean && make -j $(nproc)
./stellar-core test --ll fatal -r simple --disable-dots --abort
Level 5: Extra Checks Build
When to run: Only for changes to core data structures or when Level 4 sanitizers found something suspicious. Usually overkill.
Run with C++ standard library debugging enabled. Slower but catches more issues.
./configure --enable-extrachecks --enable-ccache --enable-sdfprefs
make clean && make -j $(nproc)
./stellar-core test --ll fatal -r simple --disable-dots --abort
Build Verification
Before running tests at Levels 4-6, also verify the build succeeds with
--disable-tests (the production configuration):
./configure --disable-tests --enable-ccache --enable-sdfprefs
make clean && make -j $(nproc)
This doesn't run tests but ensures the production build works.
Interpreting Failures
When a test fails:
- Identify the failing test: Note the exact test name and file
- Capture the failure output: Save the error message and stack trace
- Determine if it's a real failure: Check if the test is flaky or if this is a genuine regression
- Locate the relevant code: Find where in the changed code the failure originates
Common Failure Patterns
- Assertion failure: A test assertion didn't hold; check the condition
- Crash/segfault: Memory error; run with ASan for more details
- Timeout: Test took too long; may indicate infinite loop or deadlock
- Sanitizer error: Memory or threading bug; the sanitizer output shows where
Output Format
Report the results:
## Test Results: PASS
All test levels completed successfully:
- Level 1 (Smoke): 3 tests, 2.1s
- Level 2 (Focused): 47 tests, 1m 12s
- Level 3 (Full Suite): 1,234 tests, 18m 45s
- Level 3b (TX Meta Baseline): OK
Build verification:
- --disable-tests: OK
Or on failure:
## Test Results: FAIL
Failed at Level 2 (Focused Unit Tests)
**Failing test:** `LedgerManagerTests.processTransactionRejectsEmpty`
**File:** src/ledger/LedgerManagerTests.cpp:142
**Error:**
REQUIRE( result == TRANSACTION_REJECTED )
with expansion:
TRANSACTION_SUCCESS == TRANSACTION_REJECTED
**Analysis:** The test expects empty transactions to be rejected, but the
new code path is allowing them through. See LedgerManager.cpp:98 where the
empty check appears to be missing.
Levels completed before failure:
- Level 1 (Smoke): 3 tests, 2.1s ✓
Choosing the Right Test Level
For most changes (logic fixes, new features, refactors):
- Run through Level 3 (full suite) with
--all-versions - Run Level 3b (tx-meta baseline) for transaction-related changes
- Skip Levels 4-5 unless the change touches memory/threading
For memory-sensitive changes (pointers, allocations, C++ containers):
- Run through Level 4 (at least ASan)
For concurrency changes (threading, async, locks):
- Run through Level 4 (especially TSan)
For test-only changes or documentation:
- Level 1-2 is usually sufficient
ALWAYS
- ALWAYS run tests in order of increasing cost
- ALWAYS stop at the first failure (use
--abortflag) - ALWAYS use
--ll fatal -r simple --disable-dotsfor quiet output - ALWAYS capture and report failure details
- ALWAYS run full suite with
--all-versionsbefore considering complete - ALWAYS use
--rng-seed 12345for tx-meta baseline checks - ALWAYS report timing for each level
- ALWAYS identify the specific test and location of failures
NEVER
- NEVER skip smoke tests and go straight to full suite
- NEVER continue to later levels after a failure
- NEVER report "tests failed" without specifics
- NEVER assume a test failure is flaky without evidence
- NEVER run verbose output that floods the context
- NEVER run tests without having built first
- NEVER run sanitizers (Level 4-5) for trivial changes (it's overkill)
Completion
Report to the invoking agent:
- Overall result: PASS or FAIL
- For PASS: Summary of all levels completed with timing
- For FAIL: Detailed failure report with analysis
- Any observations (slow tests, warnings, etc.)
> related_skills --same-repo
> validating-a-change
comprehensive validation of a change to ensure it is correct and ready for a pull request
> regenerating a technical summary of stellar-core
Instructions for regenerating the full set of subsystem and whole-system technical summary skill documents for stellar-core
> subsystem-summary-of-work
read this skill for a token-efficient summary of the work subsystem
> subsystem-summary-of-util
read this skill for a token-efficient summary of the util subsystem