> subsystem-summary-of-bucket
read this skill for a token-efficient summary of the bucket subsystem
curl "https://skillshub.wtf/stellar/stellar-core/subsystem-summary-of-bucket?format=md"Bucket Subsystem Technical Summary
Overview
The bucket subsystem implements a log-structured merge tree (LSM-tree) data structure called the BucketList. It maintains a canonical, hash-verifiable representation of all ledger state. There are two BucketList instances: the LiveBucketList (current ledger state) and the HotArchiveBucketList (recently evicted entries). Each is organized into 11 temporal levels (level 0 being youngest/smallest, level 10 being oldest/largest), where older levels are exponentially larger and change less frequently.
The system is designed to:
- Provide a single canonical hash of all ledger entries without rehashing the entire database on each ledger close.
- Enable efficient "catch-up" via incremental bucket downloads from history archives.
- Support point and bulk lookups of ledger entries via indexed bucket files (BucketListDB).
Key Classes and Data Structures
Bucket Types (CRTP Hierarchy)
-
BucketBase<BucketT, IndexT>— Abstract CRTP base for immutable, sorted, hashed containers of XDR entries. Holds a filename, hash, size, and optional index (shared_ptr<IndexT>). Provides the coremerge()andmergeInternal()static methods. Buckets are designed to be held inshared_ptrand shared across threads. -
LiveBucket— StoresBucketEntry(INITENTRY, LIVEENTRY, DEADENTRY, METAENTRY). Supports shadows, INIT/DEAD annihilation, and in-memory level-0 merges. Has an optionalmEntriesvector for in-memory-only buckets. Index type:LiveBucketIndex. Key methods:fresh()— Creates a new bucket from init/live/dead entry vectors, sorts, hashes, writes to disk.freshInMemoryOnly()— Creates a bucket that only exists in memory (for level-0 "snap" that immediately merges).mergeInMemory()— Merges two in-memory buckets withoutFutureBucket, used only for level 0.merge()(inherited) — File-based merge of two buckets with optional shadows.maybePut()— Shadow-aware write: elides entries shadowed by newer buckets (pre-v12), preserves INIT/DEAD lifecycle entries (v11+).mergeCasesWithEqualKeys()— Handles INIT/DEAD annihilation and INIT+LIVE→INIT promotion.convertToBucketEntry()— Converts raw ledger entry vectors into sortedBucketEntryvector.isTombstoneEntry()— Returns true for DEADENTRY.
-
HotArchiveBucket— StoresHotArchiveBucketEntry(HOT_ARCHIVE_ARCHIVED, HOT_ARCHIVE_LIVE, HOT_ARCHIVE_METAENTRY). No shadow support. Index type:HotArchiveBucketIndex. HOT_ARCHIVE_LIVE acts as tombstone (restored entries). Key methods:fresh()— Creates bucket from archived entries and restored keys.maybePut()— Always writes (no shadow logic).isTombstoneEntry()— Returns true for HOT_ARCHIVE_LIVE.
BucketList Structure
-
BucketListBase<BucketT>— Abstract templated base for BucketList data structure. Contains a vector ofBucketLevel<BucketT>. Defines the temporal-leveling algorithm: level sizes are powers of 4 (levelSize(i) = 4^(i+1)), each split into curr and snap halves. Key methods:addBatchInternal()— Main entry point: adds a batch of entries at a ledger close. Walks levels top-down, callingsnap()andprepare()on levels that should spill. Level 0 usesprepareFirstLevel()for in-memory merges.levelShouldSpill()— Returns true when a level needs to snapshot curr→snap and merge snap into the next level.restartMerges()— Re-starts merges after deserialization (catchup or restart). For v12+ merges, reconstructs from current BucketList state; for older merges, uses serialized hashes.resolveAnyReadyFutures()— Non-blocking resolution of completed merges.getHash()— Returns concatenated hash of all level hashes (each level = hash of curr + snap).- Static methods:
levelSize(),levelHalf(),sizeOfCurr(),sizeOfSnap(),oldestLedgerInCurr(),oldestLedgerInSnap(),keepTombstoneEntries(),bucketUpdatePeriod().
-
BucketLevel<BucketT>— A single level in the BucketList. HoldsmCurr,mSnap(bothshared_ptr<BucketT>), andmNextCurr(astd::variant<FutureBucket<BucketT>, shared_ptr<BucketT>>). Key methods:prepare()— Starts an async merge viaFutureBucket(used for levels 1+).prepareFirstLevel()— Specialization for level 0: does an in-memory merge if possible (LiveBucket::mergeInMemory), falls back toprepare()otherwise.commit()— Resolves any pending merge and sets result as new curr.snap()— Moves curr to snap, resets curr to empty bucket.
-
LiveBucketList— ExtendsBucketListBase<LiveBucket>. Adds eviction-related methods (updateStartingEvictionIterator,updateEvictionIterAndRecordStats,checkIfEvictionScanIsStuck) andaddBatch()which callsaddBatchInternal()with init/live/dead entry vectors. AlsomaybeInitializeCaches()for index random eviction caches. -
HotArchiveBucketList— ExtendsBucketListBase<HotArchiveBucket>. SimpleraddBatch()with archived/restored entry vectors.
BucketManager
BucketManager— Singleton owner of the BucketList instances, bucket files, and merge futures. Thread-safe for bucket file operations viamBucketMutex. Key responsibilities:- Bucket file management:
adoptFileAsBucket()moves temp files into the bucket directory, deduplicating by hash.forgetUnreferencedBuckets()GCs unreferenced buckets.cleanupStaleFiles()removes orphaned files. - Merge future tracking:
mLiveBucketFutures/mHotArchiveBucketFutures(hash maps ofMergeKey → shared_future) track running merges.mFinishedMerges(BucketMergeMap) records completed merge input→output mappings for reattachment. - Batch ingestion:
addLiveBatch()andaddHotArchiveBatch()feed new entries from ledger close into the BucketList. - Snapshotting:
snapshotLedger()computes thebucketListHashfor the LedgerHeader. - Eviction:
startBackgroundEvictionScan()launches async eviction scan on a snapshot;resolveBackgroundEvictionScan()applies results. - State management:
assumeState()loads BucketList from aHistoryArchiveState.loadCompleteLedgerState()materializes the full ledger into a map. - Index management:
maybeSetIndex()sets the index for a bucket, handling race conditions on startup. - Owns:
LiveBucketList,HotArchiveBucketList,BucketSnapshotManager,TmpDirManager, bucket maps (mSharedLiveBuckets,mSharedHotArchiveBuckets), merge future maps,BucketMergeMap.
- Bucket file management:
Merge Infrastructure
-
FutureBucket<BucketT>— Wraps astd::shared_future<shared_ptr<BucketT>>representing an in-progress or completed merge. Has 5 states:FB_CLEAR,FB_HASH_OUTPUT,FB_HASH_INPUTS,FB_LIVE_OUTPUT,FB_LIVE_INPUTS. Serializable via cereal (saves/loads hash strings). Key lifecycle:- Constructed with live inputs →
FB_LIVE_INPUTS, immediately callsstartMerge(). startMerge()checks for existing merge viaBucketManager::getMergeFuture()(reattachment). If none, creates apackaged_taskthat callsBucketT::merge()and posts it to a background worker thread.mergeComplete()polls the future.resolve()blocks to get result →FB_LIVE_OUTPUT.- Serialized state can be
FB_HASH_INPUTSorFB_HASH_OUTPUT.makeLive()reconstitutes from hashes.
- Constructed with live inputs →
-
MergeKey— Identifies a merge by its inputs:keepTombstoneEntries,inputCurrHash,inputSnapHash,shadowHashes. Used as key in merge future/finished maps. -
BucketMergeMap— Bidirectional weak mapping of merge input→output and output→input. StoresMergeKey→Hash,Hash→Hash(input→output multimap), andHash→MergeKey(output→input multimap). Used for merge reattachment: if a merge's output bucket still exists, we can synthesize a pre-resolved future instead of re-running the merge. -
MergeInput<BucketT>(abstract),FileMergeInput<BucketT>,MemoryMergeInput<BucketT>— Adapters providing a uniform interface over eitherBucketInputIteratorpairs (file-based merge) or in-memoryvector<EntryT>pairs. Methods:isDone(),oldFirst(),newFirst(),equalKeys(),getOldEntry(),getNewEntry(),advanceOld(),advanceNew().
I/O Iterators
-
BucketInputIterator<BucketT>— Reads entries sequentially from a bucket file viaXDRInputFileStream. Auto-extracts the leading METAENTRY. Providesoperator*,operator++,pos(),seek(),size(). -
BucketOutputIterator<BucketT>— Writes entries to a temp file, computing a running SHA256 hash.put()buffers one entry to deduplicate adjacent same-key entries.getBucket()finalizes the file, callsBucketManager::adoptFileAsBucket(), and returns the new bucket. RespectskeepTombstoneEntriesto elide tombstones at the bottom level. Writes a METAENTRY at the start if protocol version ≥ 11.
Snapshot & Query Layer (BucketListDB)
-
BucketListSnapshotData<BucketT>— Immutable snapshot of a BucketList: a vector ofLevel{curr, snap}(shared_ptr to const buckets) plus aLedgerHeader. Thread-safe to share. -
SearchableBucketListSnapshot<BucketT>— Provides lookup functionality over a snapshot. Each instance owns mutable file stream caches (mStreams) for I/O. Key methods:load(LedgerKey)— Point lookup: iterates buckets newest-to-oldest, returns first match via index lookup + file read. Returns theLoadT(LedgerEntry for live, HotArchiveBucketEntry for hot archive).loadKeysFromBucket()— Bulk scan: uses indexscan()iterator for sequential multi-key lookup within a bucket.loadKeysInternal()— Loads keys from all buckets, supports historical snapshots.loopAllBuckets()— Iterates all non-empty bucket (curr, snap) across levels, calling a function. Stops early onLoop::COMPLETE.getBucketEntry()— Single-key lookup via index: CACHE_HIT returns cached entry, FILE_OFFSET reads from disk, NOT_FOUND skips.
-
SearchableLiveBucketListSnapshot— Extends the base with live-specific queries:loadKeys()— Bulk load with timer.loadPoolShareTrustLinesByAccountAndAsset()— Two-step query: index lookup for PoolIDs, then bulk trustline load.loadInflationWinners()— Legacy inflation vote counting.scanForEviction()— Background eviction scan: iterates bucket region, collects expired entries.scanForEntriesOfType()— Iterates entries of a givenLedgerEntryTypeusing type range bounds.
-
SearchableHotArchiveBucketListSnapshot— Hot archive queries:loadKeys(),scanAllEntries(). -
BucketSnapshotManager— Thread-safe boundary between main-thread BucketList mutations and read-only snapshots. Holds canonical snapshots behind aSharedMutex. Key methods:updateCurrentSnapshot()— Called by main thread after BucketList changes. Takes exclusive lock, rotates historical snapshots.copySearchableLiveBucketListSnapshot()/copySearchableHotArchiveBucketListSnapshot()— Creates a newSearchable*Snapshotwith fresh stream caches pointing to the current snapshot data.maybeCopySearchableBucketListSnapshot()— Refreshes a snapshot only if a newer one is available (shared lock).maybeCopyLiveAndHotArchiveSnapshots()— Atomically refreshes both live and hot archive snapshots for consistency.
Index System
-
LiveBucketIndex— Wraps either anInMemoryIndex(small buckets) orDiskIndex<LiveBucket>(large buckets), selected based on config (BUCKETLIST_DB_INDEX_CUTOFF). Additionally owns an optionalRandomEvictionCachefor ACCOUNT entries. Key methods:lookup(LedgerKey)— ReturnsIndexReturnT(CACHE_HIT, FILE_OFFSET, or NOT_FOUND).scan(IterT, LedgerKey)— Sequential scan for bulk loads.getPoolIDsByAsset()— Returns PoolIDs for asset-based trustline queries.maybeInitializeCache()— Lazily initializes the random eviction cache proportional to bucket's share of total accounts.typeNotSupported()— Returns true for OFFER type (offers are loaded from SQL during catchup, not BucketListDB).- Version:
BUCKET_INDEX_VERSION = 6.
-
HotArchiveBucketIndex— Always usesDiskIndex<HotArchiveBucket>(no in-memory index, no cache). Version:BUCKET_INDEX_VERSION = 0. -
DiskIndex<BucketT>— Persisted range-based index. Contains:RangeIndex(vector<pair<RangeEntry, streamoff>>) — Maps key ranges to file offsets (page boundaries).BinaryFuseFilter16— Bloom-filter-like structure for quick negative lookups.AssetPoolIDMap— Asset→PoolID mapping (LiveBucket only).BucketEntryCounters— Per-type entry counts and sizes.typeRanges— Map ofLedgerEntryType → (startOffset, endOffset)for type-specific scans.- Persisted to disk via cereal. Loaded on startup if version/pageSize match.
-
InMemoryIndex— For small buckets. UsesInMemoryBucketState(anunordered_set<InternalInMemoryBucketEntry>) to store all entries in memory.InternalInMemoryBucketEntryuses type-erasure to allow lookup byLedgerKeyin a set ofBucketEntry(C++20 heterogeneous lookup workaround). -
IndexReturnT— Variant return type from index queries:IndexPtrT(cache hit),std::streamoff(file offset), orstd::monostate(not found). -
BucketIndexUtils— Free functions:createIndex()builds a new index from a bucket file;loadIndex()loads a persisted index from disk;getPageSizeFromConfig().
Comparison and Ordering
-
LedgerEntryIdCmp— ComparesLedgerEntryorLedgerKeyby identity (type, then type-specific key fields). Used for sorted sets and merge ordering. -
BucketEntryIdCmp<BucketT>— ComparesBucketEntryorHotArchiveBucketEntryby their embedded ledger keys. METAENTRY sorts below all others. Handles cross-type comparisons (LIVEENTRY vs DEADENTRY, ARCHIVED vs LIVE).
Catchup Support
BucketApplicator— Applies aLiveBucketto the database during history catchup. Processes entries in scheduler-friendly batches (LEDGER_ENTRY_BATCH_COMMIT_SIZE). Only applies offers (seeks to offer range using type index). TracksseenKeysto avoid applying shadowed entries. Handles pre-v11 entries that lack INITENTRY.
Eviction
EvictionResultEntry— A single eviction candidate: theLedgerEntry, itsEvictionIteratorposition, andliveUntilLedger.EvictionResultCandidates— Collection of eviction candidates from a background scan, with validity checks against archival settings.EvictedStateVectors— Final eviction output:deletedKeys(temp entries + TTLs) andarchivedEntries(persistent entries).EvictionStatistics— Tracks eviction cycle metrics (entry age, cycle period).EvictionMetrics— Medida metrics for eviction (entries evicted, bytes scanned, blocking/background time).
Utility Types
MergeCounters— Fine-grained counters for merge operations (entry types processed, shadow elisions, reattachments, annihilations). Not published via medida; used for internal tracking and testing.BucketEntryCounters— Per-LedgerEntryTypeAndDurabilitycounts and sizes. Stored in indexes, aggregated across buckets.LedgerEntryTypeAndDurability— Finer-grained enum distinguishing TEMPORARY vs PERSISTENT CONTRACT_DATA.
Key Control Flows
Ledger Close (addBatch)
BucketManager::addLiveBatch()→LiveBucketList::addBatch()→addBatchInternal().- Walk levels top-down (10→1): if
levelShouldSpill(ledger, i-1), thenlevels[i-1].snap()+levels[i].commit()+levels[i].prepare(). - Level 0:
prepareFirstLevel()— creates fresh in-memory bucket, merges with curr in-memory viaLiveBucket::mergeInMemory(), commits immediately (synchronous, no background thread). - Levels 1+:
prepare()creates aFutureBucketwhich launches a background merge task viaapp.postOnBackgroundThread(). resolveAnyReadyFutures()non-blockingly resolves any completed merges.BucketSnapshotManager::updateCurrentSnapshot()creates new immutable snapshot data.
Background Merge (FutureBucket::startMerge)
- Constructs a
MergeKeyfrom inputs. ChecksBucketManager::getMergeFuture()for reattachment. - If no existing future, creates a
packaged_taskthat callsBucketT::merge(). merge()opensBucketInputIterators on old and new buckets, createsBucketOutputIterator, then callsmergeInternal().mergeInternal()dispatches: if entries have different keys, the lesser key is accepted viamaybePut(); if keys are equal,mergeCasesWithEqualKeys()handles INIT/DEAD annihilation, lifecycle promotion, and shadow checks.BucketOutputIterator::getBucket()finalizes →BucketManager::adoptFileAsBucket()→ file rename into bucket dir, index construction, merge tracking.
Point Lookup (BucketListDB)
- Caller obtains a
SearchableLiveBucketListSnapshotfromBucketSnapshotManager. load(LedgerKey)iterates all buckets vialoopAllBuckets().- For each bucket,
getBucketEntry()callsindex.lookup()→ returns CACHE_HIT (cached BucketEntry), FILE_OFFSET (seek + read page), or NOT_FOUND (skip bucket). - First non-null result wins (newer buckets shadow older ones).
Eviction Scan
BucketManager::startBackgroundEvictionScan()posts a task to eviction background thread.- Task calls
SearchableLiveBucketListSnapshot::scanForEviction(), which iterates through a region of the bucket file, collecting candidates whose TTL entries are expired. - Main thread calls
resolveBackgroundEvictionScan(): validates candidates against current ledger state, evicts up tomaxEntriesToArchive, updates eviction iterator in network config.
Threading Model
- Main thread: Owns
BucketManager,LiveBucketList,HotArchiveBucketList. CallsaddBatch(),snapshotLedger(),forgetUnreferencedBuckets(). Updates canonical snapshots inBucketSnapshotManager. - Worker threads (via
app.postOnBackgroundThread()): RunFutureBucketmerges. CallBucketT::merge(),adoptFileAsBucket(). AccessmBucketMutexfor file operations and future maps. - Eviction background thread: Runs
scanForEviction()on a snapshot. - Query threads (Soroban/overlay): Use
SearchableBucketListSnapshotcopies fromBucketSnapshotManager. Each snapshot has its own file stream cache. Snapshot data is immutable and shared viashared_ptr. - Synchronization:
mBucketMutex(RecursiveMutex): Guards bucket file maps, future maps, finished merge map. Must be acquired AFTERLedgerManagerImpl::mLedgerStateMutex.mSnapshotMutex(SharedMutex inBucketSnapshotManager): Exclusive forupdateCurrentSnapshot(), shared for copying snapshots.mCacheMutex(shared_mutex inLiveBucketIndex): Guards the random eviction cache.
Ownership Relationships
BucketManager
├── LiveBucketList (unique_ptr)
│ └── vector<BucketLevel<LiveBucket>>
│ ├── mCurr: shared_ptr<LiveBucket>
│ │ ├── mFilename, mHash, mSize
│ │ ├── mIndex: shared_ptr<LiveBucketIndex const>
│ │ │ ├── DiskIndex<LiveBucket> (or InMemoryIndex)
│ │ │ └── RandomEvictionCache (optional)
│ │ └── mEntries: unique_ptr<vector<BucketEntry>> (level 0 only)
│ ├── mSnap: shared_ptr<LiveBucket>
│ └── mNextCurr: variant<FutureBucket<LiveBucket>, shared_ptr<LiveBucket>>
│ └── FutureBucket holds shared_future + input/output bucket refs
├── HotArchiveBucketList (unique_ptr)
│ └── vector<BucketLevel<HotArchiveBucket>> (same structure)
├── BucketSnapshotManager (unique_ptr)
│ ├── mCurrLiveSnapshot: shared_ptr<BucketListSnapshotData<LiveBucket>>
│ ├── mCurrHotArchiveSnapshot: shared_ptr<BucketListSnapshotData<HotArchiveBucket>>
│ └── historical snapshot maps
├── mSharedLiveBuckets: map<Hash, shared_ptr<LiveBucket>>
├── mSharedHotArchiveBuckets: map<Hash, shared_ptr<HotArchiveBucket>>
├── mLiveBucketFutures: map<MergeKey, shared_future<shared_ptr<LiveBucket>>>
├── mHotArchiveBucketFutures: map<MergeKey, shared_future<shared_ptr<HotArchiveBucket>>>
├── mFinishedMerges: BucketMergeMap
├── TmpDirManager (unique_ptr)
└── Config (copy, thread-safe)
Key Data Flows
-
Ledger close → BucketList:
LedgerManager→BucketManager::addLiveBatch(initEntries, liveEntries, deadEntries)→LiveBucketList::addBatch()→ spill cascade through levels → background merges. -
BucketList → Snapshot: After
addBatch(), main thread callsBucketSnapshotManager::updateCurrentSnapshot()→ creates new immutableBucketListSnapshotDatafrom current BucketList state. -
Snapshot → Query: Background threads call
BucketSnapshotManager::copySearchableLiveBucketListSnapshot()→ gets aSearchableLiveBucketListSnapshotwith fresh file streams →load()/loadKeys()for point and bulk queries. -
Eviction flow: Main thread →
startBackgroundEvictionScan()→ eviction thread scans snapshot →resolveBackgroundEvictionScan()on main thread applies evictions toAbstractLedgerTxn→ evicted persistent entries flow toaddHotArchiveBatch()→HotArchiveBucketList. -
Catchup flow:
HistoryManagerdownloads bucket files →BucketManager::assumeState(HistoryArchiveState)→ sets curr/snap on each level, restarts merges →BucketApplicatorapplies offers to database. -
Merge reattachment: On restart,
FutureBucket::makeLive()reconstitutes from hashes →startMerge()checksBucketManager::getMergeFuture()→ if finished merge exists inBucketMergeMap, synthesizes a pre-resolved future; otherwise re-launches background merge.
> related_skills --same-repo
> validating-a-change
comprehensive validation of a change to ensure it is correct and ready for a pull request
> regenerating a technical summary of stellar-core
Instructions for regenerating the full set of subsystem and whole-system technical summary skill documents for stellar-core
> subsystem-summary-of-work
read this skill for a token-efficient summary of the work subsystem
> subsystem-summary-of-util
read this skill for a token-efficient summary of the util subsystem