EIP-4444: Bound Historical Data in Execution Clients

Prune historical data in clients older than one year


Metadata
Status: StagnantStandards Track: NetworkingCreated: 2021-11-02
Authors
George Kadianakis (@asn-d6), lightclient (@lightclient), Alex Stokes (@ralexstokes)

Abstract


Clients must stop serving historical headers, bodies, and receipts older than one year on the p2p layer. Clients may locally prune this historical data.

Motivation


Historical blocks and receipts currently occupy more than 400GB of disk space (and growing!). Therefore, to validate the chain, users must typically have a 1TB disk.

Historical data is not necessary for validating new blocks, so once a client has synced the tip of the chain, historical data is only retrieved when requested explicitly over the JSON-RPC or when a peer attempts to sync the chain. By pruning the history, this proposal reduces the disk requirements for users. Pruning history also allows clients to remove code that processes historical blocks. This means that execution clients don't need to maintain code paths that deal with each upgrade's compounding changes.

Finally, this change will result in less bandwidth usage on the network as clients adopt more lightweight sync strategies based on the PoS weak subjectivity assumption.

Specification


ParameterValueDescription
HISTORY_PRUNE_EPOCHS82125A year in beacon chain epochs

Clients SHOULD NOT serve headers, block bodies, and receipts that are older than HISTORY_PRUNE_EPOCHS epochs on the p2p network.

Clients MAY locally prune headers, block bodies, and receipts that is older than HISTORY_PRUNE_EPOCHS epochs.

Bootstrapping and syncing

This EIP impacts the way clients bootstrap and sync. Clients will not be able to full sync using devp2p since historical data will no longer be served.

Clients MUST use a valid Weak Subjectivity Checkpoint to bootstrap from a more recent view of the chain. For the purpose of syncing, clients treat weak subjectivity checkpoints as the genesis block. We call this method "checkpoint sync".

For the purposes of this proposal, we assume clients always start with a configured and valid weak subjectivity checkpoint. The way this is achieved is out-of-scope for this proposal.

Rationale


This proposal forces clients to stop serving old historical data over p2p. We make this explicit to force clients to seek historical data from other sources, instead of relying on the optional behavior of some clients which would result in quality degradation.

Why a year?

This proposal sets HISTORY_PRUNE_EPOCHS to 82125 epochs (one earth year). This constant is large enough to provide sufficient room for the Weak Subjectivity Period to grow, and it's also small enough so as to not occupy too much disk space.

Backwards Compatibility


Preserving historical data

This proposal impacts nodes that make use of historical data (e.g. web3 applications that display history of blocks, transactions or accounts). Preserving the history of Ethereum is fundamental and we believe there are various out-of-band ways to achieve this.

Historical data can be packaged and shared via torrent magnet links or over networks like IPFS. Furthermore, systems like the Portal Network or The Graph can be used to acquire historical data.

Clients should allow importing and exporting of historical data. Clients can provide scripts that fetch/verify data and automatically import them.

Full syncing from genesis

Full syncing will no longer be possible over the p2p network. However, we do want to allow interested parties to do so on their own.

We suggest that a specialized "full sync" client is built. The client is a shim that pieces together different releases of execution engines and can import historical blocks to validate the entire Ethereum chain from genesis and generate all other historical data.

It's important to also note that although archive nodes with "state sync" functionality are in development, full sync is currently the only reliable way to bootstrap them. Clients that wish to continue providing archive support would need the ability to import historical blocks retrieved out-of-band and retain support for each historical upgrade of the network.

User experience

This proposal impacts the UX for setting up applications that use historical data. Hence we suggest that clients introduce this change in two phases:

In the first phase, clients don't prune historical data by default. They introduce a command line option similar to geth's --txlookuplimit that users can use if they want to prune historical data.

In the second phase, history is pruned by default and the command line option is removed. Clients assume that users will find and import data in an out-of-band way.

JSON-RPC changes

After this proposal is implemented, certain JSON-RPC endpoints (e.g. like getBlockByHash) won't be able to tell whether a given hash is invalid or just too old. Other endpoints like getLogs will simply no longer have the data the user is requesting. The way this regression should be handled by applications or clients is out-of-scope for this proposal.

Security Considerations


Relying on weak subjectivity

With the move to PoS, it's essential for security to use valid weak subjectivity checkpoints because of long-range attacks.

This proposal relies on the weak subjectivity assumption and assumes that clients will not bootstrap with an invalid or old WS checkpoint.

This proposal also assumes that the weak subjectivity period will never be longer than HISTORY_PRUNE_EPOCHS. If that were to happen, clients with an old weak subjectivity checkpoint would never be able to "checkpoint sync" since the p2p network would not be able to provide the required data.

Centralization/censorship risk

There are censorship/availability risks if there is a lack of incentives to keep historical data.

It's important that Ethereum historical data are preserved and seeded by independent organizations, and their availability should be checked frequently. We consider these mechanisms as out-of-scope for this proposal.

Furthermore, there is a risk that more dapps will rely on centralized services for acquiring historical data because it will be harder to setup a full node.

Confusion with other proposals

Because there are a number of alternative proposals for reducing the execution client's footprint on disk, we've decided to enforce a specific pronouncination of the EIP. When pronouncing the EIP number, it MUST be pronounced EIP "four fours". All other pronounciations are incorrect.

Copyright


Copyright and related rights waived via CC0.