EIP-1057: ProgPoW, a Programmatic Proof-of-Work


Metadata
Status: StagnantStandards Track: CoreCreated: 2018-05-02
Authors
Greg Colvin (greg@colvin.org), Andrea Lanfranchi (@AndreaLanfranchi), Michael Carter (@bitsbetrippin), IfDefElse (ifdefelse@protonmail.com)

Simple Summary


A new Proof-of-Work algorithm to replace Ethash that utilizes almost all parts of commodity GPUs.

Abstract


ProgPoW is a proof-of-work algorithm designed to close the efficiency gap available to specialized ASICs. It utilizes almost all parts of commodity hardware (GPUs), and comes pre-tuned for the most common hardware utilized in the Ethereum network.

This document presents an overview of the algorithm and examines what it means to be “ASIC-resistant.” Next, we compare existing PoW designs by analyzing how each algorithm executes in hardware. Finally, we present the detailed implementation by walking through the code.

Motivation


Ever since the first bitcoin mining ASIC was released, many new Proof of Work algorithms have been created with the intention of being “ASIC-resistant”. The goal of “ASIC-resistance” is to resist the centralization of PoW mining power such that these coins couldn’t be so easily manipulated by a few players.

Ethereum's approach is to incentivize a geographically-distributed community of miners with a low barrier to entry on commodity hardware. As stated in the Yellow Paper:

11.5. Mining Proof-of-Work. The mining proof-ofwork (PoW) exists as a cryptographically secure nonce that proves beyond reasonable doubt that a particular amount of computation has been expended in the determination of some token value n. It is utilised to enforce the blockchain security by giving meaning and credence to the notion of difficulty (and, by extension, total difficulty). However, since mining new blocks comes with an attached reward, the proof-of-work not only functions as a method of securing confidence that the blockchain will remain canonical into the future, but also as a wealth distribution mechanism.

For both reasons, there are two important goals of the proof-of-work function; firstly, it should be as accessible as possible to as many people as possible. The requirement of, or reward from, specialised and uncommon hardware should be minimised. This makes the distribution model as open as possible, and, ideally, makes the act of mining a simple swap from electricity to Ether at roughly the same rate for anyone around the world.

Secondly, it should not be possible to make super-linear profits, and especially not so with a high initial barrier. Such a mechanism allows a well-funded adversary to gain a troublesome amount of the network’s total mining power and as such gives them a super-linear reward (thus skewing distribution in their favour) as well as reducing the network security...

... While ASICs exist for a proof-of-work function, both goals are placed in jeopardy. Because of this, a proof-of-work function that is ASIC-resistant (i.e. difficult or economically inefficient to implement in specialised compute hardware) has been identified as the proverbial silver bullet.

It is from these premises that Ethash was designed as an ASIC-resistant proof-of-work:

Two directions exist for ASIC resistance; firstly make it sequential memory-hard, i.e. engineer the function such that the determination of the nonce requires a lot of memory and bandwidth such that the memory cannot be used in parallel to discover multiple nonces simultaneously. The second is to make the type of computation it would need to do general-purpose; the meaning of “specialised hardware” for a general-purpose task set is, naturally, general purpose hardware and as such commodity desktop computers are likely to be pretty close to “specialised hardware” for the task. For Ethereum 1.0 we have chosen the first path.

5 years of experience with the Ethereum blockchain have demonstrated the success of our approach. This success cannot be taken for granted.

  • 11 years of experience with PoW Blockchains have shown a centralization in hardware development, resulting in a few companies controlling the lifecycle of new hardware with limited distribution.
  • New ASICs for Ethash are providing higher efficiency than GPUs, such as the Antminer E3.
  • As much as 40% of the Ethereum network may now be secured by ASICs.

ProgPow restores Ethash' ASIC-resistance by extending Ethash with a GPU-specific approach to the second path — making the “specialised hardware” for the PoW task commodity hardware.

ProgPoW Overview

The design goal of ProgPoW is to have the algorithm’s requirements match what is available on commodity GPUs: If the algorithm were to be implemented on a custom ASIC there should be little opportunity for efficiency gains compared to a commodity GPU.

The main elements of the algorithm are:

  • Changes keccak_f1600 (with 64-bit words) to keccak_f800 (with 32-bit words) to reduce impact on total power
  • Increases mix state.
  • Adds a random sequence of math in the main loop.
  • Adds reads from a small, low-latency cache that supports random addresses.
  • Increases the DRAM read from 128 bytes to 256 bytes.

The random sequence changes every PROGPOW_PERIOD (about 2 to 12 minutes depending on the configured value). When mining source code is generated for the random sequence and compiled on the host CPU. The GPU will execute the compiled code where what math to perform and what mix state to use are already resolved.

While a custom ASIC to implement this algorithm is still possible, the efficiency gains available are minimal. The majority of a commodity GPU is required to support the above elements. The only optimizations available are:

  • Remove the graphics pipeline (displays, geometry engines, texturing, etc)
  • Remove floating point math
  • A few ISA tweaks, like instructions that exactly match the merge() function

These would result in minimal, roughly 1.1-1.2x, efficiency gains. This is much less than the 2x for Ethash or 50x for Cryptonight.

Rationale for PoW on Commodity Hardware

With the growth of large mining pools, the control of hashing power has been delegated to the top few pools to provide a steadier economic return for small miners. While some have made the argument that large centralized pools defeats the purpose of “ASIC resistance,” it’s important to note that ASIC based coins are even more centralized for several reasons.

  1. No natural distribution: There isn’t an economic purpose for ultra-specialized hardware outside of mining and thus no reason for most people to have it.
  2. No reserve group: Thus, there’s no reserve pool of hardware or reserve pool of interested parties to jump in when coin price is volatile and attractive for manipulation.
  3. High barrier to entry: Initial miners are those rich enough to invest capital and ecological resources on the unknown experiment a new coin may be. Thus, initial coin distribution through mining will be very limited causing centralized economic bias.
  4. Delegated centralization vs implementation centralization: While pool centralization is delegated, hardware monoculture is not: only the limiter buyers of this hardware can participate so there isn’t even the possibility of divesting control on short notice.
  5. No obvious decentralization of control even with decentralized mining: Once large custom ASIC makers get into the game, designing back-doored hardware is trivial. ASIC makers have no incentive to be transparent or fair in market participation.

While the goal of “ASIC resistance” is valuable, the entire concept of “ASIC resistance” is a bit of a fallacy. CPUs and GPUs are themselves ASICs. Any algorithm that can run on a commodity ASIC (a CPU or GPU) by definition can have a customized ASIC created for it with slightly less functionality. Some algorithms are intentionally made to be “ASIC friendly” - where an ASIC implementation is drastically more efficient than the same algorithm running on general purpose hardware. The protection that this offers when the coin is unknown also makes it an attractive target for a dedicate mining ASIC company as soon as it becomes useful.

Therefore, ASIC resistance is: the efficiency difference of specialized hardware versus hardware that has a wider adoption and applicability. A smaller efficiency difference between custom vs general hardware mean higher resistance and a better algorithm. This efficiency difference is the proper metric to use when comparing the quality of PoW algorithms. Efficiency could mean absolute performance, performance per watt, or performance per dollar - they are all highly correlated. If a single entity creates and controls an ASIC that is drastically more efficient, they can gain 51% of the network hashrate and possibly stage an attack.

Review of Existing PoW Algorithms

SHA256

  • Potential ASIC efficiency gain ~ 1000X

The SHA algorithm is a sequence of simple math operations - additions, logical ops, and rotates.

To process a single op on a CPU or GPU requires fetching and decoding an instruction, reading data from a register file, executing the instruction, and then writing the result back to a register file. This takes significant time and power.

A single op implemented in an ASIC takes a handful of transistors and wires. This means every individual op takes negligible power, area, or time. A hashing core is built by laying out the sequence of required ops.

The hashing core can execute the required sequence of ops in much less time, and using less power or area, than doing the same sequence on a CPU or GPU. A bitcoin ASIC consists of a number of identical hashing cores and some minimal off-chip communication.

Scrypt and NeoScrypt

  • Potential ASIC efficiency gain ~ 1000X

Scrypt and NeoScrypt are similar to SHA in the arithmetic and bitwise operations used. Unfortunately, popular coins such as Litecoin only use a scratchpad size between 32kb and 128kb for their PoW mining algorithm. This scratch pad is small enough to trivially fit on an ASIC next to the math core. The implementation of the math core would be very similar to SHA, with similar efficiency gains.

X11 and X16R

  • Potential ASIC efficiency gain ~ 1000X

X11 (and similar X##) require an ASIC that has 11 unique hashing cores pipelined in a fixed sequence. Each individual hashing core would have similar efficiency to an individual SHA core, so the overall design will have the same efficiency gains.

X16R requires the multiple hashing cores to interact through a simple sequencing state machine. Each individual core will have similar efficiency gains and the sequencing logic will take minimal power, area, or time.

The Baikal BK-X is an existing ASIC with multiple hashing cores and a programmable sequencer. It has been upgraded to enable new algorithms that sequence the hashes in different orders.

Equihash

  • Potential ASIC efficiency gain ~ 100X

The ~150mb of state is large but possible on an ASIC. The binning, sorting, and comparing of bit strings could be implemented on an ASIC at extremely high speed.

Cuckoo Cycle

  • Potential ASIC efficiency gain ~ 100X

The amount of state required on-chip is not clear as there are Time/Memory Tradeoff attacks. A specialized graph traversal core would have similar efficiency gains to a SHA compute core.

CryptoNight

  • Potential ASIC efficiency gain ~ 50X

Compared to Scrypt, CryptoNight does much less compute and requires a full 2mb of scratch pad (there is no known Time/Memory Tradeoff attack). The large scratch pad will dominate the ASIC implementation and limit the number of hashing cores, limiting the absolute performance of the ASIC. An ASIC will consist almost entirely of just on-die SRAM.

Ethash

  • Potential ASIC efficiency gain ~ 2X

Ethash requires external memory due to the large size of the DAG. However that is all that it requires - there is minimal compute that is done on the result loaded from memory. As a result a custom ASIC could remove most of the complexity, and power, of a GPU and be just a memory interface connected to a small compute engine.

Specification


ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:

  • PROGPOW_PERIOD: Number of blocks before changing the random program
  • PROGPOW_LANES: The number of parallel lanes that coordinate to calculate a single hash instance
  • PROGPOW_REGS: The register file usage size
  • PROGPOW_DAG_LOADS: Number of uint32 loads from the DAG per lane
  • PROGPOW_CACHE_BYTES: The size of the cache
  • PROGPOW_CNT_DAG: The number of DAG accesses, defined as the outer loop of the algorithm (64 is the same as Ethash)
  • PROGPOW_CNT_CACHE: The number of cache accesses per loop
  • PROGPOW_CNT_MATH: The number of math operations per loop

The values of these parameters have been tweaked between the original version and the version proposed here for Ethereum adoption. See this medium post for details.

Parameter0.9.20.9.3
PROGPOW_PERIOD5010
PROGPOW_LANES1616
PROGPOW_REGS3232
PROGPOW_DAG_LOADS44
PROGPOW_CACHE_BYTES16x102416x1024
PROGPOW_CNT_DAG6464
PROGPOW_CNT_CACHE1211
PROGPOW_CNT_MATH2018
DAG Parameter0.9.20.9.3
ETHASH_DATASET_PARENTS256256

The random program changes every PROGPOW_PERIOD blocks (default 10, roughly 2 minutes) to ensure the hardware executing the algorithm is fully programmable. If the program only changed every DAG epoch (roughly 5 days) certain miners could have time to develop hand-optimized versions of the random sequence, giving them an undue advantage.

Sample code is written in C++, this should be kept in mind when evaluating the code in the specification. All numerics are computed using unsigned 32 bit integers. Any overflows are trimmed off before proceeding to the next computation. Languages that use numerics not fixed to bit lengths (such as Python and JavaScript) or that only use signed integers (such as Java) will need to keep their languages' quirks in mind. The extensive use of 32 bit data values aligns with modern GPUs internal data architectures.

ProgPoW uses a 32-bit variant of FNV1a for merging data. The existing Ethash uses a similar variant of FNV1 for merging, but FNV1a provides better distribution properties.

Test vectors can be found in the test vectors file.


ProgPow uses KISS99 for random number generation. This is the simplest (fewest instruction) random generator that passes the TestU01 statistical test suite. A more complex random number generator like Mersenne Twister can be efficiently implemented on a specialized ASIC, providing an opportunity for efficiency gains.

Test vectors can be found in the test vectors file.


The fill_mix function populates an array of int32 values used by each lane in the hash calculations.

Test vectors can be found in the test vectors file.


Like Ethash Keccak is used to seed the sequence per-nonce and to produce the final result. The keccak-f800 variant is used as the 32-bit word size matches the native word size of modern GPUs. The implementation is a variant of SHAKE with width=800, bitrate=576, capacity=224, output=256, and no padding. The result of keccak is treated as a 256-bit big-endian number - that is result byte 0 is the MSB of the value.

As with Ethash the input and output of the keccak function are fixed and relatively small. This means only a single "absorb" and "squeeze" phase are required. For a pseudo-code implementation of the keccak_f800_round function see the Round[b](A,RC) function in the "Pseudo-code description of the permutations" section of the official Keccak specs.


The inner loop uses FNV and KISS99 to generate a random sequence from the prog_seed. This random sequence determines which mix state is accessed and what random math is performed.

Since the prog_seed changes only once per PROGPOW_PERIOD (10 blocks or about 2 minutes) it is expected that while mining progPowLoop will be evaluated on the CPU to generate source code for that period's sequence. The source code will be compiled on the CPU before running on the GPU. You can see an example sequence and generated source code in kernel.cu.

Test vectors can be found in the test vectors file.


The math operations that merges values into the mix data are ones chosen to maintain entropy.

Test vectors can be found in the test vectors file.


The math operations chosen for the random math are ones that are easy to implement in CUDA and OpenCL, the two main programming languages for commodity GPUs. The mul_hi, min, clz, and popcount functions match the corresponding OpenCL functions. ROTL32 matches the OpenCL rotate function. ROTR32 is rotate right, which is equivalent to rotate(i, 32-v).

Test vectors can be found in the test vectors file.


The flow of the inner loop is:

  • Lane (loop % LANES) is chosen as the leader for that loop iteration
  • The leader's mix[0] value modulo the number of 256-byte DAG entries is used to select where to read from the full DAG
  • Each lane reads DAG_LOADS sequential words, using (lane ^ loop) % LANES as the starting offset within the entry.
  • The random sequence of math and cache accesses is performed
  • The DAG data read at the start of the loop is merged at the end of the loop

prog_seed and loop come from the outer loop, corresponding to the current program seed (which is block_number/PROGPOW_PERIOD) and the loop iteration number. mix is the state array, initially filled by fill_mix. dag is the bytes of the Ethash DAG grouped into 32 bit unsigned ints in little-endian format. On little-endian architectures this is just a normal int32 pointer to the existing DAG.

DAG_BYTES is set to the number of bytes in the current DAG, which is generated identically to the existing Ethash algorithm.

Test vectors can be found in the test vectors file.


The flow of the overall algorithm is:

  • A keccak hash of the header + nonce to create a digest of 256 bits from keccak_f800 (padding is consistent with custom one in ethash)
  • Use first two words of digest as seed to generate initial mix data
  • Loop multiple times, each time hashing random loads and random math into the mix data
  • Hash all the mix data into a single 256-bit value
  • A final keccak hash using carry-over digest from initial data + mix_data final 256 bit value (padding is consistent with custom one in ethash)
  • When mining this final value is compared against a hash32_t target

Security Considerations


This proposal has been software and hardware audited:

Least Authority in their findings suggest a change to DAG generation -- modification of ETHASH_DATASET_PARENTS from a value of 256 to the new value of 512 -- in order to mitigate vulnerability to a "Light Evaluation" attack. Due to this the DAG memory file used by ProgPoW is would no longer compatible with the one used by Ethash (epoch length and size increase ratio remain the same though).

We do not recommend implementing this fix at this time. Ethash will not be exploitable for years, and it's not clear ProgPoW will ever be exploitable. It's better to deploy the audited code.

After the completion of the audits a clever finding by Kik disclosed a vulnerability to bypassing ProgPoW memory hardness. The vulnerability is present in Ethash as well but is near-impossible to exploit. In progPoW it is not possible to exploit -- it assumes the ability to create variants of the candidate block's header hash in a fashion similar to bitcoin, which is actually not possible in Ethereum. An attacker would need modified block headers, would need customized nodes able to accept the modified block headers, and uses extraNonce/extraData as entropy -- which isn’t the standard. And the required brute-force search would be difficult to accomplish in one blocktime. And even if supported by a customized node the block propagation of such mined blocks would be immediately blocked by other peers as the header hash is invalid.

The author's have since found another vulnerability similar to Kik's, but it adds too much overhead to be ASIC-friendly. See Lanfranchi's full explanation here. To completely prevent such exploits we could change the condition modifying the input state of the last keccak pass from

  • header (256 bits) +
  • seed for mix initiator (64 bits) +
  • mix from main loop (256 bits)
  • no padding

to

  • digest from initial keccak (256 bits) +
  • mix from main loop (256 bits) +
  • padding

thus widening the constraint to target in keccak brute force keccak linear searches from 64 to 256 bits.

This fix is available as a PR to the reference implementation. Again, we do not recommend implementing this fix at this time. Kik's vulnerability and others like it cannot be exploited now and likely never will be. It's better to deploy the audited code.

Note that these vulnerabilities cannot be exploited to deny service, double spend, or otherwise damage the network. They could at worst give their deployer an efficiency advantage over other miners.

Test Cases


The random sequence generated for block 30,000 (prog_seed 3,000) can been seen in kernel.cu.

The algorithm run on block 30,000 produces the following digest and result:


Additional test vectors can be found in the test vectors file.

Machine-readable test vectors (T.B.D)

Implementation


The reference ProgPoW mining implementation is located at the @ifdefelse ProgPOW repository.

Copyright


Copyright and related rights waived via CC0.

The reference ProgPoW mining implementation located at ProgPOW is a derivative of ethminer so retains the GPL license.