Why reading data from the Ethereum blockchain is hard and how to speed it up
What is the content of a blockchain?
A blockchain is an immutable, append only distributed database. To work with a blockchain, it is not just about understanding these unique features, but one also needs to understand what data is saved on the blockchain. How does blockchain data actually look like, i.e. what is the content and how is it formatted?
As the word actually describes, a blockchain is a chain of blocks, and it works similarly for the existing major chains like Bitcoin, Ethereum and other Proof-of-Work networks. To simplify, in this article the focus is on the Ethereum blockchain only. Technical details that are not actual data but mainly support the functioning of the network (e.g. uncles etc.) are intentionally left out.
Each block consists of the header and the transactions mined with the block. The chain is actually formed by hashing various information of the previous block and including it in the new block header.
Besides this hash of the previous header, in Ethereum there are hashes of the state root and the roots of the transactions and the receipt — see here for details.
Of course there is also a block number (the increasing so called “block height”), a timestamp and information about the gas limit and gas usage. In addition, there are some pieces related to the mining process (like the miner’s address, difficulty, nonce etc.).
Check out this post for some diagrams and more details.
Now let’s have a look at the transactions that are bundled into each block from the pool of pending transactions. Besides the (account) nonce, the gas price and the gas limit, transactions are comprised of a receiving “to” address, a sending “from” address, an ETH value in WEI and a data field. The transaction can then be sent to the network and tracked by a 256 bit transaction ID which is the hash of the transaction.
For more details and an example, check out this informative article by CodeTract.
How is smart contract interaction data stored in Ethereum?
As opposed to Bitcoin, the Ethereum blockchain offers more than the simple transfer of value via end-to-end transactions. Running turing-complete code is the key differentiator for Ethereum, and therefore the most interesting data lays in the smart contract interactions of the ”world computer”.
To be able to run code, Ethereum provides a virtual machine called the Ethereum Virtual Machine (EVM). It abstracts the underlying computer so smart contracts can run on every computer, where there is an Ethereum node running. Smart contract is just a fancy word for a program or code written in a programming language and compiled for the EVM.
So let’s have a closer look at these interactions now: Smart contracts generate logs by firing events whenever a function is called by an external account (transaction) or another smart contract (internal transaction). Events can therefore be described generally as asynchronous triggers with data. Asynchronous, because the log is only written once the originating transaction has been mined into a block. More details on events and logs can be found in the Solidity documentation.
The most important use case for events is to provide smart contract return values for a user interface. Logs can also be used as a cheaper form of storage — as described in more detail here.
In order to understand how the function is called, one must look at what makes up the (optional) data field in a transaction. It could be arbitrary data, but most often it is actually a function call to a smart contract.
The transaction is targeting the smart contract by using its address in the “to” field.
In order to know which function the transaction is calling within the smart contract, the functions of the contract must be known beforehand to create a hash table. The first 32 bits in the transaction data field correspond to the first 32 bits of the hash of the function. This is then followed by 256 bit for each argument of the function. In essence this means that data fields in function calls are encoded. To use them and interpret them, they need to be decoded.
For details on how to interpret the topic and data fields in a transaction receipt log see here.
Reading from the Ethereum blockchain is hard
So now that you know what is stored on the Ethereum blockchain, I think it has become clear how difficult it is to extract actionable insights out of it. Actually there are several things that make it particularly hard:
- hexadecimal hashes instead of human readable text
- sequential nature of the data
- slow JSON-RPC interface
The most obvious difficulty is the fact that pretty much everything consists of hexadecimal hashes instead of clear text labels. For account addresses this can be seen as a feature allowing pseudonymity. But with regards to smart contract interactions, the data needs to be converted into a human readable format.
Another aspect is the serialized nature of the data. For only very few use cases it is possible to read the answer from the blockchain within a single query. Most often, you’ll have to traverse the chain with multiple requests for simple tasks such as displaying a transaction history of an account. Now, think about calculating an average gas price over the history of the blockchain, or monitoring the current state with regards to a specific token…
And lastly, consider the interface for querying the data — it is super slow. Before you can even start, you’ll need to set up an archive node (e.g. Geth or Parity) in order to have all historical data available. The size of that is about 1,7 TB as of November 2018 — and you need to store it on SSD hard drives in order to get it running reliably. Even then syncing the Ethereum Mainnet will last several days to even weeks. Once you have the node running in snyc, you can only query it by using the JSON-RPC API. This is of course really slow, particularly when considering the multitude of calls you need to make due to the serial nature of the data.
How to speed up reading blockchain data
In order to monitor smart contract development either on a test net or Ethereum Mainnet one needs to set up a node and create some sort of index in other words a database.
Doing so there are essentially two options: a) index the whole blockchain or b) limit the amount of data that is drained from the node into the index. Which approach to choose is a question of balance between ressource consumption and flexibility constraints. In any way the necessity for a filter mechanism for the relevant data arises.
Depending on the amount and type of data that you want to access, another thoughtful consideration should be made about how to query your blockchain index. The query language and the database system are, of course, mostly interdependent, so you’ll need to consider both in tandem.
Popular choices tend to be SQL (e.g. for PostgreSQL database) or Elasticsearch Query DSL (for Elasticsearch) as many developers are familiar with their query syntax.
Obviously, these architectural considerations become more complex, if you want to share such a database index across your teams/departments or even between several business entities. This would at least require authentication, likely also authorization and possibly even accounting.
Let’s assume you have made up your mind about these considerations, have acquired the infrastructure and set up the systems. Now you’ll do classical extract, transform and load (ETL) processes.
Extract relevant data from the node
First, you have to extract the data relevant for you from the node, for example everything concerning certain smart contracts, or the complete blockchain history starting from a specific point in time. Of particular interest is the question on how fast you want to retrieve incoming blocks. While it is obviously beneficial to be up to date quickly, you might have to deal with chain reorganizations from time to time. This occurs in a situation where a client node discovers a new difficultywise-longest well-formed blockchain which excludes one or more blocks that the client previously thought were part of the longest blockchain. These excluded blocks become orphans and therefore the data contained in them need to be purged from or at least flagged in the index.
Transform data into human readable format
In the transformation step you probably want to make the data human readable. Examples might be labeling Ethereum accounts of known origin (e.g. exchange wallets, smart contract names etc.), or fetching smart contract Application Binary Interfaces (ABI) in order to spell out the names of the functions. You’ll use some form of mapping in order to match the raw data to the clear text labels. This could also mean getting historical price information for ERC-20 tokens for example and combining the timestamp with the appropriate block height. You’d need that to quantify value transfer transactions in fiat denominations, which would be an example of enriching the blockchain data from external sources.
Load into database index for faster querying
All of this data then needs to be loaded into a database and be indexed for best query performance. Depending on the chosen technology, this process might take a while and involve different steps in itself.
But now it is done!
Basically you have transferred the serialized blockchain content from an OLTP (On-Line Transaction Processing) to an OLAP (On-Line Analytical Processing) environment. Therefore, you are now able to read from a database index much more quickly to start digging into the blockchain data.
Hopefully you have gained an understanding about the content stored on the Ethereum blockchain and how to dissect the logs of smart contract events in order to see their interactions. As you have learned from this article, reading from the blockchain is hard and the process to access the data faster is quite involving.
Tools and services to help you access blockchain data
In order not having to do the whole process by yourself, there are fortunately plenty of tools and services available. In the FAQ section of our website, you can find different examples and how they compare to our offering.
Eth.events provides a complete set of indexes of the Ethereum Mainnet as well as all test nets as Software-as-a-Service. As a smart contract developer, you can get your free API key here and start querying our Elasticsearch index right away. Or you might want to check out our documentation on how to get started — and feel free to contact us for any questions, support and feedback. We’d love to talk to you!
In any case: Happy BUIDLing and stay awesome!
Find & follow us:
- Twitter: https://twitter.com/get_eth_events
- Gitter: https://gitter.im/eth-events/Lobby
- Telegram: https://t.me/ethevents
- Medium: https://email@example.com
The article was first posted on https://eth.events/news/