Jim Myers Tech
Posts
Forces of the Web3 Data Universe

Forces of the Web3 Data Universe

In Web3 we spend a lot of time making data understandable, but to make data useful we need more than just human-readable data, we need access points that match use cases.

Jim Myers
March 15, 2023

One platform. One access point for both offline and online data use cases. That’s where we’re headed at Flipside.

Why? Let's start with this power user of our platform. They get it.

“Please just give me one platform to access everything, online data to power my apps, offline data to power analytics, preferably it’s all in SQL, I’m tired of gluing together all these systems”

— Flipside Crypto Power User

To get there we need a framework to guide how we think about data.

Blockchain networks exist in a state that is technically open, but practically closed. Data is optimized for consensus, storage, and network communication, trading off human readability.

For data to be useful it must be a.) understandable, and b.) accessible to users based on the specific needs of their projects.

At Flipside we’ve spent years building the missing layer that translates raw network data into a curated version that is actually useful for analysts, data scientists, and developers. However, no matter how human-readable that data is, on its own, it’s useless.

For data to be useful it must be accessible in a way that maps to use cases. This is often overlooked and at Flipside we obsess over it.

In this post, I will share a basic framework that we use to design data access layers to transform the best blockchain data into useful data.

Let’s get our bearings.

When considering how to design access to data start with 2 questions:

How much throughput does your project require?
What level of latency is acceptable?

Identify your Requirements: Throughput vs. Latency

Let's define throughput as the number of requests that a system can handle within a specific time frame. For instance, at Flipside, our blockchain nodes require a throughput of approximately 90,000 requests per minute, while user queries against Snowflake necessitate an average throughput of around 70 queries per minute.

In Web3, latency refers to the time gap between when a block is published and when it appears in your data store for reading. A trading system is a classic example of a low-latency application where the difference between a profitable and an unprofitable trade can be measured in microseconds. To achieve even more "real-time" performance, you might consider latency to be the time between when a transaction enters the mem-pool and when it is indexed.

Use Cases: are you online or offline?

Let's take a high-level look at how some typical use cases compare in terms of throughput and latency.

Consider the second quadrant (encompassing dApps, block explorers, wallets, trading systems, etc.) as scenarios where “some event occurs, the user needs to know about it right away, and there are likely to be a lot of users requesting information from the system at the same point in time about the event”.

To draw a comparison from Web2, imagine you are shopping on Amazon. You add a product to your cart and click checkout. If the item takes a long time to be added to your cart (high latency), and only a few hundred people worldwide can add items to their carts simultaneously (low throughput), we’d probably all know of Amazon as a river in South America.

Similarly, a dApp must reflect the effects of a transaction instantly when a user conducts a transaction in the application. Additionally, the application should be scalable enough to support thousands of requests per minute from thousands of users.

However, there are instances where higher latency and lower throughput are acceptable. BI and Analytics use cases typically fall under this category. When performing historical analysis, real-time updates may not significantly impact the results. As with most things, this is a spectrum that depends on the specific use case.

In other words is your use case, online (high throughput, low latency) or offline (low throughput, mixed latency)?

Once we understand the latency and throughput requirements of our project, we can dig into the architectural decisions we’ll need to make to meet those requirements.

Choice #1: How will users access your data?

The accessibility of your data to end users depends greatly on its scale and the required throughput.

To optimize for greater throughput, a structured API front-loads the compute at write, resulting in low compute at read time. On the other hand, an open-ended, expressive SQL interface prioritizes compute at read over compute at write.

If you're dealing with a small amount of data, the difference in throughput between compute at write and compute at read may not be significant.

Unfortunately, blockchain data isn’t small, and it continues to expand at a significant clip.

Let's consider an example where we require a list of the top 10k most active addresses by transaction count on Solana.

Every day, Solana generates hundreds of gigabytes of data. Assuming we possess a table that encompasses all Solana transactions, obtaining the solution can be a straightforward process:

SELECT address, count(distinct tx_id) as tx_count 
FROM solana_txs 
ORDER BY tx_count DESC
LIMIT 10000

If we design our system to offer an open-ended SQL interface, users could effortlessly execute the aforementioned query. However, executing this query on hundreds of terabytes of data could consume tens of seconds or even minutes, contingent on the available compute scale.

Conversely, if we save the query results to a database and expose it via an API endpoint, users can retrieve the results within a few milliseconds.

Both approaches have tradeoffs. An open-ended SQL interface empowers users to request any data they desire, while a structured API is useful if you know the exact universe of what the user will need in advance.

Choice #2: How will you architect your pipelines?

Let's consider a different scenario. Suppose we are constructing a system to detect smart contract hacks by monitoring user activity, where transaction count serves as an input to our model. In this case, real-time latency is crucial for detecting and responding to an attack promptly.

Now, let's examine how the scale of our data affects our pipeline architecture.

When building pipelines, we generally have two options: streaming or batch.

In a streaming architecture, data is processed in real-time as it arrives, and the output is continuously updated, allowing for swift and flexible decision-making. On the other hand, in a batch architecture, data is collected and processed periodically in larger chunks, which can be more efficient for processing vast amounts of data but may not offer immediate insights.

In instances where the data is small, we may achieve streaming-like latency by running batches on narrow time intervals (🙅please don’t do this). However, this approach becomes impractical as the data scale increases, causing the batch processing time to exceed the interval between jobs.

To revisit our alerting example, using the batch approach will limit the data latency to the batch job frequency. In contrast, a streaming approach can update the user's transaction count each time a transaction is processed.

Okay cool. Streams seem magical. Let’s just stream all the things.

🤠 Not so fast, cowboy.

Streaming, like most things, has a time and place in your architecture. We stream raw data from nodes in real time. This forms a solid foundation that we can use to support low-latency applications and build upon this core.

From here we run a lot of batch jobs to model, or transform, our raw data into curated views. So why not continue with streaming? Well, that comes down to team composition, speed, and product requirements.

At Flipside we have a large team of “Analytical Engineers”, analysts that apply engineering best practices to curating data sets using SQL. These analytics engineers have deep expertise in Web3 ecosystems and SQL. They can quickly whip up a curated view of data from raw transaction data using SQL in combination with DBT.

On the other hand, developing every curated view using a stream-based approach typically requires different lower-level skills and takes more time. In this case, we intentionally trade off a little bit of latency for agility.

There are tools that aim to bring the simplicity and agility of the SQL DBT Analytics Engineering workflow closer to streaming. I have my eyes on Materialize and their DBT integration.

Latency & Throughput are the forces that can guide how you choose to deliver access to your data and how you choose to architect your data pipelines. These choices will determine whether you hit the mark in solving for your user’s use case or whether your beautiful data just collects dust.

If you’re building data solutions for Web3 builders keep this in mind:

Builders (data scientists, analysts, and developers) should never have to think about the contents of this article. They should be able to focus on their problem domain instead of stepping thru the potholes that come with putting together a data platform. In other words, make it your mission to break down the barriers for Web3 builders.

In the coming month, Flipside will be releasing Compass, our new state-of-the-art query engine, in combination with LiveQuery, that provides a path toward this future. We’re here to reduce the barriers for our builders so they can stay focused on building real solutions for their users.

👇👇👇