No credit card required
Many years ago I was part of the team building the tools to help physicists access data for the Large Hadron Collider at CERN. I recall spending many hours sitting down with physicists, helping them navigate over a set of custom tools we had built so that they could track down which “physics events” they needed, download them locally and start their work. Helping them access data was, in essence, my role.
But as soon as we got through the various hurdles to get them the data they needed – which was hard enough, as there were petabytes of physics events, vast amounts of scripts, and home-grown tools – I then saw them struggling to process that data. Simple tasks like filtering events with some given set of physics properties was hard work. I recall contrasting what I saw – hand-rolled C++ code – with the world of relational databases and their powerful query optimizers. There were plenty of good reasons not to use any of these systems: data had a complex structure with plenty of hierarchies; scale was an issue; data ownership and ingestion required by traditional databases were all a no-go. But a lot was lost in the process.
What we were missing was a query engine that could query “all the data, anywhere”. By all the data I meant not just relational tables – the staple of today’s data analysis tools – but data with complex deep hierarchical structures for example. Whilst most data can be modelled relationally as tables, there’s a level of complexity where it’s simply not helpful to do so, as it just makes it harder to understand. And by “anywhere” I meant data from any source, without having to download it or ingest it (i.e. replicate it) into another system.
Our solution at CERN was to build home-grown tools and plenty of processes. Since those days, NoSQL, NewSQL, virtual data warehouses, data lakes and cloud computing have made tremendous progress. But for the average user, querying “all the data, anywhere” is still an elusive goal. Much of what I saw back then, I got to relive years later when querying data lakes for instance. This problem – of querying “all the data, anywhere” – is the problem we have set ourselves to solve at RAW Labs.
Querying “all data”
SQL is the de facto standard for querying data. The reasons are manifold but at this point it is the most well-known language with broad support. There are a fundamental reasons that make SQL a nearly ideal query language.
To start, SQL is a declarative language. By this we mean that as a user of SQL you express the logic of a computation and not how it is done. A SQL user declares that “I want to join datasets A and B discarding all the data with those properties”. A user does not specify how that join should be computed: which algorithm should be used, or how the computation should be distributed in your machine or cluster, or even when the data files should be open or closed, or how much should be read into memory at any point.
Another property of SQL is the abstraction level it provides over the data. As a user of a database system you are not concerned how the data is stored – which format it is on -, nor how best to access it. This is all handled for you. Data can be cached in-memory, replicated on disk, or indexed in a variety of formats and all of that is transparent to you. The format can even evolve over time, transparently, bringing you additional performance benefits.
The declarative nature of the query language and the abstraction provided by database engines gives ample room for optimization. These are some of the most important reasons why SQL won the “query language” wars.
When SQL isn’t enough
But, just as I saw at CERN, there are cases where SQL isn’t the right tool for the job. SQL in fact requires relational tables as its input data source. And yes, we can model all the data into tabular forms. There are even tools to transform complex nested structures into tables. But no matter how much we try, some data is just too clunky to fit into tables: deep hierarchies, matrices, complex document stores all have their reasons of existing for storing data.
There’s also SQL’s recent evolution. SQL is becoming too complex in ways that are, frankly, unhelpful to the users. For example, in recent years the SQL standard has had many extensions added. One example is the support for JSON, which is now part of the SQL:2016 standard. Under the new standard, to every operation on a JSON field must use JSON-specific operations, e.g. json_arrayagg or json_array_elements along with countless other specific functions. It is also a deviation of the original design principles of SQL because you now must write code that depends on the format of your data. If your data were to move from JSON to XML or Protobufs – even with the exact same schema – you would need completely different query code.
There are good reasons for how JSON support is implemented in SQL but in my opinion these have more to do with backward compatibility and legacy support than with providing users with the best possible query language. Consider the following example. Say you have the following JSON data with logs of transactions. Each transaction contains one or more entries:
Now suppose we want to obtain the total amount, which requires scanning all elements in the JSON data and for each of them scan all entries.
Here’s how you’d perform that aggregation in PostgreSQL:
Note the use of JSON-specific elements for accessing “array elements”, another syntax for accessing object fields, as well as the need of casting types; not to mention the use of CTEs (WITH statement) for a seemingly simple operation.
This is hardly user-friendly. The JSON-specific fields also mean that the storage abstraction we so appreciated is being broken: you now must know your data is stored in JSON and what the specific operations to query JSON are. Same for XML, and others.
This is now far too complicated for the average user.
Can we do better? Absolutely. At RAW Labs – and before that, in a database research group at EPFL, Switzerland, where RAW was born – we’ve been thinking of query and data manipulations languages for the future. We will see more about this language, called Snapi in future posts, but you can peek at some of Snapi examples here.
Querying data “anywhere”
To query “all the data”, we first have to get to it. Over the years there have been many attempts at querying data from “anywhere”. Data integration, virtual data warehouses, and data lakes have all been recent attempts at solving this problem with the goal of providing a “single source of truth”.
Data integration and virtual data warehouses are based on the idea that data should be kept at source. On top of that, a layer – ideally virtual – is built that provides a unified view over the underlying data. Queries are sent to this layer which then splits and otherwise forwards them to the underlying systems. Data lakes turn this idea around and instead provide a single place where all types of data can be stored. Data lakes provide in essence a large, distributed file system or object-store with a SQL-like query layer on top.
While these are very distinct implementations, they are actually two ways of solving the problem of bringing disparate sources of data together. Data lakes, however, can lead to many data duplicates, which bring along new concerns caused by data duplication such as additional cost, security issues and quality concerns. Loss of precision during the ETL/ELT process and increased latency are common concerns. Virtual data integration, on the other hand, has performance and scalability issues. Most importantly, maintaining consistency of the unified view over backend systems that can be changing over time is also very hard to manage. Data virtualisation is a powerful concept but requires a good answer to scalability and consistency issues. Data lakes can be distributed engineering marvels and provide immense flexibility and scalability.
At RAW Labs, we took inspiration from both “camps” so to speak. We use data lake technology in our query engine to provide large-scale caches and achieve the scalability and robustness requirements needed. However, data is always referenced from the source, as in data virtualisation. If you want to take a sneak peek on how easy it is to query data anywhere from RAW, look at some of our examples here.
A better future
We believe there are fundamental design principles to be followed as to be able to query all the data, anywhere:
- a declarative language;
- a powerful data model that underlies it;
- powerful optimizers;
- a managed platform for the best developer experience.
This is the vision we are building towards at RAW, which will enable everyone to query all the data, anywhere.
Stay around to learn more or join us on Discord to leave your thoughts.