Tech Blog

Querying nested data

RAW supports arbitrarily nested data structures. These are not supported in standard SQL, although multiple vendors provide limited support through custom SQL extensions. These extensions are usually implemented with custom functions and/or special syntax.

In RAW, however, handling nested data does not require any additional syntax or specific functions: the syntax is uniform and coherent in all situations. In addition, all query forms are supported without limitations.

As an example, assume a data source called “sales”, described here in JSON, with the following content:

[
    {"country": "CH",
     "products": [
         {"category": "Keyboard", "cost": 50},
         {"category": "Keyboard", "cost": 70},
         {"category": "Monitor", "cost": 450}]},
    {"country": "US",
     "products": [
        {"category": "Keyboard", "cost": 20},
        {"category": "Monitor", "cost": 200}]}
]

To find the products that cost more than a certain threshold:

SELECT p FROM sales s, s.products p WHERE p.cost > 60

The output result (in JSON) is:

[
    {"category": "Keyboard", "cost": 70},
    {"category": "Monitor", "cost": 450},
    {"category": "Monitor", "cost": 200}
]

The syntax “FROM sales s, s.products p” means that every sale is assigned to “s” and then every product of a sale is assigned to the “p”. (More formally, every element of the collection “sales” is assigned to the identifier “s” and every element of the inner collection “products” is assigned to the identifier “p”.) The input data is a nested data structure (unsupported in SQL) but the output is a collection of records of primitive types (supported in SQL). It is an example of how RAW can be used to input nested data and output tabular SQL-friendly data.

To find the average prices of keyboards per country:

SELECT s.country, AVG(SELECT cost FROM s.products WHERE category="Keyboard") AS avg_cost FROM sales s

The output result (in JSON) is:

[
    {"country": "CH", "avg_cost": 60},
    {"country": "US", "avg_cost": 20}
]

The query reads the source “sales”, which is a collection of records with fields “country” (type string) and “products” (type collection(record(category: String, cost: int))). It assigns every element of the collection to the identifier “s”. It then outputs the value given by s.country and computes the average cost of products per sale. The inner SELECT query reads from “s.products”, i.e. from the list of products in the sale “s” and filters those products whose category is “Keyboard”.

Similarly, to find the country names whose average sale prices of keyboards exceeds a certain threshold:

SELECT s.country AS country_name FROM sales s
WHERE AVG(SELECT cost FROM s.products WHERE category="Keyboard") > 40

The output result (in JSON) is:

[
    {"country_name": "CH"}
]

Master Plan

Let’s take one from Elon Musk’s book and let you know about our secret master plan for world, or rather, database world domination!

Why?

We founded RAW Labs in 2015, after building and maturing RAW for over 4 years in academia. Our motivation, as is often the case, was frustration. Frustration with scientific applications that cannot rely on database engines and build their homemade solutions at great cost. Frustration by the emergence of new, incredibly useful paradigms for data management – e.g. machine learning – and seeing how inadequate current technologies were in coping with those. Frustrated by countless hours spent writing scripts to load data to the database, or figuring out how to tune the query engine. Frustrated by ORMs layers. Frustrated by having database engines continue to expect that “all data belong to us”, when data grows so much faster than the database engine can ingest it. And frustrated because the idea of data warehouses as a single source of truth had failed, but not many seemed to do much about it.

The solution grew gradually in our heads, and with the time to experiment in academia, it became obvious that we were onto something significant. The solution was in a combination of ideas taken from multiple domains of computer science, including compilers, functional language, database research, as well as math.

How?

Let’s disentangle the issues:

  • It takes too long to load data. Solution: don’t load data. Instead, design the engine to query at source.
  • It’s hard to write scripts to load data: Solution: don’t write scripts to load data. Write queries instead, with features to do “script-y-stuff”.
  • It’s hard to tune the database engine. Plus, requirements change all the time, so even if tuned correctly, tomorrow’s queries are different than today’s. Solution: don’t tune the database. Let it tune itself based on usage.
  • Modern applications have data formats that are rich and complex; not just tables and not easily modeled as tables. Solution: support rich data formats. Bonus: ORM layers now have straightforwards mappings to modern programming languages.
  • Modern data transformations are more complex than SELECTs and JOINs. Solution: support operations other than classical database algebraic operators; but make sure to find the correct math abstractions so that the query remains “optimizable” and the query language declarative.

Conceptually, the solution is really not incredibly hard. What is hard is to build the correct design and theoretical framework.

It’s hard to build a new system that still looks-and-feels like SQL. But that’s what we accomplished with RAW with a great deal of integration between miscellaneous concepts and ideas.

What is RAW?

RAW is the embodiment of the ideas described above.

RAW a query engine that is fast, flexible and easy to use. It is embodiment of the ideas described above.  It supports CSV, JSON, XML out of the box. You can join all this data together, whether in a laptop or across a large cluster. You never create schemas or do “data management-y” tasks. It’s all done with queries: want to convert a string to date/time? Write a query that does it. If the resulting query – or “view”, which are equivalent in RAW – is used often, RAW will make sure to “cache” those query result (or parts of it).

RAW can do what traditional query engines can do. But it can also do much more. For instance, querying machine logs, IoT data, is a breeze with RAW. It is fully supported and very easy given that there is no preparation work involved.

The competition?

The old players can’t do it. The new ones are missing the point.

Here’s why: if the query engine is based on relational theory, then there is a whole set of operations it won’t be able to model. That’s the “old players”.

The new players focus on “data integration”. That’s missing the point: no one wants multiple engines plus an engine on top to connect all those engines. Instead, use a more suited engine in the first place, and skip all the layers.

 

Our Secret Plan

With that in mind, we come to our secret plan.

  1. We are extremely conscious that it’s hard to bring a new product to market, particularly one that feels different. The proof’s in the pudding, so we let you try it for free, on the Cloud. This is a full-featured service, only limited in the number of resources we let you use.
  2. We build a set of paid services using RAW that are uniquely suited to be done in RAW. That’s the initial “go-to-market”. Examples are analysis over machine log data, which is particularly painful to do today, or sensor data in general.
  3. With that revenue, we continue to add support for more data sources and data types and expand our cloud offering with paid services.
  4. And with that revenue, we build RAW on-premises and then go large-scale!

Just don’t tell anyone!!

 

Why does RAW have its own query language?

For the majority of today’s use cases, users of RAW write good old SQL.

The truth, however, is a bit more complex than that. So let’s get down to the details!

First off, SQL is arguably the most successful language in existence. Therefore, when developing RAW we took extreme care to maintain SQL compatibility to the extent possible.

With that said, RAW allows you to perform operations that go well beyond SQL’s design point. One such case is handling complex hierarchical data. It was possible to design RAW to handle some classes of hierarchical queries while trying to maintain closer compatibility to SQL. In fact many vendors do so nowadays. But the net result is that users are forced to use tools other than the database system to perform certain classes of “data transformations”. The support for hierarchical data or array data is always a second-class citizen. Users need to learn new syntax and the behavior is not always coherent. From our perspective, this was deemed a poor choice, as the query engine is ideally suited to perform all data transformations and optimizations.

Moreover, we felt that SQL was in some domains showing its age. For instance, lack of type inference, or lack of polymorphic functions means that the developer is tasked with boilerplate that the database system should take care of. Also, lack of nested data means that users are forced to “normalize data” and then perform expensive OUTER JOINs when the database engine could know better, if it had not lost the hierarchical information in the first place.

For these (and other!) reasons, RAW was born.

RAW combines classical database optimizations – e.g. algebraic query plans that get reordered based on historical execution plans or collected information – with optimizations typically found in compilers for functional programming languages. We find that the combination of two worlds – databases and functional programming – results in an ideal query language, better suited for today’s problems and building upon the research work done in recent decades.

With that in mind, our (admittedly ambitious) goal is to have RAW one day replace SQL as the standard query language.

Welcome to RAW

Welcome!

This is the first of a series of posts describing RAW, an adaptive, near real-time query engine that works directly over raw data. It’s a system we’ve been working on for several years now, first in academia as database researchers and more recently, as part of RAW Labs.

RAW is unique in multiple ways; it is a clean slate design for a database engine, accomplish much that is new and unique and occasionally breaking a few “traditional approaches” for a query engine.

To start, RAW queries data directly in its original location and format. For instance, you can query CSV, JSON or XML files directly without having to define schemas, building indexes, load data or flatten hierarchies. All these tasks are done autonomously by RAW, based on how you are actually querying your data. This is our first major shift from tradition. Normally, you’d think long and hard about how you’ll be using your data, plan accordingly, perhaps choose the most adequate query engine, load data to it and then tune it. All that work to very often be proven wrong, because the way you’ll actually use your data is not always how you planned! So we do it the other way around: you query your data, RAW will figure out how to “build itself” based on the queries it’s receiving. As a side effect, RAW is as “real time” as it can get: new data arrives and you can immediately query it!

The other big change is on data formats. Relational databases really only support tabular data as a first-class citizen. But there’s a lot more other data out there: e.g. JSONs, XMLs, array data. There is no reason not to support those formats natively and with full query capabilities (and by that, I don’t mean custom UDFs and “JSON field types”). But of course such a change needs a fully new theoretical framework, no longer based on bag/set theory. And that’s RAW: it supports a rich, extended query language, similar to SQL for most practical purposes, but which goes well beyond it. In fact, we find that RAW’s language is so rich that it eliminates the needs for “scripts” that were typically used to clean or flatten data before loading it to the database.

It’s a powerful combination of ideas: querying data directly from source and in real-time; optimizing and building the database based on usage; supporting complex data formats. We’ll explore them further in the next posts, so stay tuned!