Welcome to RAW


This is the first of a series of posts describing RAW, an adaptive, near real-time query engine that works directly over raw data. It’s a system we’ve been working on for several years now, first in academia as database researchers and more recently, as part of RAW Labs.

RAW is unique in multiple ways; it is a clean slate design for a database engine, accomplish much that is new and unique and occasionally breaking a few “traditional approaches” for a query engine.

To start, RAW queries data directly in its original location and format. For instance, you can query CSV, JSON or XML files directly without having to define schemas, building indexes, load data or flatten hierarchies. All these tasks are done autonomously by RAW, based on how you are actually querying your data. This is our first major shift from tradition. Normally, you’d think long and hard about how you’ll be using your data, plan accordingly, perhaps choose the most adequate query engine, load data to it and then tune it. All that work to very often be proven wrong, because the way you’ll actually use your data is not always how you planned! So we do it the other way around: you query your data, RAW will figure out how to “build itself” based on the queries it’s receiving. As a side effect, RAW is as “real time” as it can get: new data arrives and you can immediately query it!

The other big change is on data formats. Relational databases really only support tabular data as a first-class citizen. But there’s a lot more other data out there: e.g. JSONs, XMLs, array data. There is no reason not to support those formats natively and with full query capabilities (and by that, I don’t mean custom UDFs and “JSON field types”). But of course such a change needs a fully new theoretical framework, no longer based on bag/set theory. And that’s RAW: it supports a rich, extended query language, similar to SQL for most practical purposes, but which goes well beyond it. In fact, we find that RAW’s language is so rich that it eliminates the needs for “scripts” that were typically used to clean or flatten data before loading it to the database.

It’s a powerful combination of ideas: querying data directly from source and in real-time; optimizing and building the database based on usage; supporting complex data formats. We’ll explore them further in the next posts, so stay tuned!

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>