We founded RAW Labs in 2015, after building and maturing RAW for over 4 years in academia. Our motivation, as is often the case, was frustration. Frustration with scientific applications that cannot rely on database engines and build their homemade solutions at great cost. Frustration by the emergence of new, incredibly useful paradigms for data management – e.g. machine learning – and seeing how inadequate current technologies were in coping with those. Frustrated by countless hours spent writing scripts to load data to the database, or figuring out how to tune the query engine. Frustrated by ORMs layers. Frustrated by having database engines continue to expect that “all data belong to us”, when data grows so much faster than the database engine can ingest it. And frustrated because the idea of data warehouses as a single source of truth had failed, but not many seemed to do much about it.
The solution grew gradually in our heads, and with the time to experiment in academia, it became obvious that we were onto something significant. The solution was in a combination of ideas taken from multiple domains of computer science, including compilers, functional language, database research, as well as math.
Let’s disentangle the issues:
It takes too long to load data. Solution: don’t load data. Instead, design the engine to query at source.
It’s hard to write scripts to load data: Solution: don’t write scripts to load data. Write queries instead, with features to do “script-y-stuff”.
It’s hard to tune the database engine. Plus, requirements change all the time, so even if tuned correctly, tomorrow’s queries are different than today’s. Solution: don’t tune the database. Let it tune itself based on usage.
Modern applications have data formats that are rich and complex; not just tables and not easily modeled as tables. Solution: support rich data formats. Bonus: ORM layers now have straightforwards mappings to modern programming languages.
Modern data transformations are more complex than SELECTs and JOINs. Solution: support operations other than classical database algebraic operators; but make sure to find the correct math abstractions so that the query remains “optimizable” and the query language declarative.
Conceptually, the solution is really not incredibly hard. What is hard is to build the correct design and theoretical framework.
It’s hard to build a new system that still looks-and-feels like SQL. But that’s what we accomplished with RAW with a great deal of integration between miscellaneous concepts and ideas.
What is RAW?
RAW is the embodiment of the ideas described above.
RAW a query engine that is fast, flexible and easy to use. It is embodiment of the ideas described above. It supports CSV, JSON, XML out of the box. You can join all this data together, whether in a laptop or across a large cluster. You never create schemas or do “data management-y” tasks. It’s all done with queries: want to convert a string to date/time? Write a query that does it. If the resulting query – or “view”, which are equivalent in RAW – is used often, RAW will make sure to “cache” those query result (or parts of it).
RAW can do what traditional query engines can do. But it can also do much more. For instance, querying machine logs, IoT data, is a breeze with RAW. It is fully supported and very easy given that there is no preparation work involved.
The old players can’t do it. The new ones are missing the point.
Here’s why: if the query engine is based on relational theory, then there is a whole set of operations it won’t be able to model. That’s the “old players”.
The new players focus on “data integration”. That’s missing the point: no one wants multiple engines plus an engine on top to connect all those engines. Instead, use a more suited engine in the first place, and skip all the layers.