From Data Lakes to Data Reservoirs


This post was originally published by Scott Haines at Towards Data Science

The Emergence of Standards

Good ideas take hold and quickly spread like wildfire. Recently the data community has standardized on at least one core data format that is good enough to get behind. That is the file storage format Parquet and we are going to learn a little more about why this is such an excellent choice for our data at rest. Data at rest just means it isn’t currently in active memory.

Thanks Google for the gift of Parquet.

What makes Parquet Parquet? Credit: Apache Parquet docs

Parquet, while looking like some kind of voodoo is actually just a really well executed columnar data storage primitive.

When you break down the file format you will see you have rows and columns and types.

This allows for large collections of records encapsulated by common format (schema) to be efficiently compressed or compacted into a much smaller footprint.

For example, If I have 40 million records stored as JSON with an overhead of between 1–2kb per record, I could reasonably have around 40gb of files.

Now if you were to take that JSON and convert it to parquet you can expect around a 50–80% savings in terms of file overhead cost. But in addition to the cost savings comes the ability to easily partition you data and compress things further using snappy, gzip, lz4, or any binary splittable compression format.

I can say from experience that around 40–80 million records of medium complexity (eg. more than 20 fields) can be broken down across 128 partitions a day, and stored between 12-24gb! This includes the parquet headers and footers as snappy compressed partitions and enables distributed reading of the dataset which fits nicely back into many of our use cases which all center around a Spark centric data platform.

Cloudera Benchmarking (parquet/avro/csv). Photo Credit: Cloudera: Parquet wins

However reduction in size on disk isn’t the only big win. Once your data is stored as parquet you have the ability to do row or column level efficient skipping within the data — so retrieval of 1–2 records within say 40 million may only taker around 5 seconds. Parquet is able to read just the footers to check if some content lives inside a block of parquet prior to reading the actual content. It uses columnar statistics to help speed up this process of intuitive row/column level extraction and filtering. You can see this benchmark from Cloudera if you want to see how it compares to say Avro.

Lastly, parquet has been adopted as the defacto standard for many of the most widely used Data Engineering and Data Analytics frameworks and platforms in operation today like Apache Spark (SparkSQL natively interoperates with Parquet out of the box), Presto, DeltaLake and Apache Hudi, and many many more.

Apache Spark is the data Catalyst

The emergence of Apache Spark can be thanked for holding the baton and leading the march in terms of organizing and unifying an entire community of data practitioners. Databricks, the company behind the Spark community has gone to a lot of trouble to keep the community engaged and to continue to produce amazing new technology that live and run within the Spark ecosystem. One of the newer additions to the core SparkSQL engine that adds Atomic operations, Upserts (Insert or Update), Merge, Delete and additional conditional SQL logic directly into the Spark DSL is a framework named DeltaLake. This subtle enhancement written to work with existing Data Lake architectures, brings best practices including file system layout, operational observability as well as nice audit trail support for operations done to the records existing within a Delta Table.

The Delta Lake Ecosystem: Photo Credit: Databricks/Delta

Remember before when I brought up the problem of poisoning the Data Lake and how things needed some kind of order and governance in order for them to evolve into something greater. I called it the Data Reservoir and I believe that DeltaLake has standardized exactly how a Data Lake should work with the correct safe guards in place to ensure that the Data Lake remains clean and beautiful over time. Let’s dive into how this is possible.

Delta Schemas and File Layout

From the first write into a delta table, Spark is using the underlying StructType associated with the DataFrame it is writing into your Data Lake to ensure that the schema will be preserved and that new writes adhere to that strict schema or do a “mergeSchema=true” on write operation. Below is an example of writing a DataFrame to Delta.

Writing a DataFrame in batch mode to a Delta.

This operation will be stored with an Atomic Commit log.

Example of the _delta_log from our operation above.

The rest of the DeltaLake Table. Partitioned by year,month,day.

This schema and file location metadata is stored in what is called the _delta_log and this exists in the root of your Delta Table. This little directory exists along side the root of the file system hierarchy which encapsulates your “Table”.

Example of the Delta Commit Log. Credit: Self

Now any additional operations that occur to the Delta Table will have information about the files mutated, the version of the current table, the schema associated with the table. This gives you eyes and ears into the operations that have been applied to the data stored in this Data Lake (DeltaLake).

Reading from DeltaLake is a Breeze

Reading and Transforming the data for Analytics is simple thanks to Delta. Credit: Self

As you can see from the above code snippet. Reading from delta is simple. As a neat fun fact, in the query above we don’t reference a specific path in our load command to the data and we can use what is known as partition push down in order to skip directories and only scan the contents of the month of june (col(“month”).equalTo(6) in our where clause above. This takes advantage of the file system so you don’t have to load all table entities into place. Saves on time and $$$.

Batch Write and Full Continuous Streaming Write are supported

Delta handles Batch Read / Write and Streaming Read / Write. Sweet. Credit: Self

Above is an example of creating a readStream off of the Delta table we just created with the batch process before. The novel thing that Delta brought to the table was the mixed use case support of batch read/write, streaming readStream/writeStream. Think about Airflow which solves some difficult use cases — airflow is basically just triggering child jobs when a parent job completes, however DeltaLake allows you to generate sophisticated streaming DAGS (Directed Acyclic Graphs) of spark jobs which should make you drool just a bit if you love Spark. Spark to the rescue to solve all your Spark related needs.

Spread the word

This post was originally published by Scott Haines at Towards Data Science

Related posts