This is an essential primer tailored for executives seeking a high-level understanding of Data Warehouses and Data Lakes, the cornerstones of your company's data infrastructure. Drawing from extensive experience in building and consulting on hundreds of cost-effective Data Lakes and Data Warehouses, this piece guarantees to equip you with the key concepts needed to make informed decisions about implementing a data lake or data warehouse for your organization.
Concept #1: Getting your data under one roof
If your company is anything like most of the companies in existence today, you are relying on many different systems to support your business. Most companies have a "primary system", sometimes called an ERP, that helps to run the primary functions of the company. But you also have systems for payments, sales, marketing, lead generation, and accounting just to name a few. And for most companies the data for each of these systems is in a silo. How do you get business insights from all of these systems together? The answer is to regularly extract the data from each system and put it all under one roof. This is the basic idea behind data warehouses and data lakes.
Concept #2: How Data Warehouses work
Suppose now that you have figured out how to extract all of the data you need from these systems. How are you going to organize it? This is the primary question at the heart of Data Warehouses vs Data Lakes.
Data Warehouses were the first solution to the problem of getting all of your corporate data under one roof. In a Data Warehouse, you extract the data from all of the various sources (the "extraction" phase). Then you transform the data into a structure that is ideal for querying (the "transform" phase). Finally you load the data in the data warehouse (the "load" phase). This process is often known by the acronym "ETL" for Extract, Transform, Load.
But at the very beginning, how can you know how the data should be structured? To know this you would need to understand how you are going to use the data once the process is done. And unless you have psychic abilities, you don't know. This is one of the reasons that data warehouse projects are known to go drastically over budget. After the first ETL is complete and the business users start using it, it becomes clear that organization of the data doesn't support all the questions we want to ask it.
Concept #3: How Data Lakes work
But what if you bake the idea that you will always and forever ask questions of the data that you couldn't have known ahead of time? Business needs change. The data we capture changes too. New insights require flexibility. This is where the Data Lake really shines.
Data Lakes don't follow the ETL process. Instead you "transform" the data at the last possible moment. Here's how it works.
You start just like the Data Warehouse, extracting the data from source systems (the "extract" phase). But, now you are going to do something novel... load the data into the Data Lake unchanged (the "load" phase). Now if you are a technical person, you may ask how this is possible since you can't really load data into a database unless you have a structure already defined. And that's true.
But a Data Lake isn't a database. It's a file system that can be queried. This has huge advantages, not least of which is the ability to store different types of data as files in your Data Lake:
- Relational Data (think Excel worksheets with rows and columns)
- Non-relational Structured Data like JSON (ex.
- Non-Structured Data (ex. log files)
Wait! You're saying that different kinds of data can be queried together?
Yes that is what I'm saying and with SQL no less. There are many tools that do this, but the most common one is Presto which is available in AWS as Athena and in Azure as Azure Data Explorer (ADX).
Whenever you're ready, we can help you with your Data Lake or Data Warehouse. No matter where you are in the data journey we can help you get to the next level. Just click here to schedule a free meeting.