top of page
  • Writer's pictureIshan Deshpande

What is a Lakehouse?



Large scale data analytics solutions were been build around a data warehouse, where data was stored in a structured format. With a growth in big data there was a need for high performance ,scalable and low cost storge, this is when people started using data lake. In a data lake, data is stored as files without imposing a fixed schema for storage.

The term “Lakehouse” is derived from two other words; Data Lake and Data Warehouse. In a Lakehouse we can store structured data (such as Data Warehouse) and unstructured data (such as a Data lake) in a single location. It is scalable and can handle large amount of data.


How to create a Lakehouse?


Lakehouse is an object in the Data Engineering workload of Microsoft Fabric. To create it, you can go to the Microsoft Fabric portal and then navigate to the Data Engineering home page.

Once you are here you’ll see an option to create a Lakehouse



Give it a name and create a Lakehouse, it will just take few seconds and it will open the Lakehouse.



How to get data into Lakehouse?


To do this we have a couple of options

  • Data Pipeline

  • Dataflow Gen 2

  • Notebook

  • Shortcut

  • Event Stream

  • Upload Files




Lets get data into our Lakehouse by uploading a CSV.



Once this file is uploaded you can use notebooks or data flow gen 2 to further process this data. If we want to save this file as a delta table, we can do it very easily by just right clicking on the file and click on load to tables.


Give it a name, select use header of columns, specify the separator and click on load, within few seconds it will create a delta table for you.


Note – In this way you don’t have control over the column data types, it identifies it automatically based on the values in the column, so if you want them to have specific data types I suggest you to use spark notebook and create a table so that you can define the table schema as per your need.



Once its created you will see it under the tables folder.



Delta tables use the Apache Parquet format, a columnar storage format that is highly optimized for analytical queries. They provide ACID transaction support and Schema evolution. It also supports time travel which allow users to query the data as it existed at a specific point in time.


SQL Endpoint for Lakehouse

 

Using SQL endpoint we can analyze the data and create views using SQL queries, but note we cannot perform insert, update or delete operations.



To run SQL queries select the "New SQL query" option.





This query will return total items sold by years.

 

You can also use visual editor, which is based on power query to analyze the data.



Then in the model view we can define relationships and create measure using DAX, this is just like Power BI desktop.



Once this is all setup we can create reports using this Lakehouse. The biggest advantage of this is that it supports live connection so the data in report will update within seconds as the data in Lakehouse changes.

 

Conclusion


Lakehouse is derived from data warehouse and data lake. We can get data into Lakehouse using couple of ways like data pipeline, data flow gen 2 & notebook among others. Here data is stored in delta tables which uses columnar storage format (parquet) that is highly optimized for analytical queries, It supports ACID transactions, schema evolution and Time travel. Using the SQL endpoint in Lakehouse you can analyze the data using SQL queries and further create a data models. Lakehouse supports live connection to Power BI, which is very useful when you have a streaming data.


Overall it’s a great option for a big data workload which is highly scalable, flexible, efficient and comes with a lot of features which make development very easy. That’s all for this blog see you in the next one!

bottom of page