top of page
Writer's pictureIshan Deshpande

Getting Started with Dataflow Gen 2


Data flow gen2 is a powerful low code ETL tool within Microsoft fabric. It uses Power Query which is used in couple of other Microsoft products like Power BI and Excel. It has a variety of connectors using which you can get the data. There are 100+ transformations that you can do using visual interface and save data to a Lakehouse, Warehouse, Kusto DB or Azure SQL Database.

Lets create a new data flow and explore it in more details:


To create a new Data flow Gen2 go to data factory experience and click on “Dataflow Gen2”.




Click on ‘get data’ to get started, you will see there are wide range of connectors, almost all popular storage systems are here.



Here, I have added 3 tables from a Warehouse and performed some operations so that we could see further options.




1) Power query ribbon – Here you will find options to get data, perform transformations and add a destination. There are numerous data transformations possible, such as:

  • Filter and Sort rows

  • Pivot and Unpivot

  • Merge and Append queries

  • Split and Conditional split

  • Replace values and Remove duplicates

  • Add, Rename, Reorder, or Delete columns

  • Rank and Percentage calculator

  • Top N and Bottom N


2)  Queries pane -  This shows all data sources that we have added and they are called queries. Some options available here are renaming query, duplicating query and enable staging.


3) Diagram view - The Diagram View allows you to visually see how the data sources are connected and what are the different transformations applied on it.

 

4) Data preview Pane – This shows you a subset of data, which help you to decide which transformations to apply. You can sort, filter and set data types of columns from here.

 

5) Query Settings Pane – It primarily includes applied steps. Each transformation that you have done is represented as a step. You can view the M code for all applied steps from Advanced editor.



Now let’s take a use case, suppose I have a some tables in a data warehouse as shown in the below screenshot :



We have to perform denormalization on the Trip table and get all required columns like date, pickup time, drop off time, pickup address and drop off address into it. Filter the data to show data for the last 6 months and then load it to a Lakehouse. Let’s get started.


I have added the Trip table from our data warehouse. Notice here you will see records of the related tables, which we can expand to get related columns from the Dim tables. We also get some column statistics which are very helpful for validating and cleaning the data.



Lets first expand the Time table.



After expanding this we will get the drop off time because we have an active relationship between DropoffTimeID in trip table and TimeID in Time table, to get Pickup time we’ll need to do something else which we will see in a while.

 I’ll also rename the column to “Dropoff Time” by just double clicking on the column name.

Similarly let’s expand the Geography table.



I have expanded City, State and Zipcode columns, which are details of the dropoff location. Now I want to merge these 3 columns and name it Dropoff Address. For this we will use the Merge columns option you see at the top.




Now, I’ll remove City, State and ZipCode columns as its redundant information.

Now, to get the Pickup Time and Pickup address details we will need to get the Time and Geography table and then Merge it with Trip Table on PickupTimeID and PickupGeographyID.

First let’s get the pickup time.




We did a left outer join because we need all rows from the trip table and matching rows from the Time table. After this I expanded the TimeBKey column from Time and renamed it to Pickup Time.

Following the same steps, I’ll create the Pickup Address.

 

Now we need to get the Trip date, we can easily do this by expanding the Date table record.

But I want to show you a different way in which we could achieve this.

Notice the DateID column, the first 4 values represent year, next 2 months and last 2 days. We can use the “Column from example” feature to create a date out of it.



Select DateID column and click on “Column from examples”.



On the 1 row I’ll type how I want the output to look like and hit enter. It will identify what you are trying to achieve and create a transformation for you (you might need to enter values for 1-5 rows for it to identify the correct pattern)




After the column is created, we’ll change its type to date.

Now, the last part of our requirement is to have data for only last 6months. Click on the down arrow icon of Trip Date column and check date filters.



Here as you can see there are a lot of filtering options, we’ll use “After” and specify the date.



Finally we will add the destination to our Trip table, which is a Lakehouse in our case. We will create a new table, keeping the update method to replace.





Click on save setting and then publish the Dataflow. Once the dataflow is published and refreshed, you will see our new table created in the specified Lakehouse.

 

Conclusion


Dataflow Gen2 is a low code ETL tool in Microsoft Fabric. You can get data from more than 100 sources. There are 300+ data and AI based transformations, letting you transform data easier and with more flexibility than any other tool. It also gives you some column statistics which are helpful for cleaning and validating your data. And finally, you can load this data to Lakehouse, Warehouse, Kusto DB or Azure SQL database.


That’s all for this blog, see you in the next one!

bottom of page