A brief history with best practices of Power Query and why Dataflow Gen 2 is a big deal towards separating the E, T and L to enhance compute power and flexibility to your complex data preparation process.
Introduction
Power Query, which debuted as an Excel add-in back in 2013 under the name of “Data Explorer”, is a business user friendly tool to do data related duties in the ETL (extract, transform, load) realm. Tasks that otherwise were executed manually or with assistance of some VBA macro started to be easily available just a few mouse clicks away in a never before seen consistent and reproducible way. I mean, if you forgot something in an early step, just correct it and hit refresh. Or if you want to update to the most recent data, just hit refresh! With it, it’s never been easier to the regular non-IT business user to get up-to-date clean data into a spreadsheet.
Since the incorporation of Power Query in Analysis Services and Power BI, the data sources and new transformation features got even sharper. All this with an easy and intuitive UI (that is now being mimicked by Data Wrangler to write Python code) that writes a unique and not so easy language to master, M. Kudos for Ben Gribaudo, who goes deeper than anyone I know on this realm.
It’s not all sunshine and rainbows
But it has a problem. It is well suited for query foldable and small non-foldable data sources because it uses concepts like streaming, that works quite well with query folding by default. Although streaming might be an optimization, if your source is big, slow, not foldable or your transformation is complex you might start to get some performance issues.
That’s because it’s mashup engine processes data in containers which are limited both in quantity and size, making it well suited for small data or transformations that could be pushed to the source (aka query folding). It has no built-in high-speed staging area that allows data source isolation, meaning that the same operation can be called more than once, depending on the complexity of your transformation. That’s because, due to the resources availability limitation of the engine, the loaded queries are evaluated independent from one another in a not well orchestrated or optimized way, from the output to the input, leading to some common source/intermediate query mostly probably being evaluated more than once.
Dataflow Gen 1 as an upgrade
That’s when Power Query Online kicks in. Existent in Power Platform and Power BI Service (now Microsoft Fabric) under the name of Dataflow Gen1, it has roughly the same evaluation engine as Power BI Desktop during edit view, but if you are using a Premium capacity, the engine now have its own storage location when you enable load of those intermediate tables to improve performance (that’s when that lightning icon appears). Meaning that your slow data source will be hit only enough to materialize that table saving it on a persistent landing/stage area. So your merges and complex logic on intermediate tables will be evaluated once, allowing your dataflow to be refreshed faster.
By using Dataflow Gen1, you’re now tied to an artifact item that has both storage and compute built-in. The bad part is that the Gen1 implementation shares the output data only via workspace-level permission, which leads to problems, as I previously talked here. It has also other internal improvements by the barely documented enhanced compute engine that delivers a great performance improvement in most cases if you leverage well the staging area.
Note that the experience of doing some maintenance in a complex dataflow doesn’t currently benefit from the staging materialization performance improvements, which means that refresh might be fast, but Power Query Online editing you’ll get longer than expected experience to troubleshooting or changing your code.
Meet Dataflow Gen2
As part of Microsoft Fabric and its guidelines of separation of storage and compute as we previously talked here, Dataflow Gen2 internally leverages other Fabric items such as Lakehouse and Warehouse, to stage and compute intermediate queries. This way you might implicitly leverage a really powerful engine capable of handling larger amount of data properly. Read more about the internal implementation here. Note that currently those items are visible but will soon be hidden to the user, since it’s intended for internal use of Dataflow Gen 2 implementation only.
Gen2 let you set the destination of each table separately
It also comes with the powerful feature of selecting the destination output to other storage items like LakeHouse or Warehouse. This is key to treat Dataflow Gen2 as a separate E or separate T engine that loads data to other storage-specific Fabric items. As we previously spoke here, you can leverage a team’s multidisciplinary skills and preferences by using Data Pipelines solutions that orchestrates a job that Extract data using notebooks and Transforms using DF-gen2 or vice-versa. This flexibility certainly reduces the time to value, which in my perspective is the main feature that makes Fabric irresistible.
Even if you plan to do your ETL only by using Dataflows, you can leverage the orchestration of Data Pipelines on top of the output destination of each query to build smaller and more specific Dataflows we can surpass some of the maintenance torture problem we talked earlier. So you might think that it will lead to having even more items on your Fabric workspace the answer is YES. But we already have a improved workspace item filter experience and let’s hope they deliver the folders in the workspace in a near future!
New options to update data gives you flexibility
Along with the feature of selection a custom destination, you also have the option to append data to it. That means you can read the data at your current destination, then use it to query your external source for more data and then decide what data should you (or not) append to your destination. Somewhat you can see an Incremental Refresh pattern here.
Okay, but what are the other differences of Dataflow Gen1 and Gen2?
You can see a Dataflow Gen1 vs Gen2 comparison table here. Keep in mind that Dataflow Gen 2 is currently in Public Preview and things are changing (for the better) in a really fast pace. The link include other new features such as save a draft, enhanced refresh history and monitoring features also that solves some limitations of Gen1.
Gen2 currently is the only place where you can import PQT files
Yeah, you can just ctrl+c ctrl+v data from/to whatever Power Query version you have, but if you have a really big and complex set of queries, it won’t work. Your best alternative is to extract a PQT file, with a plus that you can have it archived in case you need it in the future.
Different meaning for enable load (or enable staging) and set a new destination
In a local Power Query instance or non-premium Dataflow Gen1, it’s advisable to only load queries that you’ll need later. Either to not load to your data model unnecessary temporary tables or to share via Dataflow as a data source with only the final output tables.
To enhance performance you can actually place some Binary.Buffer, Table.Buffer or List.Buffer functions or the more modern Table.StopFolding in that specific spot. I haven’t found a universal guideline for buffering, as it’s challenging to design a pattern that always works. It’s more a trial and error thing and if you place the Buffer in the wrong place, it might even worsen your performance.
Some unverified rules I have in mind are:
• Never buffer tables as a last step of a query
• Buffer both tables that you include in a merge (preferably by reducing the rows and columns to just those that are strictly necessary prior to the merge)
• Buffer when you need some complex loop logic by using List.Generate or List.Accumulate.
Be careful of the sort order of your staged queries
Fabric engines by default implement a sort order optimization called VORDER (aka VertiPaq Order) every time you write data, either at the staging area or the default destination. This might be topic for another blog post, but be aware that the Power Query Online Experience existent in Dataflow Gen2 uses pretty much the same mashup engine of the other instances and doesn’t leverage the staged data. When the actual refresh takes place, it uses the stage data that might have a different sort order.
If your transformations rely on referencing some previously staged table and adding some Index column for later use it as an anchor, you might get some unexpected results. I mean, what you see at Power Query Online edit mode might not be what is actually stored after the refresh.
Best practices by working with Dataflow changed since Gen1 premium
When you’re in de Dataflow Gen1 with premium features realm, you might want to enable staging over pretty much every internal query. Either because your source is slow/not query foldable and you don’t want streaming to hit it several times back to the source, or you need to materialize some intermediate table that has a complex logic behind it to boost performance, or even you need to merge data from two data sources that might have a different privacy level, which might activate the Formula Firewall protection against unintended data leak as you can deep dive here.
Dataflow Gen2 enable staging changes that mindset
Now, with Dataflow Gen2 and the ability to set a destination to each table separately, the enable load changed it’s meaning. Actually it got renamed from “enable load” to “enable staging”. Why? Because with Gen2, we have the new output destination option that changes how things are stored and processed.
When you enable staging, you’re allowing the engine to write the output to the staging Lakehouse and/or Warehouse before this data gets copied to its output destination (if configured). Those staging items are shared between every Dataflow Gen2 within the same workspace and will eventually be hidden, since they are part of the internal implementation.
The output destination can be the Dataflow Gen2 itself (for that just enable staging), but now you’ll probably want to set a separate destination. That’s because sharing Dataflow is still at workspace level and writing it straight to a LakeHouse or Warehouse you have more granular permission control as I previously talked here.
If you enable staging and have an output destination, refresh time will be longer because it will be written in two places instead of one. At the staging items and at the output destination. So please, don’t enable staging and set the output destination at the same time, because you’re just doing the job twice unnecessarily.
Never enable load unless you have a reason for that
So, as a rule of thumb, when dealing with a Dataflow Gen2 you should not enable load unless you need to benefit from it. You can have a dataflow without any query being staged as long you set the output destination for at least one query.
Valid reasons to enable load to you extraction or intermediate queries instead of load data directly to a destination are:
• When your data source has a large data volume, not compatible to Query Folding (aka the source engine will do all the transformation work) and you need to do some transformations over it
• If you are performing compute/memory intensive transformations such joining or aggregating large data volumes
• If you need to join data from different data sources, especially if they have different privacy levels
Be careful if your data source requires the use of a Gateway
The use of a data gateway acts as a bridge between your on-premises data sources and the Power BI service (Fabric). The data transformations and processing occur at the gateway level on the machine where it’s installed.
Whenever you’re using a Gateway to access a data source, you’re telling the engine:
“Use this gateway to run the whole dataflow to access, transport, and transform the data”
That means that if you enable load and have an output destination set, the engine actually uses the gateway to write at the staging items, then data from staging items are sent back to the gateway to later, the gateway write at the output destination.
So, especially if your data source requires a gateway, avoid enable staging at all costs. If you actually need this data to be merged with other sources, I’d recommend you to have separate dataflows. One to extract data via gateway to a specific LakeHouse or Warehouse and then have another Dataflow to merge that data and do the complex logic you need. This orchestration can be managed by using Data Pipelines.
Long story short
In conclusion,Dataflow Gen2 offers an intricate and powerful system that allows for separation of storage and compute, enhanced refresh options, improved output destination settings that enables a more granular permission control. However, the transition to Dataflow Gen2 also introduces a new set of best practices and considerations, including managing enable staging and dealing with gateway connections. Both controls the correct use of staging and destination settings. Users must recognize these nuances to truly harness the power of this tool. Overall, the journey from Power Query to Dataflow Gen2 emphasizes Microsoft’s commitment to continuous innovation and provides users with more robust options for managing complex data preparation and transformation processes.
I’d like to thank Jeroen Luitwieler for reviewing this post!
Recent Comments