In this blog post, we’ll delve into how Dataflow Gen 2 can enhance performance, especially if you’re currently facing challenges with Dataflow Gen1’s performance.
Looking Back at Power Query
I’ve touched upon the evolution of Power Query in a previous post, highlighting its journey from Excel and Power BI Desktop to Dataflows Gen1 and Dataflows Gen2. The discussion elucidated the differences in implementation and best practices. Read more about it here.
Why We Used Dataflow Gen1
Before, Dataflow Gen1 was used to do transformations on the Power BI Service and to store data for sharing. This made it easy to use data again without going back to the original source. This was a good step towards having a single truth.
This method worked well, but there was a problem. The data was saved as CSV files. You can learn more about this here. This method wasn’t good forhandling immense data volumes.
The Enhanced Compute Engine
To fix the problem with a lot of data, the enhanced compute engine was made for Premium capacities. But, this is a premium-only feature. You can find more details here. The enhanced compute engine after saving data in the CDM folder format, loads the same data into an Azure SQL DB-based cache, and using query folding against this cache. This produces significant performance improvements for the downstream refresh of computed entities.
The new engine saved data and then used it in a better way, including the ability to make dataflow a query foldable source making things faster for most tasks. But still it wasn’t perfect for very big amounts of data.
A New Way: Dataflow Gen2
Dataflow Gen2 is different. When you use it, it has its own place to store data. You can learn more here. This new way is better than the old ways like using CDM folders and Azure SQL DB-based alternatives.
The best thing about Dataflow Gen2 is that it can keep storage and calculations separate. You can save your results directly to more appropriate places like Fabric Lakehouse, Fabric Warehouse, and others. This flexibility allows for the use of optimized storage formats, such as delta tables which save data in the parquet file format. It also promotes collaborative data transformation by leveraging a diverse team’s capabilities, utilizing data Pipelines for ingestion and orchestration, and employing Dataflow Gen2 or Fabric Notebooks (Python, R, SQL, Scala).
Best Practices with Dataflow Gen2
It’s crucial to understand that if you opt for an external destination using Dataflow Gen2, you’ll most likely want to disable staging for the majority, if not all, of your queries. This action transforms it exclusively into an extraction/transformation tool, moving away from its earlier storage container functionality. The primary benefit of using an external storage system over the dataflows’ inbuilt one is the ability to share data at a more granular level, instead of the workspace-level sharing used by both Dataflows Gen1 and Gen2. More on this can be explored here.
For those wanting to harness the innovations of Dataflow Gen2 while maintaining the dataflow as a storage container, it’s recommended to enable staging on all shareable queries, since only the staged queries appears as tables when you use dataflow as a storage container.
A Comparative Analysis
To put things into perspective, consider my experience with a Dataflow Gen1 containing 140 million rows. I needed to apply a filter, reducing the data to approximately 12 million rows. This uses the Dataflow Gen1 enhanced compute, since the source is a Gen1.
This straightforward filter operation ranged between 2.5 to 3.5 hours. However, switching the storage location of the original unfiltered 140 million row to a Fabric Lakehouse and connecting a Dataflow Gen2 reduced this time to a mere 2.5 to 3.5 minutes! Opting for a Dataflow Gen2 storage location (with staging enabled) further decreased this duration to 2.5 to 3 minutes.
The evolution of data management and sharing tools has always been toward optimization and efficiency. As demonstrated, the transition from Dataflow Gen1 to Gen2 has brought about remarkable improvements in data processing times for large data volumes. You should pay attention, though, that the best practices of Dataflows Gen2 are different. The main difference is the decision you have to make about using the Dataflow as a data-sharing container or not. Since you can still only share Dataflows at the workspace level, you should transition to use Dataflows only as an extraction and transformation item.