In the world of data analytics, the persistent challenge of copying and moving data across multiple tools is a significant hurdle. Within the scope of Power BI Service, the furthest upstream landing place for external data involves the use of a Dataflow to copy data from the source, then copy it again into the dataset (or other dataflows) before it’s ready for consumption. Before that, data might have been provided to the Power BI user via a Data WareHouse that reads those data from a Data Lake… I think you got the idea. Multiple copies of the same data exists everywhere.
This process not only adds layers of complexity but also raises concerns about data integrity, consistency, and performance. Roche’s maxim, widely known in the field, that states:
“Data should be transformed as far upstream as possible, and as far downstream as necessary.”
My interpretation of this is: “If you have to process your data more than once, you shouldn’t do so.”
Here is a modified version of Paul Turkey’s excellent post explicating the multiple data copies and the diverse locations where businesses can process and apply different business logics.
Microsoft Fabric: A Revolutionary Shift
The introduction of Microsoft Fabric marks a significant shift in the data handling landscape, particularly with the introduction of its fundamental layer: OneLake. Currently in public preview, Fabric is set to revolutionize the way we handle data copies. Designed to decouple storage and compute across various analytical engines, Fabric facilitates the use of the same version of data across various workloads, including analytics, data science, and real-time operations. This restructuring required a complete redesign of Microsoft’s engines, with services such as Analysis Services and SQL now reading and writing delta tables in the parquet format, informally known as “delta parquet”.
This evolution leads to a significant reduction in data silos, making Fabric a truly integrated solution for analytics. The proprietary compression technology from Microsoft, VORDER (I think it is named after VertiPaq Order), originated in Analysis Services Tabular models existing in PowerPivot, Power BI and AS models that enhances the whole system by offering state-of-the-art performance that comes alongside with extreme compression for operations over these column-store parquet files. The great news is that these files are open-source and can be easily read by any tool of your preference, eliminating vendor lock-in over the actual data.
OneLake: The Solution to Multiple Data Copies
OneLake, the true foundation of the Fabric solution, addresses the issue of multiple data copies through its innovative approach of creating ‘shortcuts’. These shortcuts enable the creation of virtualized data products, eliminating the need for multiple data copies and data movement. The concept is simple: you create a shortcut to the data you want to access, and it appears within your Lakehouse (an item in Fabric that lets you organize data for a specific purpose) immediately.
So, every engine reads and writes the same file-format in the same place. It’s the true democratization of how you can do things. For example, you’re no longer limited to using Power Query’s M language to make transformations. You can perfectly use SQL or Python to achieve the same task. This is HUGE! And as a SaaS (Software as a Service), it strategically caches your data to achieve blazing fast performances, while the actual data remains untouched at its source.
The Future of Data Management: OneSecurity
To me, the killer feature is yet to be released: OneSecurity. This feature is set to take OneLake’s value proposition to the next level. Currently, we need to set up security and permissions at either the storage and/or the engine level. OneSecurity aims to bring enhanced security down to the storage layer. By “enhanced”, I mean not only Workspace Security but also fine-grained security like OLS (object level security – select tables within a lake), RLS (row level security – hide rows within a table), CLS (column level security – hide columns within a table), and more.
This paradigm shift has the potential to substantially mitigate a multitude of data governance issues. Once fully implemented, OneSecurity will pave the way for truly governed democratization of data analytical workloads. This brings us closer than ever to achieving a single, trustworthy source of truth in our data. In the meantime, when more comprehensive security is needed, we continue to depend on the security provisions enforced by each individual engine.
In conclusion, the advent of Microsoft Fabric and OneLake has kickstarted a revolution in how we manage data in Power BI, offering potential solutions to the longstanding problem of multiple data copies. As I mentioned in my last article, it is a substantial improvement for Power BI users who require more alternatives to achieve their data transformation goals at their own pace.