Make Dim Tables with Power Query
A man staring into the multi-dimensional abyss of data modeling
Star schema is something that once you can understand it, it makes Power BI development easier. Understanding a dimension table versus a fact table and what should be stored where is an art form. This post assumes that you have some familiarity with this concepts. If you are unfamiliar with these concepts, here is a great explainer by Guy in a Cube:
Over time, I have done a lot of work on flat files, some being big, wide, and long tables (many columns, many rows). I often break the table up with Power Query and make a pseudo star schema that works better with DAX. Obviously, if I had access to the source system I would just write SQL to create views and make the schema there, but many times I don’t have that luxury.
Over time, the repetitive nature of what I was doing irked me. So I decided to try to write some M code to help me out with this process.
Thought Process
When I get this big wide tables (BWT), I often try to take text columns that I could use a filters and break them out into standalone tables and then map back an index which serves as a key for that table. All the merging, creating ids, hiding columns and other transformations became tedious.
I could fire up something like DuckDB with Python and preprocess everything ahead of time, but what if you have to hand the file off to someone that does not know Power BI well, let alone Python?
So, I thought, what if you could dynamically select all the text columns, deduplicate them individually, add an index column, and then nest all of these tables together for easy viewing. Additionally, I wanted to pass in a list of column names to optionally exclude.
That was complex, but I ended up doing it.
Note: I used DimProduct in AdventureWorksDW2017 for demonstration purposes just to show functionality.
How It Works
Two functions end up making this all work, fnCreateDimTables and fnCreateDimID.
fnCreateDimTables
This function is the more complicated one that dynamically created the nested table with the dimensions. To use it after adding it as a query, invoke it on a copy (duplicated of) the target table:
You can also reference the table and add a step and call the function that way as well.
Once you do that, you are provided with a nested table of dimension tables for every text column identified. Normally, you probably don’t want so many dimension tables and would want to put like categories in the same dimension table (customer name, address, etc). This function (for now) only provides single column dimension tables based on text data types, but I hope to continue working on this to solve more complex scenarios.
If too many columns were pulled in and you want to exclude some, you can go back to the source step and pass in the columns you want to exclude in a list:
You can then reference this table and drill down to the nested table that represents the target dimension that you need:
This gives you a fast way to get dimension tables that you may need for your analysis.
But how do I get that id in my main table? That’s the next function.
fnCreateDimID
This function does a lot of the same processing, but at the source table. It creates the id for the text column you target specifically (not bulk processing, one-by-one) and then removes the text column. This sets your fact table up for modeling later on.
To use it, go to your fact table, add a step, then call the function with the previous step name as the table and the column name that you are targeting as text:
This hid the Color column in the fact table and gave me a column named “Color_id” that I can create a relationship with later:
Takeaways and Issues
It was a little complicated to figure out, but not insurmountable really. There are issues that you should watch out for:
- When starting out, create the nested dimension table based off of a duplicated copy of your fact table. Reason being is when you apply the second function on that fact table, it will hide that dimension and affect the output of the first query.
- If you have a ton of text columns, consider grouping like columns of data together into one dimension table.
- If there are a lot of unique values and the dataset is massive, performance may be an issue.
- Consider whether “the juice is worth the squeeze.” Just because something can be done does not mean it should be done.
Thanks for reading!
Comments are closed.