Get Pearson Correlation Coefficient with Power Query
I can’t tell if that pear in the water is a buoy or if it is a child that the p(e)arent lost track of
“Are these variables related?” An enduring question that has many ways of being answered. One way is to look at the Pearson Correlation Coefficient. It’s pretty simple to do in DAX and Benoit Fedit has a great article about it on his site.
The Pearson test returns a value between -1.0 and 1.0. If the number is closer to -1, then there is a strong negative correlation (as one variable increases, the other decreases). If the number is closer to one, there is a strong positive correlation (as one variable increases, the other does too). If it is close to 0, then there is no correlation or a weaker one.
There are limitations to a Pearson test. A major one is if there are outliers present. Also, a Pearson test will tell you if there is a correlation between the two variables, but it will not tell you which one is the cause (independent variable) and which is the effect (dependent variable).
Again, because I love M and because I want to learn the math, I made a function to replicate this function.
The Code
Here is the function (and the gist):
(x as list, y as list)=>
let
sum_x = List.Sum(x),
sum_y = List.Sum(y),
sum_x_sq = List.Accumulate(x, 0, (state, current)=> state + Number.Power(current, 2)),
sum_y_sq = List.Accumulate(y, 0, (state, current)=> state + Number.Power(current,2)),
sum_xy =
let
l = List.Zip({x, y}),
tx = List.Transform(l, each _{0} * _{1}),
sum = List.Sum(tx)
in sum,
n = List.Count(x),
r = ((n * sum_xy) - (sum_x * sum_y)) / Number.Sqrt((n * sum_x_sq - Number.Power(sum_x,2)) * (n* sum_y_sq -Number.Power(sum_y,2)))
in
r
The mechanics of this one was interesting to me. Passing in two lists, I zipped them together to make a nested list of x,y pairs. After that, it was as simple as operating on the list to get the sum of squares, etc. List.Accumulate was the hero the provided the loop.
How to Use It
Here is an example of calling it with a blank query:
All you have to do is refer to the table and column for each one you want to look at. I copied the example data Benoit Fedit had on his post and was able to replicate the result (0.47).
Is This Useful?
There are so many ways to do this in so many applications. I have done this in SPSS, Power BI, Excel and Alteryx. I think if you are building a dataset and doing EDA (Exploratory Data Analysis), this could be useful as a way to ask questions.
Comments are closed.