My Data Analytics Journey | Correlation

Linear Equation

Ahhhh yes...the fabled Y=MX+C that most of us are all to familiar with...This equation is important as allows us to predict a value based on existing data, kinda like what we do for Machine Learning (wink wink). In fact, linear regression is often used as a baseline model to compare against more complex machine learning models due to its simplicity and surprising accurate in certain scenarios but more on those at a later timing so stay tuned! 

A real-world example of using this equation can be to predict sales based on advertising spent, with the dependent variable Y being sales and the predictor X being advertising spent. For instance, we can see that the X variable has coefficient 72, which means that for every unit increase in advertising spend, the sales will increase by 72 units. 


This linear equation also comes in handy when we are trying to draw a best fit line through various data points, where we draw a line that minimises the mean squared errors. However, how do we know if x and y are actually correlated? This is where the Pearson Correlation Coefficient (r) value comes in, as it tells us both the strength as well as the direction of the correlation. 



Summary of r


However, it is important to note that just looking at r values can be misleading! 

Even though the value of r is > 0.7 for the above 4 graphs, we can clearly see that the dataset does not necessarily have a very strong linear relationship between the two variables. In fact, for the below graph, even when r=0, the dataset can still have a strong relationship (such as a quadratic one), albeit not a linear one. 

As such prior to calculating r value, it is good practice to plot the two variables on a scatterplot to conduct a visual sense check that the relationship between the two variables is indeed linear in nature. 



As it is often prohibitively expensive to gather data on everyone in the target population, researchers often carry out representative sampling and extrapolate the findings from their sample back to the target population. 

Do remember to check the p-value before using any variable X as a predictor of variable Y. 

Another things to note is that correlation does NOT equate to causation. In simpler terms, just because two events occur in tandem does NOT mean one causes the other. Take for instance that an increase in ice-cream sales coincided with an increase in the number of drownings. Common sense tells us that neither event causes the other, and that it is most likely a third variable such as the hot weather that brings people out to swim more as well as to purchase more ice-cream. 


Now, do you feel that this post has caused you to under these concepts better, or is it merely correlated with it? ;)




Comments