The Correlation Coefficient, r
As indicated earlier, the strength of the linear association between two variables is mathematically measured with the so-call Correlation Coefficient. At times, it is also called the Pearson product-moment correlation coefficient, after Karl Pearson, who is credited with formulating r.
Mathematically, the correlation coefficient, \(r\), is calculated with the following equation:
\[\begin{equation} r = \frac{cov (XY)}{Sx * Sy} \end{equation}\]
Basically, the correlation coefficient, \(r\),is the Covariance divided by the multiplication of the standard deviation of the data in X and the standard deviation of the data in Y.
If you think about this equation, the covariance is the product of the differences in X and Y. While the standard deviations are independently the differences in X and the differences in Y. So, in practical terms, the correlation coefficient is an standardized metric. It will never be smaller than -1 or larger than 1.
That is why the correlation coefficient is such a nice term to access the tendency between two variables. If it is closer to -1 then you know the data probably follow a strong and negative trend. If it is close to 1, then the data follow a positive and strong trend. If it is closer to zero, then the data are all over the place (there is not correlation). Towards the end of the chapter, we will learn more about how to interpret the correlation coefficient.
Lets calculate it,
COVXY= cov(StudyingTimes$Score,StudyingTimes$Hours_Studying, use = "everything", method = "pearson") #Covariance
SDX= sd(StudyingTimes$Hours_Studying) #Standard deviation for X
SDY= sd(StudyingTimes$Score) #Standard deviation for Y
r=COVXY/(SDX*SDY)
r
## [1] 0.9817004
So the coefficient of correlation between the time that you study and the score in my class is 0.98. That means the more time you study the higher your gradeā¦nice, ah???
In R, the coefficient of correlation can be calculated directly with the function cor,
## [1] 0.9817004
Alternative formulation
While looking into the correlation coefficient you will likely see alternative formulations of it that yield the same or very close approximations.
For instance, you may find it formulated like this:
\[\begin{equation} r = \frac{1}{n-1} \sum_{}\frac{x-\bar{x}}{Sx}\frac{y-\bar{y}}{Sy} \end{equation}\]
This equation above, is pretty much the same we used earlier, but reorganizing the parts.
At times, you can also defined as:
\[\begin{equation} r = \frac{n * \sum_{} xy- (\sum_{} x)*(\sum_{} y)}{\sqrt {n * \sum_{} x^2- (\sum_{} x)^2 } * \sqrt {n * \sum_{} y^2- (\sum_{} y)^2 }} \end{equation}\]
Which will yield a very close approximation to the equation we used earlier.
For the equation above, all we have to compute is \(\sum_{}x\), \(\sum_{}y\), \(\sum_{}x^2\), \(\sum_{}y^2\), and \(\sum_{}x*y\). Let try, for the sake of being sure and for you to use some tools from R.
Y=StudyingTimes$Score
X=StudyingTimes$Hours_Studying
SumX= sum(X) #sum all values of x
SumY= sum(Y) #sum all values of y
SumX2=sum (X^2) #sum all values of x^2..
SumY2=sum (Y^2) # sprintf("%.0f",sum (Y **2))
SumXY=sum (X*Y) #sum all value of x * y
n=length(Y) # the number of observations is basically the number of rows in the database
The results are:
\(\sum_{}x\) = 13
\(\sum_{}y\) = 371
\(\sum_{}x^2\) = 43.94
\(\sum_{}y^2\) = 28495
\(\sum_{}x*y\) = 1061.8
n = 5
Now we plug those values into the coefficient of correlation, r, equation:
\[\\[.0005in]\]
\[\begin{equation} r = \frac{n * \sum_{} xy- (\sum_{} x)*(\sum_{} y)}{\sqrt {n * \sum_{} x^2- (\sum_{} x)^2 } * \sqrt {n * \sum_{} y^2- (\sum_{} y)^2 }} \end{equation}\]
\[\\[.0005in]\]
\[\begin{equation} r = \frac{5 * 1061.8- (13)*(371)}{\sqrt {5 * 43.94- (13)^2 } * \sqrt {5 * 28495- (371)^2 }} \end{equation}\]
\[\\[.0005in]\]
In R, it is basically:
## [1] 0.9817004
\[\begin{equation} r = 0.98 \end{equation}\]
\[\\[.0005in]\]
Hmm, what do you think is causing the difference to the original calculation?