Correlation and Regression:
Chapter 9: Correlation and Regression: Solutions
9.1 Correlation
In this section, we aim to answer the question: Is there a relationship between A and B?
Is there a relationship between the number of employee training hours and the number of on-the-job accidents? Is there a relationship between the number of hours a person sleeps and their reaction time? Is there a relationship between the number of hours a student spends studying for a calculus test and the student’s score on that calculus test?
Definition: a correlation is a relationship between two variables.
Typically, we take x to be
the independent variable. We take y to be the dependent variable. Data is represented
by a collection of ordered pairs (x, y).
Mathematically, the strength and direction of a linear relationship between two variables is represented by the correlation coefficient. Suppose that there are n ordered pairs (x, y) that make up a sample from a population. The correlation coefficient r is given by:
r = Σ Σ 2.n Σ Σ
![]()
![]()
![]()
![]()
![]()
. n Σ(xy) − (Σ x) (Σ y)
This will always be a number between -1 and 1 (inclusive).
•
If r
is close to 1, we say that the variables are positively correlated. This means there is likely a strong linear relationship between the two variables, with a positive
slope.
•
If r
is close to -1, we say that the variables are negatively correlated. This means there is likely a strong
linear relationship between
the two variables, with a negative
slope.
•
If r
is close to 0, we say that the variables are not correlated. This means that there is likely no linear relationship between
the two variables, however, the variables may still be related in some other way.

– and is usually unknown.
Example 1: The time x
in years that an employee spent at a company and the employee’s
hourly pay, y, for 5 employees are listed in the
table below. Calculate and interpret the correlation coefficient r.
Include a plot of the data in your discussion.
|
x |
y |
x2 |
y2 |
xy |
|
5 |
25 |
25 |
625 |
125 |
|
3 |
20 |
9 |
400 |
60 |
|
4 |
21 |
16 |
441 |
84 |
|
10 |
35 |
100 |
1225 |
350 |
|
15 Σ |
38 Σ |
225 Σ |
1444 Σ |
570 Σ |
|
x = 37 |
y = 139 |
x2 = 375 |
y2 = 4135 |
xy = 1189 |
Hint: Calculate the numerator:
n Σ(xy) − .Σ xΣ .Σ yΣ = 5 · 1189 − 37 · 139 = 802
Then calculate the denominator:
Σ .Σ Σ Σ .Σ Σ![]()
![]()
![]()
.n x2 − x 2.n y2 − y 2 = .5 · 375 − (37)2.5 · 4135 − (139)2
= √506√1354 ≈ 827.72
802
Now, divide to
get r ≈ 827.72 ≈ 0.97.
Interpret this result: There is a strong positive correlation between the number of years and employee has worked and the employee’s salary, since r is very close to 1.
final exam grade, y, for 7 students. Find the correlation coefficient and interpret your result.
|
x |
1 |
0 |
2 |
6 |
4 |
3 |
3 |
|
y |
95 |
90 |
90 |
55 |
70 |
80 |
85 |
You may use the facts that (double check this for practice)
Σ x = 19, Σ y = 565, Σ x2
= 75, Σ y2 = 46, 775, Σ xy = 1, 380.
Calculate the numerator:
n Σ(xy) − .Σ xΣ .Σ yΣ = 7 · 1380 − 19 · 565 = −1075
Then calculate the denominator:
Σ .Σ Σ Σ .Σ Σ![]()
![]()
![]()
.n x2 − x 2.n y2 − y 2 = .7 · 75 − (19)2.7 · 46775 − (565)2
= √164√8200 ≈ 1159.66
≈
Now, divide to get r −1075
1159.66
≈ −0.93.
−
Interpret this result:
There is a strong negative correlation between the
number of absences and the final exam grade, since r is very close to 1. Thus, as the
number of absences increases, the final exam grade tends to decrease.
for 9 people. Find the correlation coefficient and interpret your result.
|
x |
68 |
72 |
65 |
70 |
62 |
75 |
78 |
64 |
68 |
|
y |
90 |
85 |
88 |
100 |
105 |
98 |
70 |
65 |
72 |
You may use the facts that (double check this for practice)
Σ x = 622, Σ y = 773, Σ x2 = 43, 206, Σ y2 = 68, 007, Σ xy = 53, 336.
Calculate the numerator:
n Σ(xy) − .Σ xΣ .Σ yΣ = 9 · 53336 − 622 · 773 = −782
Then calculate the denominator:
Σ .Σ Σ Σ .Σ Σ![]()
![]()
![]()
.n x2 − x 2.n y2 − y 2 = .9 · 43206 − (622)2.9 · 68007 − (773)2
= √1970√14534 ≈ 5350.89
≈ ≈ −
Now, divide to get r −782 0.15.
5350.89
Interpret this result: There appears to be an extremely weak, if any, correlation between height and pulse rate, since r is close to 0.
Example 4: The table
below shows the number of absences, x, in a Calculus
course and the final exam grade, y, for 7 students. Find the correlation coefficient and interpret your result.
|
x |
1 |
0 |
2 |
6 |
4 |
3 |
3 |
|
y |
85 |
80 |
70 |
55 |
90 |
90 |
95 |
There are 7 ordered pairs (x, y), so n = 7. Calculate the needed sums:
|
x |
y |
x2 |
y2 |
xy |
|
1 |
85 |
1 |
7225 |
85 |
|
0 |
80 |
0 |
6400 |
0 |
|
2 |
70 |
4 |
4900 |
140 |
|
6 |
55 |
36 |
3025 |
330 |
|
4 |
90 |
16 |
8100 |
360 |
|
3 |
90 |
9 |
8100 |
270 |
|
3 Σ |
95 Σ |
9 Σ |
9025 Σ |
285 Σ |
|
x = 19 |
y = 565 |
x2 = 75 4 |
y2 = 46775 |
xy = 1470 |
n Σ(xy) − .Σ xΣ .Σ yΣ = 7 · 1470 − 19 · 565 = −445
Then calculate the denominator:
Σ .Σ Σ Σ .Σ Σ![]()
![]()
![]()
.n x2 − x 2.n y2 − y 2 = .7 · 75 − (19)2.7 · 46775 − (565)2
= √164√8200 ≈ 1159.66
≈ ≈ −
Now, divide to get r −445 0.38.
1159.66
Interpret this result: There is a weak negative correlation between the study time and final exam grade, since r is closer to 0 than it is to −1. (Compare this problem with Example 2).
Interpreting the Correlation Between Two Variables:
![]()
Suppose that you find a strong positive or negative correlation between two variables. Is there a cause-and-effect relationship between these variables?
•
There could be a direct
cause-and-effect relationship: that is, x causes y.
•
There could be a reverse
cause-and-effect relationship: that is, y causes x.
•
There could be a
third (or fourth? or more?) variable that leads to the relationship between x and y.
•
The “relationship” between x and y may just be a coincidence.
9.2 Linear Regression
If there is a “significant” linear correlation between two variables, the next step is to find the equation of a line that “best” fits the data. Such an equation can be used for prediction: given a new x-value, this equation can predict the y-value that is consistent with the information known about the data. This predicted y-value will be denoted by yˆ. The line represented by such an equation is called the linear regression line.
The equation for a line is
yˆ = mx + b,
where m is the slope of the line and b is the y-intercept (the y-value for which x is 0).
In general, the regression line, will not pass through each data point. For each data point, there is an error: the difference between the y-value from the data and the y-value on the line, yˆ. By definition, this linear regression line is such that the sum of the squares of the errors is the least possible. It turns out, given a set of data, there is only one such line. The slope m and y-intercept b are given by
m = n Σ xy − (Σ x) (Σ y)
![]()
b = Σ y − m Σ x
![]()
n Σ(x2) − (Σ x)2 n n
Examples: Find the equation of the
regression line for each of the two examples and two practice problems in
section 9.1.
Example 1:
![]()
First, find the slope m. Start by determining the numerator:
n Σ xy − .Σ xΣ
.Σ yΣ 5 · 1189
− 37 · 139 = 802
Next, find the denominator:
n Σ(x2) − .Σ xΣ2 = 5 · 375 − (37)2 = 506
802
Divide
to obtain m = 506 ≈ 1.58
n n 5 · 5 ≈![]()
Now, find the y-intercept: b = Σ y − m Σ x ≈ 139 − 1.58 37 16.11
Therefore, the equation of the regression line is yˆ = 1.58x + 16.11
Additional Questions: Use the equations to (Ex 1) predict the hourly pay rate
of an employee who has worked for 20 years,
and (Ex 2) predict the test score for a student with 5 absences.
regression line: Example 2:
yˆ = 1.58 · 20 + 16.11 = 47.71 is the predicted salary, based on the data.
First, find the slope m. Start by determining the numerator:
n Σ xy − .Σ xΣ .Σ yΣ = 7 · 1380 − 19 · 565 = −1075
Next, find the denominator:
n Σ(x2) − .Σ xΣ2 = 7 · 775 − (19)2 = 164
≈ −
Divide to obtain m = −1075 6.55
164
n n 7 7![]()
Now, find the y-intercept: b = Σ y − m Σ x ≈ 565 − (−6.55) · 19 = 98.49
Therefore, the equation of the regression line is yˆ = −6.55x + 98.49
Additional Questions: Use the equations to (Ex 1) predict the hourly pay rate
of an employee who has worked for 20 years,
and (Ex 2) predict the test score
for a student with 5 absences.
For a student with 5 absences, x = 5. Plug this into the equation for the regression line:
yˆ = −6.55 · 5 + 98.49 = 65.74 is the predicted score, based on the data.
Example 3:
![]()
First, find the slope m. Start by determining the numerator:
n Σ xy − .Σ xΣ
.Σ yΣ = 9 · 53336
− 622 · 773 = −782
Next, find the denominator:
n Σ(x2) − .Σ xΣ2 = 9 · 43206 − (622)2 = 1970
≈ −
Divide to obtain m = −782 0.40
1970
n n 9 9![]()
Now, find the y-intercept: b = Σ y − m Σ x ≈ 773 − (−0.40) · 622 = 113.53
−
Therefore, the equation
of the
regression line is yˆ = 0.40x + 113.53. Even
though we found an equation, recall that the
correlation between x and y
in this example was weak.
Thus, this regression line many not work very well for the data.
![]()
First, find the slope m. Start by determining the numerator:
n Σ xy − .Σ xΣ .Σ yΣ = 7 · 1470 − 19 · 565 = −445
Next, find the denominator:
n Σ(x2) − .Σ xΣ2 = 7 · 75 − (19)2 = 164
≈ −
Divide to obtain m = −445 2.71
164
n n 7 7![]()
Now, find the y-intercept: b = Σ y − m Σ x ≈ 565 − (−2.71) · 19 = 88.07
−
Therefore, the equation
of the
regression line is yˆ = 2.71x + 88.07.
Even though
we found an
equation, recall that the correlation between x and y
in this example was weak.
Thus, this regression line many not work very well for the data. For example, for a student with x
= 0 absences, plugging in, we find
that the grade predicted by the
regression line is 88.
Comments
Post a Comment