Correlation and Regression:

 

Chapter 9: Correlation and Regression: Solutions

 

9.1    Correlation

 

In this section, we aim to answer the question: Is there a relationship between A and B?

Is there a relationship between the number of employee training hours and the number of on-the-job accidents? Is there a relationship between the number of hours a person sleeps and their reaction time? Is there a relationship between the number of hours a student spends studying for a calculus test and the student’s score on that calculus test?

Definition: a correlation is a relationship between two variables.

Typically, we take x to be the independent variable. We take y to be the dependent variable. Data is represented by a collection of ordered pairs (x, y).

Mathematically, the strength and direction of a linear relationship between two variables is represented by the correlation coefficient. Suppose that there are n ordered pairs (x, y) that make up a sample from a population. The correlation coefficient r is given by:

 

Text Box: n Text Box: x2 − ( Text Box: x) Text Box: y2 − (

r =

 

Σ

 

Σ

 

2.n

 

Σ

 

Σ

 
.         n Σ(xy) (Σ x) (Σ y)

This will always be a number between -1 and 1 (inclusive).

 

 
If r is close to 1, we say that the variables are positively correlated. This means there is likely a strong linear relationship between the two variables, with a positive slope.

 
If r is close to -1, we say that the variables are negatively correlated. This means there is likely a strong linear relationship between the two variables, with a negative slope.

 
If r is close to 0, we say that the variables are not correlated. This means that there  is likely no linear relationship between the two variables, however, the variables may still be related in some other way.

 

 



   and is usually unknown.

Example 1: The time x in years that an employee spent at a company and the employee’s hourly pay, y, for 5 employees are listed in the table below. Calculate and interpret the correlation coefficient r. Include a plot of the data in your discussion.

 

 

x

y

x2

y2

xy

5

25

25

625

125

3

20

9

400

60

4

21

16

441

84

10

35

100

1225

350

15

Σ

38

Σ

225

Σ

1444

Σ

570

Σ

x = 37

y = 139

x2 = 375

y2 = 4135

xy = 1189

 

Hint: Calculate the numerator:

n Σ(xy) .Σ xΣ .Σ yΣ = 5 · 1189 37 · 139 = 802

Then calculate the denominator:

 

Σ          .Σ Σ         Σ          .Σ Σ

 
.n       x2           x  2.n        y2           y   2  = .5 · 375 (37)2.5 · 4135 (139)2

= 5061354 827.72

 

802

Now, divide to get r 827.72 0.97.

Interpret this result: There is a strong positive correlation between the number of years and employee has worked and the employee’s salary, since r is very close to 1.


final exam grade, y, for 7 students. Find the correlation coefficient and interpret your result.

 

 

x

1

0

2

6

4

3

3

y

95

90

90

55

70

80

85

 

You may use the facts that (double check this for practice)

 

Σ x = 19,         Σ y = 565,             Σ x2 = 75,          Σ y2 = 46, 775,     Σ xy = 1, 380.

Calculate the numerator:

n Σ(xy) .Σ xΣ .Σ yΣ = 7 · 1380 19 · 565 = 1075

Then calculate the denominator:

 

Σ          .Σ Σ         Σ          .Σ Σ

 
.n       x2           x  2.n        y2           y   2  = .7 · 75 (19)2.7 · 46775 (565)2

= 1648200 1159.66

 


 
Now, divide to get r      1075

1159.66


0.93.


 
Interpret this result: There is a strong negative correlation between the number of absences and the final exam grade, since r is very close to 1. Thus, as the number of absences increases, the final exam grade tends to decrease.


for 9 people. Find the correlation coefficient and interpret your result.

 

x

68

72

65

70

62

75

78

64

68

y

90

85

88

100

105

98

70

65

72

 

You may use the facts that (double check this for practice)

 

Σ x = 622,         Σ y = 773,            Σ x2 = 43, 206,        Σ y2 = 68, 007, Σ xy = 53, 336.

Calculate the numerator:

n Σ(xy) .Σ xΣ .Σ yΣ = 9 · 53336 622 · 773 = 782

Then calculate the denominator:

 

Σ          .Σ Σ         Σ          .Σ Σ

 
.n       x2           x  2.n        y2           y   2  = .9 · 43206 (622)2.9 · 68007 (773)2

= 197014534 5350.89

 

               

 
Now, divide to get r       782           0.15.

5350.89

Interpret this result: There appears to be an extremely weak, if any, correlation between height and pulse rate, since r is close to 0.

Example 4: The table below shows the number of absences, x, in a Calculus course and the final exam grade, y, for 7 students. Find the correlation coefficient and interpret your result.

 

x

1

0

2

6

4

3

3

y

85

80

70

55

90

90

95

 

There are 7 ordered pairs (x, y), so n = 7. Calculate the needed sums:

 

x

y

x2

y2

xy

1

85

1

7225

85

0

80

0

6400

0

2

70

4

4900

140

6

55

36

3025

330

4

90

16

8100

360

3

90

9

8100

270

3

Σ

95

Σ

9

Σ

9025

Σ

285

Σ

x = 19

y = 565

x2 = 75

4

y2 = 46775

xy = 1470


 

n Σ(xy) .Σ xΣ .Σ yΣ = 7 · 1470 19 · 565 = 445

Then calculate the denominator:

 

Σ          .Σ Σ         Σ          .Σ Σ

 
.n       x2           x  2.n        y2           y   2  = .7 · 75 (19)2.7 · 46775 (565)2

= 1648200 1159.66

 

               

 
Now, divide to get r       445           0.38.

1159.66

Interpret this result: There is a weak negative correlation between the study time and final exam grade, since r is closer to 0 than it is to 1. (Compare this problem with Example 2).

Interpreting the Correlation Between Two Variables:

Suppose that you find a strong positive or negative correlation between two variables. Is there a cause-and-effect relationship between these variables?

 

    There could be a direct cause-and-effect relationship: that is, x causes y.

    There could be a reverse cause-and-effect relationship: that is, y causes x.

 
There could be a third (or fourth? or more?) variable that leads to the relationship between x and y.

    The “relationship” between x and y may just be a coincidence.


9.2   Linear Regression

 

If there is a “significant” linear correlation between two variables, the next step is to find the equation of a line that “best” fits the data. Such an equation can be used for prediction: given a new x-value, this equation can predict the y-value that is consistent with the information known  about  the  data.  This  predicted  y-value  will  be  denoted  by  yˆ.  The  line  represented by such an equation is called the linear regression line.

The equation for a line is

 

yˆ = mx + b,

where m is the slope of the line and b is the y-intercept (the y-value for which x is 0).

In general, the regression line, will not pass through each data point. For each data point, there is an error: the difference between the y-value from the data and the y-value on the line,  yˆ.  By  definition,  this  linear  regression  line  is  such  that  the  sum  of  the  squares  of  the errors is the least possible. It turns out, given a set of data, there is only one such line. The slope m and y-intercept b are given by

 


m = n Σ xy (Σ x) (Σ y)

 


b = Σ y m Σ x

 

                                  


n Σ(x2) (Σ x)2                                      n              n

 

Examples: Find the equation of the regression line for each of the two examples and two practice problems in section 9.1.

Example 1:

First, find the slope m. Start by determining the numerator:

n Σ xy .Σ xΣ .Σ yΣ 5 · 1189 37 · 139 = 802

Next, find the denominator:

 

n Σ(x2) .Σ xΣ2  = 5 · 375 (37)2 = 506

802

Divide to obtain m = 506 1.58

n

 

n

 

5

 

·

 

5

 
Now, find the y-intercept: b = Σ y m Σ x 139  1.58  37           16.11

 

Therefore, the equation of the regression line is yˆ = 1.58x + 16.11

Additional Questions: Use the equations to (Ex 1) predict the hourly pay rate of an employee who has worked for 20 years, and (Ex 2) predict the test score for a student with 5 absences.


regression line: Example 2:


yˆ = 1.58 · 20 + 16.11 = 47.71 is the predicted salary, based on the data.


First, find the slope m. Start by determining the numerator:

n Σ xy .Σ xΣ .Σ yΣ = 7 · 1380 19 · 565 = 1075

Next, find the denominator:

 

n Σ(x2) .Σ xΣ2  = 7 · 775 (19)2 = 164

 
Divide to obtain m = 1075         6.55

164

n

 

n

 

7

 

7

 
Now, find the y-intercept: b = Σ y m Σ x 565 (6.55) · 19 = 98.49

Therefore, the equation of the regression line is yˆ = 6.55x + 98.49

Additional Questions: Use the equations to (Ex 1) predict the hourly pay rate of an employee who has worked for 20 years, and (Ex 2) predict the test score for a student with 5 absences.

For a student with 5 absences, x = 5. Plug this into the equation for the regression line:

yˆ = 6.55 · 5 + 98.49 = 65.74 is the predicted score, based on the data.

Example 3:

First, find the slope m. Start by determining the numerator:

n Σ xy .Σ xΣ .Σ yΣ = 9 · 53336 622 · 773 = 782

Next, find the denominator:

 

n Σ(x2) .Σ xΣ2  = 9 · 43206 (622)2 = 1970

 
Divide to obtain m  = 782        0.40

1970

n

 

n

 

9

 

9

 
Now, find the y-intercept: b = Σ y m Σ x 773 (0.40) · 622 = 113.53

 

 
Therefore, the equation of the regression line is yˆ =    0.40x + 113.53.  Even though we found an equation, recall that the correlation between x and y in this example was weak. Thus, this regression line many not work very well for the data.


First, find the slope m. Start by determining the numerator:

n Σ xy .Σ xΣ .Σ yΣ = 7 · 1470 19 · 565 = 445

Next, find the denominator:

 

n Σ(x2) .Σ xΣ2  = 7 · 75 (19)2 = 164

 
Divide to obtain m = 445         2.71

164

n

 

n

 

7

 

7

 
Now, find the y-intercept: b = Σ y m Σ x 565  (2.71) · 19 = 88.07

 
Therefore, the equation of the regression line is yˆ =    2.71x + 88.07.  Even though we found an equation, recall that the correlation between x and y in this example was weak. Thus, this regression line many not work very well for the data. For example, for a student with x = 0 absences, plugging in, we find that the grade predicted by the regression line is 88.

Comments