Linear Regression
What is linear regression?
- If strong linear correlation exists on a scatter diagram then the data can be modelled by a linear model
- Drawing lines of best fit by eye is not the best method as it can be difficult to judge the best position for the line
- The least squares regression line is the line of best fit that minimises the sum of the squares of the gap between the line and each data value
- This is usually called the regression line of y on x
- It can be calculated by looking at the vertical distances between the line and the data values
- The regression line of y on x is written in the form
- a is the gradient of the line
- It represents the change in y for each individual unit change in x
- If a is positive this means y increases by a for a unit increase in x
- If a is negative this means y decreases by |a| for a unit increase in x
- It represents the change in y for each individual unit change in x
- b is the y – intercept
- It shows the value of y when x is zero
- You are expected to use your GDC to find the equation of the regression line
- Enter the bivariate data and choose the model “ax + b”
- Remember the mean point will lie on the regression line
How do I use a regression line?
- The equation of the regression line can be used to decide what type of correlation there is if there is no scatter diagram
- If a is positive then the data set has positive correlation
- If a is negative then the data set has negative correlation
- The equation of the regression line can also be used to predict the value of a dependent variable (y) from an independent variable (x)
- The equation should only be used to make predictions for y
- Using a y on x line to predict x is not always reliable
- Making a prediction within the range of the given data is called interpolation
- This is usually reliable
- The stronger the correlation the more reliable the prediction
- Making a prediction outside of the range of the given data is called extrapolation
- This is much less reliable
- The prediction will be more reliable if the number of data values in the original sample set is bigger
- The equation should only be used to make predictions for y
Exam Tip
- Once you calculate the values of a and b store then in your GDC
- This means you can use the full display values rather than the rounded values when using the linear regression equation to predict values
- This avoids rounding errors
Worked Example
Barry is a music teacher. For 7 students, he records the time they spend practising per week ( hours) and their score in a test ( %).
Time () |
2 |
5 |
6 |
7 |
10 |
11 |
12 |
Score () |
11 |
49 |
55 |
75 |
63 |
68 |
82 |
a)
Write down the equation of the regression line of on , giving your answer in the form where and are constants to be found.
b)
Give an interpretation of the value of .
c)
Another of Barry’s students practises for 15 hours a week, estimate their score. Comment on the validity of this prediction.