Fitting Linear Models to Data
Learning Objectives
By the end of this lesson, you will be able to:- Draw and interpret scatter plots.
- Find the line of best fit using the Desmos calculator.
- Distinguish between linear and nonlinear relations.
- Use a linear model to make predictions.
Example: Using a Scatter Plot to Investigate Cricket Chirps
The table below shows the number of cricket chirps in 15 seconds, for several different air temperatures, in degrees Fahrenheit.[footnote]Selected data from http://classic.globe.gov/fsl/scientistsblog/2007/10/. Retrieved Aug 3, 2010[/footnote] Plot this data, and determine whether the data appears to be linearly related.Chirps | 44 | 35 | 20.4 | 33 | 31 | 35 | 18.5 | 37 | 26 |
Temperature | 80.5 | 70.5 | 57 | 66 | 68 | 72 | 52 | 73.5 | 53 |
Answer: Plotting this data suggests that there may be a trend. We can see from the trend in the data that the number of chirps increases as the temperature increases. The trend appears to be roughly linear, though certainly not perfectly so.
Find the line of best fit
One way to approximate our linear function is to sketch the line that seems to best fit the data. Then we can extend the line until we can verify the y-intercept. We can approximate the slope of the line by extending it until we can estimate the [latex]\frac{\text{rise}}{\text{run}}[/latex].Example: Finding a Line of Best Fit
Find a linear function that fits the data in the table below by "eyeballing" a line that seems to fit.Chirps | 44 | 35 | 20.4 | 33 | 31 | 35 | 18.5 | 37 | 26 |
Temperature | 80.5 | 70.5 | 57 | 66 | 68 | 72 | 52 | 73.5 | 53 |
Answer: On a graph, we could try sketching a line. Using the starting and ending points of our hand drawn line, points (0, 30) and (50, 90), this graph has a slope of [latex]m=\frac{60}{50}=1.2[/latex] and a y-intercept at 30. This gives an equation of [latex]T\left(c\right)=1.2c+30[/latex] where c is the number of chirps in 15 seconds, and T(c) is the temperature in degrees Fahrenheit. The resulting equation is represented in the graph below.
Analysis of the Solution
This linear equation can then be used to approximate answers to various questions we might ask about the trend.Recognizing Interpolation or Extrapolation
While the data for most examples does not fall perfectly on the line, the equation is our best guess as to how the relationship will behave outside of the values for which we have data. We use a process known as interpolation when we predict a value inside the domain and range of the data. The process of extrapolation is used when we predict a value outside the domain and range of the data. The graph below compares the two processes for the cricket-chirp data addressed in Example: Finding a Line of Best Fit. We can see that interpolation would occur if we used our model to predict temperature when the values for chirps are between 18.5 and 44. Extrapolation would occur if we used our model to predict temperature when the values for chirps are less than 18.5 or greater than 44. There is a difference between making predictions inside the domain and range of values for which we have data and outside that domain and range. Predicting a value outside of the domain and range has its limitations. When our model no longer applies after a certain point, it is sometimes called model breakdown. For example, predicting a cost function for a period of two years may involve examining the data where the input is the time in years and the output is the cost. But if we try to extrapolate a cost when [latex]x=50[/latex], that is in 50 years, the model would not apply because we could not account for factors fifty years in the future.A General Note: Interpolation and Extrapolation
Different methods of making predictions are used to analyze data.- The method of interpolation involves predicting a value inside the domain and/or range of the data.
- The method of extrapolation involves predicting a value outside the domain and/or range of the data.
- Model breakdown occurs at the point when the model no longer applies.
Example: Understanding Interpolation and Extrapolation
Chirps | 44 | 35 | 20.4 | 33 | 31 | 35 | 18.5 | 37 | 26 |
Temperature | 80.5 | 70.5 | 57 | 66 | 68 | 72 | 52 | 73.5 | 53 |
- Would predicting the temperature when crickets are chirping 30 times in 15 seconds be interpolation or extrapolation? Make the prediction, and discuss whether it is reasonable.
- Would predicting the number of chirps crickets will make at 40 degrees be interpolation or extrapolation? Make the prediction, and discuss whether it is reasonable.
Answer:
- The number of chirps in the data provided varied from 18.5 to 44. A prediction at 30 chirps per 15 seconds is inside the domain of our data, so would be interpolation. Using our model: [latex-display]\begin{array}{l}T\left(30\right)=30+1.2\left(30\right)\hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,=66\text{ degrees}\hfill \end{array}[/latex-display] Based on the data we have, this value seems reasonable.
- The temperature values varied from 52 to 80.5. Predicting the number of chirps at 40 degrees is extrapolation because 40 is outside the range of our data. Using our model: [latex]\begin{array}{l}40=30+1.2c\hfill \\ 10=1.2c\hfill \\ c\approx 8.33\hfill \end{array}[/latex]
Analysis of the Solution
Our model predicts the crickets would chirp 8.33 times in 15 seconds. While this might be possible, we have no reason to believe our model is valid outside the domain and range. In fact, generally crickets stop chirping altogether below around 50 degrees.Try It
According to the data from the table in Example 3, predict the temperature if we counted 20 chirps in 15 seconds?Answer: [latex-display]54^\circ \text{F}[/latex-display]
Finding the Line of Best Fit Using a Graphing Utility
While eyeballing a line works reasonably well, there are statistical techniques for fitting a line to data that minimize the differences between the line and data values.[footnote]Technically, the method minimizes the sum of the squared differences in the vertical direction between the line and the data values.[/footnote] One such technique is called least squares regression and can be computed by many graphing calculators, spreadsheet and statistical software. Least squares regression is also called linear regression, and we can use Desmos to perform linear regressions.Example: Finding a Least Squares Regression Line
Find the least squares regression line using the cricket-chirp data in the table below.Chirps | 44 | 35 | 20.4 | 33 | 31 | 35 | 18.5 | 37 | 26 |
Temperature | 80.5 | 70.5 | 57 | 66 | 68 | 72 | 52 | 73.5 | 53 |
Answer:
- Click the plus button (add item) in the upper left corner, and select table.
- Enter chirps data in the x1 column.
- Enter temperature data in the y1 column.
x1 44 35 20.4 33 31 35 18.5 37 26 y1 80.5 70.5 57 66 68 72 52 73.5 53 - If you can't see the points on the grid, use the plus and minus buttons in the upper right hand corner to zoom in or out on the grid, or click on the wrench and change the upper bound of x1 to 60, and 100 for the y1.
- In the empty cell below the table you created, enter the expression y1~mx1+b
- You can add labels to your graph by clicking on the wrench in the upper right hand corner, and typing them into the cells that say "add a label"
Analysis of the Solution
Notice that this line is quite similar to the equation we "eyeballed" but should fit the data better. Notice also that using this equation would change our prediction for the temperature when hearing 30 chirps in 15 seconds from 66 degrees to:[latex]\begin{array}{l}T\left(30\right)=30.281+1.143\left(30\right)\hfill \\ \text{ }=64.571\hfill \\ \text{ }\approx 64.6\text{ degrees}\hfill \end{array}[/latex]
Q & A
Will there ever be a case where two different lines will serve as the best fit for the data? No. There is only one best fit line.Distinguish Between Linear and Nonlinear Relations
As we saw in Example: Finding a Line of Best Fit with the cricket-chirp model, some data exhibit strong linear trends, but other data, like the final exam scores plotted by age, are clearly nonlinear. Most calculators and computer software can also provide us with the correlation coefficient, which is a measure of how closely the line fits the data. Many graphing calculators require the user to turn a "diagnostic on" selection to find the correlation coefficient, which mathematicians label as r. The correlation coefficient provides an easy way to get an idea of how close to a line the data falls. We should compute the correlation coefficient only for data that follows a linear pattern or to determine the degree to which a data set is linear. If the data exhibits a nonlinear pattern, the correlation coefficient for a linear regression is meaningless. To get a sense for the relationship between the value of r and the graph of the data, the image below shows some large data sets with their correlation coefficients. Remember, for all plots, the horizontal axis shows the input and the vertical axis shows the output.A General Note: Correlation Coefficient
The correlation coefficient is a value, r, between –1 and 1.- r > 0 suggests a positive (increasing) relationship
- r < 0 suggests a negative (decreasing) relationship
- The closer the value is to 0, the more scattered the data.
- The closer the value is to 1 or –1, the less scattered the data is.
Example: Finding a Correlation Coefficient
Calculate the correlation coefficient for cricket-chirp data in the table below.Chirps | 44 | 35 | 20.4 | 33 | 31 | 35 | 18.5 | 37 | 26 |
Temperature | 80.5 | 70.5 | 57 | 66 | 68 | 72 | 52 | 73.5 | 53 |
Answer: Desmos provides you with the correlation coefficient when you use it to calculate a linear regression. The correlation coefficients is labeled as r = 0.951 for this dataset. This value is very close to 1, which suggests a strong increasing linear relationship. https://www.desmos.com/calculator/ruvzg6iy3o
Use a Linear Model to Make Predictions
Once we determine that a set of data is linear using the correlation coefficient, we can use the regression line to make predictions. As we learned previously, a regression line is a line that is closest to the data in the scatter plot, which means that only one such line is a best fit for the data.Example: Using a Regression Line to Make Predictions
Gasoline consumption in the United States has been steadily increasing. Consumption data from 1994 to 2004 is shown in the table below.[footnote]http://www.bts.gov/publications/national_transportation_statistics/2005/html/table_04_10.html[/footnote] Determine whether the trend is linear, and if so, find a model for the data. Use the model to predict the consumption in 2008.Is this an interpolation or an extrapolation?Year | '94 | '95 | '96 | '97 | '98 | '99 | '00 | '01 | '02 | '03 | '04 |
Consumption (billions of gallons) | 113 | 116 | 118 | 119 | 123 | 125 | 126 | 128 | 131 | 133 | 136 |
Answer: We can introduce new input variable, t, representing years since 1994, this makes entering the data into Desmos easier. Read the value for b, and the value for the slope, m, from Desmos to create the equation for the regression line:
[latex]C\left(t\right)=113.318+2.209t[/latex]
The correlation coefficient was calculated to be 0.997, suggesting a very strong increasing linear trend. Using this to predict consumption in 2008, which is 14 years after 1994, [latex]\left(t=14\right)[/latex],[latex]\begin{array}{l}C\left(14\right)=113.318+2.209\left(14\right)\hfill \\ =144.244\hfill \end{array}[/latex]
The model predicts 144.244 billion gallons of gasoline consumption in 2008. This is an extrapolation because there is not a datapoint whose x1 value is 2008. The scatter plot of the data, including the least squares regression line, is shown below in Desmos. Note how we changed the viewing window for the y-axis to 100 < y < 150. https://www.desmos.com/calculator/lv27pmtdbhTry It
Use Desmos to find a linear regression for the following data, which represents the amount of time a SCUBA diver can spend underwater as a function of the depth of the water.Depth (feet) | Time (minutes) |
50 | 80 |
60 | 55 |
70 | 45 |
80 | 35 |
90 | 25 |
100 | 22 |
Answer: Here is a sample Desmos graph for this dataset. https://www.desmos.com/calculator/zyrpta1uls 1) The equation for the regression line is [latex]y=-1.1143x+127.24[/latex] 2) A diver can spend [latex]y=-1.1143(110)+127.24=1.51[/latex] minutes at a depth of 110 feet. 3) A diver can spend [latex]y=-1.1143(120)+127.24=-6.48[/latex] minutes at a depth of 120 feet. This doesn't make sense because a negative value for time doesn't have any meaning. 4) To find at what depth the dive time would be zero, we need to set the regression equation equal to zero. [latex-display]\begin{array}{l}0=-1.1143x+127.24\\-127.24=-1.1143x\\114.19 = x\end{array}[/latex-display] A diver at a depth of 114.19 would have a dive time of 0 minutes.
try it
Here are more data sets that you can plot in Desmos. Try to find a linear regression for them, then by looking at the correlation coefficient you can determine whether they are linear.Depth of the Columbia River | Water Velocity |
0.66 | 1.55 |
1.98 | 1.11 |
2.64 | 1.42 |
3.3 | 1.39 |
4.62 | 1.39 |
5.94 | 1.14 |
7.26 | 0.91 |
8.58 | 0.59 |
9.9 | 0.59 |
10.56 | 0.41 |
11.22 | 0.22 |
% of Mississippi River in Crops (By Basin) | Nitrate Concentration (mg/ L) |
2.4 | 0.647 |
1.3 | 1.062 |
14.3 | 1.432 |
0.5 | 0.579 |
45.6 | 3.561 |
46.6 | 3.938 |
1.5 | 0.927 |
53.6 | 2.549 |
4.1 | 0.357 |
3.1 | 0.245 |
Dimensions of the Lava Dome in Mt. St. Helens, t = 0 on 18 October 1980 (eruption was 18 May).
(days) | (millions of cubic meters) |
0 | 2.9 |
70 | 13 |
109 | 28 |
173 | 40 |
242 | 56 |
322 | 64 |
376 | 75 |
547 | 88 |
603 | 100 |
699 | 115 |
872 | 152 |
922 | 154 |
1087 | 173 |
1343 | 178 |
1692 | 212 |
1858 | 243 |
FYI
Divers who want or need to descend to depths greater than 100 feet employ different techniques and equipment to help them safely navigate the depth. For example, different gas mixtures or rebreather equipment may be used. Gas mixtures such as Oxygen, Helium, and Nitrogen can help to mitigate the narcotic effects of breathing gas at great depths.[footnote]https://en.wikipedia.org/wiki/Trimix_(breathing_gas)[/footnote]Licenses & Attributions
CC licensed content, Original
- Revision and Adaptation. Provided by: Lumen Learning License: CC BY: Attribution.
- Temperature as a Function of the Number of Cricket Chirps in a 15 Second Period Interactive. Authored by: Lumen Learning. Located at: https://www.desmos.com/calculator/ruvzg6iy3o. License: Public Domain: No Known Copyright.
- Consumption of Gas as a Function of Year Interactive. Authored by: Lumen Learning. Located at: https://www.desmos.com/calculator/lv27pmtdbh. License: Public Domain: No Known Copyright.
- Dive Time as a Function of Depth Interactive. Authored by: Lumen Learning. Located at: https://www.desmos.com/calculator/zyrpta1uls. License: Public Domain: No Known Copyright.
CC licensed content, Shared previously
- Precalculus. Provided by: OpenStax Authored by: Jay Abramson, et al.. Located at: https://openstax.org/books/precalculus/pages/1-introduction-to-functions. License: CC BY: Attribution. License terms: Download For Free at : http://cnx.org/contents/[email protected]..
- Scuba diver using rebreather with open circuit bailout cylinders returning from a 600-foot (180 m) dive. Authored by: Trevor Jackson. Located at: https://commons.wikimedia.org/w/index.php?curid=25988843. License: CC BY-SA: Attribution-ShareAlike.
- College Algebra. Provided by: OpenStax Authored by: Abramson, Jay et al.. Located at: https://openstax.org/books/college-algebra/pages/1-introduction-to-prerequisites. License: CC BY: Attribution. License terms: Download for free at http://cnx.org/contents/[email protected].