Lecture 6: Curve fitting using Excel

A note on being computing scientists and homework

So far, the Excel assignments have been with a fair bit of guidance, even with step by step recipes. However, you now have enough tools under your belt to tackle problems with out the step by step instructions. As the semester carries on, you will be doing more and more of the work with out such detailed instructions. I will try to provide some more tutorials, but I am expecting that you are reading the assignments and asking questions in class when I do the demos!

Today, we will experiment with curve fitting using Excel. First, we'll explore Excel's built-in "Trend-line" option, where we can add trend lines to data sets after a chart has been generated. Later, we will utilize an Excel tool called Solver, whereby we can fit a data set to more general functional forms, i.e., more freedom than a straight line, exponential, logarithmic, etc.

Curve fitting using the Add Trendline option when right clicking a data series point in a chart

All you have to do is to right-click a data point within a chart then select Add TrendLine. There are many options:

Let's quickly discuss this with the following spreadsheet: (download this from the web to your disk space):

L05_climate.xls

Try the linear fit. Do you think it is getting warmer? How hot will the average be in 2050? (hint: show the equation by clicking the options tab and selecting Display equation on chart. Try the ten year moving average--what do you think now about your linear projection?

If you are to be your own computing scientist, you should look into what Excel is assuming. The equations behind these options can be found from Excel's Help page:

We will talk about least squares later in the lecture.

Curve fitting using Excel's SOLVER function

If we did not want to use an equation of a line to fit to data, or any of Excel's other options -- no problem, we can use Solver to do this. To simply fit a line to some data, the Trendline function is easiest. If we wanted to fit data to some specific function not contained in Excel's Trendline options, then we'd try Solver. Solver basically will help us find solutions that best satisfy inputted constraints. Let's look at a simple example. Download the following file from the web to your disk space:

L06_Solver1.xls

This example is a simple budgeting problem that I seem to have all the time.

First, you will need to add the Solver function to your tool belt if you haven't yet: do this with the Tools>Add Ins menu. Check the Solver box. Once Solver is added, go to the Data>Solver... option to open Solver.

Here is another example:

http://garnero.asu.edu/glg410/L05_mileage.xls

While this is perhaps a silly example, it makes a couple points. For one, you can see our functional form is not of the type

y = mx + b.
Thus, we can't use the Add Trendline function to optimize a fit to our equation. First, you will need to add the Solver function to your tool belt if you haven't yet: do this with the Tools>Add Ins menu. Check the Solver box. Once Solver is added, go to the Data>Solver... option to open Solver if you have not done so already. You will see several options already chosen. We want to minimize the target cell E4 (the total cost), so make sure the Min radio button is selected. Then click Solve.

Curve fitting and Least Squares

Instead of finding a minimum of a functional form, we will find the minimum of the RMS which stands for root mean square. RMS is similar to the standard deviation:

where p is the number of observations, H_i is the observed value at x_i, and H_model(x_i) is the model (or calculated value) at x_i. Thus, if we can minimize this measure, the root of the square of the differences between model (prediction) and observation, then we will be getting a best fit according to our description of "goodness" of fit.

Today, we will use Solver to obtain a best fit to a data set of ocean floor sedimentation rates. Let's work with these data of age of sediment versus depth below the seafloor:

Depth (cm)	Sedimentation Age (years)
407	10510
545	11160
825	11730
1158	12410
1454	12585
2060	13445
2263	14685

Can we figure out the sedimentation rate from these data? Yes, if we assume a functional form (e.g., linear, quadratic, etc). For sedimentation, let's assume the sedimentation rate stays constant, thus the equation should be linear, i.e., of the form y=mx + b. To analyze these data, we will solve this problem two ways:

(1) Use Add trendline on a chart made from these sedimentation data
(2) Use Solver to do the same thing.

This will show you another approach for determining best fit models to data sets. Open a new Excel workbook, and we will put each approach on a different sheet

Automatic way: Add trendline

Copy the above data into Excel. Make a chart of the data (XY scatter plot). Fit the data with a straight line, using Excel's built-in Add Trendline function. Choose the Linear option. Also, click the "Options" tab and choose to display the equation on the chart. Here is a picture of what that might look like (you'll notice that I changed the Y-axis range so that the data span is more appropriately represented in this dimension - always de-junk your chart!):

There, that was easy. If you're just fitting a line (or one of the other automated functional forms), this is really the way to do it. Done.

Less automated, but more flexibility: Solver

To illustrate the concept of Solver, we'll solve the same problem by finding the m and b that minimizes the RMS (root mean square - a measure of the misfit of the model to the data). To do so, follow these instructions:

Somewhere on a clean worksheet (e.g., below your freshly copied data table), define two cells, one for slope "m" and another for the Y-intercept "b". Go ahead and stick some number in each (they'll get updated later) and define names for those cells: I recommend naming them m and b. Also, put your estimates in the sheet next to each other (view my spreadsheet in the next image in these notes).
Set up a column of model age estimations that uses our guess of m ands b, call this Y_calc which equals m*x + b, where the x values are the depths of the observations, and m and b are our model parameters that describe our best fitting line (see Column C in image below).
Determine the RMS between the observed and modeled values: make a column that calculates the difference between the observed ages (Y_obs) and those you just estimated from your guessed m and b (Y_calc) [my column D in image below]. Note that these differences are called the residuals. In a column to the right of this (Column E), square that difference.
Calculate the sum of those squares (thus, simply sum the values in column E), then take the square root of this number, then finally divide by the number of observations. NOTE: count the number of observations using Excel's function count(cell range). If your data set had 100's or 1000's of data, you'd surely do it this way. This is the RMS, and what we will minimize using Solver. Make sure you understand what we're doing: we are computing the misfit of each data point to your line, and then squaring this. Then... we sum these up, square root, divide by the number of observations. Thus, this is essentially a weighted average of our misfits of our model (i.e., the RMS). Hence, minimizing the RMS will minimize our model's misfit. Understand this.

Your sheet might look like this:

Finally, open Solver under the Tools menu, and set it up to make the target cell the one containing your estimation of the RMS (F9 in my sheet) a minimum (click the Min button) by varying the cells that contain m and b (the cell range that is dashed below). It might look like:

Click Solve, Then you will find that after the solver finishes searching for the best fit, it will ask you if you want to keep the result, which of course you do. It will also offer a report (usually on a separate worksheet), select the Answer sheet option.
Finally: compare your answers for the best m and b values (i.e., the ones Solver obtained) to those calculated by the Add trendline function.
Bonus: Use your final fitted equation to extrapolate to determine what age the sediments should be at 5000 cm, assuming the constant sedimentation rate.

Assignment

In the file DEWIJS.txt are a number of measurements of the percentage by volume of Zinc in rock samples from a mine area. The data come from 118 assays in 2 meter increments along a sphalerite-quartz vein. The data are ordered along the vein.
Import the data into your spreadsheet.
Make a plot of the Zn content along the vein
Study the data and compute the following: Min, Max, Mean, Mode, Median, Standard Deviation, Relative Standard Deviation.
Plot the data in a Frequency histogram. Using the drawing tools in your spreadsheet, label the Mean and Standard deviation.
Use those evaluations to describe the Zinc deposit and its potential in a few sentences narrative to a prospective buyer of the deposit.
After some mapping of the area, you determine that about 500 cubic meters of the vein can be economically mined. Using your mean and the standard deviation, determine the approximate range of likely cubic meters of Zn that will be recovered.
Upon further discussion with the financial people in your company, the estimation of how much can be extracted gets trickier:
Zn is valued at about $5,000 per cubic meter
The cost to mine it is $1,000,000 for the first 500 cubic meters.
Then, the cost goes up by each additional cubic meter exponentially. So the equation for the cost is 1,000,000+(amount mined > 500 cubic meters)².
In your spreadsheet, make a column of volume mined (go from 0 to about 7000 m³ in steps of 500 m³). To the right, make a column of Cost to mine. In the third column, compute the value of the Zinc mined. Finally, in the fourth column, compute the difference between the cost and the value.
Plot both the cost and the value on the same chart. Where do the lines cross? That is the break even point.
Using the equations you set up for the cost to mine, the value, and the difference, use the solver to figure out exactly what the break even about is. Hint: Set the difference between the cost to mine and the value to 0 by changing the volume mined. Note that sometimes Excel will not get exactly to zero, but will be at something like 10^-7 which is close enough.
If your company wants a decent profit (like $5,000,000), what should the amount mined be?

This assignment is due by 5 pm, October 1 as a link on your assignments web page.

Data Visualization Lecture

Prof. Ed Garnero

Prof. Ramón Arrowsmith

Prof. Steve Semken

GLG410/598 Computers in Geology

Last modified: September 27, 2007

GLG410/598--Computers in Geology