Announcements | Syllabus | Schedule | Weekly lecture notes | Links |
Let's take an example of the masses of pebbles on a beach. If you pick up a pebble on the beach, it is a specimen. It might not be representative of all the pebbles on the beach (population). But we cannot examine all of the pebbles on the beach, so we select a subset (10 or 100 or 1000, etc.) and that is a sample (dataset) of the population which we hope is representative. Here we will play with a small data set of pebbles. Open this web page in a browser:
Below I show a plot of the masses of the individual pebbles in our sample: (a default Excel plot). This chart was made with the Chart Wizard, choosing columns and adding labels, and gives a visual impression of the data, but can we do more? Of course!
The simplest way to look at a dataset is just to sort it. In Excel, look at Data->Sort. You can sort on multiple columns (size, then shape, etc. if they are in your spreadsheet). One really important point is that you should be aware that sorting with only one column will scramble your data. So, you should probably select all of the columns and then in the sort dialogue, choose the column(s) on which you want to sort.
In today's lecture we are continuing our introduction to Microsoft's Excel. We will cover some basic statistics in Excel: histograms and frequency. If your data set is either huge or complicated looking, what is the best approach for analyzing it? Here are some options:
Obviously, we should pursue options 2 and 3. We can apply some statistical tools to Earth and Space science problems. We should be careful however, that a statistical description of our data does not bias our view of the data. So... What is a statistic? --> An (estimated) parameter that characterizes a dataset. Wikipedia.org states:
Statistics is a mathematical science pertaining to the collection, analysis, interpretation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities; it is also used for making informed decisions in all areas of business and government.
Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. In addition, patterns in the data may be modeled in a way that accounts for randomness and uncertainty in the observations, to draw inferences about the process or population being studied; this is called inferential statistics. Both descriptive and inferential statistics can be considered part of applied statistics. There is also a discipline of mathematical statistics, which is concerned with the theoretical basis of the subject.
The word statistics is also the plural of statistic (singular), which refers to the result of applying a statistical algorithm to a set of data. Thus we speak of employment statistics, accident statistics, etc.
When we work with with a large number of observations, it is desirable to be able to describe some of the basic characteristics of the data with single numbers that describe some essential feature of the data, specifically the distribution of numbers. Some important examples are:
Here are some basic representations of the mass of the distribution that we'll quickly play with:
Mean = (sum of all values) divided by the number of valuesHow can we improve on the pebbles chart and possibly display the distribution of pebble masses graphically? The simplest method uses the frequency histogram which allows the general properties of the distribution to be visualized. If you haven't already done it, install the Analysis Tool Pack [choose Tools>Add Ins.., then check both Analysis Tools Pack boxes], then go to the Tools/Data Anaysis... dialog box, scroll down and choose Histogram.
To construct such a plot, we need to decide on specified ranges of values (bins) within which we want to count the number of pebble masses (Frequency distribution). For example, you might bin masses from 0-100, 100-200, etc. Go ahead and make a column with values 0, 100, then drag the lower right box corner down, which will fill lower cells with that pattern; stop when an appropriate range has been created. Then when we plot the frequency distribution using a bar chart, we have a frequency histogram. Note we might also look at the cumulative frequency distribution in which we add the cumulative percentage of the total specimens that are in each bin. The curvature of the cumulative distribution plot tells us something about the dispersion of the dataset.
Here I will walk you through making this first histogram with a frequency curve, for the pebble data set. You very well may do this dozens of times this semester, so become familiar with these steps. For this example, I'm assuming you already have your pebbles data set correctly entered into a spreadsheet.
Note that the mean is not near the center of the data. These data have significant positive skewness. Positive skewness means that the tail of the distribution is longer to the right. Negative skewness is the opposite. In general, a bigger number means more asymmetry. (be careful that you know what the program is actually computing). Standard skewness is in units of the standard deviation, so if it is <1, then we are close to being symmetric. The skewness is based on the deviations cubed:
Tasks | Hints and help |
---|---|
Copy the data with the headings into Excel | Be careful that you keep the number of events smaller by trial and error. Make the start date more recent or the area smaller. On a slow machine, it can take a long time to copy from the browser and paste. Once you have pasted it into Excel, then parse using the Data->Text to columns command with Fixed Width. |
Produce a histogram of depth | See above in the lecture for how to do it. Remember you need to define your input bins. They should go from the minumum depth to the maximum depth (compute these) in a regular way (10 or 50 or 100 km bins). |
Compute mean depth, standard deviation of depth, and skewness of depth | See above lecture |
Sort by increasing depth. Make an xy map of the dots, but with the dots colored by depth (don't worry if there are none at certain depths): 0 to 35 km orange 35 to 70 km yellow 70 to 150 km green 150 to 300 km blue 300 to 500 km purple 500 to 800 km red Use a different set of ranges for the colors if your events don't go so deep. | Here is a graphic for how the sorting might look: sort.jpg. This graphic shows importantly how to work with the source data in the Chart: Click on the Chart, and the go to Chart menu->Source data: plotbydepth.gif. I have labeled the key features. Make sure you choose the right series (if there is more than one) on the lower left. Then switch the X and Y values by manually editing to which columns they point. Also, give it a name while you are here so the explanation (legend) will be correct. This graphic shows how to color the dots: colordots.jpg. Remember to just double click on one of the dots of the series you want to change. This graphic shows how to add another series: addevents.jpg. Select the group of latitude longitude pairs from the spreadsheet and copy them. Then go to the chart, click on it once and then File->Paste Special. Make sure you have selected New Series and Categories (X Values) in First Column (see addevents.jpg). My final map looks like: finalcolormap.jpg |
What do the depth statistics and your plot of the earthquake depths and positions tell you about what is happening in terms of seismic deformation in your study area? | Think about what you know about the tectonics of the study area. Recall also that the temperature of transition from brittle to ductile behavior is about 300 C. |
Plot magnitude versus day. Do you see any long term trends? Are earthquakes on the increase? | Just plot the date versus the magnitude column. One issue here is getting the axes labels right. Remember to double click on the axis (X axis) to change it from the defaults. While the dates are nicely formatted on the chart, they are just numbered in days since 1900 in the chart editor. So, figure out how many days are the earliest and latest events in your study and use them to define the axes. Also, make sure the magnitudes (Y axis) range is sufficiently narrow so it spans the data only. This graphic shows a bit about the axis editing: timelabels.jpg. This graphic shows the final result in my example: finaltime.jpg. |
Produce a histogram of magnitude with 0.1 unit bins. What is the minimum and maximum? What is the most common magnitude? Why do you think the minimum is where it is? What about the maximum? | Remember that the earthquake magnitude bins should only really go from your minimum (or slightly less) to your maximum (or slightly more). Or, you can go from magnitude 0 to magnitude 9, but no earthquakes will be beyond that range. Follow the instructions about histogram making from above in the lecture. |
It is unlikely that your earthquakes will have any interesting depth pattern north-south or east-west. If we rotate the data by enough so that you can plot a cross section through it (rotated latitude versus depth), you can then use one of the axes directions as your X axis (distance in degrees along the cross-section) versus depth. Choose an interesting direction based on what you know about the tectonics and geology of the area. It is obvious what the direction would be for a subduction zone: down the subduction zone. The result will be a cross section through the data projected to a vertical plane. What does the cross section show in terms of seismicity distribution and tectonic processes? | Use the equations we introduced in the Rotation and translation exercise. Recall that the rotation equations rotate about the 0,0 axis. So, first we need to translate the data to get it close to 0,0. Then we rotate it. Set up some new spreadsheet columns to the right of your data:
This graphic shows how I set up my spreadsheet: rotandtransspreadsheet.jpg This graphic shows how my final charts look: finaldepthxsec.jpg If either of those graphics look blurry in your browser, click on them or make then full size. They ar bigger images. |
Some of the pebbles example comes from short course notes, Statistical inference for Geology and Planetary Geology, L. S. Glaze and S. M. Baloga, 1997; and also: Waltham, D., 1994, Mathematics: a simple tool for geologists, New York: Chapman and Hall, 189 p.
Web page originally by Prof. Ramón Arrowsmith with major modifications and additions from Prof. Ed Garnero and Prof. Steve Semken