Chapter 2 Organizing/Displaying Data. 2.1 Bar, Circle and Time-Series Graphs. Exploratory Data Analysis. EDA is a method of studying data that uses stem/leaf plots and histograms. It allows for exploration, pattern finding, and observation of extreme values.
2.1 Bar, Circle and Time-Series Graphs
EDA is a method of studying data that uses stem/leaf plots and histograms. It allows for exploration, pattern finding, and observation of extreme values.
EDA is used when you have general data but are not sure where it might lead or you have few prior assumptions. This is opposed to an experiment where specific data is collected (perhaps with controls) and the observer has particular questions in mind.
Segmented bar chart
Vertical or horizontal
Quantitative or Qualitative Data
Bars of uniform Width and uniform spacing between
Lengths represent values of variables, frequency of occurrence or % of occurrence
Labeled, titled, scales
Sometimes scales on sides are general but you will also see a label on top of a bar to give more specific information
You can change the scale by putting in a “break” on the vertical axis.
The area occupied by a part of a graph should correspond to the magnitude of value it represents. Otherwise, the picture can be misleading even though it is labeled correctly.
Average amount spent on
Holiday Gifts per child
Each wedge displays proportional part of total population (that is, the percentage that give a particular answer or share a characteristic)
OK for qualitative data.
Data plotted in sequential order
Data sequencing is at regular intervals
Time Series Data must be collected for thre same variable for the same subject at regular intervals over a period of time. NASDAQ, NYSE, Rainfall, etc are examples of Time Series Data
Make sure your graph is actually saying something…
Some examples of poor graphs
Histograms, Frequency Tables and Contingency Tables
Sometimes there is a lot of data. One way to evaluate data is to list the counts, or how many times a particular answer is given. Imagine if the senior and junior class are asked to choose their favorite car color out of three choices. One way to show that data is a Contingency Table.
A contingency table is a way to display and analyze the relationship between 2 (or more) sets of categorical data.
If the data is given as percentages, it may be called a two way frequency table.
Great way to evaluate large quantities of data
Width of bars represent a quantitative value (the class)
Height indicates frequency (how many individuals give a response in each particular class)
Some books call the bars bins
A frequency table is used to organize data for drawing the histogram.
Class category or interval
Class Width Width of particular interval
Class Frequency # of tally marks for a particular class
Lower/Upper Class Limit lowest/highest data value that can fit in a particular class
Lower Class Limit + Class Width Smallest Value for next class
Class Boundaries Upper class boundary = UCL + ½
Lower Class Boundary = LCL - ½
Class MidpointLCL + UCL
Class Width = Largest Data Value – Smallest Data Value
Desired # of Classes
Enter the data by hand into L1 – press “STAT”, “EDIT” and start typing into List 1.
Hit “2nd” “StatPlot”; Turn“ON” the stats plot on Plot1 and select the histogram picture. The TI83 should automatically select the correct list. If it doesn’t, change it by typing in “2nd” and then the list name you want (see above the number keys for the lists)
Hit Graph – you will see a histogram.
Go to window and change the xscale to the class width and that forces it to match your choice of classes.
“Trace” then allows you to see class information.
Enter all the data by hand.
Select “Tools”, “Data Analysis”, “Histogram” (it might need to install the data analysis package – do it)
Input range is your range of data values
Output range is the list that you create somewhere else in your table that lists the maximum value for each class. This will force it to make the # of classes you want.
Then click OK. It will put it on another worksheet in your file.
An Ogive is a dot plot that shows the accumulation at each level.
Another way to display data in a histogram like method without losing the actual individual data values is a stem/leaf display. It looks like this:
and it is arranged so that the
stem is the left digit(s) and the
leaves are the right digit(s)
Data such as 25, 26, 30, 31, 32, 33, 34, 35, 35, 40, 41, 44 would display like this:
Turn that data table sideways and it looks like a histogram – the class with more entries (higher frequency) extends further right.
Lets turn our day 1 pulse exercise into a stem/leaf display.
A dot plot is a little more basic. One axis is the individuals (or perhaps basic count value), the other is the quantitative data values, and dot represents each data value.
These are the Kentucky Derby winning times from 1875 through 2004.
Any idea why there are two clusters? (Hint: something happened in 1896, and it has nothing to do with steroids)
I found this guy’s blog that was really interesting. He was musing over his ipod playlist and wondered how many times some of his songs had played.
“I exported the Library and wrote some python scripts to extract data …It turns out I have 208 unplayed songs in my library, and additionally lots of low single digit playcount songs. Here’s an (ugly excel generated) histogram”
While I was delving around, I figured I would see if theres any correlation between the length of time a song has been in my library, and the number of times it’s been played. The dot plot turned out interesting .
Bock, V., Velleman, P., De Veaux, R, Stats: Modeling the World, 2nd Edition, Boston, Pearson Addison Wesley p. 49