Health & Environmental Research Online (HERO)

Print Feedback Export to File
Book/Book Chapter 
Comparing data distributions 
Chambers, JM; Cleveland, WS; Kleiner, B; Tukey, JA 
Wadsworth international Group, Duxbury Press 
Belmont, California; Boston, Massachusetts 
Graphical methods for data analysis 
In many applications we have two or several groups of observations rather than a single set, and the goal of the analysis is to compare the distributions of the groups. For instance, we can again consider the gross national products of all countries in the United Nations in 1980, but separated into northern hemisphere and southern hemisphere countries. Probably the simplest comparison is to determine whether the "typical" value for one group is above or below the "typical" value for the other; however much more detailed comparisons are possible and often needed. Virtually any of the distributional questions posed for one group in Chapter 2 can be asked of two or more groups in comparison to each other.

Graphical methods can be used for making such distributional comparisons. We begin below by describing the empirical quantile-quantile plot. Then we discuss how the displays of Chapter 2 for each data set can be combined to allow effective visual comparisons. Finally, we show how certain kinds of derived plots based on differences and ratios can enhance our ability to perceive structure in the data.

One example that will be used to illustrate the methodology of this chapter is a set of data from a cloud-seeding experiment described by Simpson, Olsen, and Eden (1975). Rainfall was measured from 52 clouds, of which 26 were chosen randomly to be seeded with silver iodide. The data are the amounts of rainfall in acre-feet from the 52 clouds, and the objective is to describe the effect that seeding has on rainfall. The data for this and the two examples described below are given in the Appendix.

A second example is the average monthly temperatures in degrees Fahrenheit from January 1964 to December 1973 in Newark, New Jersey, and in Lincoln, Nebraska. Both the Newark and Lincoln data sets have 120 observations.

A third example is the maximum daily atmospheric ozone concentrations in Stamford, Connecticut, described in Chapter 2, together with a similar set of ozone measurements from Yonkers, New York, for the same time period. Although there are 136 Stamford values, the Yonkers data set has 148 observations, since Yonkers has fewer missing values. 
Wadsworth statistics/probability series