Go to TOC

Chapter 1

Background

Statistics is the art of summarizing data, depicting data, and extracting information from it. Statistics and the theory of probability are distinct subjects, although statistics depends on probability to quantify the strength of its inferences. The probability used in this course will be developed in Chapter 3 and throughout the text as needed. We begin by introducing some basic ideas and terminology.

1.1 Populations, Samples and Variables

A population is a set of individual elements whose collective properties are the subject of investigation. Usually, populations are large collections whose individual members cannot all be examined in detail. In statistical inference a manageable subset of the population is selected according to certain sampling procedures and properties of the subset are generalized to the entire population. These generalizations are accompanied by statements quantifying their accuracy and reliability. The selected subset is called a sample from the population.

Examples:

(a) the population of registered voters in a congressional district, (b) the population of U.S. adult males, (c) the population of currently enrolled students at a certain large urban university, (d) the population of all transactions in the U.S. stock market for the past month, (e) the population of all peak temperatures at points on the Earth’s surface over a given time interval.

Some samples from these populations might be: (a) the voters contacted in a pre-election telephone poll, (b) adult males interviewed by a TV reporter, (c) the dean’s list, (d) transactions recorded on the books of Smith Barney, (e) peak temperatures recorded at several weather stations.

Clearly, for these particular samples, some generalizations from sample to population would be highly questionable.

6

Go to TOC

CHAPTER 1. BACKGROUND 7

A population variable is an attribute that has a value for each individual in the population. In other words, it is a function from the population to some set of possible values. It may be helpful to imagine a population as a spreadsheet with one row or record for each individual member. Along the ith row, the values of a number of attributes of the ith individual are recorded in different columns. The column headings of the spreadsheet can be thought of as the population variables. For example, if the population is the set of currently enrolled students at the urban university, some of the variables are academic classification, number of hours currently enrolled, total hours taken, grade point average, gender, ethnic classification, major, and so on. Variables, such as these, that are defined for the same population are said to be jointly observed or jointly distributed.

1.2 Types of Variables

Variables are classified according to the kinds of values they have. The three basic types are numeric variables, factor variables, and ordered factor variables. Numeric variables are those for which arith- metic operations such as addition and subtraction make sense. Numeric variables are often related to a scale of measurement and expressed in units, such as meters, seconds, or dollars. Factor variables are those whose values are mere names, to which arithmetic operations do not apply. Factors usually have a small number of possible values. These values might be designated by numbers. If they are, the numbers that represent distinct values are chosen merely for convenience. The values of factors might also be letters, words, or pictorial symbols. Factor variables are sometimes called nominal variables or categorical variables. Ordered factor variables are factors whose values are ordered in some natural and important way. Ordered factors are also called ordinal variables. Some textbooks have a more elaborate classification of variables, with various subtypes. The three types above are enough for our purposes.

Examples: Consider the population of students currently enrolled at a large university. Each stu- dent has a residency status, either resident or nonresident. Residency status is an unordered factor variable. Academic classification is an ordered factor with values “freshman”, “sophomore”, “junior”, “senior”, “post-baccalaureate” and “graduate student”. The number of hours enrolled is a numeric variable with integer values. The distance a student travels from home to campus is a numeric vari- able expressed in miles or kilometers. Home area code is an unordered factor variable whose values are designated by numbers.

1.3 Random Experiments and Sample Spaces

An experiment can be something as simple as flipping a coin or as complex as conducting a public opinion poll. A random experiment is one with the following two characteristics:

(1) The experiment can be replicated an indefinite number of times under essentially the same exper- imental conditions.

(2) There is a degree of uncertainty in the outcome of the experiment. The outcome may vary from replication to replication even though experimental conditions are the same.

Go to TOC

CHAPTER 1. BACKGROUND 8

When we say that an experiment can be replicated under the same conditions, we mean that control- lable or observable conditions that we think might affect the outcome are the same. There may be hidden conditions that affect the outcome, but we cannot account for them. Implicit in (1) is the idea that replications of a random experiment are independent, that is, the outcomes of some replications do not affect the outcomes of others. Obviously, a random experiment is an idealization of a real experiment. Some simple experiments, such as tossing a coin, approach this ideal closely while more complicated experiments may not.

The sample space of a random experiment is the set of all its possible outcomes. We use the Greek capital letter Ω (omega)to denote the sample space. There is some degree of arbitrariness in the description of Ω. It depends on how the outcomes of the experiment are represented symbolically.

Examples:

(a) Toss a coin. Ω = {H,T}, where “H” denotes a head and “T” a tail. Another way of repre- senting the outcome is to let the number 1 denote a head and 0 a tail (or vice-versa). If we do this, then Ω = {0, 1}. In the latter representation the outcome of the experiment is just the number of heads.

(b) Toss a coin 5 times, i.e., replicate the experiment in (a) 5 times. An outcome of this experiment is a 5 term sequence of heads and tails. A typical outcome might be indicated by (H,T,T,H,H), or by (1,0,0,1,1). Even for this little experiment it is cumbersome to list all the outcomes, so we use a shorter notation

Ω = {(x1, x2, x3, x4, x5) | xi = 0 or xi = 1 for each i} .

(c) Select a student randomly from the population of all currently enrolled students. The sample space is the same as the population. The word “randomly” is vague. We will define it later.

(d) Repeat the Michelson-Morley experiment to measure the speed of the Earth relative to the ether (which doesn’t exist, as we now know). The outcome of the experiment could conceivably be any nonnegative number, so we take Ω = [0,∞) = {x | x is a real number and x ≥ 0.} Uncertainty arises from the fact that this is a very delicate experiment with several sources of unpredictable error.

1.4 Computing in Statistics

Even moderately large data sets cannot be managed effectively without a computer and computer software. Furthermore, much of applied statistics is exploratory in nature and cannot be carried out by hand, even with a calculator. Spreadsheet programs, such as Microsoft Excel, are designed to manipulate data in tabular form and have functions for performing the common tasks of statistics. In addition, many add-ins are available, some of them free, for enhancing the graphical and statistical capabilities of spreadsheet programs. Some of the exercises and examples in this text make use of Excel with its built-in data analysis package. Because it is so common in the business world, it is important for students to have some experience with Excel or a similar program.

The disadvantages of spreadsheet programs are their dependence on the spreadsheet data format with cell ranges as input for statistical functions, their lack of flexibility, and their relatively poor graphics. Many highly sophisticated packages for statistics and data analysis are available. Some of

Go to TOC

CHAPTER 1. BACKGROUND 9

the best known commercial packages are Minitab, SAS, SPSS, Splus, Stata, and Systat. The package used in this text is called R. It is an open source implementation of the same language used in Splus and may be downloaded free at

http://www.r-project.org .

http://www.rstudio.com .

Rstudio makes importing data into R much easier and makes it easier to integrate R output with other programs. Detailed instructions on using R and Rstudio for the exercises will be provided.

Data files used in this course are from four sources. Some are local in origin and come from student or course data at the University of Houston. Others are simulated but made to look as realistic as possible. These and others are available at

http://www.math.uh.edu/ charles/data .

Many data sets are included with R in the datasets library and other contributed packages. We will refer to them frequently. The main external sources of data are the data archives maintained by the Journal of Statistics Education.

www.amstat.org/publications/jse

and the Statistical Science Web:

http://www.stasci.org/datasets.html.

1.5 Exercises

1. Go to http://www.math.uh.edu/ charles/data. Examine the data set “Air Pollution Filter Noise”. Identify the variables and give their types.

2. Highlight the data in Air Pollution Filter Noise. Include the column headings but not the language preceding the column headings. Copy and paste the data into a plain text file, for example with Notepad in Windows. Import the text file into Excel or another spread sheet program. Create a new folder or directory named “math3339” and save both files there.

3. Start R by double clicking on the big blue R icon on your desktop. Click on the file menu at the top of the R Gui window. Select “change dir . . . ” . In the window that opens next, find the name of the directory where you saved the text file and double click on the name of that directory. Suppose that you named your file “apfilternoise”. (Name it anything you like.) Import the file into R with the command

CHAPTER 1. BACKGROUND 10

and display it with the command

> apfilternoise

Click on the file menu at the top again and select “Exit”. At the prompt to save your workspace, click “Yes”. If you open the folder where your work was saved you will see another big blue R icon. If you double click on it, R will start again and your previously saved workspace will be restored.

If you use Rstudio for this exercise you can import apfilternoise into R by clicking on the ”Import Dataset” tab. This will open a window on your file system and allow you to select the file you saved in Exercise 2. The dialog box allows you to rename the data and make other minor changes before importing the data as a data frame in R.

4. If you are using Rstudio, click on the ”Packages” tab and then the word ”datasets”. Find the data set ”airquality” and click on it. Read about it. If you are using R alone, type

> help(airquality)

at the command prompt > in the Console window.

Then type

> airquality

to view the data. Could ”Month” and ”Day” be considered ordered factors rather than numeric vari- ables?

5. A random experiment consists of throwing a standard 6-sided die and noting the number of spots on the upper face. Describe the sample space of this experiment.

6. An experiment consists of replicating the experiment in exercise 4 four times. Describe the sample space of this experiment. How many possible outcomes does this experiment have?

Go to TOC

Chapter 2

Descriptive and Graphical Statistics

A large part of a statistician’s job consists of summarizing and presenting important features of data. Simply looking at a spreadsheet with 1000 rows and 50 columns conveys very little information. Most likely, the user of the data would rather see numerical and graphical summaries of how the values of different variables are distributed and how the variables are related to each other. This chapter concerns some of the most important ways of summarizing data.

2.1 Location Measures

2.1.1 The Mean

Suppose that x is the name of a numeric variable whose values are recorded either for the entire population or for a sample from that population. Let the n recorded values of x be denoted by x1, x2, . . . , xn. These are not necessarily distinct numbers. The mean or average of these values is

x̄ = 1

n

n∑ i=1

xi

When the values of x for the entire population are included, it is customary to denote this quantity by µ(x) and call it the population mean. The mean is called a location measure partly because it is taken as a representative or central value of x. More importantly, it behaves in a certain way if we change the scale of measurement for values of x. Imagine that x is temperature recorded in degrees Celsius and we decide to change the unit of measurement to degrees Fahrenheit. If yi denotes the Fahrenheit temperature of the ith individual, then yi = 1.8xi + 32. In effect, we have defined a new variable y by the equation y = 1.8x + 32. The means of the new and old variables have the same relationship as the individual measurements have.

ȳ = 1

n

n∑ i=1

yi = 1

n

n∑ 1

(1.8xi + 32) = 1.8x̄+ 32

In general, if a and b > 0 are constants and y = a+bx, ȳ = a+bx̄. Other location measures introduced below behave in the same way.

11

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 12

When there are repeated values of x, there is an equivalent formula for the mean. Let the m distinct values of x be denoted by v1, . . . , vm. Let ni be the number of times vi is repeated and let fi = ni/n. Note that

∑m i=1 ni = n and

∑m i=1 fi = 1. Then the average is given by

x̄ =

m∑ i=1

fivi

The number ni is the frequency of the value vi and fi is its relative frequency.

2.1.2 The Median and Other Quantiles

Let x be a numeric variable with values x1, x2, . . . , xn. Arrange the values in increasing order x(1) ≤ x(2) ≤ . . . ≤ x(n). The median of x is a number median(x) such that at least half the values of x are ≤ median(x) and at least half the values of x are ≥ median(x). This conveys the essential idea but unfortunately it may define an interval of numbers rather than a single number. The ambiguity is usually resolved by taking the median to be the midpoint of that interval. Thus, if n is odd, n = 2k+1, where k is a positive integer,

median(x) = x(k+1)

, while if n is even, n = 2k,

median(x) = x(k) + x(k+1)

2 .

Let p ∈ (0, 1) be a number between 0 and 1. The pth quantile of x is more commonly known as the 100pth percentile; e.g., the 0.8 quantile is the same as the 80th percentile. We define it as a number q(x, p) such that the fraction of values of x that are ≤ q(x, p) is at least p and the fraction of values of x that are ≥ q(x, p) is at least 1−p. For example, at least 80 percent of the values of x are ≤ the 80th percentile of x and at least 20 percent of the values of x are ≥ its 80th percentile. Again, this may not define a unique number q(x, p). Software packages have rules for resolving the ambiguity, but the details are usually not important.

The median is the 50th percentile, i.e., the 0.5 quantile. The 25th and 75th percentiles are called the first and third quartiles. The 10th, 20th, 30th, etc. percentiles are called the deciles. The median is a location measure as defined in the preceding section.

2.1.3 Trimmed Means

Trimmed means of a variable x are obtained by finding the mean of the values of x excluding a given percentage of the largest and smallest values. For example, the 5% trimmed mean is the mean of the values of x excluding the largest 5% of the values and the smallest 5% of the values. In other words, it is the mean of all the values between the 5th and 95th percentiles of x. A trimmed mean is a location measure.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 13

2.1.4 Grouped Data

Sometimes large data sets are summarized by grouping values. Let x be a numeric variable with values x1, x2, . . . , xn. Let c0 < c1 < . . . < cm be numbers such that all the values of x are between c0 and cm. For each i, let ni be the number of values of x (including repetitions) that are in the interval (ci−1, ci], i.e., the number of indices j such that ci−1 < xj ≤ ci. A frequency table of x is a table showing the class intervals (ci−1, ci] along with frequencies ni with which the data values fall into each interval. Sometimes additional columns are included showing the relative frequencies fi = ni/n, the cumulative relative frequencies Fi =

∑ j≤i fj , and the midpoints of the intervals.

Example 2.1. The data below are 50 measured reaction times in response to a sensory stimulus, arranged in increasing order. A frequency table is shown below the data.

0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.13 1.17 1.21 1.23 1.35 1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20 2.29 2.32 2.39 2.47 2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73

Interval Midpoint ni fi Fi (0,1] 0.5 11 0.22 0.22 (1,2] 1.5 22 0.44 0.66 (2,3] 2.5 11 0.22 0.88 (3,4] 3.5 4 0.08 0.96 (4,5] 4.5 2 0.04 1.00

If only a frequency table like the one above is given, the mean and median cannot be calculated exactly. However, they can be estimated. If we take the midpoint of an interval as a stand-in for all the values in that interval, then we can use the formula in the preceding section for calculating a mean with repeated values. Thus, in the example above, we would estimate the mean as

0.22(0.5) + .44(1.5) + 0.22(2.5) + 0.08(3.5) + 0.04(4.5) = 1.78

Estimating the median is a bit more difficult. By examining the cumulative frequencies Fi, we see that 22% of the data is less than or equal to 1 and 66% of the data is less than or equal to 2. Therefore, the median lies between 1 and 2. That is, it is 1 + a certain fraction of the distance from 1 to 2. A reasonable guess at that fraction is given by linear interpolation between the cumulative frequencies at 1 and 2. In other words, we estimate the median as

1 + .50− .22 .66− .22

(2− 1) = 1.636.

A cruder estimate of the median is just the midpoint of the interval that contains the median, in this case 1.5. We leave it as an exercise to calculate the mean and median from the data of Example 1 and to compare them to these estimates.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 14

2.1.5 Histograms

The figure below is a histogram of the reaction times.

> hist(reacttimes\$Times,breaks=0:5,xlab=”Reaction Times”,main=” “)

Reaction Times

F re

qu en

cy

0 1 2 3 4 5

0 5

10 15

20

The histogram is a graphical depiction of the grouped data. The end points ci of the class intervals are shown on the horizontal axis. This is an absolute frequency histogram because the heights of the vertical bars above the class intervals are the absolute frequencies ni. A relative frequency histogram would show the relative frequencies fi. A density histogram has bars whose heights are the relative frequencies divided by the lengths of the corresponding class intervals. Thus,in a density histogram the area of the bar is equal to the relative frequency. If all class intervals have the same length, these types of histograms all have the same shape and convey the same visual information.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 15

2.1.6 Robustness

A robust measure of location is one that is not affected by a few extremely large or extremely small values. Values of a numeric variable that lie a great distance from most of the other values are called outliers. Outliers might be the result of mistakes in measuring or recording data, perhaps from misplacing a decimal point. The mean is not a robust location measure. It can be affected significantly by a single extreme outlier if that outlying value is extreme enough. Thus, if there is any doubt about the quality of the data, the median or a trimmed mean might be preferred to the mean as a reliable location measure. The median is very insensitive to outliers. A 5% trimmed mean is insensitive to outliers that make up no more than 5% of the data values.

2.1.7 The Five Number Summary

The five number summary is a convenient way of summarizing numeric data. The five numbers are the minimum value, the first quartile (25th percentile), the median, the third quartile (75th percentile), and the maximum value. Sometimes the mean is also included, which makes it a six number summary.

Example 2.2. The natural logarithms y of the data values x in Example 1 are, to two places:

-2.12 -1.20 -1.05 -0.99 -0.82 -0.56 -0.49 -0.48 -0.34 -0.22 -0.13 0.02 0.08 0.11 0.12 0.16 0.19 0.21 0.30 0.34 0.35 0.35 0.38 0.40 0.42 0.43 0.47 0.48 0.52 0.54 0.62 0.64 0.65 0.73 0.74 0.77 0.78 0.79 0.83 0.84 0.87 0.90 0.96 1.05 1.23 1.23 1.33 1.38 1.51 1.55

It is sometimes advantageous to transform data in some way, i.e., to define a new variable y as a function of the old variable x. In this case, we have transformed the reaction times x with the natural logarithm transformation. We might want to do this to so that we can more easily apply certain statistical inference procedures you will learn about later. The six number summary of the transformed data y is:

> summary(log(reacttimes\$Times))

Min. 1st Qu. Median Mean 3rd Qu. Max.

-2.12000 0.08605 0.42520 0.33710 0.78500 1.55400

2.1.8 The Mode

The mode of a variable is its most frequently occurring value. With numeric variables the mode is less important than the mean and median for descriptive purposes or for statistical inference. For factor variables the mode is the most natural way of choosing a ”most representative” value. We hear this frequently in the media, in statements such as ”Financial problems are the most common cause of marital strife”. For grouped numeric data the modal class interval is the class interval having the highest absolute or relative frequency. In Example 1, the modal class interval is the interval (1,2].

2.1.9 Exercises

1. Find the mean and median of the reaction time data in Example 1.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 16

2. Find the quartiles of the reaction time data. There is more than one acceptable answer.

3. The 40th value x40 of the reaction time data has a value of 2.32. Replace that with 232.0. Recalculate the mean and median. Comment.

4. Construct a frequency table like the one in Example 1 for the log-transformed reaction times of Example 2. Use 5 class intervals of equal length beginning at -3 and ending at 2. Draw an absolute frequency histogram.

5. Estimate the mean and median of the grouped log-transformed reaction times by using the tech- niques discussed in Example 1. Compare your answers to the summary in Example 2.

6. Repeat exercises 1, 2, and the histogram of exercise 4 by using R.

7. Let x be a numeric variable with values x1, . . . , xn−1, xn. Let x̄n be the average of all n val- ues and let x̄n−1 be the average of x1, . . . , xn−1. Show that x̄n = (1− 1n )x̄n−1 +

1 nxn. What happens

if xn →∞ while all the other values of x are fixed?

2.2 Measures of Variability or Scale

2.2.1 The Variance and Standard Deviation

Let x be a population variable with values x1, x2, . . . , xn. Some of the values might be repeated. The variance of x is

var(x) = σ2 = 1

n

n∑ i=1

(xi − µ(x))2.

The standard deviation of x is sd(x) = σ =

√ var(x).

When x1, x2, . . . , xn are values of x from a sample rather than the entire population, we modify the definition of the variance slightly, use a different notation, and call these objects the sample variance and standard deviation.

s2 = 1

n− 1

n∑ i=1

(xi − x̄)2,

s = √ s2.

The reason for modifying the definition for the sample variance has to do with its properties as an estimate of the population variance.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 17

Alternate algebraically equivalent formulas for the variance and sample variance are

σ2 = 1

n

n∑ i=1

x2i − µ(x)2,

s2 = 1

n− 1 (

n∑ i=1

x2i − nx̄2).

These are sometimes easier to use for hand computation.

The standard deviation σ is called a measure of scale because of the way it behaves under linear transformations of the data. If a new variable y is defined by y = a+ bx, where a and b are constants, sd(y) = |b|sd(x). For example, the standard deviation of Fahrenheit temperatures is 1.8 times the standard deviation of Celsius temperatures. The transformation y = a + bx can be thought of as a rescaling operation, or a choice of a different system of measurement units, and the standard deviation takes account of it in a natural way.

2.2.2 The Coefficient of Variation

For a variable that has only positive values, it may be more important to measure the relative vari- ability than the absolute variability. That is, the amount of variation should be compared to the mean value of the variable. The coefficient of variation for a population variable is defined as

cv(x) = sd(x)

µ(x) ,

For a sample of values of x we substitute the sample standard deviation s and the sample average x̄.

2.2.3 The Mean and Median Absolute Deviation

Suppose that you must choose a single number c to represent all the values of a variable x as accurately as possible. One measure of the overall error with which c represents the values of x is

g(c) =

√√√√ 1 n

n∑ i=1

(xi − c)2.

In the exercises, you are asked to show that this expression is minimized when c = x̄. In other words, the single number which most accurately represents all the values is, by this criterion, the mean of the variable. Furthermore, the minimum possible overall error, by this criterion, is the standard deviation. However, this is not the only reasonable criterion. Another is

h(c) = 1

n

n∑ i=1

|xi − c|.

It can be shown that this criterion is minimized when c = median(x). The minimum value of h(c) is called the mean absolute deviation from the median. It is a scale measure which is somewhat more robust(less affected by outliers) than the standard deviation, but still not very robust. A related very robust measure of scale is the median absolute deviation from the median, or mad :

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 18

2.2.4 The Interquartile Range

The interquartile range of a variable x is the difference between its 75th and 25th percentiles.

IQR(x) = q(x, .75)− q(x, .25).

It is a robust measure of scale which is important in the construction and interpretation of boxplots, discussed below.

All of these measures of scale are valid for comparison of the ”spread”or variability of numeric variables about a central value. In general, the greater their values, the more spread out the values of the variable are. Of course, the standard deviation, median absolute deviation, and interquartile range of a variable will be different numbers and one must be careful to compare like measures.

2.2.5 Boxplots

Boxplots are also called box and whisker diagrams. Essentially, a boxplot is a graphical representation of the five number summary. The boxplot below depicts the sensory response data of the preceding section without the log transformation.

> boxplot(reacttimes\$Times,horizontal=T,xlab=”Reaction Times”)

> summary(reacttimes)

Times

Min. :0.120

1st Qu.:1.090

Median :1.530

Mean :1.742

3rd Qu.:2.192

Max. :4.730

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 19

0 1 2 3 4

Reaction Times

The central box in the diagram encloses the middle 50% of the numeric data. Its left and right bound- aries mark the first and third quartiles. The boldface middle line in the box marks the median of the data. Thus, the interquartile range is the distance between the left and right boundaries of the central box. For construction of a boxplot, an outlier is defined as a data value whose distance from the nearest quartile is more than 1.5 times the interquartile range. Outliers are indicated by isolated points (tiny circles in this boxplot). The dashed lines extending outward from the quartiles are called the whiskers. They extend from the quartiles to the most extreme values in either direction that are not outliers.

This boxplot shows a number of interesting things about the response time data.

(a) The median is about 1.5. The interquartile range is slightly more than 1.

(b) The three largest values are outliers. They lie a long way from most of the data. They might call for special investigation or explanation.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 20

(c) The distribution of values is not symmetric about the median. The values in the lower half of the data are more crowded together than those in the upper half. This is shown by comparing the distances from the median to the two quartiles, by the lengths of the whiskers and by the presence of outliers at the upper end .

The asymmetry of the distribution of values is also evident in the histogram of the preceding sec- tion.

2.2.6 Exercises

1. Find the variance and standard deviation of the response time data. Treat it as a sample from a larger population.

2. Find the interquartile range and the median absolute deviation for the response time data.

3. In the response time data, replace the value x40 = 2.32 by 232.0. Recalculate the standard deviation, the interquartile range and the median absolute deviation and compare with the answers from problems 1 and 2.

4. Make a boxplot of the log-transformed reaction time data. Is the transformed data more sym- metrically distributed than the original data?

5. Show that the function g(c) in section 2.2.3 is minimized when c = µ(x). Hint: Minimize g(c)2.

6. Find the variance, standard deviation, IQR, mean absolute deviation and median absolute de- viation of the variable ”Ozone” in the data set ”airquality”. Use R or Rstudio. You can address the variable Ozone directly if you attach the airquality data frame to the search path as follows:

> attach(airquality)

The R functions you will need are ”sd” for standard deviation, ”var” for variance, ”IQR” for the interquartile range, and ”mad” for the median absolute deviation. There is no built-in function in R for the mean absolute deviation, but it is easy to obtain it.

> mean(abs(Ozone-median(Ozone)))

2.3 Jointly Distributed Variables

When two or more variables are jointly distributed, or jointly observed, it is important to understand how they are related and how closely they are related. We will first consider the case where one variable is numeric and the other is a factor.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 21

2.3.1 Side by Side Boxplots

Boxplots are particularly useful in quickly comparing the values of two or more sets of numeric data with a common scale of measurement and in investigating the relationship between a factor variable and a numeric variable. The figure below compares placement test scores for each of the letter grades in a sample of 179 students who took a particular math course in the same semester under the same instructor. The two jointly observed population variables are the placement test score and the letter grade received. The figure separates test scores according to the letter grade and shows a boxplot for each group of students. One would expect to see a decrease in the median test score as the letter grade decreases and that is confirmed by the picture. However, the decrease in median test scores from a letter grade of B to a grade of F is not very dramatic, especially compared to the size of the IQRs. This suggests that the placement test is not especially good at predicting a student’s final grade in the course. Notice the two outliers. The outlier for the ”W” group is clearly a mistake in recording data because the scale of scores only went to 100.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 22

A B C D F W

40 60

80 10

0 12

0

Te st

2.3.2 Scatterplots

Suppose x and y are two jointly distributed numeric variables. Whether we consider the entire population or a sample from the population, we have the same number n of observed values for each variable. If we plot the n points (x1, y1), (x2, y2), . . . , (xn, yn) in a Cartesian plane, we obtain a scatterplot or a scatter diagram of the two variables. Below are the first 6 rows of the ”Payroll” data set. The column labeled ”payroll” is the total monthly payroll in thousands of dollars for each company listed. The column ”employees” is the number of employees in each company and ”industry” indicates which of two related industries the company is in. A scatterplot of all 50 values of the two variables ”payroll” and ”employees” is also shown.

> Payroll[1:6,]

payroll employees industry

1 190.67 85 A

2 233.58 109 A

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 23

3 244.04 130 B

4 351.41 166 A

5 298.60 154 B

6 241.43 124 B

> attach(Payroll)

> plot(payroll~employees,col=industry)

50 100 150

15 0

20 0

25 0

30 0

35 0

employees

pa yr

ol l

The scatterplot shows that in general the more employees a company has, the higher its monthly payroll. Of course this is expected. It also shows that the relationship between the number of employees and the payroll is quite strong. For any given number of employees, the variation in payrolls for that number is small compared to the overall variation in payrolls for all employment levels. In this plot, the data from industry A is in black and that from industry B is red. The plot shows that for employees ≥ 100, payrolls for industry A are generally greater than those for industry B at the same level of employment.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 24

2.3.3 Covariance and Correlation

If x and y are jointly distributed numeric variables, we define their covariance as

cov(x, y) = 1

n

n∑ i=1

(xi − µ(x))(yi − µ(y)).

If x and y come from samples of size n rather than the whole population, replace the denominator n by n − 1 and the population means µ(x), µ(y) by the sample means x̄, ȳ to obtain the sample covariance. The sign of the covariance reveals something about the relationship between x and y. If the covariance is negative, values of x greater than µ(x) tend to be accompanied by values of y less than µ(y). Values of x less than µ(x) tend to go with values of y greater than µ(y), so x and y tend to deviate from their means in opposite directions. If cov(x, y) > 0, they tend to deviate in the same direction. The strength of these tendencies is not expressed by the covariance because its magnitude depends on the variability of each of the variables about its mean. To correct this, we divide each deviation in the sum by the standard deviation of the variable. The resulting quantity is called the correlation between x and y:

cor(x, y) = cov(x, y)

sd(x) ∗ sd(y) .

The correlation between payroll and employees in the example above is 0.9782 (97.82 %).

Theorem 2.1. The correlation between x and y satisfies −1 ≤ cor(x, y) ≤ 1. cor(x, y) = 1 if and only if there are constants a and b > 0 such that y = a+ bx. cor(x, y) = −1 if and only if y = a+ bx with b < 0.

A correlation close to 1 indicates a strong positive relationship (tending to vary in the same direction from their means) between x and y while a correlation close to −1 indicates a strong negative rela- tionship. A correlation close to 0 indicates that there is no linear relationship between x and y. In this case, x and y are said to be (nearly) uncorrelated. There might be a relationship between x and y but it would be nonlinear. The picture below shows a scatterplot of two variables that are clearly related but very nearly uncorrelated.

> xs=runif(500,0,3*pi)

> ys=sin(xs)+rnorm(500,0,.15)

> cor(xs,ys)

 0.004200081

> plot(xs,ys)

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 25

0 2 4 6 8

− 1.

0 −

0. 5

0. 0

0. 5

1. 0

1. 5

xs

ys

Some sample scatterplots of variables with different population correlations are shown below.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 26

−1 0 1 2

− 4

− 2

0 1

2 3

cor(x,y)=0

−2 −1 0 1 2

− 3

− 2

− 1

0 1

2

cor(x,y)=0.3

−3 −1 0 1 2 3 4

− 3

− 1

1 2

cor(x,y)=−0.5

−2 −1 0 1 2

− 2

− 1

0 1

2 cor(x,y)=0.9

2.3.4 Exercises

1. With the Air Pollution Filter Noise data, construct side by side boxplots of the variable NOISE for the different levels of the factor SIZE. Comment. Do the same for NOISE and TYPE.

2. With the Payroll data, construct side by side boxplots of ”employees” versus ”industry” and ”pay- roll” versus ”industry”. Are these boxplots as informative as the color coded scatterplot in Section 2.3.2?

3. If you are using Rstudio click on the ”Packages” tab, then the checkbox next to the library MASS. Click on the word MASS and then the data set ”mammals” and read about it. If you are using R alone, in the Console window at the prompt > type

> data(mammals,package=”MASS”).

View the data with

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 27

> mammals

Make a scatterplot with the following commands and comment on the result.

> attach(mammals)

> plot(body,brain)

Also make a scatterplot of the log transformed body and brain weights.

> plot(log(body),log(brain))

A recently discovered hominid species homo floresiensis had an estimated average body weight of 25 kg. Based on the scatterplots, what would you guess its brain weight to be?

4. Let x and y be jointly distributed numeric variables and let z = a + by, where a and b are constants. Show that cov(x, z) = b ∗ cov(x, y). Show that if b > 0, cor(x, z) = cor(x, y). What happens if b < 0?

Go to TOC

Chapter 3

Probability

3.1 Basic Definitions. Equally Likely Outcomes

Let a random experiment with sample space Ω be given. Recall from Chapter 1 that Ω is the set of all possible outcomes of the experiment. An event is a subset of Ω. A probability measure is a function which assigns numbers between 0 and 1 to events. If the sample space Ω, the collection of events, and the probability measure are all specified, they constitute a probability model of the random experiment.

The simplest probability models have a finite sample space Ω. The collection of events is the col- lection of all subsets of Ω and the probability of an event is simply the proportion of all possible outcomes that correspond to that event. In such models, we say that the experiment has equally likely outcomes. If the sample space has N elements, then each elementary event {ω} consisting of a single outcome has probability 1N . If E is a subset of Ω, then

Pr(E) = #(E)

N .

Here we introduce some notation that will be used throughout this text. The probability measure for a random experiment is most often denoted by the abbreviation Pr, sometimes with subscripts. Events will be denoted by upper case Latin letters near the beginning of the alphabet. The expression #(E) denotes the number of elements of the subset E.

Example 3.1. The Payroll data consists of 50 observations of 3 variables, ”payroll”, ”employees” and ”industry”. Suppose that a random experiment is to choose one record from the Payroll data and suppose that the experiment has equally likely outcomes. Then, as the summary below shows, the probability that industry A is selected is

Pr(industry = A) = 27

50 = 0.54.

> summary(Payroll)

28

Go to TOC

CHAPTER 3. PROBABILITY 29

payroll employees industry

Min. :129.1 Min. : 26.00 A:27

1st Qu.:167.8 1st Qu.: 71.25 B:23

Median :216.1 Median :108.50

Mean :228.2 Mean :106.42

3rd Qu.:287.8 3rd Qu.:143.25

Max. :354.8 Max. :172.00

In this example we use another common and convenient notational convention. The event whose probability we want is described in quasi-natural language as ”industry=A” rather than with the the formal but too cumbersome {ω ∈ Payroll|industry(ω) = A}. The description ”industry=A” refers to the set of all possible outcomes of the experiment for which the variable ”industry” has the value ”A”. This sort of informal description of an event will be used again and again.

The assumption of equally likely outcomes is an assumption about the selection procedure for ob- taining one record from the data. It is conceivable that a selection method is employed for which this assumption is not valid. If so, we should be able to discover that it is invalid by replicating the experiment sufficiently many times. This is a basic principle of classical statistical inference. It relies on a famous result of mathematical probability theory called the law of large numbers. One version of it is loosely stated as follows:

Law of Large Numbers: Let E be an event associated with a random experiment and let Pr be the probability measure of a true probability model of the experiment. Suppose the experiment is repli- cated n times and let P̂ r(E) = 1n × # replications in which E occurs. Then P̂ r(E) → Pr(E) as n→∞.

P̂ r(E) is called the empirical probability of E.

3.2 Combinations of Events

Events are related to other events by familiar set operations. Let E1, E2, . . . be a finite or infinite sequence of events. The union of E1 and E2 is the event

E1 ∪ E2 = {ω ∈ Ω|ω ∈ E1 or ω ∈ E2}.

More generally, ⋃ i

Ei = E1 ∪ E2 ∪ . . . = {ω ∈ Ω|ω ∈ Ei for some i }.

The intersection of E1 and E2 is the event

E1 ∩ E2 = {ω ∈ Ω|ω ∈ E1 and ω ∈ E2},

and, in general, ⋂ i

Ei = E1 ∩ E2 ∩ . . . = {ω ∈ Ω|ω ∈ Ei for all i}.

Go to TOC

CHAPTER 3. PROBABILITY 30

Sometimes we omit the intersection symbol ∩ and simply conjoin the symbols for the events in an intersection. In other words,

E1E2 . . . En = E1 ∩ E2 ∩ . . . ∩ En.

The complement of the event E is the event

∼E = {ω ∈ Ω|ω /∈ E}.

∼E occurs if and only if E does not occur. The event E∼1 E2 occurs if and only if E1 occurs and E2 does not occur.

Finally, the entire sample space Ω is an event with complement φ, the empty event. The empty event never occurs. We need the empty event because it is possible to formulate a perfectly sensible description of an event which happens never to be satisfied. For example, if Ω = Payroll the event ”employees < 25” is never satisfied, so it is the empty event.

We also have the subset relation between events. E1 ⊆ E2 means that if E1 occurs, then E2 oc- curs, or in more familiar language, E1 is a subset of E2. For any event E, it is true that φ ⊆ E ⊆ Ω. E2 ⊇ E1 means the same as E1 ⊆ E2.

3.2.1 Exercises

1. A random experiment consists of throwing a pair of dice, say a red die and a green die, simultane- ously. They are standard 6-sided dice with one to six dots on different faces. Describe the sample space.

2. For the same experiment, let E be the event that the sum of the numbers of spots on the two dice is an odd number. Write E as a subset of the sample space, i.e., list the outcomes in E.

3. List the outcomes in the event F = ”the sum of the spots is a multiple of 3”.

4. Find ∼F , E ∪ F , EF = E ∩ F , and E∼F .

5. Assume that the outcomes of this experiment are equally likely. Find the probability of each of the events in # 4.

6. Show that for any events E1 and E2, if E1 ⊆ E2 then ∼E2 ⊆∼ E1.

7. Load the ”mammals” data set into your R workspace. In Rstudio you can click on the ”Pack- ages” tab and then on the checkbox next to MASS. Without Rstudio, type

> data(mammals,package=”MASS”)

Attach the mammals data frame to your R search path with

> attach(mammals)

Go to TOC

CHAPTER 3. PROBABILITY 31

A random experiment is to choose one of the species listed in this data set. All outcomes are equally likely. You can obtain a list of the species in the event ”body > 200” with the command

> subset(mammals,body>200)

What is the probability of this event, i.e., what is the probability that you randomly select a species with a body weight greater than 200 kg?

8. What are the species in the event that the ratio of brain weight to body weight is greater than 0.02? Remember that brain weight is recorded in grams and body weight in kilograms, so body weight must be multiplied by 1000 to make the two weights comparable. What is the probability of that event?

3.3 Rules for Probability Measures

The assumption of equally likely outcomes is the starting point for the construction of many proba- bility models. There are many random experiments for which this assumption is wrong. No matter what other considerations are involved in choosing a probability measure for a model of a a random experiment, there are certain rules that it must satisfy. They are:

1. 0 ≤ Pr(E) ≤ 1 for each event E.

2. Pr(Ω) = 1.

3. If E1, E2, . . . is a finite or infinite sequence of events such that EiEj = φ for i 6= j, then Pr( ⋃ iEi) =∑

i Pr(Ei). If EiEj = φ for all i 6= j we say that the events E1, E2, . . . are pairwise disjoint.

These are the basic rules. There are other properties that may be derived from them as theorems.

4. Pr(E∼F ) = Pr(E)− Pr(EF ) for all events E and F . In particular, Pr(∼E) = 1− Pr(E)

5. Pr(φ) = 0.

6. Pr(E ∪ F ) = Pr(E) + Pr(F )− Pr(EF ) for all events E and F .

7. If E ⊆ F , then Pr(E) ≤ Pr(F ).

8. If E1 ⊆ E2 ⊆ . . . is an infinite sequence, then Pr( ⋃ iEi) = limi→∞ Pr(Ei).

9. If E1 ⊇ E2 ⊇ . . . is an infinite sequence, then Pr( ⋂ iEi) = limi→∞ Pr(Ei).

Go to TOC

CHAPTER 3. PROBABILITY 32

3.4 Counting Outcomes. Sampling with and without Replace- ment

Suppose a random experiment with sample space Ω is replicated n times. The result is a sequence (ω1, ω2, . . . , ωn), where ωi ∈ Ω is the outcome of the ith replication. This sequence is the outcome of a so-called compound experiment – the sequential replications of the basic experiment. The sample space of this compound experiment is the n-fold cartesian product Ωn = Ω × Ω × · · · × Ω. Now suppose that the basic experiment is to choose one member of a finite population with N elements. We may identify the sample space Ω with the population. Consider an outcome (ω1, ω2, . . . , ωn) of the replicated experiment. There are N possibilities for ω1 and for each of those there are N possi- bilities for ω2 and for each pair ω1, ω2 there are N possibilities for ω3, and so on. In all, there are N × N × · · · × N = Nn possibilities for the entire sequence (ω1, ω2, · · · , ωn). If all outcomes of the compound experiment are equally likely, then each has probability 1Nn . Moreover, it can be shown that the compound experiment has equally likely outcomes if and only if the basic experiment has equally likely outcomes, each with probability 1N .

Definition: An ordered random sample of size n with replacement from a population of size N is a randomly chosen sequence of length n of elements of the population, where repetitions are possible and each outcome (ω1, ω2, · · · , ωn) has probability 1Nn .

Now suppose that we sample one element ω1 from the population, with all N outcomes equally likely. Next, we sample one element ω2 from the population excluding the one already chosen. That is, we randomly select one element from Ω ∼ {ω1} with all the remaining N − 1 elements being equally likely. Next, we randomly select one element ω3 from the the N − 2 elements of Ω ∼ {ω1, ω2}, and so on until at last we select ωn from the remaining N − (n− 1) elements of the population. The result is a nonrepeating sequence (ω1, ω2, · · · , ωn) of length n from the population. A nonrepeating sequence of length n is also called a permutation of length n from the N objects of the population. The total

number of such permutations is N × (N − 1)× · · · × (N − n+ 1) = N !(N−n)! . Obviously, we must have n ≤ N for this to make sense. The number of permutations of length N from a set of N objects is N !. It can be shown that, with the sampling scheme described above, all permutations of length n

are equally likely to result. Each has probability (N−n)!N ! of occurring.

Definition: An ordered random sample of size n without replacement from a population of size N is a randomly chosen nonrepeating sequence of length n from the population where each outcome

(ω1, ω2, · · · , ωn) has probability (N−n)!N ! .

Most of the time when sampling without replacement from a finite population, we do not care about the order of appearance of the elements of the sample. Two nonrepeating sequences with the same elements in different order will be regarded as equivalent. In other words, we are concerned only with the resulting subset of the population. Let us count the number of subsets of size n from a set of N objects. Temporarily, let C denote that number. Each subset of size n can be ordered in n! different ways to give a nonrepeating sequence. Thus, the number of nonrepeating sequences of length n is C times n!. So, N !(N−n)! = C × n! i.e., C =

N ! n!(N−n)! =

( N n

) . This is the same binomial coefficient

( N n

) that appears in the binomial theorem: (a+ b)N =

∑N n=0

( N n

) anbN−n.

Go to TOC

CHAPTER 3. PROBABILITY 33

Definition: A simple random sample of size n from a population of size N is a randomly chosen subset

of size n from the population, where each subset has the same probability of being chosen, namely 1 (Nn)

.

A simple random sample may be obtained by choosing objects from the population sequentially, in the manner described above, and then ignoring the order of their selection.

Example: The Birthday Problem

There are N = 365 days in a year. (Ignore leap years.) Suppose n = 23 people are chosen ran- domly and their birthdays recorded. What is the probability that at least two of them have the same birthday?

Solution: Arbitrarily numbering the people involved from 1 to n, their birthdays form an ordered sam- ple, with replacement, from the set of 365 birthdays. Therefore, each sequence has probability 1Nn of occurring. No two people have the same birthday if and only if the sequence is actually nonrepeating. The number of nonrepeating sequences of birthdays is N(N − 1) · · · (N −n+ 1). Therefore, the event ”No two people have the same birthday” has probability

N(N − 1) · · · (N − n+ 1) Nn

= N(N − 1) · · · (N − n+ 1)

N ×N × · · · ×N

= (1− 1 N

)(1− 2 N

) · · · (1− n− 1 N

)

With n = 23 and N = 365 we can find this in R as follows:

> prod(1-(1:22)/365)

 0.4927028

So, there is about a 49% probability that no two people in a random selection of 23 have the same birthday. In other words, the probability that at least two share a birthday is about 51%.

An important, intuitively obvious principle in statistics is that if the sample size n is very small in comparison to the population size N , a sample taken without replacement may be regarded as one taken with replacement, if it is mathematically convenient to do so. A sample of size 100 taken with replacement from a population of 100,000 has very little chance of repeating itself. The probability of a repetition is about 5%.

3.4.1 Exercises

1. A red 6-sided die and a green 6-sided die are thrown simultaneously. The outcomes of this exper- iment are equally likely. What is the probability that at least one of the dice lands with a 6 on its upper face?

2. A hand of 5-card draw poker is a simple random sample from the standard deck of 52 cards. What is the probability that a 5-card draw hand contains the ace of hearts?

Go to TOC

CHAPTER 3. PROBABILITY 34

3. How many 5 draw poker hands are there? In 5-card stud poker, the cards are dealt sequentially and the order of appearance is important. How many 5 stud poker hands are there?

4. Everybody in Ourtown is a fool or a knave or possibly both. 70% of the citizens are fools and 85% are knaves. One citizen is randomly selected to be mayor. What is the probability that the mayor is both a fool and a knave?

5. A Martian year has 669 days. An R program for calculating the probability of no repetitions in a sample with replacement of n birthdays from a year of N days is given below.

> birthdays=function(n,N) prod(1-1:(n-1)/N)

To invoke this function with, for example, n=12 and N=400 simply type

> birthdays(12,400)

Check that the program gives the right answer for N=365 and n=23. Then use it to find the number n of Martians that must be sampled in order for the probability of a repetition to be at least 0.5.

6. A standard deck of 52 cards has four queens. Two cards are randomly drawn in succession, without replacement, from a standard deck. What is the probability that the first card is a queen? What is the probability that the second card is a queen? If three cards are drawn, what is the probability that the third is a queen? Make a general conjecture. Prove it if you can. (Hint: Does the probability change if ”queen” is replaced by ”king” or ”seven”?)

3.5 Conditional Probability

Definition: Let A and B be events with Pr(B) > 0. The conditional probability of A, given B is:

Pr(A|B) = Pr(AB) Pr(B)

. (3.1)

Pr(A) itself is called the unconditional probability of A.

Example 3.2. R includes a tabulation by various factors of the 2201 passengers and crew on the Titanic. Read about it by typing

> help(Titanic)

We are going to look at these factors two at a time, starting with the steerage class of the passengers and whether they survived or not.

> apply(Titanic,c(1,4),sum)

Survived

Class No Yes

Go to TOC

CHAPTER 3. PROBABILITY 35

1st 122 203

2nd 167 118

3rd 528 178

Crew 673 212

Suppose that a passenger or crew member is selected randomly. The unconditional probability that that person survived is 7112201 = 0.323.

> apply(Titanic,4,sum)

No Yes

1490 711

> apply(Titanic,1,sum)

1st 2nd 3rd Crew

325 285 706 885

Let us calculate the conditional probability of survival, given that the person selected was in a first class cabin. If A = ”survived” and B = ”first class”, then

Pr(AB) = 203

2201 = 0.0922

and

Pr(B) = 325

2201 = 0.1477.

Thus,

Pr(A|B) = 0.0922 0.1477

= 0.625.

First class passengers had about a 62% chance of survival. For random sampling from a finite popu- lation such as this, we can use the counts of occurrences of the events rather than their probabilities because the denominators in Pr(AB) and Pr(B) cancel.

Pr(A|B) = #(AB) #(B)

= 203

325 = 0.625

For comparison, look at the conditional probabilities of survival for the other classes.

Pr(survived|second class) = 118 285

= 0.414

Pr(survived|third class) = 178 706

= 0.252

Pr(survived|crew) = 212 885

= 0.240

Go to TOC

CHAPTER 3. PROBABILITY 36

3.5.1 Relating Conditional and Unconditional Probabilities

The defining equation (3.1) for conditional probability can be written as

Pr(AB) = Pr(A|B)Pr(B), (3.2)

which is often more useful, especially when Pr(A|B) is easily determined from the description of the experiment. There is an even more useful result sometimes called the law of total probability. Let B1, B2, · · · , Bk be pairwise disjoint events such that each Pr(Bi) > 0 and Ω = B1 ∪ B2 ∪ · · · ∪ Bk. Let A be another event. Then,

Pr(A) =

k∑ i=1

Pr(A|Bi)Pr(Bi). (3.3)

This is quite easy to show since A = (AB1) ∪ · · · ∪ (ABk) is a union of pairwise disjoint events and Pr(ABi) = Pr(A|Bi)Pr(Bi).

Example 3.3. Diagnostic Tests: Let D denote the presence of a disease in a randomly selected member of a given population. Suppose that there is a diagnostic test for the disease and let T denote the event that a random subject tests positive, that is, that the test indicates the disease. The conditional probability Pr(T |D) is called the sensitivity of the test. The conditional probability Pr(∼T |∼D) is called the specificity of the test. The unconditional probability Pr(D) is called the prevalence of the disease in the population. A good test will have both a high sensitivity and a high specificity, although there is usually a trade-off between the two. The unconditional probability that a randomly chosen subject tests positive for the disease is

Pr(T ) = Pr(T |D)Pr(D) + Pr(T |∼D)Pr(∼D) Suppose that the disease is rare, Pr(D) = 0.02, and that the sensitivity of the test is Pr(T |D) = 0.95 with specificity Pr(∼T |∼D) = 0.85. The false positive rate for the test is Pr(T |∼D) = 1 − Pr(∼T |∼D) = 0.15. The unconditional probability of a positive test result is

Pr(T ) = 0.95× 0.02 + 0.15× 0.98 = 0.166

16.6% of the population will test positive for the disease, even though only 2% have it.

3.5.2 Bayes’ Rule

Bayes’ rule is named for Thomas Bayes, an eighteenth century clergyman and part-time mathemati- cian. As given below, it is merely a relationship between conditional probabilities but it is associated with Bayesian inference, a distinct philosophy and methodology of statistical practice. Bayes’ rule is often described as a rule for calculating conditional ”posterior” probabilities from unconditional ”prior” probabilities.

Bayes’ Rule: Let A and B1, B2, · · · , Bk be given as in the law of total probability (3.3) and assume Pr(A) > 0. Then for each i,

Pr(Bi|A) = Pr(A|Bi)Pr(Bi)

Pr(A) , (3.4)

where Pr(A) is calculated as in (3.3).

Go to TOC

CHAPTER 3. PROBABILITY 37

Example 3.4. Urn 1 contains 3 red balls and 5 white balls. Urn 2 contains 6 red balls and 3 white balls. A fair coin is tossed (meaning that heads and tails are equally likely). If a head turns up, a ball is randomly selected from Urn 1. If a tail comes up, a ball is randomly selected from Urn 2. Given that a white ball was selected, what is the probability that it came from Urn 1?

Solution: From the law of total probability,

Pr(White) = Pr(White|Urn1)Pr(Urn1) + Pr(White|Urn2)Pr(Urn2)

= 5

8 × 1

2 +

3

9 × 1

2 =

23

48 .

From Bayes’ rule,

Pr(Urn1|White) = Pr(White|Urn1)Pr(Urn1) Pr(White)

= (5/8)× (1/2)

23/48 =

15

23 .

Example 3.5. Diagnostic Tests (Continued): A patient receiving a test result indicating a disease should be more interested in the conditional probability of having the disease (the probability posterior to receiving the diagnosis) than in the unconditional probability (the probability prior to receiving the diagnosis). That is, he or she wants to know Pr(D|T ). This is easily obtained from Bayes’ rule.

Pr(D|T ) = Pr(T |D)Pr(D) Pr(T )

Let us assume the same prevalence, sensitivity and specificity as in the previous example. Then

Pr(D|T ) = 0.95× 0.02 0.166

= 0.1145.

Thus, if a disease is rare a positive test result may not strongly indicate the presence of the disease.

3.6 Independent Events

Two events A and B are independent if Pr(AB) = Pr(A)Pr(B). If Pr(B) > 0, this is equivalent to Pr(A|B) = Pr(A). In other words, the probability of A is not affected by the occurrence or non-occurrence of B. This conforms to our intuitive understanding of independence. More generally, events in a collection C are independent if Pr(A1A2 · · ·An) = Pr(A1)Pr(A2) · · ·Pr(An) for each finite subcollection {A1, A2, · · · , An} of events in C. Events that are not independent are dependent.

Example 3.6. Draw 2 cards in succession without replacement from a standard deck. Let A be the event that the first card is a face card and let B be the event that the second card is a seven. The conditional probability of B, given A, is 4/51. The unconditional probability of B is 1/13. Therefore, A and B are dependent events. Let C be the event that the second card drawn is a heart. The unconditional probability of C is 1/4. It is an exercise to show that the conditional probability of C, given A, is also 1/4. Therefore,A and C are independent.

Go to TOC

CHAPTER 3. PROBABILITY 38

3.6.1 Exercises

1. A department store tabulated the relative frequencies of the amounts of purchases and the method of payment. The results are shown below.

Cash Credit Debit <\$20 .09 .03 .04 \$20-\$100 .05 .21 .18 >\$100 .03 .23 .14

(a) What proportion of purchases are paid for in cash?

(b) Given that a purchase is for more than \$100, what is the probability that it is paid for by credit?

(c) Are payment by credit and amount > \$100 independent events?

2. Refer to examples 3 and 5 above. What is Pr(∼D|∼T )?

3. Generalize equation (3.2) to show that for any events A1, A2, · · · , An,

Pr(A1A2 · · ·An) = Pr(An|A1A2 · · ·An−1)Pr(An−1|A1 · · ·An−2) · · ·Pr(A2|A1)Pr(A1),

provided Pr(A1A2 · · ·An−1) > 0. Hint: Use an inductive argument.

4. The Montana text file is adapted from the Montana outlook poll conducted by the University of Montana in 1992. Use Rstudio to load it into your R workspace or use plain R with the command with the ”read.table” function as shown below.

> attach(Montana)

> table(AREA,INC)

INC

AREA <20K >35K 20-35K

NE 13 21 22

SE 17 21 31

W 17 18 30

> table(AREA,POL)

POL

AREA Dem Ind Rep

NE 15 12 30

SE 30 16 31

W 39 12 17

Are the events INC > 35K and AREA == W independent or dependent? What about the events AREA == W and POL == Rep?

Go to TOC

CHAPTER 3. PROBABILITY 39

5. Two cards are drawn in succession without replacement from a standard deck. Show that the events A=”face card on first draw” and B=”heart on second draw” are independent. Hint: Write A = A1 ∪A2, where A1=”face card and a heart on first draw” and A2=”face card and not a heart on first draw”.

3.7 Replications of a Random Experiment

In Chapter 1 we mentioned that replications of a random experiment are independent, without making that statement precise. We can now elaborate on that idea. Let Ω be the sample space of a basic random experiment. Replicating the experiment n times results in a compound random experiment whose sample space is the n-fold Cartesian product Ωn = Ω×Ω× · · · ×Ω. Let A1, A2, · · · , An be any subsets of Ω, that is, any events belonging to the basic experiment. Thus Pr(Ai) is a well defined probability for each i. The cartesian product A1 ×A2 × · · · ×An is an event in the compound exper- iment, a subset of Ωn. For a replicated experiment it must be true that

Pr(A1 ×A2 × · · · ×An) = Pr(A1)× Pr(A2)× · · · × Pr(An)

for all choices of A1, A2, · · · , An.

The notation in the last equation is slightly off. The symbol ”Pr” on the left stands for the probability measure on Ωn, whereas on the right it stands for the probability measure on Ω.

Go to TOC

Chapter 4

Discrete Distributions

4.1 Random Variables

A random variable is a function whose domain is the sample space Ω of a random experiment. The set of values (the range) of this function might be a finite set of letters, words, or other symbols. Such a random variable is called a nominal variable, a categorical variable, or a factor. Other random variables, called numeric variables, have real number values whose order and arithmetic relationships are important. This chapter is mostly about numeric random variables.

Examples:

1. Select one person randomly from a population of M women and N men. Let X = 1 if a the person selected is a woman and let X = 0 if a man. In other words, X is the number of women that occur in a single random selection. A random variable that has only the two values 0 and 1 is called a Bernoulli random variable.

2. Replicate the experiment in (1) n times, i.e., choose an ordered random sample of size n with replacement from the population. Let W be the number of women in the sample. W has possible values {0, 1, · · · , n}. We may express W as W = X1 +X2 + · · ·+Xn, where Xi is 1 if a woman was selected on the ith replication and 0 if a man was selected.

3. Choose a random sample of size n without replacement from a population of M women and N men. Let W be the number of women in the sample.

4. Choose a random sample of size n from the population of prospective voters in a national election. Let X1 be the number in the sample who self-identify as Democrats, X2 the number of Republicans, X3 the number of Libertarians, X4 the number of Greens, X5 the number of Other Party respon- dents, and X6 the number affiliated with no party. Individually, each of these variables is of the type described in examples (2) or (3), depending on whether sampling from the population is done with or without replacement. Since they are simultaneously observed on each outcome of the sampling experiment, they are said to be jointly observed or jointly distributed.

40

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 41

5. Roll a fair 6-sided die twice. Let X1 be the sum of the two rolls and let X2 be the larger of the two rolls. Then X1 and X2 are jointly distributed random variables.

4.2 Discrete Random Variables

A random variable whose set of possible values (i.e., its range) is a finite or countably infinite set is called a discrete random variable. All of the random variables in the examples above are discrete. Let X denote such a variable. Its values can be arranged in a finite or infinite sequence x1, x2, · · · , xn, · · · . The probabilities with which X assumes these values is of fundamental importance. The set of experimental outcomes {ω ∈ Ω|X(ω) = xi} is an event and will be denoted by (X = xi) for short. The probability Pr(X = xi) is the frequency of xi or probability mass at xi and the function f defined on the set {x1, x2, · · · }

f(xi) = Pr(X = xi)

is called the frequency function or probability mass function of X. For numeric variables it is conve- nient to allow f to be defined for all real numbers x by defining

f(x) = Pr(X = x),

with the understanding that Pr(X = x) is 0 if x is not one of the xi. As a consequence of the rules of probability we have f(x) ≥ 0 for each real x and

∑ x f(x) = 1, where the sum is taken over all real

numbers x. In reality, the sum reduces to ∑ i f(xi) = 1.

Other probabilities can be expressed in terms of the frequency function. For example, if I is any kind of interval of real numbers, the set of outcomes (X ∈ I) = {ω ∈ Ω|X(ω) ∈ I} is an event and its probability may be calculated as

Pr(X ∈ I) = ∑ x∈I

f(x).

Example 4.1. Roll a 6-sided die twice. Assume that all 36 outcomes are equally likely. Let T denote the total number of spots on the two rolls. A table of values of T and their probabilities is given below.

t 2 3 4 5 6 7 8 9 10 11 12 f(t) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

Suppose that we are interested in Pr(T ≤ 4). We can calculate it as

Pr(T ≤ 4) = ∑ t≤4

f(t) =

4∑ t=2

f(t)

= 1/36 + 2/36 + 3/36

= 6/36

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 42

4.3 Expected Values

Definition 4.1. Let X be a discrete random variable with frequency function f(x) = Pr(X = x). The expected value or mean value of the distribution of X is

E(X) = µ = ∑ x

xf(x) = ∑ x

xPr(X = x).

In case the set of possible values of X is countably infinite, we require that the sum in the definition be absolutely convergent. If X has only a finite set of possible values this is not a concern.

Theorem 4.1. Let X be a discrete random variable with frequency function f(x) and let h(x) be a function defined on the range of X. The expected value of the random variable Y = h(X) is equal to

E(Y ) = E(h(X)) = ∑ x

h(x)f(x).

Proof: For a given value y of Y ,

Pr(Y = y) = ∑

x:h(x)=y

Pr(X = x).

Hence,

E(Y ) = ∑ y

y ∑

h(x)=y

Pr(X = X) = ∑ x

h(x)f(x).

Example 4.2. For the random variable T of the preceding example, find E(T ) and E(T 2).

Solution: Extending the table of Example 1,

t 2 3 4 5 6 7 8 9 10 11 12 f(t) 136

2 36

3 36

4 36

5 36

6 36

5 36

4 36

3 36

2 36

1 36

tf(t) 236 6 36

12 36

20 36

30 36

42 36

40 36

36 36

30 36

22 36

12 36

t2f(t) 436 18 36

48 36

100 36

180 36

294 36

320 36

324 36

300 36

242 36

144 36

Adding the entries in the last two rows gives

E(T ) = 7

and E(T 2) = 54.833.

Definition 4.2. The variance of the distribution of a random variable X is

var(X) = σ2 = E((X − µ)2),

where µ is the mean of X. The standard deviation of X is the square root of its variance.

sd(X) = σ = √ var(X).

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 43

The mean, variance and standard deviation of a distribution are analogous to but not the same as the sample mean, variance and standard deviation that we discussed previously for numeric data sets. Like their sample counterparts, the mean and standard deviation of a distribution are measures of its location and spread.

Theorem 4.2. If Y = a+ bX, where a and b are constants, then

E(Y ) = a+ bE(X)

and sd(Y ) = |b|sd(X).

This leads to an alternate formula for the variance that is sometimes easier for calculation. Let µ = E(X). Then

(X − µ)2 = X2 − 2µX + µ2.

Hence,

var(X) = E((X − µ)2) = E(X2)− 2µ2 + µ2

= E(X2)− E(X)2.

For the random variable T of the preceding example, we calculated E(T 2) = 54.833 and E(T ) = 7. Thus, var(T ) = 54.833− 49 = 5.833 and sd(T ) =

√ 5.833 = 2.415.

The next theorem, Chebyshev’s inequality, places a universal restriction on the probabilities of devi- ations of random variables from their means.

Theorem 4.3. If X is a random variable with mean µ and standard deviation σ and if k is a positive constant, then

Pr(|X − µ| > kσ) ≤ 1/k2.

4.3.1 Exercises

1. A fair coin is tossed until either a head occurs or 6 tails in a row have occurred. Let X denote the number of tosses. Find the frequency function, mean, and variance of X.

2. Verify Chebyshev’s inequality for k = 2 and k = 3 when X is the total number of spots on two rolls of a fair 6-sided die.

3. Prove Theorem 2.

4. The function f(n) = 1/(n(n+ 1)), n = 1, 2, 3, · · · is a legitimate frequency function. Show that its mean value does not exist.

4.4 Bernoulli Random Variables

A Bernoulli random variable has only two possible values, usually designated as 1 and 0. Often these are numeric codes for verbal descriptions like ”success” and ”failure”. For example, roll a pair of dice

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 44

and call it a success if the total number of spots is 7 or 11. Otherwise, call the experiment a failure. Instead of defining a random variable X with possible values {success, failure} we typically let the values of X be {1, 0}, where 1 means success and 0 means failure. One advantage in doing this is that X is then a numeric variable and can be interpreted as the number of successes in one performance of the experiment.

For a given Bernoulli variable X let p denote Pr(X = 1). p is the so-called success probability. In the example just given, p = Pr(T = 7 or T = 11) = 8/36, but in general p could be any number between 0 and 1. The frequency function for X is

f(x) =

 p, if x = 1,

1− p, if x = 0, 0, if x 6= 0, 1.

(4.1)

A compact way of writing this is f(x) = px(1− p)1−x

for x = 0, 1.

Example 4.3. Randomly select one person from a population of M men and N women. Let W be the number (either 0 or 1) of women selected. W is a Bernoulli variable with success probability p = N/(M +N).

4.4.1 The Mean and Variance of a Bernoulli Variable

Let X be a Bernoulli variable with success probability p = Pr(X = 1). The expected value of X is

E(X) = 0× (1− p) + 1× p = p.

Furthermore, since X2 = X, E(X2) = p also. Therefore,

var(X) = E(X2)− E(X)2 = p(1− p).

4.5 Binomial Random Variables

Let X be a Bernoulli random variable with success probability p arising from a given random ex- periment. Replicate the experiment n times and let X1, X2, · · · , Xn be the values of X from the replications. The random variables X1, · · · , Xn are independent, which is a most important property, not just for Bernoulli variables but for any jointly distributed random variables whenever it holds.

Definition 4.3. Jointly distributed random variables X1, X2, · · · , Xn are independent if for all inter- vals I1, I2, · · · , In,

Pr(X1 ∈ I1;X2 ∈ I2; · · · ;Xn ∈ In) = Pr(X1 ∈ I1)× Pr(X2 ∈ I2)× · · · × Pr(Xn ∈ In).

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 45

The expression (X1 ∈ I1;X2 ∈ I2; · · · ;Xn ∈ In) means the same thing as the intersection (X1 ∈ I1) ∩ (X2 ∈ I2) ∩ · · · ∩ (Xn ∈ In).

For independent replications of a Bernoulli experiment, let Y = X1 + X2 + · · · + Xn. Y is the total number of successes in the n replications. Clearly, the possible values of Y are 0, 1, · · · , n. We will derive Pr(Y = y) for any y in this range. Let x1, x2, · · · , xn be any particular sequence of y 1’s and n− y 0’s. Since Pr(Xi = 1) = p, Pr(Xi = 0) = 1− p, and the Xi are independent,

Pr(X1 = x1;X2 = x2; · · · ;Xn = xn) = py(1− p)n−y.

This is only one particular sequence of values of the Xi that leads to the event Y = y. In all, there are

( n y

) sequences of y 1’s and n− y 0’s. Thus,

Pr(Y = y) =

( n

y

) py(1− p)n−y.

Definition 4.4. A random variable Y has a binomial distribution based on n trials and success probability p ∈ (0, 1) if the frequency function of Y is

fY (y) =

{( n y

) py(1− p)n−y, if y ∈ {0, 1, · · · , n};

0, otherwise.

Note that a Bernoulli random variable is a binomial random variable with n = 1. The family of all binomial distributions is a parametric family because specification of the values of the two parameters n and p singles out a specific member of that family. To indicate that Y has a binomial distribution with given parameter values n and p, we write Y ∼ Binom(n, p).

Any numeric random variable X has a cumulative distribution function defined as

FX(x) = Pr(X ≤ x)

for all real numbers x. For discrete random variables the relationship between the frequency function and the cumulative distribution function is

FX(x) = ∑ xi≤x

fX(xi),

where x1, x2, · · · are the values of X. In particular, for a binomial random variable Y ∼ Binom(n, p),

FY (y) =

 ∑

0≤i≤y ( n i

) pi(1− p)n−i if 0 ≤ y ≤ n,

0 if y < 0,

1 if y ≥ n.

Any cumulative distribution function F has the following properties:

1. F is a nondecreasing function defined on the set of all real numbers. 2. F is right-continuous. That is, for each a, F (a) = F (a+) = limx→a+ F (x). 3. limx→−∞ F (x) = 0; limx→+∞ F (x) = 1.

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 46

4. Pr(a < X ≤ b) = FX(b)− FX(a) for all real a and b, a < b. 5. Pr(X > a) = 1− FX(a). 6. Pr(X < b) = FX(b−) = limx→b− FX(x). 7. Pr(a < X < b) = FX(b−)− FX(a). 8. Pr(X = b) = FX(b)− FX(b−).

Here is the graph of the cumulative distribution function of X ∼ Binom(20, 0.4). Notice that it is constant between successive possible values.

0 5 10 15 20

0. 0

0. 2

0. 4

0. 6

0. 8

1. 0

x

F (x

)

The values of the frequency function of X are plotted below as vertical line segments.

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 47

5 10 15 20

0. 00

0. 05

0. 10

0. 15

x

f( x)

R has a suite of functions related to binomial distributions. You can read about them by calling

> help(Binomial)

For now, the most important are the functions ”dbinom” for calculating the frequency function and ”pbinom” for calculating the cumulative distribution function. For example, if Y ∼ Binom(20, 0.3), the frequency function f(10) = Pr(Y = 10) and the cumulative distribution F (10) = Pr(Y ≤ 10) are found by

> dbinom(10,size=20,prob=0.3)

 0.03081708

> pbinom(10,size=20,prob=0.3)

 0.9828552

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 48

4.5.1 The Mean and Variance of a Binomial Distribution

To derive the mean and variance of a binomial distribution we will rely on the following results, which will be discussed in more detail later.

Theorem 4.4. If jointly distributed random variables X1, X2, · · · , Xn have expected values and Y = X1 + X2 + · · · + Xn, then E(Y ) = E(X1) + E(X2 + · · · + E(Xn). If X1, X2, · · · , Xn are independent, then var(Y ) = var(X1) + var(X2) + · · ·+ var(Xn).

A binomial random variable Y ∼ Binom(n, p) has the same distribution as X1 + · · · + Xn, where X1, X2, · · · , Xn are independent Bernoulli variables Xi ∼ Binom(1, p). Therefore,

E(Y ) = E(X1) + · · ·+ E(Xn) = np,

var(Y ) = var(X1) + · · ·+ var(Xn) = np(1− p),

and sd(Y ) =

√ np(1− p).

4.5.2 Exercises

1. A 6-sided die is thrown twice. All outcomes are equally likely. Let M denote the maximum of the two numbers on the upper surfaces. Find the frequency function and the cumulative distribution function of M . Graph the cumulative distribution function.

2. Six people are randomly selected in succession, with replacement, from a class containing 25 men and 20 women. What is the probability of obtaining the sequence 1, 0, 0, 0, 1, 1, where 1 indicates a man was chosen and 0 indicates a woman was chosen?

3. Write down all the other outcomes of this sequential sampling experiment that lead to 3 men and 3 women being chosen. What are their probabilities?

4. What is the probability that 3 men are chosen in the sampling experiment?

5. What is the probability that 2 or more women are chosen?

6. Suppose that 6 people are randomly chosen without replacement from a population consisting of 2500 men and 2000 women. Find the approximate probability that there are 4 men in the sample. Justify your answer.

7. Use both your calculator and R to find the following probabilities.

(a) Pr(Y = 5), Y ∼ Binom(12, 0.3).

(b) Pr(Y > 8), Y ∼ Binom(12, 0.3).

(c) Pr(|Y − 10| ≤ 4), Y ∼ Binom(20, 0.5).

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 49

8. Show that the sum of binomial frequencies is 1, i.e., that

n∑ x=0

( n

x

) px(1− p)n−x = 1.

Hint: Expand 1 = 1n = [p+ (1− p)]n by the binomial theorem from calculus.

9. Sketch the cumulative distribution function of the Bernoulli distribution Binom(n = 1, p = .7).

10. Use R’s ”pbinom” function to verify Chebyshev’s inequality for k = 2 and k = 3 when X ∼ Binomial(50, 0.4).

4.6 Hypergeometric Distributions

Suppose that a random sample of size k is selected without replacement from an urn containing m white balls and n black balls. Let X denote the number of white balls in the sample. The distribution of X is called a hypergeometric distribution with parameters m, n, and k. X is an integer-valued random variable that lies between max{0, k − n} and min{k,m}.

Let x be an integer between max{0, k − n} and min{k,m}. In order for the event (X = x) to occur, a set of x white balls must be chosen. This can occur in

( m x

) ways, and for each such outcome, there

are ( n k−x )

ways of choosing k − x black balls. Therefore, the number of outcomes of the sampling experiment in the event (X = x) is (

m

x

)( n

k − x

) .

The ( m+n k

) outcomes of the sampling experiment are all equally likely. Thus,

fX(x) = Pr(X = x) =

( m x

)( n k−x )(

m+n k

) (4.2) Definition 4.5. An integer valued random variable has a hypergeometric distribution with parameters m, n, and k (all positive integers, k ≤ m+ n) if its frequency function is given by 4.2 for all integers x, max{0, k − n} ≤ x ≤ min{k,m}.

We have mentioned several times that if the sample size is small compared to the population size, we can regard a sample taken without replacement as one taken with replacement (or vice-versa) if it is mathematically convenient to do so. This is reflected in the following theorem.

Theorem 4.5. Let m, n, and k be positive integers and suppose that m,n→∞ in such a way that m/(m+ n)→ p ∈ (0, 1). Then for any integer x between 0 and k,(

m x

)( n k−x )(

m+n k

) → (k x

) px(1− p)k−x.

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 50

This theorem justifies approximating a hypergeometric distribution with a binomial distribution in certain circumstances. In general it is easier to work with a binomial distribution.

To indicate that X has a hypergeometric distribution with parameters m, n and k, we write X ∼ Hyper(m,n, k). The R functions for the hypergeometric frequency function and the cumulative dis- tribution function are ”dhyper” and ”phyper”, respectively. Details on their use are in the R help file

> help(Hyper)

Example 4.4. A class consists of 25 men and 20 women. Six people are randomly selected from the class without replacement. What is the probability that 3 men are chosen?

Solution: The number X of men in the sample has a hypergeometric distribution with parameter values m = 25, n = 20, and k = 6. Hence,

Pr(X = 3) =

( 25 3

)( 20 3

)( 45 6

) . This can be calculated in R with the ”dhyper” function or with the ”choose” function for evaluating binomial coefficients.

> dhyper(x=3,m=25,n=20,k=6)

 0.3219129

> choose(25,3)*choose(20,3)/choose(45,6)

 0.3219129

For finding the value of the cumulative distribution function, calculating binomial coefficients quickly becomes tiresome. The R function is ”phyper”. For example, Pr(X ≤ 3) is

> phyper(3,25,20,6)

 0.5527105

If the sampling had been with replacement, the distribution of X would have been binomial and the answers would have been

> dbinom(x=3,size=6,prob=25/45)

 0.3010682

> pbinom(3,6,25/45)

 0.5472216

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 51

4.6.1 The Mean and Variance of a Hypergeometric Distribution

A random sample of size k chosen without replacement from a population of m successes and n failures can be selected sequentially. After each selection all the remaining members of the population must be equally likely on the next selection. All subsets of size k are equally likely to result from this method. On the ith selection, let Xi = 1 if the choice is a success and Xi = 0 if it is not. X1, X2, · · · , Xk are Bernoulli variables, all with the same success probability p = m/(m+n), but they are not independent. Since Y = X1 + · · ·+Xk is the number of successes and has the hypergeometric distribution Y ∼ Hyper(m,n, k), the first part of Theorem 4 applies to give

E(Y ) = kp = km/(m+ n).

However, the second part of Theorem 4 does not apply because the Xi are not independent. The variance of the hypergeometric distribution differs from the variance of the binomial distribution by a factor sometimes called the correction for sampling from a finite population.

var(Y ) = kp(1− p)(1− k − 1 m+ n− 1

).

Notice that the correction factor

1− k − 1 m+ n− 1

is almost 1 if k << m + n. This is another clue that sampling with and without replacement are almost the same under these circumstances.

4.7 Poisson Distributions

Poisson distributions are important in modeling random phenomena such as subatomic decay events, meteor impacts and genetic mutations that occur sporadically in time or space. We shall first consider occurrences in time. For a given time interval I, let X(I) denote the number of occurrences of the phenomenon in question during that interval.

Definition 4.6. A Poisson process is a collection of non-negative integer valued random variables X(I) associated with time intervals I = (t, t+ ∆t) which satisfies the following conditions.

1. If no two of the time intervals I1, I2, · · · , Im overlap, the random variables X(I1), X(I2), · · · , X(Im) are independent.

2. If two time intervals I1 and I2 have the same length, the random variables X(I1) and X(I2) have the same distribution.

3. There is a constant λ > 0 such that for a time interval I of length ∆t, Pr(X(I) > 0) = λ∆t + �, where �/∆t→ 0 as ∆t→ 0. For small time intervals, the probability of an occurrence during that in- terval is approximately proportional to its length, with negligible error. The proportionality constant λ is called the rate of the process.

4. Pr(X(I) > 1)/∆t → 0 as ∆t → 0. The probability of more than one occurrence during a small time interval is negligible.

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 52

A spatial Poisson process satisfies the same conditions except that the random variables X(I) are associated with two or three-dimensional regions I of space and ∆t is the area or volume of the region I rather than time duration.

Theorem 4.6. For a Poisson process with rate parameter λ > 0, the random variable X(I) has nonnegative integer values and has frequency function

Pr(X(I) = x) = f(x) = e−λ∆t(λ∆t)x

x!

for x = 0, 1, 2, · · · .

Definition 4.7. A random variable X with nonnegative integer values has a Poisson distribution if its frequency function is

f(x) = Pr(X = x) = e−µ µx

x! , (4.3)

for x = 0, 1, 2, · · · , where µ > 0 is a constant.

Thus, for a Poisson process the random variable X(I) has a Poisson distribution with parameter µ = λ∆t. As with any frequency function, the sum of Poisson frequencies must equal 1. This is easy to show from the McClaurin series for the exponential function.

∞∑ x=0

e−µµx/x! = e−µ ∞∑ x=0

µx/x! = e−µeµ = 1.

Example 4.5. Suppose that random mutations occur in a certain section of the human genome ac- cording to a Poisson process with a rate of 1 per 10,000 years. What is the probability that more than one mutation will occur in a period of 5000 years?

Solution: To model a process occurring in time as a Poisson process, it is necessary to specify the unit of time. In this problem it is convenient to take the time unit to be 10,000 years and the rate param- eter λ to be 1. One could choose one year as the time unit and adjust the rate to be λ = 0.0001. The resulting answer would be correct as long as we keep the units in mind. The point is, the parameter λ must be expressed in units of 1/time. We will take λ = 1 for simplicity. Thus, an interval I of 5000 years has length ∆t = 0.5 and the number of mutations X = X(I) has a Poisson distribution with parameter µ = λ∆t = 0.5.

Pr(X > 1) = 1− Pr(X ≤ 1) Pr(X ≤ 1) = Pr(X = 0) + Pr(X = 1)

= e−0.5 + e−0.50.5

= 0.6065 + 0.3033 = 0.9098.

P r(X > 1) = 1− 0.9098 = 0.0902.

If X has a Poisson distribution with parameter µ, we write X ∼ Pois(µ). The R functions for evaluating the frequency function and the cumulative distribution function are ”dpois” and ”ppois”. The parameter µ must be specified as an argument. Unfortunately, it is called ”lambda” in R. Don’t confuse it with the rate parameter λ in the discussion above.

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 53

> ppois(1,lambda=0.5)

 0.909796

> dpois(0,0.5); dpois(1,0.5)

 0.6065307

 0.3032653

There is a close relationship between binomial and Poisson distributions. Let {pn} be a sequence of positive numbers such that pn → 0 and npn → µ > 0. Rearrange the expression(

n

x

) pxn(1− pn)n−x

as

n(n− 1) · · · (n− x+ 1) nx

1

(1− pn)x (npn)

x

x! (1− npn

n )x

Then

(npn) x

x! (1− npn

n )n → e−µµ

x

x!

while the rest of it goes to 1. Thus the binomial distribution Binom(n, pn) approaches the Poisson distribution Pois(µ). In certain circumstances, the Poisson distribution Pois(µ = np) is a good approximation to the binomial distribution Binom(n, p). It has been proved 1 that the error of approximation is at most np2.

Example 4.6. The incidence of Hantavirus infection in New Mexico during an outbreak of the disease was 4.4 cases per million residents. What is the probability that in a sample of 10000 residents, there will be more than one case of Hantavirus?

Solution: Let X denote the number of cases. Considering this as a binomial experiment, X ∼ Binom(n = 10000, p = 4.4 × 10−6). Using the Poisson approximation, X ∼ Pois(µ = 4.4 × 10−2). According to R, the true probability is

> 1-pbinom(1,size=10000,prob=4.4e-06)

 0.0009399798

and the Poisson approximation is

> 1-ppois(1,lambda=4.4e-02)

 0.0009400684

The actual error of approximation is 8.86× 10−8. The upper bound np2 for the error is 1.94× 10−7. 1Hodges, J.L. and LeCam, L.(1960) ”The Poisson Approximation to the Binomial Distribution”,

Annals of Mathematical Statistics 31, 737-740

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 54

4.7.1 The Mean and Variance of a Poisson Distribution

Let X ∼ Pois(µ) with frequency function

f(x) = e−µµx/x!

for x = 0, 1, 2, · · · . We calculate the mean of X from the definition.

E(X) =

∞∑ x=0

xe−µ µx

x!

= µeµ ∞∑ x=1

x µx−1

x!

= µe−µ ∞∑ x=1

µx−1

(x− 1)!

= µe−µ ∞∑ y=0

µy

y!

= µe−µeµ = µ

We leave it as an exercise to show by a similar argument that E(X(X−1)) = µ2. Thus, E(X2) = µ2+µ and

var(X) = µ.

The mean and variance of a Poisson distribution are both equal to the parameter µ.

4.7.2 Exercises

1. A club has 50 members, 10 belonging to the ruling clique and 40 second-class members. Six mem- bers are randomly selected for free movie tickets. What is the probability that 3 or more belong to the ruling clique?

2. Answer the same question if the club has 50,000 members, 10,000 in the ruling clique and 40,000 second-class members.

3. Biologists tagged 50 animals of a species and then released them back into the wild. After a certain ”mixing” time they captured a random sample of 50 animals and discovered that 6 of them had been tagged. Let N denote the size of the population and let X denote the number of tagged animals in a random sample of size 50 from the population. For N = 200, 500, 1000 calculate the probability that X ≤ 6.

4. Huck and Jim are waiting for a raft. The number of rafts floating by over intervals of time is a Poisson process with a rate of λ = 0.4 rafts per day. They agree in advance to let the first raft go and take the second one that comes along. What is the probability that they will have to wait more than a week? Hint: If they have to wait more than a week, what does that say about the number of rafts in a period of 7 days?

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 55

5. In each of the following cases, use R’s ”pbinom” or ”dbinom” function to find the true probability of the event. Then give the Poisson approximation and the value of the upper bound for the error. How does the actual error compare to the upper bound?

(a) Pr(X ≤ 6), X ∼ Binom(n = 36, 000, p = 1.67× 10−4).

(b) Pr(2 ≤ X ≤ 3), X ∼ Binom(105, 4.4× 10−6).

(c) Pr(X = 6), X ∼ Binom(n = 104, p = 5× 10−4).

6. Fire ant colonies occur according to a spatial Poisson process with a rate of 1.5 colonies per acre. What is the probability that a 10 acre plot of land will have 10 or fewer fire ant colonies?

7. Complete the proof that for X ∼ Pois(µ), var(X) = µ.

4.8 Jointly Distributed Variables

Jointly distributed random variables require more than just their individual distributions to completely characterize them. For now, we concentrate on jointly distributed discrete variables and begin with the case of just two variables.

Definition 4.8. LetX1 andX2 be two jointly distributed discrete random variables. The joint frequency function of X1 and X2 is the function of two variables defined as

f(x1, x2) = Pr(X1 = x1 and X2 = x2).

The marginal frequency functions of X1 and X2 are simply their individual frequency functions as previously defined.

f1(x1) = Pr(X1 = x1),

f2(x2) = Pr(X2 = x2).

Suppose that x1 is given and that f1(x1) = Pr(X1 = x1) > 0. The conditional frequency function of X2, given that X1 = x1, is the function of x2 defined by

f2|1(x2|x1) = Pr(X2 = x2|X1 = x1).

The conditional frequency function of X1, given that X2 = x2 is

f1|2(x1|x2) = Pr(X1 = x1|X2 = x2).

Theorem 4.7. Let X1 and X2 have joint frequency function f(x1, x2). Then

(1) f2(x2) = ∑ x1 f(x1, x2) for all x2.

(2) f1(x1) = ∑ x2 f(x1, x2) for all x1.

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 56

(3) f2|1(x2|x1) = f(x1, x2)/f1(x1) if f1(x1) > 0.

(4) X1 and X2 are independent if and only if f(x1, x2) = f1(x1)f2(x2) for all x1, x2.

(5) ∑ x1

∑ x2 f(x1, x2) = 1.

For m ≥ 2 jointly distributed discrete random variables X1, X2, · · · , Xm, the joint frequency function is the function of m arguments defined as

f(x1, x2, · · · , xm) = Pr(X1 = x1;X2 = x2; · · · ;Xm = xm).

Statements analogous to (1) through (5) of the preceding theorem also hold for more than two variables. In particular, X1, X2, · · · , Xm are independent if and only if

f(x1, x2, · · · , xm) = f1(x1)f2(x2), · · · , fm(xm)

for all x1, x2, · · · , xm, where fi(xi) is the marginal frequency function of Xi.

Example 4.7. Roll a standard pair of dice. All outcomes are equally likely. Let X1 be the maximum of the numbers of spots on the two dice and let X2 be the minimum of the two numbers. The joint frequency function of X1 and X2 can be displayed in tabular form as follows.

X1: max 1 2 3 4 5 6 X2: min

1 1/36 2/36 2/36 2/36 2/36 2/36 11/36 2 0 1/36 2/36 2/36 2/36 2/36 9/36 3 0 0 1/36 2/36 2/36 2/36 7/36 4 0 0 0 1/36 2/36 2/36 5/36 5 0 0 0 0 1/36 2/36 3/36 6 0 0 0 0 0 1/36 1/36

1/36 3/36 5/36 7/36 9/36 11/36 1

The numbers in the rightmost column are the marginal frequencies of X2. The numbers in the bot- tom row are the marginal frequencies of X1. The conditional frequency function f1|2(x1|X2 = 3) is obtained by dividing each of the elements of the row corresponding to x2 = 3 by their sum 7/36. Thus,

x1 1 2 3 4 5 6 f1|2(x1|3) 0 0 1/7 2/7 2/7 2/7

Clearly, X1 and X2 are not independent since the marginal frequency function of X1 is not equal to the conditional frequency function given that X2 = 3. It is clear also because the joint frequency function obviously does not factor into the product of the marginal frequencies.

4.8.1 Covariance and Correlation

Definition 4.9. Let X and Y be jointly distributed random variables with respective means µx and µy and standard deviations σx, σy. The covariance of X and Y is

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 57

cov(X,Y ) = E((x− µx)(Y − µy)).

The correlation of X and Y is

cor(X,Y ) = cov(X,Y )

σxσy .

An alternate formula for the covariance is

cov(X,Y ) = E(XY )− E(X)E(Y ).

The covariance has the following properties. Most of them follow easily from the definition.

1. cov(X,Y ) = cov(Y,X)

2. cov(X,X) = var(X)

3. If X,Y , and Z are jointly distributed and a and b are constants

cov(X, aY + bZ) = a ∗ cov(X,Y ) + b ∗ cov(X,Z).

4. If X and Y are independent, cov(X,Y ) = 0.

It can be shown that −1 ≤ cor(X,Y ) ≤ 1. cor(X,Y ) = 1 if and only if there are constants a, b such that Y = a+bX and b > 0. In other words, there is an exact linear relationship between X and Y with positive slope. cor(X,Y ) = −1 if and only if there is an exact linear relationship with negative slope. In all other cases, the correlation is strictly between -1 and 1. In general, the correlation measures the strength of linear association between X and Y .

Example 4.8. Let us find the covariance and correlation between X1 and X2 in Example 4.7. To begin, we have

E(X1X2) = 1 ∗ 1 ∗ 1/36 + 1 ∗ 2 ∗ 2/36 + · · ·+ 6 ∗ 5 ∗ 0/36 + 6 ∗ 6 ∗ 1/36 = 12.250 E(X1) = 1 ∗ 1/36 + 2 ∗ 3/36 + 3 ∗ 5/36 + 4 ∗ 7/36 + 5 ∗ 9/36 + 6 ∗ 11/36 = 4.472 E(X2) = 1 ∗ 11/36 + 2 ∗ 9/36 + 3 ∗ 7/36 + 4 ∗ 5/36 + 5 ∗ 3/36 + 6 ∗ 1/36 = 2.528

Thus,

cov(X1, X2) = E(X1X2)− E(X1)E(X2) = 0.9445.

To get the correlation, we must divide this by the product of the standard deviations. You can show that sd(X1) = 1.404 = sd(X2). Hence,

cor(X1, X2) = 0.9445

1.971 = 0.479.

Random variables X1 and X2 whose covariance is 0 are said to be uncorrelated. If X1 and X2 are independent then they are uncorrelated. The converse is not true. Being uncorrelated has an important inplication for variances.

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 58

Theorem 4.8. If jointly distributed random variables X1, X2, · · · , Xn are pairwise uncorrelated, then

var(X1 +X2 + · · ·+Xn) = var(X1) + var(X2) + · · ·+ var(Xn).

Proof: We will assume that n = 2. The general proposition can then be proved easily by induction.

var(X1 +X2) = cov(X1 +X2, X1 +X2)

= cov(X1, X1 +X2) + cov(X2, X1 +X2)

= cov(X1, X1) + cov(X1, X2) + cov(X2, X1) + cov(X2, X2)

= var(X1) + 2cov(X1, X2) + var(X2)

Since by hypothesis cov(X1, X2) = 0,

var(X1 +X2) = var(X1) + var(X2).

4.9 Multinomial Distributions

Suppose X is a random variable which is a factor (nominal variable or categorical variable) with m possible values or levels L1, L2, · · · , Lm. For example, if we randomly choose one member of the population of eligible voters, that that person will be classified in one and only one way as ”Republican”, ”Democrat”, ”Libertarian”, ”Green”, ”Other”, or ”Independent”. The random vari- able X is party afiliation and these six names are its levels. In general, let p1 = Pr(X = L1), p2 = Pr(X = L1), · · · , pm = Pr(X = Lm). Each pi ∈ (0, 1) and p1 + p2 + · · · + pm = 1. Because of this last condition, we can express one of the pi in terms of the others, e.g., pm = 1−

∑m−1 i=1 pi. This

leaves only m− 1 of the pi as free parameters which must satisfy ∑m−1 i=1 pi < 1.

Let the experiment giving rise to X be replicated n times independently. Let Y1 be the number of replications for which X = L1, Y2 the number of replications for which X = L2, and so on. Finally let Ym be the number of replications for which X = Lm. Y1, Y2, etc. are jointly distributed random variables whose joint distribution is called a multinomial distribution. The replications are called multinomial trials.

Let y1, y2, · · · , ym be nonnegative integers such that ∑m i=1 yi = n. Consider any particular n-term

sequence of levels in which L1 occurs y1 times, L2 occurs y2 times, and so on until finally Lm occurs ym times. By independence, the probability of this sequence resulting from the experiment is

py11 p y2 2 · · · pymm .

However, this is only one way that the event (Y1 = y1;Y2 = y2; · · · ;Ym = ym) could occur. The number of ways that it can occur is the total number of n-term sequences that have y1 terms equal to L1, y2 terms equal to L2, and so on. That number is given by the multinomial coefficient

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 59

n!

y1!y2! · · · ym! .

Thus, the joint frequency function for Y1, Y2, · · · , Ym is

f(y1, y2, · · · , ym) = Pr(Y1 = y1;Y2 = y2; · · · ;Ym = ym) (4.4)

= n!

y1!y2! · · · ym! py11 p

y2 2 · · · pymm .

where y1, y2, · · · , ym are nonnegative integers that sum to n and p1, p2, · · · , pm are positive numbers that sum to 1.

The R function for the multinomial frequency function is ”dmultinom”. You can read about it by calling

> help(Multinomial).

The required arguments for ”dmultinom” are ”x”, which is the same as the vector (y1, y2, · · · , ym) in the discussion above, and ”prob” which is the vector (p1, p2, · · · , pm) of probabilities of the levels. For example, suppose that n = 25, m = 4, and (p1, p2, p3, p4) = (0.2, 0.4, 0.2, 0.2). Say we want to find Pr(Y1 = 5;Y2 = 10;Y3 = 5 : Y4 = 5). The answer is

> dmultinom(x=c(5,10,5,5),prob=c(0.2,0.4,0.2,0.2))

 0.00849941

Example 4.9. : Hardy-Weinberg genetic equilibrium

A gene occurs in two forms, or alleles, a dominant form ”A” and a recessive form ”a”. Each individual organism in the population carries two copies of the gene, one from each parent. The organism has genotype ”AA” if both copies are of form A, ”Aa” if one is of form A and the other of form a, or ”aa” if both are allele a. Let θ denote the proportion of all the copies of the gene in the population which are of form A. Then 1− θ is the proportion of form a. The Hardy-Weinberg model for genetic equilibrium assumes that in matings, the alleles contributed by the parents are independently selected with probabilities equal to their frequencies in the population. Thus, the probability that an offspring will have genotype AA is θ2, the probability of aa is (1− θ)2, and the probability of Aa is 2θ(1− θ). This is a fair assumption if the population is large, thoroughly mixed, and none of the genotypes has a reproductive advantage over the others.

Suppose that a proportion θ = .65 of genes are of form A. The Hardy-Weinberg genotype probabilities are pAA = 0.65

2 = 0.4225, pAa = 2 × 0.65 × 0.35 = 0.455, and paa = 0.352 = 0.1225. Suppose that a sample of size 100 is randomly selected from the population and each organisim in the sample is typed. Let YAA, YAa, and Yaa denote the numbers of the three genotypes in the sample. The outcome (YAA, YAa, Yaa) = (42, 46, 12) is the most probable outcome. However, its probability is small.

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 60

> dmultinom(c(42,46,12),prob=c(.4225,.455,.1225))

 0.01028722

Marginal and conditional distributions of a multinomial distribution are also multinomial. In partic- ular, each component Yi of Y = (Y1, Y2, · · · , Ym) has a binomial distribution Yi ∼ Binom(n, pi). This is easy to see without any calculation. Simply call a trial a success if level Li occurs, otherwise call it a failure. The conditional distributions are well illustrated by the case m = 3. Let (Y1, Y2, Y3) have a multinomial distribution based on n trials with probabilities (p1, p2, p3). The conditional distribution of Y1, given that Y2 = y2 is binomial with n− y2 trials and with success probability p11−p2 =

p1 p1+p3

.

(Y1|Y2 = y2) ∼ Binom(n− y2, p1

1− p2 ). (4.5)

When Y = (Y1, · · · , Ym) has a multinomial distribution, the components Yi of are correlated, therefore dependent. When m ≥ 3 the covariance between Yi and Yj , i 6= j is

cov(Yi, Yj) = −npipj (4.6)

and the correlation is

cor(Yi, Yj) = − √

pi 1− pi

pj 1− pj

. (4.7)

4.9.1 Exercises

1. Roll a pair of standard dice. All outcomes are equally likely. Let X1 be the minimum of the num- bers on the dice and let X2 be their sum. Construct a joint frequency table like the one in Example 6. Include the marginal frequency functions by summing the rows and columns of the table.

2. Find the conditional frequency function of X2, given that X1 = 2. Are X1 and X2 independent or dependent?

3. Let X1 and X2 be independent discrete random variables with frequency functions f1 and f2, respectively. Let Y = X1 +X2. The frequency function for Y is given by the convolution formula:

g(y) = ∑ x1

f2(y − x1)f1(x1).

Verify the convolution formula for the case where X1 and X2 are independent rolls of a fair die.

4. The proportion of the dominant allele of a certain gene in a population is 0.75. The recessive proportion is 0.25. A sample of 20 members of the population is taken and their genotypes deter- mined. What is the probability that the sample had 12 pure dominant, 2 pure recessive, and 6 mixed genotypes?

5. From a set of n objects, y1 are to be chosen and labelled ”L1”, y2 are to be labelled ”L2”, y3 are to be labelled ”L3”, and so on until finally, the last ym are labelled ”Lm”. The number of ways this can be done is

Go to TOC

CHAPTER 4. DISCRETE DISTRIBUTIONS 61

( n

y1

)( n− y1 y2

)( n− y1 − y2

y3

) · · · ( ym ym

) .

Simplify this expression.

6. Let (Y1, Y2, Y3) have a multinomial distribution with n = 30 and (p1, p2, p3) = (0.25, 0.40, 0.35). What is the conditional distribution of Y2, given that Y1 = 10?

7. Prove equation 4.5.

8. In problem 4 above, what are the covariance and correlation between the number of pure dominant and the number of mixed types in the sample of 20 organisms? What is the conditional distribution of the number of mixed types, given that the number of pure dominant types is 12?

9. Derive equation 4.7 from equation 4.6. Show that the absolute value of the expression on the right hand side of 4.7 is less than 1.

Go to TOC

Chapter 5

Continuous Distributions

5.1 Density Functions

All measurements are made with limited precision. Therefore, one could say that all observable nu- meric random variables are discrete. However, some numeric observations are made with very great precision and can be replicated many times. In such situations it may be hopeless to try to describe a discrete distribution for the observations. Instead, the distribution may be approximated closely by a continuous distribution which is more amenable to mathematical treatment.

Example 5.1. A number between 0 and 1 is randomly selected in the following way. A fair coin is tossed a large number of times (e.g., m = 100) to generate a string of 1’s and 0’s. These are the bits in a binary representation of the number. Thus, the number is

Um =

m∑ i=1

Xi/2 i,

where X1, · · · , Xm are independent Bernoulli random variables with success probability p = 0.5.

There are 2m possible values of Um, so it would be a chore to write them all down and tabulate the frequency function. In fact, the values of Um are all equally likely. (Can you see why?) Since all m-term binary expansions of numbers in (0, 1) are equally likely, it seems that the values of Um should be evenly distributed over the interval (0, 1). In other words, two subintervals of (0, 1) of the same length should have the same proportion of values of Um. That is nearly true, and it is exactly true in the limit as m→∞. If we allow infinite binary expansions and define

U =

∞∑ i=1

Xi/2 i

then the range of U is the set of all real numbers between 0 and 1. For any subinterval (u1, u2) of (0, 1)

Pr(u1 < U ≤ u2) = FU (u2)− FU (u1) = u2 − u1.

62

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 63

We say that the random variable U is uniformly distributed over the interval (0, 1). Note that U is not observable but Um is. For reasonably large values of m, the discrete distribution of Um is closely approximated by the continuous distribution of U .

The cumulative distribution function of U is

F (u) =

 u if 0 ≤ u ≤ 1, 0 if u < 0,

1 if u > 1.

(5.1)

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

0. 0

0. 2

0. 4

0. 6

0. 8

1. 0

u

F (u

)

The cumulative distribution function 5.1 is the integral of another function called the density function of U .

f(u) =

{ 0 if u < 0 or u > 1,

1 if 0 ≤ u ≤ 1. (5.2)

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 64

−0.5 0.0 0.5 1.0 1.5

0. 0

0. 2

0. 4

0. 6

0. 8

1. 0

u

f( u)

By this we mean that

F (u) =

∫ u −∞

f(x)dx

for all real numbers u. It follows that

Pr(u1 < U ≤ u2) = ∫ u2 u1

f(u)du

for any interval (u1, u2). This distribution is called the uniform distribution over the interval (0, 1). In the notation used by R, we write U ∼ Unif(0, 1) to indicate that the random variable U has this distribution.

Definition 5.1. A density function is a nonnegative function f defined on the set of real numbers R such that ∫ ∞

−∞ f(x)dx = 1.

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 65

Theorem 5.1. If f is a density function, then its integral F (x) = ∫ x −∞ f(u)du is a continuous cumula-

tive distribution function. That is, F is nondecreasing, limx→−∞ F (x) = 0, and limx→∞ F (x) = 1. If X is a random variable with this density function, then for any two real numbers x1, x2 with x1 < x2,

Pr(x1 < X ≤ x2) = ∫ x2 x1

f(u)du.

Conversely, if F is a continuous cumulative distribution function which is continuously differentiable except perhaps at a finite set of points, its derivative f(x) = F ′(x) is a density function and F (x) =∫ x −∞ f(u)du.

Except for some slight caveats it is accurate to say that the cumulative distribution is the integral of the density and that the density is the derivative of the cumulative distribution. The density function is analogous to the frequency function for a discrete random variable. The difference is that sums are replaced by integrals. In the discrete case

Pr(a < X ≤ b) = ∑ a<x≤b

f(x),

while in the continuous case

Pr(a < X ≤ b) = ∫ b a

f(x)dx.

For a random variable X with a continuous distribution, Pr(X = x) = 0 for each fixed x. Thus, the events (a < X ≤ b), (a < X < b), (a ≤ X < b) and (a ≤ X ≤ b) all have the same probability. End points of intervals may be included or not included according to convenience.

Example 5.2. Let X be a Poisson process occurring in time, so that X(I) is the number of ”arrivals” in the time interval I. Let λ > 0 be the rate of the process. Instead of focusing on the number of arrivals in a given interval of time, let us consider the times between successive arrivals. Let the random variable T be the time from the beginning (t=0) until the first arrival and let t > 0 be a given positive number. The event (T > t) happens if and only if the number X(0, t) of occurrences in the interval (0, t) is zero. X(0, t) has a Poisson distribution with parameter µ = λt. Hence,

Pr(T > t) = Pr(X(0, t) = 0) = e−λt.

The cumulative distribution of T is

F (t) = Pr(T ≤ t) = 1− e−λt.

for t ≥ 0 and F (t) = 0 for t < 0. The density of T is

f(t) = F ′(t) = λe−λt

for t > 0. For t ≤ 0, f(t) = 0. The cumulative distribution and the density of T are plotted below for λ = 1. This distribution is called an exponential distribution and the parameter λ is its rate parameter. We write T ∼ Exp(rate = λ) to indicate that T has this distribution.

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 66

> par(mfrow=c(2,1))

> curve(pexp(x),from=-1,to=4,ylab=”F(x)”)

> abline(h=0,lty=2)

> abline(v=0,lty=2)

> curve(dexp(x),from=0,to=4,ylab=”f(x)”,xlim=c(-1,4))

> lines(c(-1,0),c(0,0))

> abline(v=0,lty=2)

> abline(h=0,lty=2)

> par(mfrow=c(1,1))

−1 0 1 2 3 4

0. 0

0. 4

0. 8

x

F (x

)

−1 0 1 2 3 4

0. 0

0. 4

0. 8

x

f( x)

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 67

5.2 Expected Values and Quantiles for Continuous Distribu- tions

5.2.1 Expected Values

When a random variable X has a continuous distribution, we evaluate probabilities Pr(X ∈ I) as integrals

∫ I f(x)dx rather than sums

∑ x∈I f(x). Similarly, we evaluate expected values as integrals.

If X has density function f(x), then

E(X) = µ =

∫ ∞ −∞

xf(x)dx.

More generally, if h is a function defined on the range of X,

E(h(X)) =

∫ ∞ −∞

h(x)f(x)dx.

In particular,

var(X) = σ2 =

∫ ∞ −∞

(x− µ)2f(x)dx

=

∫ ∞ −∞

x2f(x)dx− µ2.

As before, we require that these expressions be absolutely convergent; otherwise the expected values do not exist.

Example 5.3. Let X have the exponential distribution with rate parameter λ > 0. For x ≥ 0 the density of X is

f(x) = λe−λx,

and f(x) = 0 for x < 0. The expected value of X is

µ = E(X) =

∫ ∞ −∞

xf(x)dx

=

∫ ∞ 0

xλe−λxdx

= 1

λ

∫ ∞ 0

ue−udu

= 1

λ .

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 68

Similarly, the expected value of X2 is

E(X2) =

∫ ∞ −∞

x2f(x)dx

=

∫ ∞ 0

x2λe−λxdx

= 1

λ2

∫ ∞ 0

u2e−udu

= 2

λ2 .

It follows that the variance of X is

var(X) = 2

λ2 − ( 1

λ )2 =

1

λ2

and the standard deviation is

sd(X) = 1

λ .

The mean and standard deviation of an exponential distribution are the same.

The exponential distribution is often identified by its mean parameter µ rather than the rate parameter λ. When it is, the density and cumulative distribution are

f(x) = 1

µ e−x/µ

F (x) = 1− e−x/µ

for x ≥ 0.

5.2.2 Quantiles

Definition 5.2. Let F be a given cumulative distribution and let p ∈ (0, 1). The pth quantile of F , also called the 100pth percentile of F , is defined as

F−1(p) = min{x|F (x) ≥ p}.

When F is identified with a particular random variableX, we also write q(X, p) for F−1(p). F−1(.25), F−1(.5) , and F−1(.75) are the first quartile, the median, and the third quartile of F , respectively. The function F−1 as defined above is called the quantile function.

For continuous distributions, F−1(p) is the smallest number x such that F (x) = p. Often F has a true inverse function and finding the quantile reduces to finding the unique solution of the equation F (x) = p.

Example 5.4. Let F be the exponential distribution with mean µ and cumulative distribution F (x) = 1− e−x/µ for x ≥ 0. Setting F (x) = p and solving for x gives

x = F−1(p) = −µ ln(1− p).

Thus, the median of F is −µ ln(.5) = µ ln 2.

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 69

5.2.3 Exercises

1. For 0 ≤ x ≤ 1 let f(x) = kx(1 − x), where k is a constant. Find the value of k such that f is a density function. 2. Find the mean and variance of the distribution in the preceding exercise. 3. For x ≥ 0, let f(x) = 2xe−x2 . Show that f is a density function. 4. Find the cumulative distribution for the density in the preceding exercise. 5. Find the pth quantile of this distribution. 6. For a real number x, let F (x) = ex/(1 + ex). Find the density function for this cumulative distribution function. Find the quantile function F−1(p).

5.3 Uniform Distributions

Let U ∼ Unif(0, 1) and let a and b > a be constants. Define a new random variable X by linearly transforming U as X = a+ (b− a)U . Since U lies between 0 and 1, X lies between a and b. We will calculate the cumulative distribution of X by a method that is useful for many kinds of transforma- tions, both linear and nonlinear.

Let x be an arbitrary number in (a, b).

FX(x) = Pr(X ≤ x) = Pr(a+ (b− a)U ≤ x)

= Pr(U ≤ x− a b− a

)

= FU ( x− a b− a

)

= x− a b− a

since FU (u) = u for u ∈ (0, 1). Thus the cumulative distribution of X is

F (x) =

 (x− a)/(b− a) if a < x < b 0 if x ≤ a 1 if x ≥ b

(5.3)

By differentiating we get the uniform density on (a, b).

f(x) = F ′(x) =

{ 1/(b− a) if a < x < b 0 otherwise

(5.4)

Definition 5.3. A random variable X is uniformly distributed on the interval (a, b) if its cumulative distribution function is given by 5.3 and its density function is 5.4.

We indicate that X is uniformly distributed on (a, b) by writing X ∼ Unif(a, b).

Example 5.5. X ∼ Unif(−1, 3). Find the following probabilities.

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 70

(a) Pr(X ≤ 2)

(b) Pr(−2 ≤ X < 1)

(c) Pr(0 < X < 3)

(d) Pr(X > 0)

Solution: We will use both elementary calculations and the R function ”punif” for the cumulative distribution.

(a) F (2) = (2− (−1))/(3− (−1)) = 0.75

> punif(2,min=-1,max=3)

 0.75

(b) F (1)− F (−2) = 0.5− 0 = 0.5

punif(1,-1,3)-punif(-2,-1,3)

 0.5

(c) F (3)− F (0) = 1− 1/4 = 0.75

> punif(3,-1,3)-punif(0,-1,3)

 0.75

(d) 1− F (0) = 1− 1/4 = 0.75

> 1-punif(0,-1,3)

 0.75

The Mean, Variance and Quantile Function of a Uniform Distribution

The formulas for the mean, variance and quantile function of a uniform distribution are simple calcu- lations which we leave as an exercise. Let X ∼ Unif(a, b).

E(X) = a+ b

2 (5.5)

var(X) = (b− a)2

12 (5.6)

q(X, p) = a+ (b− a) ∗ p (5.7)

5.4 Exponential Distributions and Their Relatives

5.4.1 Exponential Distributions

Definition 5.4. A random variable X has an exponential distribution with rate parameter λ > 0, if its cumulative distribution is

F (x) =

{ 1− e−λx if x ≥ 0, 0 if x < 0,

(5.8)

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 71

with density function

f(x) =

{ λe−λx if x ≥ 0, 0 if x < 0.

(5.9)

When X has the exponential distribution with rate parameter λ, we write X ∼ Exp(λ).

In a previous section, we showed that the mean and standard deviation of the exponential distribution with rate λ are µ = 1/λ and that the quantile function is given by F−1(p) = −µ ln(1− p).

As previously mentioned, the exponential distributions have an intimate connection with Poisson pro- cesses. If we observe a Poisson process evolving in time and T1 is the time from the beginning until the first arrival, T2 is the time between the first and second arrivals, T3 the time between the second and third arrivals, and so on, then T1, T2, T3 etc. are independent random variables all with the same exponential distribution Exp(λ), where λ is the rate parameter for the Poisson process.

Exponential distributions are a starting point for the study of lifetime distributions. Let T denote the length of time a randomly chosen member of a population survives in a particular state. The population is not necessarily a biological population; it could be the population of atoms in a lump of radioactive matter and the lifetime could be the amount of time an atom survives in an excited state before decaying into a lower energy state. Our main interest is in the survival function S(t) = 1 − F (T ) = Pr(T > t) for t > 0, interpreted as the probability of survival past time t. If T has an exponential distribution with rate parameter λ, then

S(t) = e−λt.

This describes systems at the atomic level quite well. The exponential distribution has a peculiar property called the ”memoryless” property which seems to be true for such things as atoms in an excited energy state. Informally, this property states that the probability that an object survives an additional ∆t units of time is independent of the amount of time t that the object has already survived. In symbols, Pr(T > t+ ∆t|T > t) does not depend on t. For an exponential distribution

Pr(T > t+ ∆t|T > t) = Pr(T > t+ ∆t)/Pr(T > t) = e−λ(t+∆t))/e−λt

= e−λ∆t.

Thus an object whose lifetime distribution is exponential does not age. The exponential distributions are the only continuous distributions with the memoryless property. Lifetime distributions for com- plex systems (e.g., organisms) are not exponential because clearly they do age. The probability of survival for an additional 5 years is not the same for a 70 year old man as for a 20 year old man.

The cumulative distribution and quantile function for the exponential distributions are calculated in R with the functions ”pexp” and ”qexp”. For ”pexp” the required argument is x, the point at which the cumulative distribution is to be evaluated and the rate argument for the rate parameter is required only if it is different from 1.

> pexp(2)

 0.8646647

> pexp(2,rate=2)

 0.9816844

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 72

For the quantile function qexp, the argument p is required and the rate argument is optional.

> qexp(.75)

 1.386294

> qexp(.75,rate=2)

 0.6931472

5.4.2 Gamma Distributions

The gamma function is an extension to the set of all positive real numbers of the familiar factorial function defined for nonnegative integers. It is defined for α > 0 as

Γ(α) =

∫ ∞ 0

xα−1e−xdx.

This integral converges for any α > 0. Using integration by parts, it can be shown that

Γ(α+ 1) = αΓ(α)

and it is easily seen that Γ(1) = 1. Hence, Γ(2) = 1∗Γ(1) = 1, Γ(3) = 2∗Γ(2) = 2, Γ(4) = 3∗Γ(3) = 3!, and by induction Γ(n) = (n− 1)! for positive integers n. Now modify the integrand in the definition slightly as follows: ∫ ∞

0

xα−1e−λxdx,

where λ > 0 is a constant. By making the change of variables u = λx, the integral becomes

1

λα

∫ ∞ 0

uα−1e−udu = Γ(α)

λα .

Thus,

f(x;α, λ) = λα

Γ(α) xα−1e−λx

for x > 0 is a density function. It is called the gamma density with shape parameter α and rate parameter λ. Sometimes the formula is written in terms of the scale parameter β = 1/λ instead of the rate parameter λ. When X has a gamma distribution we write X ∼ Gamma(α, rate = λ) or X ∼ Gamma(α, scale = β).

The exponential distributions are special cases of the gamma distributions corresponding to α = 1.

Like exponential distributions, the gamma distributions have a connection to Poisson processes. The time between the kth and k+mth arrivals in a Poisson process with rate parameter λ has the gamma distribution Gamma(shape = m, rate = λ).

Several gamma density functions with different shape parameters are plotted below.

> par(mfrow=c(2,2))

> curve(dgamma(x,shape=.5),from=0,to=4,xlab=”x”,ylab=”f(x)”,

+ main=”alpha = 0.5″)

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 73

> curve(dgamma(x,shape=1),from=0,to=4,xlab=”x”,ylab=”f(x)”,

+ main=”alpha = 1 (exponential)”)

> curve(dgamma(x,shape=2),from=0,to=8,xlab=”x”,ylab=”f(x)”,

+ main=”alpha = 2″)

> curve(dgamma(x,shape=3),from=0,to=8,xlab=”x”,ylab=”f(x)”,

+ main=”alpha = 3″)

> par(mfrow=c(1,1))

0 1 2 3 4

0. 0

1. 0

2. 0

alpha = 0.5

x

f( x)

0 1 2 3 4

0. 0

0. 4

0. 8

alpha = 1 (exponential)

x

f( x)

0 2 4 6 8

0. 0

0. 1

0. 2

0. 3

alpha = 2

x

f( x)

0 2 4 6 8

0. 00

0. 10

0. 20

alpha = 3

x

f( x)

Aside from their relationship to Poisson processes, some gamma distributions are especially important in the theory of statistical inference. We will cover this aspect of them in a later chapter.

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 74

The Mean, Variance and Quantiles of a Gamma Distribution

If X ∼ Gamma(shape = α, scale = β),

E(X) = 1

βαΓ(α)

∫ ∞ 0

xαe−x/βdx

= βα+1Γ(α+ 1)

βαΓ(α)

= βα+1αΓ(α)

βαΓ(α)

= αβ (5.10)

By a similar argument E(X2) = α(α+ 1)β2.

Thus, var(X) = αβ2 (5.11)

To find the pth quantile of a gamma distribution we would have to solve the equation

p = 1

βαΓ(α)

∫ x 0

tα−1e−t/β .

This cannot be done in terms of elementary functions of x; however, the equation can be solved numerically. R’s quantile function for the gamma distributions ”qgamma” does this. For example, if α = 3 and β = 2, the third quartile of the corresponding gamma distribution is

> qgamma(.75,shape=3,scale=2)

 7.840804

The function ”pgamma” gives the value of the cumulative distribution.

> pgamma(7.840804,shape=3,scale=2)

 0.75

5.4.3 Weibull Distributions

Engineers and scientists have found that a power transformation of an exponential random variable sometimes results in a more realistic representation of the lifetime distribution of a complex system. Specifically, suppose that X ∼ Exp(λ = 1) and let T = βX1/α, where α and β are positive constants. We write the exponent as 1/α rather than α simply to make the formulas come out nicer in the end. The survival function for T is

Pr(T > t) = Pr(βX1/α > t)

= Pr(X > ( t

β )α)

= e−( t β ) α

(5.12)

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 75

From the survival function we easily derive the cumulative distribution and the density.

F (t) = 1− e−( t β ) α

f(t) = α

β ( t

β )α−1e−(

t β ) α

(5.13)

The utility of the Weibull distribution in survival analysis is that it has the two adjustable parameters α and β which can be adapted to specific data on survival of the systems under study. In other words, they can be estimated from data. Estimation of parameters is one of the major goals of statistical inference. The parameter β is called the scale parameter because it adjusts for a change in the scale of time measurement. The parameter α is the shape parameter. It governs the shape of the density function and its rate of decay as t → ∞. For α = 1 the Weibull distribution is the exponential distribution. For α > 1 the system wears out with age in the sense that the conditional probability of survival another ∆t units of time decreases as t increases. For α < 1, the system improves with age – the conditional probability of survival another ∆t units increases as t increases. Examples of the survivalfunction and density function for several values of α are plotted below.

> par(mfrow=c(2,2))

> curve(1-pweibull(x,shape=0.5),from=0,to=4,ylab=”S(t)”,xlab=”t”,

+ main=”alpha=0.5″)

> curve(1-pweibull(x,shape=1),from=0,to=4,ylab=”S(t)”,xlab=”t”,

+ main=”alpha=1 (exponential)”)

> curve(1-pweibull(x,shape=3),from=0,to=4,ylab=”S(t)”,xlab=”t”,

+ main=”alpha=3″)

> curve(1-pweibull(x,shape=4),from=0,to=4,ylab=”S(t)”,xlab=”t”,

+ main=”alpha=4″)

> par(mfrow=c(1,1))

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 76

0 1 2 3 4

0. 2

0. 4

0. 6

0. 8

1. 0

alpha=0.5

t

S (t

)

0 1 2 3 4

0. 0

0. 4

0. 8

alpha=1 (exponential)

t

S (t

)

0 1 2 3 4

0. 0

0. 4

0. 8

alpha=3

t

S (t

)

0 1 2 3 4

0. 0

0. 4

0. 8

alpha=4

t

S (t

)

> par(mfrow=c(2,2))

> curve(dweibull(x,shape=0.5),from=0,to=4,ylab=”f(t)”,xlab=”t”,

+ main=”alpha=0.5″)

> curve(dweibull(x,shape=1),from=0,to=4,ylab=”f(t)”,xlab=”t”,

+ main=”alpha=1 (exponential)”)

> curve(dweibull(x,shape=3),from=0,to=4,ylab=”f(t)”,xlab=”t”,

+ main=”alpha=3″)

> curve(dweibull(x,shape=4),from=0,to=4,ylab=”f(t)”,xlab=”t”,

+ main=”alpha=4″)

> par(mfrow=c(1,1))

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 77

0 1 2 3 4

0. 0

0. 5

1. 0

1. 5

2. 0

alpha=0.5

t

f( t)

0 1 2 3 4

0. 0

0. 4

0. 8

alpha=1 (exponential)

t

f( t)

0 1 2 3 4

0. 0

0. 4

0. 8

1. 2

alpha=3

t

f( t)

0 1 2 3 4

0. 0

0. 5

1. 0

1. 5

alpha=4

t

f( t)

The Mean, Variance and Quantile Function of a Weibull Distribution

The mean, variance, and quantiles of a Weibull distribution are easily found from its relationship to the standard exponential distribution Exp(λ = 1) with density f(x) = e−x, x ≥ 0. If X has this distribution, then Y = βX1/α has the Weibull distribution Weib(shape = α, scale = β). Therefore,

E(Y ) = β

∫ ∞ 0

x1/αe−xdx

= β ∗ Γ( 1 α

+ 1) (5.14)

Likewise,

E(Y 2) = β2 ∗ Γ( 2 α

+ 1)

so that

var(Y ) = β2{Γ( 2 α

+ 1)− Γ( 1 α

+ 1)2}. (5.15)

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 78

The transformation y = βx1/α is strictly increasing. Therefore, it maps quantiles of X to correspond- ing quantiles of Y . The pth quantile of X is − ln(1− p). Thus, the pth quantile of Y is

q(Y, p) = β[− ln(1− p)] 1α (5.16)

5.4.4 Exercises

1. Suppose X ∼ Unif(−1, 3). Find the probabilities of the following events, both by hand calculation and with R’s punif function. (a) (X ≤ 2) (b) (X ≥ 1) (c) (−0.5 < X < 1.5) (d) (X = 0)

2. Find the median of Unif(a, b).

3. Suppose U ∼ Unif(0, 1). Show that 1− U ∼ Unif(0, 1).

4. Suppose U ∼ Unif(0, 1) and that λ > 0 is a given constant. Let

X = − 1 λ log(1− U)

. Find the cumulative distribution of X.

5. Find the first and third quartiles of the exponential distribution with mean µ. Compare the in- terquartile range F−1(.75)− F−1(.25) to the standard deviation.

6. Suppose X ∼ Exp(λ = 2). Find the probabilities of the events (a) – (d) in exercise 1 above.

7. The lifetimes in years of air conditioning systems have a Weibull distribution with a shape param- eter α = 4 and scale parameter β = 8. What is the probability that your new a.c. system will last more than 10 years? What is the third quartile of a.c. lifetimes?

8. Refer to problem 7 above. Given that your a.c. system has already lasted more than 10 years, what is the probability that it will last at least one more year? Given that it has already lasted 5 years, what is the probability that it will last another year?

9. Suppose X ∼ Gamma(α, scale = β) and that Y = kX with k > 0 a constant. Show that Y ∼ Gamma(α, scale = kβ).

10. Suppose X ∼ Gamma(α = 3, scale = 2). Use R’s ”pgamma” function to find:

(a) Pr(X ≤ 4) (b) Pr(1 ≤ X < 3)

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 79

5.5 Normal Distributions

Definition 5.5. The random variable Z has the standard normal distribution if its density function is

φ(z) = 1√ 2π e−z

2/2 (5.17)

for all z, −∞ < z <∞. This is indicated by the expression Z ∼ Norm(0, 1).

By considering the rapid rate at which φ(z)→ 0 as |z| → ∞, it is easy to see that ∫∞ −∞ φ(z)dz is finite.

To show that the integral is 1, so that φ is actually a density function, is trickier. The standard trick is to let the value of the integral be denoted by I and then to write

I2 = 1

∫ ∞ −∞

e−x 2/2dx

∫ ∞ −∞

e−y 2/2dy.

Next, write this as a double integral.

I2 = 1

∫ ∞ −∞

∫ ∞ −∞

e−(x 2+y2)/2dxdy.

Now change to polar coordinates x = r cos θ, y = r sin θ with r ≥ 0 and 0 ≤ θ < 2π.

I2 = 1

∫ 2π θ=0

∫ ∞ r=0

e−r 2/2rdrdθ.

This expression is equal to 1. We leave the remaining details to the reader.

The standard normal cumulative distribution is

Φ(z) =

∫ z −∞

φ(u)du = 1√ 2π

∫ z −∞

e−u 2/2du. (5.18)

This integral cannot be expressed in terms of elementary functions, but it can be numerically evaluated as accurately as desired. The R function for evaluating Φ is ”pnorm” and the function for Φ−1 is ”qnorm”. For example,

> pnorm(1.645)

 0.9500151

> pnorm(1.98)

 0.9761482

> pnorm(-0.68)

 0.2482522

> qnorm(.975)

 1.959964

> qnorm(.25)

 -0.6744898

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 80

Although it is seldom needed, the density function may evaluated with ”dnorm”.

> dnorm(0)

 0.3989423

> dnorm(1.645)

 0.1031108

The density and cumulative distribution of the standard normal are plotted below.

−3 −2 −1 0 1 2 3

0. 0

0. 2

0. 4

x

−3 −2 −1 0 1 2 3

0. 0

0. 4

0. 8

x

5.5.1 Tables of the Standard Normal Distribution

Tables of values of the standard normal cumulative distribution are widely available. The one below was produced with R. The first two significant digits of z are arranged along the left hand margin and the third is read from the top row. The table entries are Φ(z) for nonnegative values of z. For negative z use the relation Φ(−z) = 1 − Φ(z), which holds for all z because of the symmetry of the density function φ.

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 81

0 1 2 3 4 5 6 7 8 9

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359

0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753

0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141

0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517

0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879

0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224

0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549

0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852

0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133

0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389

1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830

1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015

1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177

1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319

1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441

1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545

1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633

1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706

1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767

2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817

2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857

2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890

2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916

2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936

2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952

2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964

2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974

2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981

2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986

5.5.2 Other Normal Distributions

Suppose that Z ∼ Norm(0, 1) and that σ > 0 and µ are constants. Let the random variable X be related to Z by X = σZ+µ. We calculate the cumulative distribution of X by the same transformation methods we have used before.

FX(x) = Pr(X ≤ x) = Pr(σZ + µ ≤ x)

= Pr(Z ≤ x− µ σ

)

= Φ( x− µ σ

) (5.19)

Differentiating FX gives the density function of the random variable X.

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 82

fX(x) = 1

σ Φ′(

x− µ σ

)

= 1

σ φ( x− µ σ

)

= 1

σ √

2π e−

(x−µ)2

2σ2 (5.20)

The distribution of X is called the normal distribution with parameters µ and σ. This is indicated by the expression X ∼ Norm(µ, σ).

Theorem 5.2. X ∼ Norm(µ, σ) if and only if Z = (X − µ)/σ ∼ Norm(0, 1).

If Z ∼ Norm(0, 1), then

E(Z) =

∫ ∞ −∞

zφ(z)dz = 0

because the integrand zφ(z) is an odd function. To find the variance

var(Z) = E(Z2) =

∫ ∞ −∞

z2φ(z)dz

integrate by parts letting u = z and dv = zφ(z)dz. The result is

var(Z) = 1.

If X ∼ Norm(µ, σ), then X = µ+ σZ with Z ∼ Norm(0, 1). Hence,

E(X) = µ+ σE(Z) = µ,

and var(X) = σ2var(Z) = σ2.

Thus the parameter µ is the mean and the parameter σ is the standard deviation of a normal distri- bution.

The practical implication of Theorem 5.2 is that we can use the standard normal table for any normal distribution.

Example: Let X ∼ Norm(µ = 2, σ = 3). Find Pr(X ≤ 5.5). Find the 95th percentile of X.

Solution: According to equation 5.19,

Pr(X ≤ 5.5) = FX(5.5) = Φ((5.5− 2)/3) = Φ(1.1667)

Interpolating the table values, we get 0.8784.

Since X = µ+ σZ, F−1X (p) = µ+ σΦ

−1(p).

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 83

Thus, F−1X (.95) = 2 + 3Φ −1(.95). Reading the table backward and interpolating again, we have

Φ−1(.95) = 1.645. Hence, F−1X (.95) = 6.935.

The R functions for computing the normal cumulative distribution FX and its inverse F −1 X are ”pnorm”

and ”qnorm”, illustrated below.

> pnorm(5.5,mean=2,sd=3)

 0.8783275

> qnorm(.95,mean=2,sd=3)

 6.934561

5.5.3 The Normal Approximation to the Binomial Distribution

We shall see in the next chapter that normal distributions occupy a central place in probability and statistics. This is because of a famous theorem called the central limit theorem. In just one of many applications of the central limit theorem, the binomial distributions may be approximated by normal distributions. We state the result as a theorem and postpone its justification until the next chapter.

Theorem 5.3. Let Y ∼ Binom(n, p), where p is a fixed constant. For all real z,

Pr((Y − np)/ √ np(1− p) ≤ z)→ Φ(z)

as n→∞.

Example 5.6. Suppose that Y is the number of heads in 30 tosses of a fair coin. Let us approximate Pr(Y > 18).

Solution: When approximating the binomial with the normal distribution, a better approximation is obtained by applying the continuity correction. This means to adjust the inequalities describing events to avoid points of discontinuity of the binomial distribution, i.e., the possible values of the discrete random variable Y . Since Y is an integer, the events (Y > 18) and (Y > 18.5) are actually the same. The mean of Y is 15 and its standard deviation is

√ 7.5 = 2.739.

Pr(Y > 18.5) = Pr( y − 15 2.739

> 18.5− 15

2.739 )

= 1− Φ(1.278)

> 1-pnorm(1.278)

 0.1006247

Compare this to the exact binomial probability

> 1-pbinom(18,30,0.5)

 0.1002442

Go to TOC

CHAPTER 5. CONTINUOUS DISTRIBUTIONS 84

5.5.4 Exercises

1. Let Z ∼ Norm(0, 1). Use the normal table and also R’s ”pnorm” function to find

(a) Pr(Z ≤ 1.45) (b) Pr(Z > −1.28) (c) Pr(−0.674 ≤ Z < 1.036) (d) Pr(Z > 0.836)

2. Use the normal table and also R’s ”pnorm” function to find

(a) Pr(X ≤ 6.13), X ∼ Norm(1, 4) (b) Pr(X > −2.35), X ∼ Norm(−1, 2) (c) Pr(−0.872 < X ≤ 7.682), X ∼ Norm(2.5, 5) (d) Pr(X > 0.698), X ∼ Norm(−2, 4)

3. Use the normal table and also R’s ”qnorm” function to find

(a) The 90th percentile of Norm(0, 5). (b) The 15th percentile of Norm(1, 3). (c) The interquartile range, i.e., the distance from the first to third quartiles of Norm(µ, σ).

4. A student makes a score of 700 on an achievement test with normally distributed scores having a mean of 600 and a standard deviation of 75. What is the student’s percentile score?

5. X is the number of heads when a fair coin is tossed 30 times. Find the exact binomial probability Pr(X = 15) and its normal approximation.

Go to TOC

Chapter 6

Joint Distributions and Sampling Distributions

6.1 Introduction

We have already discussed jointly distributed discrete random variables, their joint and marginal distributions, and their conditional distributions. If X and Y are jointly distributed discrete variables, their joint frequency function is related to joint probabilities by

Pr(X ∈ I1;Y ∈ I2) = ∑ x∈I1

∑ y∈I2

f(x, y).

The marginal (individual) frequency functions of the variables are related to the joint frequency function by, e.g.,

fX(x) = ∑ y

f(x, y),

and the conditional frequency function of one variable, given the value of the other, is

fX|Y (x|y) = Pr(X = x|Y = y) = f(x, y)

fY (y) .

6.2 Jointly Distributed Continuous Variables

The formal relationships for jointly distributed continuous variables are similar, except that sums must be replaced by integrals. IfX and Y are jointly distributed continuous variables, their joint density function is a function f(x, y) ≥ 0 of two arguments such that for all intervals I1 and I2,

Pr(X ∈ I1;Y ∈ I2) = ∫ I2

∫ I1

f(x, y)dxdy. (6.1)

More generally, if A is any region in the x, y cartesian plane that has an area,

Pr((X,Y ) ∈ A) = ∫ A

∫ f(x, y)dxdy.

85

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 86

Here (X,Y ) can be thought of as a random point in the x, y plane. In (6.1) if we let I2 = (−∞,∞) we find that the marginal density function of X is

fX(x) =

∫ ∞ −∞

f(x, y)dy. (6.2)

Similarly, the marginal density of Y is

fY (y) =

∫ ∞ −∞

f(x, y)dx.

Let x be a fixed number such that fX(x) > 0. The conditional density function of Y , given that X = x, is the function of y:

fY |X(y|x) = f(x, y)

fX(x) . (6.3)

If we integrate this function with respect to y over an interval I, we obtain the conditional probability that Y ∈ I, given that X = x.

Pr(Y ∈ I|X = x) = ∫ I

fY |X(y|x)dy. (6.4)

In this situation, this is the definition of Pr(Y ∈ I|X = x). The elementary definition of condi- tional probability does not work because the event (X = x) has zero probability. Conditional and unconditional probabilities for Y are related by

Pr(Y ∈ I) = ∫ ∞ −∞

Pr(Y ∈ I|X = x)fX(x)dx. (6.5)

Example 6.1. The Uniform Distribution over a Region in the Plane Consider the shaded triangular region T with vertices (0, 0), (1, 0), and (1, 1) shown below. When we say that (X,Y ) is uniformly distributed over T we mean that f(x, y) has a constant value on T and is zero outside of T . Since

∫ T

∫ f(x, y)dxdy = 1, this implies that for (x, y) ∈ T ,

f(x, y) = 1/ area of T .

For any region A in the plane, Pr((X,Y ) ∈ A) is the proportion of the area of T occupied by A, i.e., the area of A ∩ T divided by the area of T . In this example the area of T is 1/2, so the joint density of X and Y is f(x, y) = 2 if (x, y) ∈ T and f(x, y) = 0 otherwise. We will find the marginal and conditional density functions associated with f .

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 87

0.0 0.2 0.4 0.6 0.8 1.0

0. 0

0. 2

0. 4

0. 6

0. 8

1. 0

x

y

The basic formula for the marginal density of X is given by (6.2). For x < 0 or x > 1, f(x, y) = 0 for all y and therefore fX(x) = 0. For fixed x between 0 and 1, as pictured, f(x, y) = 0 if y < 0 or y > x and f(x, y) = 2 if 0 ≤ y ≤ x. Thus, (6.2) reduces to

fX(x) =

∫ y=x y=0

2dy = 2x.

Similarly, for 0 ≤ y ≤ 1,

fY (y) =

∫ x=1 x=y

2dx = 2(1− y).

To find the conditional density of Y , given that X = x we must remember that the definition requires that fX(x) > 0, i.e., that 0 < x ≤ 1. Since f(x, y) = 0 for y outside the interval (0, x),

fY |X(y|x) = 2

2x =

1

x

for 0 ≤ y ≤ x. In other words, given X = x, Y is uniformly distributed on the interval (0, x). This

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 88

should be almost obvious from the picture. Likewise, given Y = y ∈ (0, 1), X is uniformly distributed on the interval (y, 1).

The conditional distribution of Y , given that X = x may have a mean

E(Y |X = x) = ∫ ∞ −∞

yfY |X(y|x)dy

and a variance

var(Y |X = x) = ∫ ∞ −∞

y2fY |X(y|x)dy − E(Y |X = x)2.

If so, these will generally be functions of the number x. The relationship between the conditional and unconditional expected values of Y is readily obtained from the definition of the conditional density.∫ ∞

−∞ E(Y |X = x)fX(x)dx =

∫ ∞ −∞

yfY (y)dy = E(Y ). (6.6)

More generally, if g(y) is a function defined on the range of the random variable Y ,

E(g(Y )) =

∫ ∞ −∞

E(g(Y )|X = x)fX(x)dx.

6.2.1 Mixed Joint Distributions

It is quite common to have two or more jointly distributed random variables, some continuous and others discrete. In most such cases their conditional distributions precede a description of their joint distribution. Suppose X and Y are jointly distributed, X is discrete with frequency function fX(x), and Y is continuous with conditional density function fY |X(y|x). The joint distribution of X and Y is characterized through their joint hybrid frequency-density function

f(x, y) = fX(x)fY |X(y|x)

and the marginal density function of Y is

fY (y) = ∑ x

fY |X(y|x)fX(x).

Then the conditional frequency function of X, given that Y = y is

fX|Y (x|y) = fY |X(y|x)fX(x)

fY (y) . (6.7)

Example 6.2. The site of an archaeological excavation was at two different times occupied by two genetically distinct groups of people. The earlier group is thought to have comprised about 25% of their combined numbers. One distinguishing anatomical characteristic is the logarithm Y of the ratio of skull height to skull width. For the earlier group that is normally distributed with mean 0.223 and standard deviation 0.04. For the later group, the log of the skull ratio is normally distributed with mean 0.300 and standard deviation 0.04. A skull is excavated that has a value of Y = 0.240. What is the probability that it came from the earlier group?

Solution Let X = 1 if a skull comes from the earlier group and X = 0 if it comes from the later group. X is a Bernoulli variable with success probability p = 0.25. We want to find fX|Y (1, 0.240) = Pr(X = 1|Y = 0.240). We will perform the calculations of equation (6.7) with R and its ”dnorm()” function.

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 89

> fxy=0.25*dnorm(0.240,mean=0.233,sd=0.04)

> fy=fxy+0.75*dnorm(0.240,mean=0.300,sd=0.04)

> fxy/fy

 0.5027688

6.2.2 Covariance and Correlation

If X and Y are jointly distributed with joint density function f(x, y) and g(x, y) is a real valued function, then g(X,Y ) is a random variable. If its expected value exists, it can be found by

E(g(X,Y )) =

∫ ∞ −∞

∫ ∞ −∞

g(x, y)f(x, y)dxdy. (6.8)

Especially important is the covariance of X and Y .

cov(X,Y ) = E((X − µx)(Y − µy)) = E(XY )− E(X)E(Y ),

where µx = E(X) and µy = E(Y ). The correlation between X and Y is

cor(X,Y ) = cov(X,Y )

σxσy ,

a number between -1 and 1. Here σx and σy are the standard deviations of X and Y , respectively. The interpretations of the covariance and correlation for continuous variables are the same as for discrete variables. The greater the absolute value of cor(X,Y ), the closer the variables X and Y come to satisfying a linear relationship of the form aX + bY = c for some constants a, b and c.

Example 6.3. We will calculate the covariance and correlation between X and Y in Example 1.

E(XY ) =

∫ 1 0

(∫ x 0

2xy dy

) dx

=

∫ 1 0

2x

(∫ x 0

ydy

) dx

=

∫ 1 0

x3dx

= 1/4

We leave it as an exercise to show that E(X) = 2/3, E(Y ) = 1/3, E(X2) = 1/2, and E(Y 2) = 1/6. It follows that var(X) = 1/2− (2/3)2 = 1/18 and var(Y ) = 1/6− (1/3)2 = 1/18. Thus,

cov(X,Y ) = 1/4− (2/3)(1/3) = 1/36

and

cor(X,Y ) = 1/36

1/18 = 1/2.

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 90

Definition 6.1. Two random variables X and Y are uncorrelated if cov(X,Y ) = 0, equivalently, if E(XY ) = E(X)E(Y ).

Example 6.4. Let f(x, y) be the uniform density over the unit disc in the x, y plane. We leave it to the reader to show that E(X) = E(Y ) = 0. Thus, the covariance of X and Y is just E(XY ). For all four quadrants Q, the integral ∫

Q

∫ xyf(x, y)dxdy

is the same except for sign. It is positive in the first and third quadrants and negative in the second and fourth quadrants. Thus, cov(X,Y ) = 0. Students should work out the details by actually doing the integrations.

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

− 1.

5 −

1. 0

− 0.

5 0.

0 0.

5 1.

0 1.

5

6.2.3 Bivariate Normal Distributions

A bivariate normal distribution depends on five parameters, µx, µy, σx > 0, σy > 0, and ρ ∈ (−1, 1). Let X and Y be jointly distributed variables and let X ∼ Norm(µx, σx). Also, let the conditional

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 91

distribution of Y , given that X = x, be normal with conditional mean

E(Y |X = x) = µy + ρ σy σx

(x− µx), (6.9)

conditional variance var(Y |X = x) = σ2y(1− ρ2), (6.10)

and standard deviation σy √

1− ρ2. Then the joint density of X and Y is the product of the two normal densities

f(x, y) = fY |X(y|x)fX(x). (6.11)

It is tedious, but only algebra, to show that the joint density of X and Y works out to be

f(x, y) = 1

2πσxσy √

1− ρ2 e − 12

[ ( x=µxσx )

2−2ρ( x−µxσx ) ( y−µy σy

) + ( y−µy σy

)2] (6.12)

This expression remains the same if all the x′s and y′s are interchanged, so it follows that the marginal distribution of Y is Norm(µy, σy) and that the conditional distribution of X, given that Y = y, is normal with conditional mean µx + ρ

σx σy

(y − µy) and variance σ2x(1 − ρ2). The parameter ρ is the correlation between X and Y . Notice that if ρ = 0, the joint density factors into the product of the marginal densities.

f(x, y) = fX(x)fY (y).

This means that X and Y are independent. We will say more about this in the next section.

The level curves f(x, y) = constant of the bivariate normal density function are ellipses whose incli- nation to the coordinate axes and eccentricities depend on the correlation ρ. The figure below shows level curves for the bivariate normal density function with zero means, unit variances and ρ = −0.7.

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 92

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

−3 −2 −1 0 1 2 3

− 3

− 2

− 1

0 1

2 3

To show that X and Y have a joint bivariate normal distribution, it is enough to show that X is normally distributed and that the conditional distribution of Y , given that X = x, is normal with mean β0 + β1x, a linear function of x, and constant variance σ independent of x. The parameters µy, σy and ρ are then uniquely determined by β0, β1, σ and the mean and standard deviation of X.

6.3 Independent Random Variables

We have already defined what it means for jointly distributed random variables X1, · · · , Xn to be independent. To repeat that definition, it means that for all intervals I1, · · · , In of real numbers,

Pr(X1 ∈ I1;X2 ∈ I2; · · · ;Xn ∈ In) = Pr(X1 ∈ I1)Pr(X2 ∈ I2) · · ·Pr(Xn ∈ In).

If each Xi has a density function or frequency function fi(x), the Xi are independent if and only if the joint frequency-density function factors into the product of the marginal frequency or density functions:

f(x1, x2, · · · , xn) = f1(x1)f2(x2) · · · fn(xn),

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 93

for all x1, · · · , xn.

Theorem 6.1. Let X and Y be jointly distributed independent variables with finite variances. Then X and Y are uncorrelated.

Proof: We will assume that X and Y are continuous variables. The argument for other cases would be almost the same. Since X and Y are independent,

f(x, y) = fX(x)fY (y)

and

E(XY ) =

∫ ∞ −∞

∫ ∞ −∞

xyfX(x)fY (y)dxdy

=

∫ ∞ −∞

xfX(x)dx

∫ ∞ −∞

yfY (y)dy

= E(X)E(Y )

Thus, cov(X,Y ) = E(XY )− E(X)E(Y ) = 0.

The converse is not true in general. In Example 6.4, X and Y are uncorrelated but not independent. However, if X and Y have a bivariate normal distribution with correlation ρ = 0, then they are independent.

6.3.1 Exercises

1. In Example 1, are X and Y independent or dependent? Give reasons.

2. In Example 1 find Pr(X > 1/2) in two different ways, one by integrating the marginal density function and the other by considering ratios of areas.

3. 70% of students in calculus 1 are STEM majors. The semester average in calculus 1 for STEM majors is normally distributed with mean 81 and standard deviation 8. The semester average for non-STEM majors is normally distributed with mean 75 and standard deviation 10. Given that a student has a semester average of 90, what is the probability that he or she is a STEM major?

4. In problem 3, what is the probability that a student is a STEM major, given that his or her semester average is greater than or equal to 90?

5. In Example 4 show by integration that E(X) = E(Y ) = 0 and E(XY ) = 0.

6. Suppose that (X,Y ) has a bivariate normal distribution with parameters µx = 0; µy = 0; σx = 1; σy = 1; ρ = −0.7. Plot E(Y |X = x) as a function of x. On the same axes plot E(X|Y = y) as a function of y.

7. Suppose that U and V are independent Unif(0, 1) random variables. Let W = U + V . Find the cumulative distribution of W , Pr(W ≤ w), for 0 ≤ w ≤ 2. (Hint: Use (6.5) and consider 0 ≤ w ≤ 1

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 94

and 1 ≤ w ≤ 2 separately.)

8. Find the density function of the random variable W in exercise 7 and sketch its graph.

6.4 Sums of Random Variables

Let X1 and X2 be jointly distributed random variables. If their expected values exist, so does E(X1 + X2) and

E(X1 +X2) = E(X1) + E(X2) (6.13)

This can be extended to the sum of any number of jointly distributed variables. We calculate the variance of X1 +X2 as follows:

var(X1 +X2) = cov(X1 +X2, X1 +X2)

= cov(X1, X1 +X2) + cov(X2, X1 +X2)

= cov(X1, X1) + cov(X1, X2) + cov(X2, X1) + cov(X2, X2)

= var(X1) + 2cov(X1, X2) + var(X1)

We used three general properties of the covariance function above. One is that cov(X1, X2) = cov(X2, X1). Another is that var(X) = cov(X,X). Finally, we used the fact that the covariance function is linear in each of its arguments. These are all easy to establish from the definition of the covariance.

This result can be extended to more than two jointly distributed random variables by induction.

var(

n∑ i=1

Xi) =

n∑ i=1

var(Xi) + 2

n∑ i=2

i−1∑ j=1

cov(Xi, Xj). (6.14)

If the random variables Xi are uncorrelated, then we have

Theorem 6.2. If X1, X2, · · · , Xn are uncorrelated (in particular, if they are independent),

var(

n∑ i=1

Xi) =

n∑ i=1

var(Xi).

Next, we consider the sum of independent normal random variables. If X1 ∼ Norm(µ1, σ1) and X2 ∼ Norm(µ2, σ2) are independent, then we know from the preceding theorem that the mean of X1 +X2 is µ1 + µ2 and the standard deviation of X1 +X2 is

√ σ21 + σ

2 2 . We will show that X1 +X2

is normally distributed. We have

X1 = µ1 + σ1Z1,

X2 = µ2 + σ2Z2,

where Z1 and Z2 are independent standard normal random variables. The joint density function of (Z1, Z2) is bivariate normal with correlation ρ = 0. Its formula is

f(z1, z2) = 1

2π e−

1 2 (z

2 1+z

2 2),

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 95

and its level curves are circles centered at the origin.

z1

z2

0.02

0.04

0.06

0.08

0.1

0.12

0.14

−2 −1 0 1 2

− 2

− 1

0 1

2

The joint density function depends only on the squared distance z21 + z 2 2 of the point (z1, z2) from the

origin and not on the orientation of the axes. If the axes were rotated about the origin, the formula for the density function in the new coordinate system would be the same. A rotation of the axes through an angle θ corresponds to a transformation of (Z1, Z2) of the form

Z ′1 = Z1 cos θ + Z2 sin θ (6.15)

Z ′2 = −Z1 sin θ + Z2 cos θ.

So, the distribution of (Z ′1, Z ′ 2) is the same as that of (Z1, Z2) and in particular, Z

′ 1 is standard normal,

like Z1. Let us choose θ so that

cos θ = σ1√

σ21 + σ 2 2

sin θ = σ2√

σ21 + σ 2 2

.

We now have that

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 96

σ1√ σ21 + σ

2 2

Z1 + σ2√

σ21 + σ 2 2

Z2

is standard normal. Multiplying by √ σ21 + σ

2 2 ,

σ1Z1 + σ2Z2 ∼ Norm(0, √ σ21 + σ

2 2),

and

µ1 + µ2 + σ1Z1 + σ2Z2 ∼ Norm(µ1 + µ2, √ σ21 + σ

2 2).

Thus, X1 +X2 is normally distributed. By induction, we can easily extend this result to sums of any number of independent normally distributed random variables.

Theorem 6.3. Let Xi, i = 1, · · · , n be independent normally distributed random variables Xi ∼ Norm(µi, σi). Then X1 + X2 + · · · + Xn is normally distributed with mean µ1 + µ2 + · · · + µn and variance σ21 + · · ·+ σ2n.

We shall not prove it, but it is true that all linear combinations of two random variables with a bivariate normal distribution are normally distributed.

Theorem 6.4. Let X1 and X2 have a bivariate normal distribution with parameters µ1, µ2, σ1, σ2 and ρ. Let a, b1, and b2 be constants and let Y = a+ b1X1 + b2X2. Then Y is normally distributed with mean a+ b1µ1 + b2µ2 and variance b

2 1σ

2 1 + b

2 2σ

2 2 + 2b1b2 ρ σ1σ2.

6.4.1 Simulating Random Samples

Definition 6.2. A random sample of size n from a distribution (generically denoted by F ) is a sequence X1, X2, · · · , Xn of independent random variables whose marginal distributions are all F .

Random samples come from replicating an experiment with an associated random variable X having distribution F and letting Xi be the value of X on the i

th replication.

Computer simulation of random experiments is a very important tool in probability and statistics. Simulation of an experiment involves generating a sequence of numbers that behave like values of a random sample, even though the mechanism for generating them is deterministic. Very good random number generators are available even on hand-held calculators. Below we describe the most basic method of simulation.

Recall that if X is a random variable with cumulative distribution F , the pth quantile of F , or of X, is

F−1(p) = q(X, p) = min{x|F (x) ≥ p}

for 0 < p < 1. If we can calculate the quantile function and can simulate an observation from the uniform distribution Unif(0, 1), then we can simulate an observation of X with distribution F .

Theorem 6.5. Let F be a given cumulative distribution and let U ∼ Unif(0, 1). Then the random variable X = F−1(U) has distribution F .

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 97

Proof: Let x be an arbitrary real number. It follows from the definition of the quantile function that F−1(p) ≤ x if and only if F (x) ≥ p. Therefore,

Pr(X ≤ x) = Pr(F−1(U) ≤ x) = Pr(F (x) ≥ U).

Since U ∼ Unif(0, 1) and F (x) ∈ [0, 1],

Pr(U ≤ F (x)) = F (x).

Thus Pr(X ≤ x) = F (x) for all x and F is the cumulative distribution of X.

To simulate a random sample of size n from F , simulate a random sample U1, U2, · · · , Un from Unif(0, 1) and let Xi = F

−1(Ui), i = 1, · · · , n. Almost any calculator will simulate random sam- ples from Unif(0, 1). Simply press the random number key n times. In R, the ”runif” function will generate any number of uniform samples. Then to transform these into random samples from the exponential distribution with mean 1, we recall from the preceding chapter that the quantile function for Exp(λ = 1) is − log(1− p).

> us=runif(10)

> xs=-log(1-us)

> data.frame(us,xs)

us xs

1 0.98573318 4.24981876

2 0.09089919 0.09529929

3 0.52470614 0.74382202

4 0.78792302 1.55080598

5 0.16579396 0.18127486

6 0.91731973 2.49277432

7 0.83268964 1.78790473

8 0.18240977 0.20139401

9 0.44970750 0.59730533

10 0.54134677 0.77946084

Although in theory this procedure is universally applicable, it is not necessarily the most efficient for generating random samples from given distributions. Each of the common families of distributions has a random sample simulator in R. For example, for the uniform distributions, it is ”runif()”, for exponential distributions ”rexp()”, for gamma distributions ”rgamma()”, for normal ”rnorm()”. The sample size n is a required argument and there are other required or optional arguments for selecting one particular member of the given family. We will illustrate this by generating a sample of size 100 from the gamma distribution with shape parameter α = 2. We will plot a density histogram of the samples, and then superimpose the ideal density function.

> xs=rgamma(100,shape=2)

> hist(xs,freq=F)

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 98

Histogram of xs

xs

D en

si ty

0 2 4 6 8

0. 00

0. 05

0. 10

0. 15

0. 20

0. 25

0. 30

6.5 Sample Sums and the Central Limit Theorem

Let X1, X2, · · · , Xn be a random sample from a distribution F . We are interested in the distributions of the sample sum Tn = X1 + · · ·+Xn and the sample average Xn = Tn/n.

We assume that F has mean µ and standard deviation σ. From (6.2) we know that E(Tn) = nµ and var(Tn) = nσ

2. Hence, sd(Tn) = √ nσ. It then follows that E(Xn) = µ and sd(Xn) = σ/

√ n. The

standardized values of Tn and Xn are equal by simple algebra.

Zn = Tn − nµ σ √ n

=

√ n(Xn − µ)

σ (6.16)

If the Xi are normally distributed, then by Theorem 3, Zn is standard normal.

Example 6.5. The time for each of the four legs of a 400 meter relay race is normally distributed with mean 10 seconds and standard deviation 1.5 seconds. What is the probability that the race is

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 99

run in less than 36 seconds?

Solution:

Pr(T4 < 36) = Pr

( T4 − 4(10)

1.5 √

4 <

36− 4(10) 1.5 √

4

) = Pr (Z4 < −1.333) = Φ(−1.333) = 0.0913

Theorem 6.6. The Central Limit Theorem

Let X1, X2, · · · , Xn be a random sample from a distribution with mean µ and standard deviation σ. Let Tn and Xn be the sample sum and sample average and let Zn be their standardized value as defined in (6.16). Then

lim n→∞

Pr(Zn ≤ z) = Φ(z)

where Φ is the standard normal cumulative distribution.

The remarkable thing about the central limit theorem is that its conclusion holds for any distribution whatsoever, as long as it has a positive, finite variance. An informal way of stating it is that the sample sum is approximately (or asymptotically) normal with mean nµ and standard deviation

√ nσ.

The sample average is approximately normal with mean µ and standard deviation σ/ √ n. A natural

question is how large the sample size must be to have a good approximation. The folklore answer is that n ≥ 30 is usually sufficient, but the answer really depends on how nearly normal the underlying distribution is. We will explore that question by simulating random samples from some common dis- tributions and checking to see if their averages appear to be normally distriibuted.

Instead of looking at a histogram of the averages, we will use a better graphical indication of normality called a normal quantile plot. A normal quantile plot is a plot of theoretical quantiles of the standard normal distribution on the horizontal axis and the sample quantiles on the vertical axis. If the data comes from a normal distribution, this plot should be close to a straight line. Any patterned departure from straightness is an indication of non-normality. The following is a normal quantile plot of a sample of size 30 from a normal distribution.

> xs=rnorm(30,mean=10,sd=3)

> qqnorm(xs)

> qqline(xs)

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 100

−2 −1 0 1 2

6 8

10 12

14 16

18 Normal Q−Q Plot

Theoretical Quantiles

S am

pl e

Q ua

nt ile

s

The two left-hand figures below are a normal quantile plot and a histogram of a sample of size 50 from the gamma distribution Gamma(shape = 1/2, rate = 1). As the figures show, this is a very non-normal distribution. The two right hand figures show the normal quantile plot and histogram obtained from 50 replications of the experiment of sampling 30 from the gamma distribution and calculating the 50 sample means. The figures appear to confirm that a sample of size 30 is borderline sufficient for the approximate normality of the sample mean, even from this very skewed distribution.

> par(mfrow=c(2,2))

> xs=rgamma(50,shape=0.5)

> qqnorm(xs); qqline(xs)

> xmeans=replicate(50,mean(rgamma(30,shape=0.5)))

> qqnorm(xmeans); qqline(xmeans)

> hist(xs,freq=F)

> hist(xmeans,freq=F)

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 101

−2 −1 0 1 2

0. 0

0. 5

1. 0

1. 5

2. 0

Normal Q−Q Plot

Theoretical Quantiles

S am

pl e

Q ua

nt ile

s

−2 −1 0 1 2

0. 3

0. 5

0. 7

0. 9

Normal Q−Q Plot

Theoretical Quantiles

S am

pl e

Q ua

nt ile

s

Histogram of xs

xs

D en

si ty

0.0 0.5 1.0 1.5 2.0 2.5

0. 0

0. 5

1. 0

1. 5

Histogram of xmeans

xmeans

D en

si ty

0.2 0.4 0.6 0.8

0. 0

1. 0

2. 0

3. 0

Next, we investigate how well the central limit theorem applies to samples of size 30 from the Bernoulli distribution Binom(n = 1, p = 0.25). Since X = 0 or X = 1, a normal quantile plot and a histogram of the data would have no meaning, but we can look at them for sample averages.

> par(mfrow=c(2,1))

> xmeans=replicate(50,mean(rbinom(30,1,0.25)))

> qqnorm(xmeans); qqline(xmeans)

> hist(xmeans,freq=F)

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 102

−2 −1 0 1 2

0. 10

0. 25

0. 40

Normal Q−Q Plot

Theoretical Quantiles

S am

pl e

Q ua

nt ile

s

Histogram of xmeans

xmeans

D en

si ty

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

0 2

4 6

Again, a sample size of 30 seems marginally acceptable. However, this is affected by the success probability. A value of p nearer 1/2 would give a different picture.

6.5.1 Exercises

1. John’s projected annual income after graduation is normally distributed with mean \$80,000 and standard deviation \$10,000. Sally’s is normally distributed with mean \$85,000 and standard deviation \$12,000. If their incomes are independent, what is the probability that their combined income exceeds \$180,000?

2. What is the standard deviation of their combined income if their incomes have a bivariate normal distribution and the correlation between them is 0.4? What is it if the correlation is -0.4?

3. What is the probability that their combined income exceeds \$180,000 when the correlation is 0.4? -0.4?

4. Suppose that John’s and Sally’s incomes have a bivariate normal distribution with a correlation of

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 103

0.4. Given that Sally’s income is \$95,000, what is the probability that John’s income is greater than \$100,000. What is the unconditional probability?

5. Back when people still wrote ”checks” and entered the amounts in a ”check register”, it was common to round off the amounts entered to the nearest dollar. Assume that the roundoff errors are uniformly distributed between -\$0.50 and \$0.50. and that they are independent. If 30 checks are written, what is the probability that the sum of the roundoff errors is less than \$5.00 in absolute value?

6. Do 50 replications of the experiment of adding the roundoff errors for 30 checks. Use normal quantile plots and histograms to investigate the normality of the sum of 30 roundoff errors.

7. The logistic distribution has cumulative distribution function

F (x) = 1

1 + e−x , −∞ < x <∞.

Find the quantile function of this distribution. Simulate a random sample of size 100 from the logistic distribution. Make a histogram and a normal quantile plot of the simulated data.

6.6 Other Distributions Associated with Normal Sampling

6.6.1 Chi Square Distributions

Definition 6.3. Let ν be a positive integer. The gamma distribution with shape parameter α = ν/2 and scale parameter β = 2 is called the chi square distribution with ν degrees of freedom. We write X ∼ Chisq(df = ν) to say that X has such a distribution.

The density function of Chisq(df = ν) is

f(x) = 1

2ν/2Γ(ν/2) x(ν/2)−1e−x/2 (6.17)

A key fact about gamma distributions is

Theorem 6.7. If X1 ∼ Gamma(shape = α1, scale = β) and X2 ∼ Gamma(α2, β) are independent, then X1 +X2 ∼ Gamma(shape = α1 + α2, scale = β).

This theorem can be established with the convolution formula for the density of the sum of independent random variables. We will omit the proof. Notice that the theorem assumes that X1 and X2 have the same scale parameter. Since all chi square distributions have scale 2, it follows that the sum of independent chi square random variables has a chi square distribution.

Corollary 6.1. If X1, X2, · · · , Xn are independent and each Xi ∼ Chisq(df = νi), then X1 + X2 + · · ·+Xn ∼ Chisq(df = ν1 + ν2 + · · ·+ νn).

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 104

Let W = Z2, where Z ∼ Norm(0, 1). The density of W can be calculated as follows.

Pr(W ≤ w) = Pr(− √ w ≤ Z ≤

√ w)

= 2× Pr(0 ≤ Z ≤ √ w)

= 2√ 2π

∫ √w 0

e−z 2/2dz

f(w) = d

dw

2√ 2π

∫ √w 0

e−z 2/2dz

= 1

21/2 √ π w−1/2e−w/2

This is the density of Gamma(shape = 1/2, scale = 2) = Chisq(df = 1).

Corollary 6.2. If Z1, Z2, · · · , Zn are independent Norm(0, 1) variables, then

Z21 + Z 2 2 + · · ·+ Z2n ∼ Chisq(df = n).

Tables of Chi Square Distributions

The table below is a table of lower percentage points of chi square distributions with degrees of freedom given by the row headings. The column headings are the probabilities Pr(W ≤ w), where W ∼ Chisq(df = row index) and w is the table entry. For example, the .025 quantile of Chisq(df = 10) is 3.246973.

0.01 0.025 0.05 0.1

1 0.000157 0.000982 0.003932 0.015791

2 0.020101 0.050636 0.102587 0.210721

3 0.114832 0.215795 0.351846 0.584374

4 0.297109 0.484419 0.710723 1.063623

5 0.554298 0.831212 1.145476 1.610308

6 0.872090 1.237344 1.635383 2.204131

7 1.239042 1.689869 2.167350 2.833107

8 1.646497 2.179731 2.732637 3.489539

9 2.087901 2.700389 3.325113 4.168159

10 2.558212 3.246973 3.940299 4.865182

11 3.053484 3.815748 4.574813 5.577785

12 3.570569 4.403789 5.226029 6.303796

13 4.106915 5.008751 5.891864 7.041505

14 4.660425 5.628726 6.570631 7.789534

15 5.229349 6.262138 7.260944 8.546756

16 5.812212 6.907664 7.961646 9.312236

17 6.407760 7.564186 8.671760 10.085186

18 7.014911 8.230746 9.390455 10.864936

19 7.632730 8.906516 10.117013 11.650910

20 8.260398 9.590777 10.850811 12.442609

21 8.897198 10.282898 11.591305 13.239598

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 105

22 9.542492 10.982321 12.338015 14.041493

23 10.195716 11.688552 13.090514 14.847956

24 10.856361 12.401150 13.848425 15.658684

25 11.523975 13.119720 14.611408 16.473408

26 12.198147 13.843905 15.379157 17.291885

27 12.878504 14.573383 16.151396 18.113896

28 13.564710 15.307861 16.927875 18.939242

29 14.256455 16.047072 17.708366 19.767744

30 14.953457 16.790772 18.492661 20.599235

31 15.655456 17.538739 19.280569 21.433565

32 16.362216 18.290765 20.071913 22.270594

33 17.073514 19.046662 20.866534 23.110197

34 17.789147 19.806253 21.664281 23.952253

35 18.508926 20.569377 22.465015 24.796655

36 19.232676 21.335882 23.268609 25.643300

37 19.960232 22.105627 24.074943 26.492094

38 20.691442 22.878482 24.883904 27.342950

39 21.426163 23.654325 25.695390 28.195785

40 22.164261 24.433039 26.509303 29.050523

The next table is a table of upper percentage points of the chi square distributions with degrees of freedom given by the row headings. The column headings are the probabilities that Pr(W > w) where W ∼ Chisq(df = row index) and w is the table entry. For example, the .975 quantile of Chisq(df = 10) is 20.483. Quantiles of the chi square distributions are given in R by the ”qchisq()” function.

> qchisq(.975,df=30)

 46.97924

0.1 0.05 0.025 0.01

1 2.705543 3.841459 5.023886 6.634897

2 4.605170 5.991465 7.377759 9.210340

3 6.251389 7.814728 9.348404 11.344867

4 7.779440 9.487729 11.143287 13.276704

5 9.236357 11.070498 12.832502 15.086272

6 10.644641 12.591587 14.449375 16.811894

7 12.017037 14.067140 16.012764 18.475307

8 13.361566 15.507313 17.534546 20.090235

9 14.683657 16.918978 19.022768 21.665994

10 15.987179 18.307038 20.483177 23.209251

11 17.275009 19.675138 21.920049 24.724970

12 18.549348 21.026070 23.336664 26.216967

13 19.811929 22.362032 24.735605 27.688250

14 21.064144 23.684791 26.118948 29.141238

15 22.307130 24.995790 27.488393 30.577914

16 23.541829 26.296228 28.845351 31.999927

17 24.769035 27.587112 30.191009 33.408664

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 106

18 25.989423 28.869299 31.526378 34.805306

19 27.203571 30.143527 32.852327 36.190869

20 28.411981 31.410433 34.169607 37.566235

21 29.615089 32.670573 35.478876 38.932173

22 30.813282 33.924438 36.780712 40.289360

23 32.006900 35.172462 38.075627 41.638398

24 33.196244 36.415029 39.364077 42.979820

25 34.381587 37.652484 40.646469 44.314105

26 35.563171 38.885139 41.923170 45.641683

27 36.741217 40.113272 43.194511 46.962942

28 37.915923 41.337138 44.460792 48.278236

29 39.087470 42.556968 45.722286 49.587884

30 40.256024 43.772972 46.979242 50.892181

31 41.421736 44.985343 48.231890 52.191395

32 42.584745 46.194260 49.480438 53.485772

33 43.745180 47.399884 50.725080 54.775540

34 44.903158 48.602367 51.965995 56.060909

35 46.058788 49.801850 53.203349 57.342073

36 47.212174 50.998460 54.437294 58.619215

37 48.363408 52.192320 55.667973 59.892500

38 49.512580 53.383541 56.895521 61.162087

39 50.659770 54.572228 58.120060 62.428121

40 51.805057 55.758479 59.341707 63.690740

6.6.2 Student t Distributions

Definition 6.4. Let Z ∼ Norm(0, 1) and W ∼ Chisq(df = ν) be independent. The distribution of

T = Z√ W/ν

is called the student-t distribution with ν degrees of freedom. We indicate that T has this distribution by writing T ∼ t(df = ν).

The graph of the student-t density function is bell-shaped and symmetric about 0 like the normal distribution, but heavier in the tails. As ν →∞ it converges to the standard normal density function, so for large values of ν there is very little difference between student-t and standard normal.

Tables of Student t Distributions

Tables of upper percentage points of the student-t distribution are available in all statistics textbooks. The one below was produced in R. The row headings 1 through 40 are the numbers of degrees of freedom. The column headings are the right tail probabilities Pr(T > t) and the table entries are the values of t for those probabilities. For example, the 99th percentile of the student-t distribution with 30 degrees of freedom is 2.457. Quantiles for the student-t distributions are given in R by the ”qt()” function.

> qt(.99,df=30)

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 107

 2.457262

0.1 0.05 0.025 0.01

1 3.077684 6.313752 12.706205 31.820516

2 1.885618 2.919986 4.302653 6.964557

3 1.637744 2.353363 3.182446 4.540703

4 1.533206 2.131847 2.776445 3.746947

5 1.475884 2.015048 2.570582 3.364930

6 1.439756 1.943180 2.446912 3.142668

7 1.414924 1.894579 2.364624 2.997952

8 1.396815 1.859548 2.306004 2.896459

9 1.383029 1.833113 2.262157 2.821438

10 1.372184 1.812461 2.228139 2.763769

11 1.363430 1.795885 2.200985 2.718079

12 1.356217 1.782288 2.178813 2.680998

13 1.350171 1.770933 2.160369 2.650309

14 1.345030 1.761310 2.144787 2.624494

15 1.340606 1.753050 2.131450 2.602480

16 1.336757 1.745884 2.119905 2.583487

17 1.333379 1.739607 2.109816 2.566934

18 1.330391 1.734064 2.100922 2.552380

19 1.327728 1.729133 2.093024 2.539483

20 1.325341 1.724718 2.085963 2.527977

21 1.323188 1.720743 2.079614 2.517648

22 1.321237 1.717144 2.073873 2.508325

23 1.319460 1.713872 2.068658 2.499867

24 1.317836 1.710882 2.063899 2.492159

25 1.316345 1.708141 2.059539 2.485107

26 1.314972 1.705618 2.055529 2.478630

27 1.313703 1.703288 2.051831 2.472660

28 1.312527 1.701131 2.048407 2.467140

29 1.311434 1.699127 2.045230 2.462021

30 1.310415 1.697261 2.042272 2.457262

31 1.309464 1.695519 2.039513 2.452824

32 1.308573 1.693889 2.036933 2.448678

33 1.307737 1.692360 2.034515 2.444794

34 1.306952 1.690924 2.032245 2.441150

35 1.306212 1.689572 2.030108 2.437723

36 1.305514 1.688298 2.028094 2.434494

37 1.304854 1.687094 2.026192 2.431447

38 1.304230 1.685954 2.024394 2.428568

39 1.303639 1.684875 2.022691 2.425841

40 1.303077 1.683851 2.021075 2.423257

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 108

6.6.3 The Joint Distribution of the Sample Mean and Variance

Theorem 6.8. Let X1, X2, · · · , Xn be a random sample from the normal distribution Norm(µ, σ). Let X be the sample average and S2 the sample variance. Then

1) X ∼ Norm(µ, σ/ √ n);

2) (n− 1)S2/σ2 ∼ Chisq(df = n− 1);

3) X and S2 are independent random variables;

4) T = √ n(X−µ) S has a student-t distribution with n− 1 degrees of freedom.

Proof: Conclusion (1) follows from results of the previous section. Let us accept (2) and (3) for the

time being. To prove (4), let Z = √ n(X−µ) σ and let W =

(n−1)S2 σ2 . Then Z and W satisfy the conditions

of the definition of a student-t random variable with ν = n− 1 and

T = Z√ W/ν

=

√ n(X − µ)

S .

We will prove (2) and (3) only for the case n = 2. The proof can be generalized to any n > 2, but it requires advanced linear algebra. We have

X1 = µ+ σZ1

X2 = µ+ σZ2

, with Z1 and Z2 independent Norm(0, 1) variables. Then

X = µ+ σZ,

and

Z = Z1 + Z2

2 .

The sample variance S2X of {X1, X2} is σ2 times the sample variance S2Z of {Z1, Z2}. Furthermore, it is easy to show that

S2Z = (Z1 − Z2)2

2 .

Now apply the transformations (6.15) with θ = π/4 to (Z1, Z2). The transformed variables

Z ′1 = Z1 + Z2√

2

and

Z ′2 = Z2 − Z1√

2

are independent and standard normal. Thus,

(n− 1)S2X σ2

= S2Z = Z ′2 2 ∼ Chisq(df = 1 = n− 1).

Go to TOC

CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 109

and √ 2(X − µ) σ

= Z ′1 ∼ Norm(0, 1)

independently.

6.6.4 Exercises

1. Use R’s ”qchisq()” function and also the table provided above to find (a) the 95th percentile of Chisq(df = 24), (b) the 90th percentile of Chisq(df = 10).

2. Use R’s ”qt()” function and also the table provided above to find (a) the 99th percentile of student-t with df = 15, (b) the 95th percentile of student-t with df = 30.

3. Let W ∼ Chisq(df = ν). From known facts about gamma distributions, find E(W ) and var(W ).

4. Let W ∼ Chisq(df = ν). Using the central limit theorem and Corollary 2 of Theorem (6.7), show that

W − ν√ 2ν

approaches standard normal as ν → ∞. Confirm this by using R’s ”rchisq()” function with a large number of degrees of freedom to generate a random sample of size 100 from the chi square distribution. Then make a normal quantile plot and a histogram of the simulated sample values.

Go to TOC

Chapter 7

Statistical Inference for a Single Population

7.1 Introduction

In this chapter we begin the study of statistical inference. This is the science of inferring characteristics of an entire population from the information contained in a sample from that population. Statistical inferences are not just bald assertions about population characteristics. They must be accompanied by statements quantifying the probable accuracy of those assertions. Thus, probability is an essential ingredient of statistical inference. The probabiliity statements accompanying inferences are derived from prior knowledge or experience, knowledge particular to the subject at hand, and theoretical assumptions concerning the processes that produce the values of population variables. Of course the probability statements must be internally consistent and satisfy the mathematical properties of prob- ability as outlined in Chapter 3.

Statistical inference is divided into two broad categories, estimation and hypothesis testing. They are not mutually exclusive. We turn to estimation first.

7.2 Estimation of Parameters

A random variable X, whether it comes from sampling from a finite population or from an idealized random experiment, has a distribution which depends on one or more unknown parameters. Often we assume that the distribution is from a family whose members are completely determined by the values of a few parameters, for example, the family of normal distributions, whose members are determined by their means and standard deviations. In other cases, we make only minimal assumptions about the distributions, such as the existence of a variance.

7.2.1 Estimators

Let X1, X2, · · · , Xn be a random sample from a distribution that depends in part on an unknown pa- rameter θ. An estimator of θ is a function θ̂(X1, X2, · · · , Xn) of the sample values whose value is taken

110

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 111

as an estimate of the value of θ. θ̂ must be computable from the data X1, · · · , Xn alone, so it cannot involve θ or other unknown parameters in its formula or in any way except through the data values. Also, observe that θ̂(X1, · · · , Xn) = θ̂ is a random variable, so it has a distribution derived from the distribution of X. Therefore, the distribution of θ̂ also depends on the unknown value of θ. Its dis- tribution has a median, presumably a mean and a variance, and other general features of distributions.

Examples:

1. The sample mean X is an estimator of the population mean µ.

2. The sample variance S2 is an estimator of the population variance σ2.

3. The sample proportion of successes p̂ = Y/n, where Y ∼ Binom(n, p) is an estimator of the popu- lation proportion of successes p.

4. Sample quantiles are estimators of quantiles of a distribution.

7.2.2 Desireable Properties of Estimators

An unbiased estimator is one which, on average, gives the correct value of the unknown parameter. More precisely,

Definition 7.1. An estimator θ̂ of a parameter θ is unbiased if

Eθ(θ̂) = θ

for all values of θ. The bias of θ̂ is the mean estimation error:

bias(θ̂, θ) = Eθ(θ̂)− θ.

θ̂ is asymptotically unbiased if its bias approaches zero as the sample size n→∞.

The subscripts in the equations above are to remind you that the distribution (and therefore the

expected value) of θ̂ depends on the unknown θ and that the equations must be true for any value it may have.

All other things being equal, it is nice for an estimator to be unbiased. However, there are many natural and useful estimators that are biased. For example, the sample standard deviation is a biased estimator of the population standard deviation. It is more important that an estimator be asymptoti- cally unbiased. If it is not, there is a persistent systematic error which cannot be reduced by increasing the sample size. The sample standard deviation is asymptotically unbiased.

Another consideration is the accuracy of an estimator, as measured by the spread of its distribution about the true parameter value. When comparing two unbiased estimators of the same parameter, the one with smaller variance is better than the one with larger variance. Furthermore, we want the variance of an estimator to approach zero as n → ∞. If this is true for an asymptotically unbiased estimator then for large sample sizes, the probability is high that the estimated value of θ will be very

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 112

close to the true value. All the common estimators that we will study are asymptotically unbiased. For many of them it can be shown that their variances approach zero at least as fast as the variance of any other estimator of the same parameter. Therefore, they are in a sense nearly the best possible estimators.

7.3 Estimating a Population Mean

Let X be a variable with unknown mean µ and standard deviation σ and let X1, X2, · · · , Xn be a random sample from the distribution of X. We will assume first that σ is known. Realistically, it probably is not known but we will postpone that issue for now. X is an unbiased estimator of µ:

E(X) = E( 1

n

n∑ i=1

Xi) = 1

n

n∑ i=1

E(Xi) = 1

n

n∑ i=1

µ = µ.

Since the samples are independent, its variance goes to zero as n→∞.

var(X) = var( 1

n

n∑ 1

Xi) = 1

n2

n∑ 1

var(Xi)

= 1

n2

n∑ 1

σ2 = σ2

n .

sd(X) = σ√ n .

Therefore, for large sample sizes n, X will be close to µ with high probability. Let us examine this statement more closely in the case where X is normally distributed and its standardized value

Z =

√ n(X − µ)

σ .

has a standard normal distribution.

Let � > 0 be the accuracy we would like to achieve in estimating µ, i.e., the maximum tolerable error of estimation or margin of error. Let 1− α, where α ∈ (0, 1), be the probability with which we would like to achieve it. Then

Pr ( |X − µ| ≤ �

) = Pr

(√ n|X − µ| σ

≤ � √ n

σ

) = Pr

( |Z| ≤ �

√ n

σ

) = Φ

( � √ n

σ

) − Φ

( −� √ n

σ

) ≥ 1− α

provided √ n�

σ ≥ Φ−1

( 1− α

2

) = zα/2.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 113

Therefore, if

n ≥ z2α/2σ

2

�2 (7.1)

we achieve accuracy to within � units with probability at least 1− α.

This is a rule for determining the sample size required to achieve desired experimental accuracy. It depends on the normality of X and knowledge of σ. If the samples are not from a normal distribution, it can still be used as long as n is also large enough for the central limit theorem to apply to the distribution of X. Conventionally n ≥ 30 is considered large enough in most situations. The rule is useful also if the population standard deviation σ is known only approximately.

Example 7.1. Engineers would like to estimate the mean sulfur dioxide concentration in emissions from a particular industrial process. They would like their estimate to be accurate to 0.1 ppm with a probability of at least 95%. The standard deviation of measurements is at most 2 ppm. How many independent measurements of SO2 concentration should they make?

Solution: Since 1 − α = 0.95, zα/2 = z.025 = 1.96. We are given that � = 0.1 and σ ≤ 2. Therefore, rounding up to the nearest integer, from (7.1) we need

n ≥ 1.96 222

0.12 = 1537

samples. This is large enough that normality of X should be of no concern.

7.3.1 Confidence Intervals

From the definition of zα/2 = Φ −1(1− α/2) we have

Pr(|Z| < zα/2) = 1− α

.

Assuming that X is normally distributed,

1− α = Pr ( −zα/2 <

√ n(X − µ)

σ < zα/2

) .

We can rearrange these inequalities as follows.

1− α = Pr ( X − zα/2

σ√ n < µ < X + zα/2

σ√ n

) .

The interval X ± zα/2

σ√ n

(7.2)

with random endpoints includes the unknown parameter µ with probability 1 − α. It is called a 100(1 − α)% confidence interval for µ. A confidence interval is more informative than a single point

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 114

estimate of a parameter because it gives a range of possible values for the parameter along with a statement of the probability that such an interval would include the parameter in repeated sampling.

All of the above is based on the assumption that we have prior knowledge of the population standard deviation σ, at least to a good approximation. When this is not the case, and the sample size is large, there is a modification of the central limit theorem that provides a solution.

Theorem 7.1. Let X and S be the sample mean and standard deviation from a sample of size n from a distribution with mean µ and standard deviation σ. Then the distribution of

T =

√ n(X − µ)

S

approaches the standard normal distribution as n→∞.

When the population standard deviation is unknown and the sample size is large, an approximate 100(1− α)% confidence interval for µ is

X ± zα/2 S√ n . (7.3)

When σ is unknown, n must be somewhat larger for a good approximation than when σ is known. Below is a normal quantile plot of 100 replications of T for a sample of size n = 100 from the expo- nential distribution with mean 1. This is a very non-normal, skewed distribution. The plot indicates that n = 100 should be a large enough sample size in most applications. If the normal approximation is valid, we would expect that in 95% of the 100 replications, the lower confidence limit (lcl) would be less than the true mean of 1 and the upper confidence limit (ucl) would be greater than 1. The actual number is shown in the output.

> expsamp=matrix(rexp(10000),nrow=100)

> xbars=apply(expsamp,1,mean)

> sds=apply(expsamp,1,sd)

> zs=sqrt(50)*(xbars-1)/sds

> lcl=xbars-1.96*sds/sqrt(100)

> ucl=xbars+1.96*sds/sqrt(100)

> sum(lcl < 1 & 1 < ucl)

 93

> qqnorm(zs,main=”Standardized Exponential Averages”)

> qqline(zs)

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 115

−2 −1 0 1 2

− 2

− 1

0 1

Standardized Exponential Averages

Theoretical Quantiles

S am

pl e

Q ua

nt ile

s

7.3.2 Small Sample Confidence Intervals for a Normal Mean

The preceding results apply only when the sample size is large enough to assume that X has a normal distribution. If the sample size is not large, but it is known that the samples are from a nearly normal distribution, there is a confidence interval for µ based on the student-t distribution.

From Theorem 8 of Chapter 6, if X1, · · · , Xn is a sample from Norm(µ, σ), the random variable

T =

√ n(X − µ)

S

has a student-t distribution with n− 1 degrees of freedom. If tα/2(n− 1) denotes the 1−α/2 quantile of this distribution,

Pr ( −tα/2(n− 1) < T < tα/2(n− 1)

) = 1− α

.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 116

Rearranging in the same way as before,

1− α = Pr ( X − tα/2(n− 1) < µ < X + tα/2(n− 1)

S√ n

) .

Therefore, a 100(1− α)% confidence interval for µ is

X ± tα/2(n− 1) S√ n . (7.4)

For large samples there is almost no difference between the student-t and normal confidence intervals. The number tα/2(n− 1) is almost the same as zα/2.

Example 7.2. A sample of 10 middle school teachers from the Houston area with exactly 5 years teaching experience was obtained. Their salaries are shown below. Find a 90% confidence interval for the mean salary after 5 years in the whole Houston area.(This is made-up data. The mean of the normal distribution it was generated from is 50,000.)

50333 43683 50290 40389 49324 46840 50849 40397 53249 53325

Solution: Normal quantile plots for only 10 observations can be misleading. This one does not show any pronounced non-normality, so we shall assume that the distribution is normal.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 117

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

40 00

0 44

00 0

48 00

0 52

00 0

Normal Q−Q Plot

Theoretical Quantiles

S am

pl e

Q ua

nt ile

s

The mean of the observations is 47867.90 and the standard deviation is 4853.46. For 90% confidence, tα/2(n − 1) = t.05(9) = 1.833. This comes from the student-t table in Chapter 6 or from R with the command

> qt(.95,9)

 1.833113

The 90% confidence interval is

X ± tα/2 S√ n

= 47867.90± 1.8334853.46√ 10

= 47867.90± 2813.29

or (45054.61,50681.19).

R has a function that will find student-t confidence intervals from raw data for any given confidence level. It is ”t.test”, and it returns the confidence interval along with the results of a hypothesis test about the population mean. We will discuss hypothesis testing later in this chapter.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 118

> salaries

 50333 43683 50290 40389 49324 46840 50849 40397 53249 53325

> t.test(salaries,conf.level=.90)

One Sample t-test

data: salaries

t = 31.188, df = 9, p-value = 1.756e-10

alternative hypothesis: true mean is not equal to 0

90 percent confidence interval:

45054.44 50681.36

sample estimates:

mean of x

47867.9

The ”t.test” function can be used in the large sample situation even with non-normal data. The confidence interval it returns is almost the same as the normal confidence interval because the student-t distribution converges to the standard normal distribution as the degrees of freedom increases without bound. Both the student-t procedure (7.4) and the normal procedure (7.3) are very robust against non-normality as long as the underlying distribution is symmetric and unimodal.

7.3.3 Exercises

1. A sample of size 36 from a normally distributed population variable with population standard deviation 20 had a sample mean of 88. Find a 90% confidence interval for the population mean.

2. A sample of size 90 from a population variable had a sample mean of 4.74 and a sample standard deviation of 0.71. Find a 95% confidence interval for the population mean.

3. Import the ”reacttimes” data set and consider the 50 observations of the variable ”Times” to be a sample from a larger population. Find a 99% confidence interval for the population mean. Construct a normal quantile plot and comment on the appropriateness of the procedure.

4. The Cauchy distribution with density function

f(x) = 1

π(1 + x2)

does not have a mean. Show that the mean does not exist. The distribution is bell-shaped and sym- metric about its median 0. Repeat the computations for the plot ”Standardized Exponential Averages” above, but generate samples from the Cauchy distribution using the ”rcauchy” command instead of ”rexp”. Use a sample size of 50. Count the number of the 100 intervals generated that contain the number 0. Display the confidence intervals with the command

> cbind(lcl,ucl)

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 119

Comment on the number of intervals that contain 0 and the lengths of the intervals. Repeat the exercise with a sample size of 200. What do you observe about the lengths of the intervals?

5. We wish to estimate the population mean of a variable that has standard deviation 70.5. We want to estimate it with an error no greater than 5 units with probability 0.99. How big a sample should we take from the population? What happens if the standard deviation and the margin of error are both doubled?

6. The data frame ”Loblolly” is included with the datasets library of R. Bring it into your R workspace with

> data(Loblolly,package=”datasets”)

or just click on it under the ”Packages” tab in Rstudio. This data set has measurements of height and age of 84 Loblolly pine trees. Select a sample of 10 of the 84 tree records with the command

> mytrees=Loblolly[sample(84,10), ]

> attach(mytrees)

Assume that the ratio of height to age is normally distributed. Find a 90% confidence interval for the population mean of this ratio, using your sample of 10 trees. You can address the sample values of this variable simply by the R expression

> height/age

for example.

7. Assess the normality of ”height/age”, considering the 84 trees in Loblolly as a sample from a much larger population. Using this sample of 84 trees, find a 90% confidence interval for the mean ratio height/age in the larger population.

8. Henry Cavendish 1 made 29 measurements of the specific gravity of the earth. They are provided in the file ”Cavendish.txt” at

www.math.uh.edu/ charles/data Import this data into R and find a 95% confidence interval for the specific gravity of the earth. Assume Cavendish’s measurements are a random sample from a population whose mean is the true specific gravity.

According to NASA’s Earth Fact Sheet

1Henry Cavendish, 1731-1810: British chemist and physicist. An important figure in the history of science.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 120

http://nssdc.gsfc.nasa.gov/planetary/factsheet/earthfact.html

the specific gravity of Earth is 5.541.

7.4 Estimating a Population Proportion

Let Y be a random variable with a binomial distribution Binom(n, p), where n is the number of trials and p is the success probability. A random variable like Y most often arises from sampling n subjects from a population with a proportion p of ”successes”. Sampling is with replacement, although if the population is much larger than the sample size it matters very little whether the samples are obtained with or without replacement. The random variable

p̂ = Y/n

is the sample proportion of successes. It is an unbiased estimator of p.

E(p̂) = 1

n E(Y ) =

1

n (np) = p.

Its variance and standard deviation are

var(p̂) = 1

n2 var(Y ) =

np(1− p) n2

= p(1− p)

n

sd(p̂) =

√ p(1− p)

n

In fact, estimating a population proportion is just a special case of estimating a population mean. p̂ is the sample average of independent samples from the Bernoulli distribution with success probability p. The main complication is that the population variance p(1−p) cannot be known without also knowing the mean p. Therefore, the assumption of a known population variance does not apply, except in one application described below.

Since we are dealing with a sample average, the central limit theorem applies and the distribution of the standardized sample success proportion

Z =

√ n(p̂− p)√ p(1− p)

(7.5)

approaches standard normal as n → ∞. We will assume throughout this discussion that n is large enough for the normal approximation to be accurate. In fact, we assume that n is large enough for the enhanced version of the central limit theorem, Theorem 7.1, to apply. This implies that

Z ′ =

√ n(p̂− p)√ p̂(1− p̂)

(7.6)

is approximately standard normal also (with a somewhat larger value of n).

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 121

7.4.1 Choosing the Sample Size

Let � be the maximum tolerable error in estimating p with p̂ and let 1− α be the desired probability of achieving that degree of accuracy. From (7.5),

Pr(|p̂− p| ≤ �) = Pr

( √ n|p̂− p|√ p(1− p)

≤ √ n�√

p(1− p)

)

= Φ

( √ n�√

p(1− p)

) − Φ

( −

√ n�√

p(1− p)

) ≥ Φ(zα/2)− Φ(−zα/2) = 1− α

provided √ n�√

p(1− p) ≥ zα/2.

Solving for n,

n ≥ z2α/2 p(1− p)

�2 . (7.7)

This is unsatisfactory because one would have to know p to use it. However, it might be that a good guess at p is already known and the purpose of the sampling experiment is simply to refine that guess. If p∗ is a prior estimate of p, it can be substituted into the right hand side of (7.7), yielding

n ≥ z2α/2 p∗(1− p∗)

�2 . (7.8)

This procedure is common in public opinion polling, especially during campaigns, when a candidate’s approval rating p is updated every few days.

Another approach is to replace the right hand side of (7.7) by something larger. The function p(1−p) has a maximum value of 1/4 when p = 1/2. Substituting p = 1/2 into the right hand side of (7.7),

n ≥ z2α/2

4�2 (7.9)

Example 7.3. Candidates A and B are competing in their party primary. Candidate A announced weeks ago and has been conducting polls frequently to determine his favorability rating (the percent- age of all prospective voters who view him favorably). His favorability rating in the last poll was 0.65. Candidate B just announced that she is running and has no polling history. Each wants to determine his or her favorability rating to within 3 percentage points. How many prospective voters should each candidate sample to achieve 3 percentage point accuracy with probability 0.95?

Solution: A has a prior estimate p∗ = 0.65 for his success probability, so he can use (7.8). With zα/2 = z0.025 = 1.96 and � = 0.03, this results in

n ≥ 1.962 0.65× 0.35 0.032

= 971.

Candidate B has no prior estimate of p, so she uses (7.9).

n ≥ 1.96 2

4× 0.032 = 1067.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 122

7.4.2 Confidence Intervals for p

Assuming that the sample size n is large, we can apply (7.5) and say

Pr

( p̂− zα/2

√ p(1− p)

n < p < p̂+ zα/2

√ p(1− p)

n

) = 1− α.

This is unusable as a confidence interval because the unknown p is part of the expression for the end points. Below are 3 possible ways of modifying it.

Method 1 – substitute 1/2 for p

This gives the confidence interval

p̂± zα/2

2 √ n

(7.10)

The width of this interval is greater than those of the other intervals in this list. Therefore, it sacrifices a bit of accuracy but achieves a confidence level a bit greater than the the nominal level of 100(1−α)%.

Method 2 – substitute p̂ for p

From (7.6) the distribution of the random variable Z ′ approaches standard normal as n→∞. Thus,

Pr

( p̂− zα/2

√ p̂(1− p̂ n

< p < p̂+ zα/2

√ p̂(1− p̂)

n

) = 1− α

approximately and

p̂± zα/2

√ p̂(1− p̂)

n (7.11)

is an approximate 100(1− α)% confidence interval for p. This is probably the most often used confi- dence interval for p but it has been discovered recently 2 that the rate of convergence of the distribution of Z ′ to standard normal is not uniform, even when extreme values of p near 0 or 1 are avoided. For certain values of the true proportion p the actual confidence level is significantly less than the nominal confidence unless n is quite large.

Method 3 – solve a quadratic inequality for p

We know from (7.5) that 1− α is the probability that

−zα/2 ≤ √ n(p̂− p)√ p(1− p)

≤ zα/2.

Another way of writing this pair of inequalities is

n(p̂− p)2

p(1− p) ≤ z2α/2,

2L.D. Brown, T. Cai, and A. DasGupta, ”Interval Estimation for a Binomial Proportion”, Statistical Science 16 (2) 101-133.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 123

or n(p− p̂)2 ≤ z2α/2p(1− p).

This is a quadratic inequality in p which can be solved by elementary algebra. After a great deal of algebraic manipulation the solution set is the set of all p in the confidence interval

M ±H

. M is the midpoint of the interval and is given by

M = p̂+

z2α/2 n

1 2

1 + z2 α/2

n

. (7.12)

H is the half-width of the interval.

H = zα/2

√ p̂(1−p̂) n +

z2 α/2

4n2

1 + z2 α/2

n

(7.13)

Unlike the other intervals this confidence interval is not centered at p̂. Rather, it is centered at a point between p̂ and 1/2. For large values of n the center is very close to p̂. The expression inside the radical in the half-width H is a weighted sum of the estimated variance p̂(1− p̂)/n and the value 1/4n that we substitute for the unknown variance in Method 1. It can be shown that the endpoints M ±H of this interval always lie between 0 and 1. This is not true of the other intervals. Despite its complexity, this confidence interval is considered superior to the others.

Method 3 is implemented in R through the ”prop.test”function. For example, suppose that 12 successes were observed in n = 30 samples of a Bernoulli random variable. You should calculate M ± H by hand to verify that this is the Method 3 confidence interval.

> prop.test(x=12,n=30,correct=F,conf.level=.90)

1-sample proportions test without continuity correction

data: 12 out of 30, null probability 0.5

X-squared = 1.2, df = 1, p-value = 0.2733

alternative hypothesis: true p is not equal to 0.5

90 percent confidence interval:

0.2671262 0.5494187

sample estimates:

p

0.4

”prop.test” is designed for testing hypotheses about p as well as for calculating confidence intervals. Unless you specify otherwise, as we did here, ”prop.test” will introduce a continuity correction that changes the answers slightly.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 124

7.4.3 Exercises

1. The Food and Drug Administration monitors the production line of a breakfast cereal company to determine what proportion of its boxes of cereal contain insect parts. The FDA would like to know that proportion to within 5 percentage points with 95% confidence. How many boxes of cereal should they sample?

2. Suppose the company has been inspected before and that previously the proportion of cereal boxes with insect parts was 0.15. The FDA wants to be as unobtrusive as possible. How many boxes of cereal should they sample?

3. The FDA sampled 60 boxes of cereal and found 12 with insect parts. Find a 95% confidence for the true proportion using all three methods described above.

4. Use R to answer question 3 with Method 3.

7.5 Estimating Quantiles

Let X1, X2, · · · , Xn be a random sample of a variable X with a cumulative distribution F . We shall assume that F is continuous and that n is large enough for the central limit theorem to apply. We are interested in estimating the pth quantile of F ,

θ = q(X, p) = F−1(p) = min{x|F (x) ≥ p} . For a given real number y let Y be the number of the samples that satisfy Xi ≤ y, and let

F̂ (y) = Y/n.

F̂ (y) is simply the sample proportion of ”successes” Xi ≤ y, but as a function of y it is a bona fide cumulative distribution function. It is called the empirical distribution function and is a sample

estimate of the cumulative distribution function F of the variable X. Its pth quantile

F̂−1(p)

is the pth quantile of the samples and an estimator of θ = F−1(p). Since F̂ (θ) is a sample success proportion with mean p,

1− α = Pr

( p− zα/2

√ p(1− p)

n ≤ F̂ (θ) < p+ zα/2

√ p(1− p)

n

)

= Pr

( F̂−1

( p− zα/2

√ p(1− p)

n

) ≤ θ < F̂−1

( p+ zα/2

√ p(1− p)

n

)) Theorem 7.2. For large samples, a 100(1− α)% confidence interval for the pth quantile θ = F−1(p) of a continuous distribution F is

F̂−1

( p− zα/2

√ p(1− p)

n

) ≤ θ ≤ F̂−1

( p+ zα/2

√ p(1− p)

n

) (7.14)

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 125

In particular, a 100(1− α)% confidence interval for the median θ = F−1(.5) is

F̂−1 (

0.5− zα/2

2 √ n

) ≤ θ ≤ F̂−1

( 0.5 +

zα/2

2 √ n

) (7.15)

F̂−1(y) is the sample yth quantile, which is returned by R’s ”quantile” function. The intervals calcu- lated may depend on the exact rules used by software to find the sample quantiles. In the example below, we have included the ”type=1” argument to the quantile function to ensure that R calculates F̂−1 as defined above.

Example 7.4. The data frame ”test.vs.grade” has placement test scores and semester grades for 179 students. Twenty of the test scores are missing. We will treat the known 159 values of the variable ”Test” as a sample from a larger population and find a 95% interval for the median of the population test scores.

> summary(Test)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 24.00 70.00 80.00 76.55 88.00 100.00 20

> quantile(Test,0.5-1.96/(2*sqrt(159)),na.rm=T,type=1)

42.22809%

76

> quantile(Test,0.5+1.96/(2*sqrt(159)),na.rm=T,type=1)

57.77191%

84

So, the 95% confidence interval is from 76 to 84.

7.5.1 Exercises

1. Find 90% confidence intervals for the first and third quartiles of test scores.

2. Generate a sample of size 50 from the exponential distribution with mean 1. Use the ”rexp” function in R. Find a 95% confidence interval for the median of the distribution. Is the true median in the interval? Repeat this several times. Repeat several more times with n = 100.

3. Repeat exercise 2 with a confidence level of 90% and a sample of size 50 from the Cauchy distribu- tion with median 0.

4. Use the Loblolly data on 84 pine trees to find 90% confidence intervals for the quartiles and median of the population variable height/age.

5. Use Cavendish’s data to find an 95% confidence interval for the specific gravity of the earth, assuming that his data is a random sample from a population whose median is the true specific gravity. This is a small data set, so the large sample confidence interval may not be reliable.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 126

7.6 Estimating the Variance and Standard Deviation

The sample variance

S2 = 1

n− 1

n∑ i=1

(Xi −X)2

from a sample X1, X2, · · · , Xn of values of a numeric variable X is an unbiased estimator of σ2 = var(X). To see this, note that

(n− 1)S2 = n∑ i=1

(Xi −X)2 = n∑ i=1

X2i − nX 2 .

Thus,

(n− 1)E(S2) = n∑ i=1

E(X2i )− nE(X 2 ).

Also, E(X2i ) = var(Xi) + E(Xi)

2 = σ2 + µ2

and

E(X 2 ) = var(X) + E(X)2 =

σ2

n + µ2.

Putting all this together, we have

(n− 1)E(S2) = (n− 1)σ2,

and E(S2) = σ2.

The sample standard deviation S is not an unbiased estimator of the population standard deviation σ, but it is asymptotically unbiased and usually is the estimator of choice for σ.

For samples from a normal distribution, (n−1)S 2

σ2 has a chi square distribution with n − 1 degrees of freedom. This enables us to find confidence intervals for σ2 and σ. Suppose we want a 100(1 − α)% confidence interval for σ2. Let q(α/2) and q(1−α/2) denote the α/2 and 1−α/2 quantiles of the chi square distribution with n− 1 degrees of freedom. Then,

1− α = Pr ( q(α/2) <

(n− 1)S2

σ2 < q(1− α/2)

) = Pr

( 1

q(1− α/2) <

σ2

(n− 1)S2 <

1

q(α/2)

) = Pr

( (n− 1)S2

q(1− α/2) < σ2 <

(n− 1)S2

q(α/2)

) To get a confidence interval for σ, simply take the square roots of the end points of the confidence interval for σ2. Unfortunately, these confidence intervals are rather sensitive to departures from normality, so use them with caution.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 127

Example 7.5. The R data set ”airquality” has measurements of Ozone levels, wind speed, solar radia- tion and temperature for 153 days in New York. Wind speed measurements seem to be approximately normal, so we will apply the method above to find a 90% confidence interval for the variance and standard deviation of wind speeds in New York.

> attach(airquality)

> lcl=152*var(Wind)/qchisq(.95,152)

> ucl=152*var(Wind)/qchisq(.05,152)

> c(lcl,ucl)

 10.37878 15.15278

> sqrt(.Last.value)

 3.221612 3.892657

7.7 Hypothesis Testing

A statistical hypothesis is simply a statement about one or more distributions or random variables. Statistical hypotheses usually specify or restrict the values of parameters of distributions. Researchers conduct controlled experiments with the goal of confirming a scientific hypothesis, which may be expressed in terms of the parameters of the distributions of experimental data. This is called the research hypothesis, and it asserts that there is a real experimental effect due to the pre-established conditions of the experiment. Opposed to the research hypothesis is the null hypothesis, which asserts that there is no real experimental effect and any apparent signal in the data is merely random variation. In other words, it is the hypothesis of a null experimental effect. The burden of proof is on the research hypothesis because it is the one that makes a definite, positive assertion about the reality of a phenomenon, and that is the logic of experimental science. The research hypothesis is also called the alternative hypothesis.

7.7.1 Test Statistics, Type 1 and Type 2 Errors

Let H0 denote the null hypothesis and H1 the alternative or research hypothesis. These are assertions about the distribution of a population variable X, from which a sample X1, X2, · · · , Xn is obtained. Based on the data, a decision is made either to accept H1, thereby rejecting H0, or not to accept H1 (not reject H0). We do not usually say that we accept H0; we either reject it or do not reject it.

Type 1 error – to reject H0 when it is true, in other words, to accept H1 when it is not true.

Type 2 error – to not reject H0 when it is false, i.e., to not accept H1 when it is true.

The data enters into the decision to reject or not reject H0 through a test statistic, a function τ = τ(X1, X2, · · · , Xn) of the data which is supposed to ”point toward” H1 and away from H0 in the sense that the greater the value of τ the greater the degree of support for H1. The test statistic is a random variable and its distribution must be known if H0 is true. The hypothesis H1 is accepted if τ is sufficiently large, larger than a critical value τα.

Decision rule: Reject H0 (accept H1) if τ(X1, · · · , Xn) > τα.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 128

The probability of type 1 error then is

α = Pr(τ > τα|H0). (7.16)

The probability of type 2 error Pr(τ ≤ τα|H1) usually cannot be calculated without more information because H1 does not completely specify the distribution of τ . The notation Pr(·|H0) is convenient in this context but it does not really refer to a conditional probability. H0 is not an event.

The classical Neyman-Pearson 3 paradigm for hypothesis testing is to hold the probability of type 1 error fixed at some small value (conventionally, α = .01, .05, .10), adjust τα according to (7.16), and make a definite decision as to whether or not to reject H0. If H0 is rejected the value of τ is said to be significant at level α. We insist that α be small because we do not want to accept H1 unless the evidence against H0 is strong.

7.8 Hypotheses About a Population Mean

Suppose that X1, X2, · · · , Xn is a random sample from a distribution with mean µ and known standard deviaton σ. If there is no true experimental effect, the null hypotheses is

H0 : µ = µ0,

where µ0 is some given number. The alternative is

H1 : µ > µ0.

The sample average X is a test statistic. Everyone would agree that larger values of X offer stronger support for H1 : µ > 0. We will assume that the distribution of X is normal, either because the samples are from a normal distribution or because the sample size is large enough for the central limit theorem to apply. Then if H0 is true we know that X ∼ Norm(µ0, σ/

√ n). Instead of X, we will take

its standardized value

Z =

√ n(X − µ0)

σ

as our test statistic τ . If α is the desired probability of type 1 error, we choose the critical value of Z so that

Pr(Z > zα|H0) = α,

i.e., zα = Φ −1(1− α). We reject H0 if

Z > zα,

in other words, if

X > µ0 + zα σ√ n .

The alternative H1 : µ > µ0 is called a one-sided alternative. The alternative hypothesis H1 : µ < µ0 is also one-sided. To force it into our template, we could choose the test statistic τ = −Z and τα = zα. Then we reject H0 if −Z > zα, equivalently, if Z < −zα or

3Jerzy Neyman 1894-1981 and Egon Pearson 1895-1980: eminent Polish and British mathematical statisticians

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 129

X < µ0 − zα σ√ n .

The alternative hypothesis H1 : µ 6= µ0 is two-sided. To test it against H0 : µ = µ0 we use the test statistic τ = |Z| with critical value τα = zα/2 and reject H0 if

|X − µ0| > zα/2 σ√ n .

Example 7.6. Test automobiles of a given type burning fuel of a given type are known to have a mean fuel efficiency of 23 mpg with a standard deviation of 3 mpg. A new gasoline additive is being tested which supposedly improves efficiency. A sample of 36 test cars were fueled with the newer gasoline and their fuel efficiencies were measured. Their average was 24.5 mpg. At a significance level of α = 0.05 can we conclude that the additive does improve efficiency?

Solution: We will assume that sample averages X from samples of size n = 36 are normally distributed. We are testing the research hypothesis H1 : µ > 23 against the null hypothesis H0 : µ = 23. We also assume that the variance of fuel efficiency with the additive is the same as the variance without it. We accept H1 if

X > µ0 + zα σ√ n

= 23 + 1.645 3√ 36

= 23.8225

Since it is indeed true that X > 23.8225 we accept H1 and conclude that the additive improves efficiency.

7.8.1 Tests for the mean when the variance is unknown

When testing hypotheses about the mean of a variable X the variance will not usually be known. However, for large samples, the random variable

T =

√ n(X − µ0)

S ,

where S is the sample standard deviation, has an approximate standard normal distribution under the hypothesis H0 : µ = µ0. Therefore, T or |T | can be used as a test statistic for this null hypothesis. For the one-sided alternative H1 : µ > µ0 we reject H0 when

T > zα,

that is, when

X > µ0 + zα S√ n . (7.17)

For the two-sided alternative H1 : µ 6= µ0, we reject H0 when

|T | > zα/2,

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 130

i.e., when

|X − µ0| > zα/2 S√ n . (7.18)

Example 7.7. The verbal IQs of 200 school children of similar age and circumstances were measured. The average was 11.30 (not on a 100 point scale), and the sample standard deviation was 2.26. The purpose of the study was to determine whether the population mean IQ is different from 11, which was the mean for children of the same characteristics 10 years previously. At a significance level of α = 0.05, can we conclude that the mean is different? What if α = 0.10?

Solution: The observed value of T is

T =

√ n(X − µ0)

S =

√ 200(11.30− 11)

2.26 = 1.877.

We are not assuming that scores are normally distributed, but n = 200 should be plenty large enough for T to be nearly standard normal. Since the alternative H1 : µ 6= 11 is two-sided, the test statistic is |T | and we reject H0 : µ = 11 if

|T | > zα/2 = z.025 = 1.96.

However, 1.877 is not greater than 1.96, so we do not accept H1. If α = 0.10, then zα/2 = z.05 = 1.645 and we would accept H1.

Student t Tests for Small Samples

If n is not large, but the samples come from a normal distribution and H0 is true, T has a student-t distribution with n − 1 degrees of freedom and the critical value is tα(df = n − 1), the 100(1 − α)th percentile of student-t. For H1 : µ > µ0 we reject H0 when

X > µ0 + tα(n− 1) S√ n , (7.19)

and for H1 : µ 6= µ0 reject H0 when

|X − µ0| > tα/2(n− 1) S√ n . (7.20)

Example 7.8. W.S. Gosset, 1876-1937, was an English statistician who worked for the Guinness brewing company. He discovered the distributions we know as the student-t distributions and pub- lished his work under the pseudonym ”Student”. In one of the first applications, he compared the yield of barley seeds dried by two different methods. Seeds dried by each method were planted in adjacent small plots (split plots) that had no difference in soil properties or rainfall. 4 There were 11 split plots and the yields for each drying method on each split plot are shown below. We assume that the difference in yield for a split plot is normally distributed. We do not know the mean or the variance of the distribution of yield differences. We are interested in the research hypothesis that the mean difference is not equal to 0 against the null hypothesis that it is 0.

4W.S.Gosset, ”The Probable Error of a Mean”,Biometrika 6 (1908), 1-25

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 131

Plot 1 2 3 4 5 6 7 8 9 10 11

Regular 1903 1935 1910 2496 2108 1961 2060 1444 1612 1316 1511

Kiln 2009 1915 2011 2463 2180 1925 2122 1482 1542 1443 1535

Diff 106 -20 101 -33 72 -36 62 38 -70 127 24

The variable X=Diff=Kiln-Regular is assumed to have a normal distribution. The sample mean is X = 33.73 and the sample standard deviation is S = 66.17. Thus, the observed value of T is

T =

√ 11× 33.73

66.17 = 1.691.

If α = 0.05, the critical value of |T | is tα/2(df = n − 1) = t.025(10) = 2.228. Therefore, we do not conclude that there is a difference in yield for the two drying methods.

7.9 p-values

Instead of comparing the test statistic to a critical value for a pre-established significance level, many statisticians prefer to simply report the p-value of the statistic. Recall that the larger the value of the test statistic τ , the greater is its degree of confirmation of H1 and disconfirmation of H0. One way to quantify this idea is to compare the observed value of τ to the distribution of values it would have in future replications of the experiment, assuming H0 to be true. Let τobs be the observed value of τ , a fixed number once the experiment has been done. We define the p-value of τobs to be Pr(τ > τobs|H0). The smaller this probability is, the larger τobs is, comparatively speaking. A very small p-value is strong evidence that H1 is true and H0 is not. There is a temptation to think of the p-value as the probability of H0, but that is a misinterpretation. H0 is not an event.

The figure below is a pictorial representation of p-values.

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 132

Null Density of Test Statistic

τ

τobs

p−value

For the one-sided alternative H1 : µ > µ0 the p-value is Pr(T > Tobs|H0). For the two-sided alterna- tive H1 : µ 6= µ0, it is Pr(|T | > |Tobs||H0).

To test at a fixed significance level α, simply compare the p-value to α. If the p-value is less than or equal to α, reject H0. We will calculate the p-value in Example 7.7. The observed value of T was 1.877. Since we were assuming that T is normally distributed and the alternative was two-sided, the p-value is

Pr(|T | > 1.877) = 2Pr(T > 1.877) = 2(1− Φ(1.877)) = 0.0605

Since 0.05 < p-value < 0.10, we would not reject H0 at significance level α = 0.05, but we would reject H0 at level α = 0.10.

It is a serious misapplication of p-values to use them to shop for alternative hypotheses. In Example 7.7, the p-value of Tobs = 1.877 for the one sided alternative H1 : µ > 11 is 0.0325, whereas for the alternative H1 : µ 6= 11 it is 0.0650. If we take α = 0.05 as the required significance level, and

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 133

have not specified H1 in advance, then we have the ridiculous situation that we are willing to believe µ > 11 but not willing to believe µ 6= 11. Hypothesis testing makes sense only when the hypotheses are formulated before the collection of data.

7.9.1 Exercises

1. For each of the following scenarios state whether H0 should be rejected or not. State any assump- tions that you make beyond the information that is given.

(a) H0 : µ = 4, H1 : µ 6= 4, n = 15, X = 3.4, S = 1.5, α = .05. (b) H0 : µ = 21, H1 : µ < 21, n = 75, X = 20.12, S = 2.1, α = .10. (c) H0 : µ = 10, H1 : µ 6= 10, n = 36, p-value = 0.061.

2. Use the ”test.vs.grade” data and test the null hypothesis that the mean test score for the population is 70 against the alternative that it is greater than 70. Find a p-value and state your conclusion if α = 0.05. Repeat for the null hypothesis µ = 75.

3. Use your sample of 10 trees to test the null hypothesis that the mean value of height/age is 2 against the alternative that it is greater than 2. Give a p-value and state your conclusion if α = 0.10.

4. Use the Cavendish data to test the research hypothesis that the specific gravity of the earth is greater than 5.4. Give a p-value.

7.10 Hypotheses About a Population Proportion

Let p denote the proportion of successes in a population and let Y be the number of successes in a sample with replacement of size n from the population. The sample proportion of successes p̂ = Y/n is an unbiased estimator of p. Consider the null hypothesis

H0 : p = p0

and the one sided alternative H1 : p > p0.

If n is large and H0 is true,

Z =

√ n(p̂− p0)√ p0(1− p0)

= Y − np0√ np0(1− p0)

.

is standard normal. Therefore, Pr(Z > zα|H0) = α.

A test of significance level α for H0 : p = p0 against the alternative H1 : p > p0 is to reject H0 when Z > zα, equivalently when

p̂ > p0 + zα

√ p0(1− p0)

n ,

or when Y > np0 + zα

√ np0(1− p0).

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 134

The level α test for H0 against the two sided alternative H1 : p 6= p0 rejects H0 when

|Z| > zα/2,

that is, when |Y − np0| > zα/2

√ np0(1− p0).

The p-values for the one sided and two sided alternatives are, respectively,

Pr(Z > Zobs|H0) = 1− Φ(zobs)

and Pr(|Z| > |Zobs||H0) = 2(1− Φ(|zobs|)),

where

Zobs = Yobs − np0√ np0(1− p0)

.

Example 7.9. A gene occurs in a dominant form (allele) with probability 2/3 and in recessive form with probability 1/3. An organism in the population shows the dominant physical characteristic if its two copies of the gene are either two dominant copies or a dominant and a recessive copy. Organisms with two recessive copies of the gene show the recessive physical characteristic. If the population is in genetic equilibrium, the frequency of the dominant characteristic in the population is 89%. A biologist suspects that the population is not in genetic equilibrium and to test his suspicion collects 100 specimens. Eighty of them had the dominant characteristic. Is the biologist’s claim supported?

Solution: Let p denote the proportion of dominant physical types in the population. The research hypothesis is H1 : p 6= 0.89 and the null hypothesis is H0 : p = 0.89. The observed value of Y , the number in the sample of n = 100 with the dominant characteristic, is

Yobs = 80,

and the observed value of Z is

Zobs = .80− .89√

100(.89)(.11) = −2.876.

Since the alternative is two sided, the p-value is

Pr(|Z| > 2.876|H0) = 2(1− Φ(2.876)) = 0.004.

Since the p-value is so small, we can say that there is strong evidence that the population is not in genetic equilibrium.

The R function ”prop.test” that we used to find confidence intervals for a success probability p is also used to test hypotheses about p. We will use it to answer the question in the preceding example.

> prop.test(x=80,n=100,p=0.89,correct=F)

1-sample proportions test without continuity correction

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 135

data: 80 out of 100, null probability 0.89

X-squared = 8.2737, df = 1, p-value = 0.004022

alternative hypothesis: true p is not equal to 0.89

95 percent confidence interval:

0.7111708 0.8666331

sample estimates:

p

0.8

We omitted the Yates continuity correction so that R’s answer could be compared to the one above. The Yates correction alters the answers slightly. The X-squared statistic given in the R output is the square of the test statistic Z, so there is a 1-1 correspondence between its values and the values of |Z|. If the alternative is one sided, the argument ”alternative=”g”” or ”alternative=”l” can be included in the function call. For example, if H1 : p < 0.89 is the alternative hypothesis,

> prop.test(x=80,n=100,p=0.89,correct=F,alternative=”l”)

1-sample proportions test without continuity correction

data: 80 out of 100, null probability 0.89

X-squared = 8.2737, df = 1, p-value = 0.002011

alternative hypothesis: true p is less than 0.89

95 percent confidence interval:

0.0000000 0.8574982

sample estimates:

p

0.8

7.10.1 Exercises

1. Let p denote the proportion of all Math 3339 students who are women. On some random class day, count the number of students attending your class and the number of them who are women. At a sig- nificance level of α = 0.05, test the null hypothesis H0 : p = 1/2 against the alternative H1 : p < 1/2. Assume that the students attending your class are a random sample of Math 3339 students.

2. Let X be a random variable with a continuous distribution and suppose its median m is unique. Consider the null hypothesis H0 : m = m0 and the alternative H1 : m > m0. Suppose that a large sample of n values of X is obtained. Let Y be the number of sample values Xi ≤ m0. Y has a binomial distribution Y ∼ Binom(n, p) with success probability p. If H0 is true, what is the value of p? What does H1 imply about p? Show how to test H0 against H1 with Y . This is called the sign test for the median of X. Apply the sign test to the variable ”Times” in the data set ”react.times” and test H0 : m = 1.4 against H1 : m > 1.4. Give a p-value. You can count the number of observations of Times less than or equal to 1.4 with

> sum(Times <= 1.4)

Go to TOC

CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 136

after attaching react.times to your workspace. Use ”prop.test” to perform the test.

3. Modify the sign test to test the hypothesis that the first quartile of Times is equal to 1.2 against the alternative that it is less than 1.2.

Go to TOC

Chapter 8

Regression and Correlation

8.1 Examples of Linear Regression Problems

The R data set ”mammals” is included in the library ”MASS”. It lists the mean body weights in kilograms and the mean brain weights in grams of 62 mammal species. It is part of a larger data set in a study of sleep in mammals.1 If for each of the 62 species we plot a point in a rectangular coordinate system, with body weight on the horizontal x axis and brain weight on the vertical y axis, we obtain a scatterplot of the data.

1Allison,T. and Cicchetti, D.V. Sleep in mammals, ecological and constitutional correlates. Science 194, (1976).

137

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 138

0 1000 2000 3000 4000 5000 6000

0 10

00 20

00 30

00 40

00 50

00

body

br ai

n

The purpose of such an exercise might be to discover a relationship between body weight and brain weight that is characteristic of mammalian development. The scatterplot does not suggest much of a relationship. One reason is that the diagram is dominated by a few large mammals and the others are crowded together near the lower left corner. This can be alleviated by plotting the logarithms of the variables.

> data(mammals,package=”MASS”)

> attach(mammals)

> plot(log(body),log(brain),xlab=”log body wt”,ylab=”log brain wt”)

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 139

−4 −2 0 2 4 6 8

− 2

0 2

4 6

8

log body wt

lo g

br ai

n w

t

Since this data is a result of sampling, it is fair to consider Y=log(brain wt) and X=log(body wt) as jointly distributed random variables.

For mammals with log body weights X near a given value x, there is a distribution of log brain weights Y . It appears that the centers of these distributions increase almost linearly as x increases. Furthermore, the vertical dispersion of the distribution of Y does not seem to vary much as x varies. We will hypothesize that the conditional mean of Y , given that X = x, is a linear function of x:

E(Y |X = x) = β0 + β1x (8.1)

where β0 and β1 are unknown intercept and slope parameters, respectively, and that the variance of the conditional distribution of Y , given that X = x, is constant, independent of x:

var(Y |X = x) = σ2. (8.2)

The constant variance σ2 is also unknown. These are the two basic assumptions of simple linear regression. The term ”regression” was first applied in this context by Francis Galton (1822-1911), who studied the relationship between the heights of fathers and their full-grown sons. He observed that the

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 140

sons of unusually tall fathers tended to be tall, but not as tall as their fathers. Hence, their heights ”regressed to the mean”.

The values of X in this example were not predetermined as part of the design of the experiment. Rather, they were simply observed along with the corresponding values of Y . When this is the case, the experiment is said to be an observational study. A designed experiment is one in which the values of X are controlled. This type of experiment is more common in engineering, some of the hard sciences and pharmaceutical research than in social sciences and business. The scatterplot below shows data on life-

times of 1.5 volt batteries 2.

0.8 0.9 1.0 1.1 1.2 1.3

2 4

6 8

Voltage

T im

e

The X variable is the voltage of the battery, which decreases slowly from 1.5 to a smaller value when the battery is under a load. The Y variable is the measured time for the voltage of a battery to de- crease to the experimenter-determined level of x = 1.3, 1.2, 1.1, etc. The constant variance assumption (11.2) appears to be at least approximately true, but the assumption of linearity (8.1) seems to be violated. There is a noticeable curvature in the pattern of points in the scatterplot. The methods of linear regression analysis developed in this chapter should be used with caution in such a case.

2Peter K. Dunn, Comparing the lifetimes of two brands of batteries, Journal of Statistics Education 21, 1 (2013)

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 141

Regardless of whether the experiment is designed or an observational study, we will call X the design variable and Y the response variable. Even if the data comes from an observational study, we write the x-values in lower case since we are concerned only with the conditional distribution of Y , given the value of X. Therefore, the values of X are treated as non-random inputs.

8.2 Least Squares Estimates

The data in a simple linear regression problem consists of n pairs {(xi, Yi)}ni=1. Both variables are numeric in character. We can represent the data as:

Y1 = β0 + β1×1 + �1

Y2 = β0 + β1×2 + �2

… (8.3)

Yn = β0 + β1xn + �n

The �i in the equations above are the random deviations of the Yi from their expected values. They are unobservable since we don’t know the values of β0 and β1.

There is an alternate parametrization of (8.3) that is convenient. Let x̄ be the average of x1, · · · , xn and write

Yi = α+ β1(xi − x̄) + �i, where β0 = α − β1x̄, α = β0 + β1x̄. We will use this parametrization in the derivation below of the least squares estimates. If we have estimators α̂ of α and β̂1 of β1, we can immediately get the estimator β̂0 = α̂− β̂1x̄ of β0.

Given estimators α̂ and β̂1, the estimated expected value or predicted value of Yi is

Ŷi = α̂+ β̂1(xi − x̄),

and the deviation of the observation Yi from its predicted value is

ei = Yi − Ŷi.

ei is also called the i th residual. Think of it as an estimate of the unobservable true random error �i

in (8.3).

The method of least squares selects estimators α̂ and β̂1 that minimize the residual sum of squares:

SS(resid) =

n∑ i=1

e2i

=

n∑ i=1

(Yi − Ŷi)2

=

n∑ i=1

(Yi − α̂− β̂1(xi − x̄))2.

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 142

With a little algebra, this can be written

SS(resid) =

n∑ i=1

(Yi − α̂)2 − 2β̂1 n∑ i=1

(Yi − α̂)(xi − x̄) + β̂21 n∑ i=1

(xi − x̄)2

=

n∑ i=1

(Yi − α̂)2 − 2β̂1Sxy + β̂21Sxx, (8.4)

where

Sxy =

n∑ i=1

Yi(xi − x̄)

and

Sxx =

n∑ i=1

(xi − x̄)2.

Now, the first term on the right of (8.4) is a quadratic function of α̂ alone and does not involve β̂1.

The sum of the other two terms on the right is a quadratic function of β̂1 alone. Therefore, SS(resid)

will be minimized when we choose α̂ to minimize the first term and β̂1 to minimize the sum of the other two terms. The solutions are:

α̂ = Y = 1

n

∑ Yi (8.5)

β̂1 = Sxy Sxx

(8.6)

The least squares estimator of the intercept parameter β0 in the original formulation is

β̂0 = Y − β̂1x̄. (8.7)

The line with equation y = β̂0 + β̂1x is the fitted line. For a small data set, a reasonably good calculator will calculate the least squares estimates with the punch of a button, once the x’s and Y ’s are entered. Even without using this feature, the least squares estimates are easy to find. Note that β̂1 is the sample covariance between the x’s and Y ’s divided by the sample variance of the x’s.

Example 8.1. The data frame ”Mileage” shown below has 11 predetermined fuel mixture ratios and the mileages per gallon of 11 test cars with those fuel ratios. The table leaves space for you to fill in the cross-product terms Yi(xi− x̄) and the terms (xi− x̄)2. Their sums or averages are at the bottom. This is a good way to organize the calculations if you must do them without technology.

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 143

i xi: fuel Yi: mileage Yi(xi − x̄) (xi − x̄)2 1 0.40 17.42 2 0.43 17.56 3 0.46 17.62 4 0.49 17.69 5 0.52 18.08 6 0.55 18.01 7 0.58 18.01 8 0.61 18.34 9 0.64 18.26 10 0.67 18.50 11 0.70 18.41

average or sum 0.55 17.99 0.356 0.099

From the bottom line of the table, we have α̂ = Y = 17.99, Sxy = 0.356, and Sxx = 0.099. Thus

β̂1 = 0.356/0.099 = 3.594. The equation of the fitted line is

y = 17.99 + 3.594(x− 0.55) = 16.01 + 3.594x

There are several ways of calculating the least squares estimates in R. If all you want is the estimated values of β0 and β1, the simplest way to get them is

> attach(Mileage)

> coef(lsfit(fuel,mileage))

Intercept X

16.014242 3.593939

Here ”Mileage” is the name of the data frame that contains the variables ”fuel” and ”mileage”. The x-variable ”fuel” must be entered first as an argument to ”lsfit”, which does the computational work. ”coef” extracts the coefficients from the object returned by ”lsfit”. The figure below shows the scat- terplot and the fitted line. The function ”abline” adds a line to an existing plot.

> plot(Mileage)

> attach(Mileage)

> abline(coef(lsfit(fuel,mileage)))

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 144

0.40 0.45 0.50 0.55 0.60 0.65 0.70

17 .4

17 .6

17 .8

18 .0

18 .2

18 .4

fuel

m ile

ag e

8.2.1 The ”lm” Function in R

The ”lsfit” function calculates the least squares estimates but does not provide all the information needed in linear regression analysis. The R function ”lm” (which stands for linear model) does. We will illustrate its use by re-analyzing the data from the preceding example.

> attach(Mileage)

> mileage.lm=lm(mileage ~ fuel, data=Mileage)

> summary(mileage.lm)

Call:

lm(formula = mileage ~ fuel, data = Mileage)

Residuals:

Min 1Q Median 3Q Max

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 145

-0.12000 -0.06982 -0.03182 0.04845 0.19691

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 16.0142 0.1858 86.18 1.93e-14 ***

fuel 3.5939 0.3329 10.79 1.89e-06 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 0.1048 on 9 degrees of freedom

Multiple R-squared: 0.9283, Adjusted R-squared: 0.9203

F-statistic: 116.5 on 1 and 9 DF, p-value: 1.887e-06

Much of the output above relates to topics discussed later. The part labeled ”Coefficients:” contains the least squares estimates of the intercept β0 and the coefficient β1 of fuel, the x-variable . It also contains information needed for inferences about the true values of those parameters.

”lm” creates an R object called a linear model object, which has been given a name ”mileage.lm”. The arguments to lm are the model formula, ”mileage ∼ fuel”, which tells R that mileage is a linear function of fuel. The circumflex or tilde ”∼” separates the y-variable from the x-variable. The ”data” argument tells R that the variables mileage and fuel are in the data frame ”Mileage”.

Once the linear model object is created by calling lm, the information is displayed with the summary function. The model object actually contains a great deal of information and the summary only dis- plays the most important part of it. Note that it gives a 5-number summary of the residuals. The residuals are stored as part of the model object. If you want to see all of them, type

> residuals(mileage.lm)

8.2.2 Exercises

1. Fill in the blank cells of the table in Example 8.1 and verify that the numbers given for Sxy and Sxx are correct.

2. Listed below are the log body weights and log brain weights of the primates species in the data set ”mammals”. Find the equation of the least squares line with y = log brain weight and x = log body weight. Do it by hand, by constructing a table like the one in Example 8.1. Then do it with your calculator as efficiently as possible. Finally, use the lm function in R to do it by creating a linear model object ”primates.lm”. The model formula is ”log(brain)∼log(body)”. You can select the primates and put them in a new data frame by first listing the primate species names:

> primatenames=c(”Owl monkey”, ”Patas monkey”, ”Gorilla”, etc.)

and then

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 146

> primates=mammals[primatenames, ]

Your ”data” argument in calling lm would be ”data=primates”, as in

> primates.lm=lm(log(brain)∼log(body),data=primates)

log body log brain

Owl monkey -0.7339692 2.740840

Patas monkey 2.3025851 4.744932

Gorilla 5.3327188 6.006353

Human 4.1271344 7.185387

Rhesus monkey 1.9169226 5.187386

Chimpanzee 3.9543159 6.086775

Baboon 2.3561259 5.190175

Verbet 1.4327007 4.060443

Galago -1.6094379 1.609438

Slow loris 0.3364722 2.525729

3. A recently discovered hominid species Homo floresiensis, nicknamed the hobbit, had a body weight of about 25 kilograms. Use the fitted line from the preceding problem to predict its brain weight. Read about H. floresiensis in Wikipedia or some other source. Some of the scientific arguments were very contentious and involved the creature’s brain weight.

4. Repeat problem 2 for the rodent species in ”mammals”. The data are

log body log brain

Mountain beaver 0.30010459 2.0918641

Guinea pig 0.03922071 1.7047481

Chinchilla -0.85566611 1.8562980

Ground squirrel -2.29263476 1.3862944

Arctic ground squirrel -0.08338161 1.7404662

African giant pouched rat 0.00000000 1.8870696

Yellow-bellied marmot 1.39871688 2.8332133

Golden hamster -2.12026354 0.0000000

Mouse -3.77226106 -0.9162907

Rabbit 0.91629073 2.4932055

Rat -1.27296568 0.6418539

Mole rat -2.10373423 1.0986123

5. Solve the equation log(y) = β0 + β1 log(x) for y and simplify. y will be expressed as a power function of x. Find the estimated power functions for primates and rodents.

8.3 Distributions of the Least Squares Estimators

Henceforth, α̂, β̂0, and β̂1 will refer only to the least squares estimators. So far, we have not assumed much about the distribution of the Yi, except for their means and variances. Now we assume in

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 147

addition that they are independent and normally distributed. Equivalently, we assume that the errors �i are normally distributed with mean 0 and variance σ

2. Thus, Y1, Y2, · · · , Yn are independent and Yi ∼ Norm(β0 + β1xi, σ). These assumptions have profound consequences.

Theorem 8.1. If the errors �i in (8.3) are independent and normally distributed with mean 0 and variance σ2, then

(1) α̂ ∼ Norm (α, σ/ √ n);

(2) β̂1 ∼ Norm ( β1, σ/

√ Sxx ) ;

(3) β̂0 ∼ Norm ( β0, σ

√ 1/n+ x̄2/Sxx

) ;

(4) SS(resid)/σ2 ∼ Chisq(df = n− 2);

(5) α̂, β̂1, and SS(resid) are independent random variables.

(6) Let x be a given value of the design variable X and let µ(x) = E(Y |X = x) = α + β1(x − x̄) be the expected value of the response Y when X = x. Let µ̂(x) = α̂+ β̂1(x− x̄) be its estimated value. Then

µ̂(x) ∼ Norm

µ(x), σ√ 1 n

+ (x− x̄)2 Sxx

 . Proof:

We shall prove (1), (2), and (6) only. (3) is the special case of (6) when x = 0. α̂, β̂1, and µ̂(x) are all linear combinations of Y1, Y2, · · · , Yn, which are normally distributed and independent. There- fore, these parameter estimates are normally distributed and we need only calculate their means and variances. First, α̂:

E(α̂) = E(Y ) = 1

n

n∑ i=1

E(Yi)

= 1

n

n∑ i=1

(α+ β1(xi − x̄))

= α

since ∑n i=1(xi − x̄) = 0. Also, by independence and since var(Yi) = σ2,

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 148

var(α̂) = var( 1

n

n∑ i=1

Yi)

= 1

n2

n∑ i=1

var(Yi)

= nσ2

n2 = σ2

n

Next, β̂1:

E(β̂1) = E

( Sxy Sxx

) =

∑n i=1(xi − x̄)E(Yi)

Sxx

=

∑n i=1(xi − x̄)(α+ β1(xi − x̄))

Sxx

= β1 ∑n i=1(xi − x̄)2

Sxx

= β1Sxx Sxx

= β1

again because ∑n i=1(xi − x̄) = 0. Since the Yi are independent and have common variance σ2,

var(β̂1) =

∑n i=1(xi − x̄)2var(Yi)

S2xx

=

∑n i=1(xi − x̄)2σ2

S2xx

= σ2

Sxx .

Finally,

E(µ̂(x)) = E(α̂) + E(β̂1)(x− x̄) = α+ β1(x− x̄) = µ(x).

Assuming the independence assertion (5),

var(µ̂(x)) = var(α̂) + (x− x̄)2var(β̂1)

= σ2

n +

(x− x̄)2σ2

Sxx

= σ2 (

1

n +

(x− x̄)2

Sxx

) .

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 149

8.3.1 Exercises

1. The numbers below are the values of the design variable X in a linear regression problem whose true parameter values are β0 = 1 and β1 = −1. The error variance σ2 is 2.55. Find the 5th and 95th percentiles of the distribution of β̂1.

x: -10 -8 -6 -4 -2 0 2 4 6 8 10

2. With the same data, find the 5th and 95th percentiles of α̂.

3. With the same data, find the 5th and 95th percentiles of µ̂(7.5).

4. With the same data, find the 95th percentile of the distribution of SS(resid).

5. Find Pr(|β̂1| > 1.38).

8.4 Inference for the Regression Parameters

The goal in this section is to develop confidence intervals and hypothesis tests for the unknown parameters α, β0, β1, and σ

2, the constant variance of the errors �i, in (8.3). Let us reconsider

equation (8.4) when α̂ and β̂1 are the least squares estimators. Since β̂1 = Sxy/Sxx,

SS(resid) =

n∑ i=1

(Yi − α̂)2 − 2β̂1 n∑ i=1

(Yi − α̂)(xi − x̄) + β̂21 n∑ i=1

(xi − x̄)2

=

n∑ i=1

( Yi − Y

)2 − β̂21Sxx. Write this as

n∑ i=1

( Yi − Y

)2 = SS(resid) + β̂21Sxx,

or

SS(tot) = SS(resid) + SS(regr) (8.8)

The total sum of squares is the total squared variation of the Yi about their average. If β1 = 0, then the Yi are just a sample from a single distribution and SS(tot)/(n− 1) is just the usual sample variance. The regression sum of squares β̂21Sxx can also be written

SS(regr) =

n∑ i=1

( Ŷi − Y

)2 .

It is the total squared deviation of the fitted values Ŷi from their average, which is also Y . The residual sum of squares and the regression sum of squares are independent random variables. Divide both sides of (8.8) by SS(tot).

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 150

1 = SS(res)

SS(tot) + SS(regr)

SS(tot)

Definition 8.1. The coefficient of determination R2 is defined as

R2 = SS(regr)

SS(tot) = 1− SS(resid)

SS(tot) .

Here is the correct interpretation of R2. The total squared variation SS(tot) of the observations Yi about their average can be decomposed into two independent parts. One is the variation

SS(regr) =

n∑ i=1

(β̂1(xi − x̄))2

accounted for by the presumed linear relationship and the variation of the inputs xi. The other is the variation SS(resid) that comes from the random deviation from the linear relationship. If SS(regr) is a high percentage of SS(tot), then most of the variation in the Yi comes from the variation in the xi and little of it comes from random error. Thus, R

2 is interpreted as a measure of the strength of the association between X and Y .

Definition 8.2. The mean square residual is

S2 = SS(resid)

n− 2 .

The mean square residual is also denoted byMS(resid). Its square root S is called the residual standard error.

From (4) of Theorem 8.1,

E

( SS(resid)

σ2

) = n− 1.

It follows that E ( S2 )

= σ2

and S2 is an unbiased estimator of σ2. S is the preferred estimator of σ, though it is not unbiased.

Recall the definition of the student t distributions from Chapter 6. If Z ∼ Norm(0, 1) and W ∼ Chisq(df = ν) are independent, then

T = Z√ W/ν

has the student-t distribution with ν degrees of freedom. In the following theorem, we apply this definition with Z equal to the standardized values of α̂, β̂1, and µ̂(x) (see Theorem 8.1), and with W = SS(resid)/σ2 and ν = n− 2. We then have

√ W/ν = S/σ.

Theorem 8.2. Under the assumptions of Theorem 8.1, the following random variables all have student-t distributions with n− 2 degrees of freedom:

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 151

√ n (α̂− α) S

,

√ Sxx

( β̂1 − β1

) S

,

µ̂(x)− µ(x)

S √

1 n +

(x−x̄)2 Sxx

.

These are sometimes called the studentized values of the estimators.

8.4.1 Confidence Intervals for the Parameters

From the preceding theorem, we can immediately get confidence intervals for the regression parameters.

Corollary 8.1. 100(1− γ) % confidence intervals for α, β1, and µ(x) are:

α̂± tγ/2(ν = n− 2) S√ n , (8.9)

β̂1 ± tγ/2(n− 2) S√ Sxx

, (8.10)

µ̂(x)± tγ/2(n− 2)S

√ 1

n +

(x− x̄)2 Sxx

. (8.11)

The estimated standard deviations of the estimators, obtained by substituting the estimator S for the unknown parameter σ, are

se(α̂) = S√ n ,

se(β̂1) = S√ Sxx

,

se(µ̂(x)) = S

√ 1

n +

(x− x̄)2 Sxx

and are called their standard errors. So, the confidence intervals above may be expressed as:

α̂± tγ/2(n− 2)se(α̂),

β̂1 ± tγ/2(n− 2)se(β̂1), µ̂(x)± tγ/2(n− 2)se(µ̂(x)).

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 152

8.4.2 Hypothesis Tests for the Parameters

Tests of hypotheses about the regression parameters are based on the student-t distribution of the studentized parameter estimates. By far the most important hypothesis to be tested is the null hy- pothesis that the slope parameter β1 is equal to zero. If β1 = 0 and the other assumptions of the linear regression model hold, then the distribution of the response Y does not depend on the value of the design variable X. We will consider the more general null hypothesis

H0 : β1 = β10

and the two-sided alternative

H1 : β1 6= β10.

The p-value is

Pr (|T | > |Tobs|) ,

where T ∼ TDist(df = n− 2) and

Tobs =

√ Sxx(β̂1(obs)− β10)

S

is the observed value of the studentized β̂1 in Theorem 8.2. For one-sided alternatives, eliminate the absolute value signs and adjust the direction of the inequality appropriately.

Example 8.2. The table below shows vacancy rates (percentage of apartments that are vacant for over 1 month) and rental rates per 10 square feet in 30 American cities.

Vacancy 3.0 11.00 17.00 2.0 7.00 18.00 9.00 12.0 13.00 13.00

Rent 21.7 16.42 14.84 24.9 16.62 11.75 19.33 14.6 12.01 19.88

Vacancy 8.00 16.00 10.00 3.00 12.00 17.00 3.00 11.00 16.00 20.00

Rent 19.83 17.78 17.79 17.08 19.39 15.81 21.15 12.33 15.58 10.83

Vacancy 20.00 14.00 8.00 5.00 2.00 20.00 15.00 19.00 2.00 14.00

Rent 17.38 15.09 23.19 19.27 14.88 16.27 18.53 14.25 17.68 19.74

Find a 90 % confidence interval for the expected rental rate when the vacancy rate is 15 %. Test the null hypothesis that the expected increase in Rent for a unit increase in Vacancy is 0.

Solution: What is required is a 90 % confidence interval for µ(15) where the response Y = Rent and the expected response is a linear function of X = Vacancy. The expected increase in Rent for a unit increase in Vacancy is the slope parameter β1. The data is in the text file ”rents”. We will import it into R and analyze it with R’s full capabilities later. For now we will show all the steps in obtaining the confidence interval. You are encouraged to follow the steps with your calculator.

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 153

You can verify that Y=17.197, x̄=11.333, Sxx=1028.667, and Sxy=-312.507. These should all be easy to obtain with the mean, variance, and covariance buttons on your calculator. The least squares estimates of the parameters are:

α̂ = Y = 17.197

β̂1 = Sxy Sxx

= − 312.507 1028.667

= −0.304

and the fitted line is

y = 17.197− 0.304(x− 11.333). Next, we need to find S, the residual standard error, and for that we need SS(resid). It isn’t necessary to sum the squares of the residuals. Instead, we will use equation (8.8) in the form

SS(resid) = SS(tot)− SS(regr). SS(tot) is n − 1 times the sample variance of the responses Yi. Its value is 326.087. SS(regr) has essentially been calculated:

SS(regr) = β̂21Sxx = S2xy Sxx

= 94.939

Thus,

SS(resid) = 326.087− 94.939 = 231.148

MS(resid) = SS(resid)

n− 2 =

231.148

28 = 8.255

S = √ MS(resid) = 2.873

Now we can calculate the coefficient of determination R2 and the standard errors.

R2 = SS(regr)

SS(tot) =

94.039

326.087 = 0.2911.

se(µ̂(x)) = S

√ 1

n +

(x− x̄)2 Sxx

= 2.873

√ 1

30 +

(15− 11.333)2 1028.667

= 0.619

and

se(β̂1) = S√ Sxx

= 2.873

32.073

= 0.090

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 154

The predicted Rent for Vacancy = 15 is

µ̂(15) = 17.197− 0.304(15− 11.333) = 16.082.

The 90% confidence interval for µ̂(15), the expected Rent when Vacancy = 15, is

µ̂(15)± t.05(28)se(µ̂(15)) = 16.082± 1.053,

i.e., (15.029,17.135).

The observed T statistic for testing the null hypothesis H0 : β1 = 0 is

Tobs = β̂1 − 0 se(β̂1)

= −0.304 0.090

= −3.378

Since the alternative H1 : β1 6= 0 is 2-sided, the p-value is

Pr (|T | > 3.378) = 2Pr(T > 3.378) = 0.0022,

which is a highly significant result. We can conclude that β1 6= 0. The scatterplot and fitted line are shown below.

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 155

5 10 15 20

12 14

16 18

20 22

24

Vacancy

R en

t

The F Test for Significance of Regression

Definition 8.3. The F distribution with ν1 degrees of freedom in the numerator and ν2 degrees of freedom in the denominator is the distribution of a random variable

F = U/ν1 V/ν2

,

where U ∼ Chisq(df = ν1) and V ∼ Chisq(df = ν2) are independent. That F has this distribution is indicated by F ∼ FDist(ν1, ν2).

It follows immediately from the definition that if F ∼ FDist(ν1, ν2), then 1/F ∼ FDist(ν2, ν1) Tables of the F distributions are included in most textbooks, but because two parameters have to be specified, they tend to be rather coarse. We will use the R functions ”qf” and ”pf” for evaluating the quantile function and the cumulative distribution. For example, the 95th percentile of FDist(ν1 = 20, ν2 = 30) and Pr(F ≤ 1.4) are

> qf(.95,20,30)

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 156

 1.931653

> pf(1.4,20,30)

 0.8024855

Now, U = SS(regr)/σ2 ∼ Chisq(df = ν1 = 1) and V = SS(resid)/σ2 ∼ Chisq(df = ν2 = n− 2) are independent. Furthermore, V/ν2 = MS(resid)/σ

2. Although it may seem pointless, divide SS(regr) by its degrees of freedom ν1 = 1 and call it MS(regr). Then

F = MS(regr)

MS(resid)

has the F distribution with 1 degree of freedom in the numerator and n − 2 degrees of freedom in the denominator. It is in fact the square of the student-t statistic for testing H0 : β1 = 0 against H1 : β1 6= 0. Therefore, we reject H0 and accept H1 if F is larger than a critical value, or if its p-value is smaller than a given significance level.

Example 8.2 in R

We will use R’s ”lm” function to answer the questions in the preceding example. First, we create the linear model object and give it a name, then call the summary function for that object.

> rents.lm=lm(Rent~Vacancy,data=rents)

> summary(rents.lm)

Call:

lm(formula = Rent ~ Vacancy, data = rents)

Residuals:

Min 1Q Median 3Q Max

-5.1521 -2.2374 0.1688 1.9937 4.9807

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 20.63971 1.14279 18.061 < 2e-16 ***

Vacancy -0.30380 0.08958 -3.391 0.00209 **

NA

Residual standard error: 2.873 on 28 degrees of freedom

Multiple R-squared: 0.2911, Adjusted R-squared: 0.2658

F-statistic: 11.5 on 1 and 28 DF, p-value: 0.002089

The ”Coefficients” section of the output shows the estimated intercept and slope parameters, their standard errors, the observed student-t statistics associated with them, and the p-values for the 2- sided alternative to the null hypothesis that the parameter is equal to 0. Compare the numbers derived in Example 8.2 to those in the line preceded by ”Vacancy”. Except for some slight roundoff error they are the same. Also notice that the residual standard error and the value of R2 (”Multiple R-squared”) agree with those in the example. The F-statistic at the very bottom of the summary has the same

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 157

p-value as the t-test statistic for H0 : β1 = 0, as it should. The adjusted R-squared value can be ignored as it is useful only in multiple regression problems with more than one design variable.

Since the parameter estimates and their standard errors are given in the output, it would be easy to calculate confidence intervals for the parameters, e.g.,

β̂1 ± tγ/2(n− 2)se ( β̂1

) .

However, this is unnecessary because R will do it for us. To get 95% confidence intervals for both β0 and β1, use the ”confint” function with the name of the model object as an argument.

> confint(rents.lm)

2.5 % 97.5 %

(Intercept) 18.2988060 22.980611

Vacancy -0.4873016 -0.120294

If you want confidence intervals with a level other than 95% you must enter the ”level” argument to the function. For example, for 90% confidence intervals,

> confint(rents.lm,level=.90)

5 % 95 %

(Intercept) 18.6956704 22.5837464

Vacancy -0.4561913 -0.1514043

R will also give you a confidence interval for µ(x). To repeat the results of the example,

> predict(rents.lm,newdata=data.frame(Vacancy=15),interval=”c”,level=.90)

fit lwr upr

1 16.08274 15.02987 17.13562

The first argument in ”predict” is the name of the fitted linear model object. The ”newdata” argument has to be given in the way shown above. The new value(s) of X must be given the same name as the X variable on the right of the model formula and it must be inside the data.frame function as indicated above. The ”interval” argument is either ”c” for a confidence interval, or ”p” for a prediction interval. The ”level” argument is to specify the confidence level. Its default is 95%.

A new observation of the response Y at X = x is contained in the prediction interval

µ̂(x)± tγ/2(n− 2)S

√ 1 +

1

n +

(x− x̄)2 Sxx

with probability 1− γ. The prediction interval is wider than the confidence interval.

8.4.3 Exercises

1. Show that

F = (n− 2) R 2

1−R2 .

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 158

2. The data below gives the responses Y to a sample of values of a design variable X. Without using R, except to check your work, find the following: (a) the estimates α̂, β̂1, β̂0, (b) all three terms in SS(tot) = SS(resid) + SS(regr), (c) the residual standard error, (d) the standard errors of the estimates, and (e) R2.

3. With the same data, calculate the student-t test statistics for the hypotheses that the intercept and slope parameters are equal to 0. Calculate the F statistic for significance of regression. Calculate their p-values. (You can use R’s ”pt” and ”pf” functions for that.)

4. Without using R, find 95% confidence intervals for β0 and β1.

X Y

1 7.44 1.61

2 5.36 -0.22

3 6.37 0.21

4 5.46 0.22

5 5.73 -0.76

6 6.00 0.74

7 5.66 1.66

8 4.57 1.13

9 7.47 3.24

10 5.72 -0.47

5. With the ”primates” data find a 90% prediction interval for the brain weight of a primate whose body weight is 25 kg. Hint: First find the prediction interval for log brain weight. Use R.

6. The R datasets package has a data frame called ”airquality”, which lists ozone concentration, solar radiation, wind speed and temperature in New York for 154 days in 1973. Some of the data values are missing, but R will automatically omit those cases with missing data. Fit a linear model with Ozone as the response and Wind as the X variable. Find 90% confidence intervals for the expected ozone concentration when wind speed is 0 and for the expected increase in ozone concentration for a unit increase in wind speed.

7. With the airquality data, test the null hypothesis H0 : β1 = −5 against the one-sided alternative β1 > −5. The output of lm does not give you the answer directly, but it does give you the estimated value of β1 and its standard error. You know that the test statistic has the student-t distribution with n− 2 degrees of freedom. Give a p-value. Warning: Because of missing data, n = 116 not 154.

8.5 Correlation

In an observational study, with X and Y jointly distributed random variables, modeling Y as a function of X and predicting the response for new values of X might not be important. Instead, one might simply want a measure of the linear association between X and Y . One such measure is the correlation between X and Y , more formally called the Pearson product-moment correlation.

ρ = cor(X,Y ) = cov(X,Y )

σxσy

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 159

The correlation is a parameter, a characteristic of the joint distribution of X and Y . If X and Y have a bivariate normal distribution, ρ enters explicitly in the formula for the joint density function and in the expression for the conditional expectation of Y , given X = x,

E (Y |X = x) = µy + ρ σy σx

(x− µx) ,

from which we see that β1 = ρσy/σx in the regression equation. This means that the regression null hypothesis H0 : β1 = 0 is equivalent to H0 : ρ = 0.

The sample correlation is

R = Sxy√ SxxSyy

, (8.12)

where Sxy and Sxx are the same as before and Syy is our new name for SS(tot):

Syy =

n∑ i=1

( Yi − Y

)2 .

It is simple algebra to show that

R = β̂1

√ Sxx Syy

.

Recall the student-t statistic for testing H0 : β1 = 0.

T = β̂1 √ Sxx S

.

It is just more simple algebra to write this as

T = R √ n− 2√

1−R2 .

Therefore, if X and Y have a bivariate normal distribution, a test of H0 : ρ = 0 against H1 : ρ 6= 0 is to reject H0 when |T | > tγ/2(n − 2) or when the p-value based on the the student-t distribution TDist(df = n− 2) is smaller than the chosen significance level. We have three equivalent tests in the case of a bivariate normal distribution: the T test for H0 : β1 = 0, the T test for H0 : ρ = 0, and the F test for significance of regression.

8.5.1 Confidence intervals for ρ

The following theorem is proved by a complicated argument that starts with the central limit theorem. We will accept its conclusion without proof.

Theorem 8.3. As the sample size n → ∞, the distribution of Z = √ n− 3

( ψ̂ − ψ

) approaches

standard normal, where

ψ = 1

2 ln

1 + ρ

1− ρ (8.13)

Go to TOC

CHAPTER 8. REGRESSION AND CORRELATION 160

ψ̂ = 1

2 ln

1 +R

1−R (8.14)

Thus a large sample 100(1− γ) % confidence interval for ψ is

ψ̂ ± zγ/2√ n− 3

.

The function ln(1 + ρ)/(1− ρ) is strictly increasing for −1 < ρ < 1 and the expression (8.13) can be inverted to give

ρ = eψ − e−ψ

eψ + e−ψ = tanh(ψ),

the hyperbolic tangent function. So, a large sample 100(1− γ) % confidence interval for ρ is( tanh

( ψ̂ −

zγ/2√ n− 3

) , tanh

( ψ̂ +

zγ/2√ n− 3

)) .

The R function for testing hypotheses and obtaining confidence intervals for the correlation between two variables is ”cor.test”. We will illustrate it with the variables ”Vacancy” and ”Rent” in the data frame ”rents”.

> attach(rents)

> cor.test(Vacancy,Rent)

Pearson’s product-moment correlation

data: Vacancy and Rent

t = -3.3912, df = 28, p-value = 0.002089

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

-0.7533935 -0.2225779

sample estimates:

cor

-0.5395793

The confidence level is adjustable with a ”conf.level” argument, e.g., ”conf.level=.90”.

8.5.2 Exercises

1. With the airquality data find a 99 % confidence interval for the correlation between Ozone and Wind.

2. With the observations of X and Y in Exercises 4.3, find the p-value for the test that the correlation between X and Y is zero against a two sided alternative.

Go to TOC

Chapter 9

Inference from Multiple Samples

9.1 Comparison of Two Population Means

Let X1, X2, · · · , Xm be a random sample from a distribution with mean µx and standard deviation σx. Let Y1, Y2, · · · , Yn be a random sample from another distribution with mean µy and standard deviation σy. The two samples are independent of each other, so everything in sight is independent of everything else. Our goal is to find confidence intervals for the difference µx−µy of the population means or to test the null hypothesis H0 : µx = µy that they are equal.

Let X, Y , Sx, and Sy denote the sample means and standard deviations of the two samples. Naturally, our inferences will be based on the difference X−Y between the sample averages. The expected value and variance of this difference are:

E(X − Y ) = E(X)− E(Y ) = µx − µy (9.1)

var(X − Y ) = var(X) + var(Y )

= σ2x m

+ σ2y n

(9.2)

The standard deviation of X − Y is

sd(X − Y ) = √ σ2x m

+ σ2y n

and its standard error, obtained by estimating σx and σy by Sx and Sy, respectively, is

se(X − Y ) = √ S2x m

+ S2y n

(9.3)

9.1.1 Large Samples

The central limit theorem applies not just to each sample average separately, but also to their differ- ence. In other words, for large sample sizes m and n, the distribution of

161

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 162

Z = (X − Y )− (µx − µy)

sd(X − Y ) (9.4)

is approximately standard normal. More usefully, since the population variances are not usually known, the studentized value

T = (X − Y )− (µx − µy)

se(X − Y ) (9.5)

approaches standard normal as m,n → ∞. If zα/2 is the 100(1 − α/2) percentile of the standard normal distribution, then approximately

Pr(−zα/2 < T < zα/2) = 1− α.

Theorem 9.1. For large sample sizes m and n a 100(1− α)% confidence interval for µx − µy is

X − Y ± zα/2

√ S2x m

+ S2y n .

Theorem 9.2. For large sample sizes m and n, a test of significance level α for the null hypothesis H0 : µx = µy against the two sided alternative H1 : µx 6= µy rejects H0 when

|T | > zα/2,

i.e., when

∣∣X − Y ∣∣ > zα/2√S2x m

+ S2y n .

For the one sided alternative, H1 : µx > µy, reject H0 when

X − Y > zα

√ S2x m

+ S2y n .

The p-value for the two-sided alternative is 2(1−Φ(|Tobs|)), where Φ is the standard normal cumulative distribution.

Example 9.1. The data set ”lungcap” has measurements of forced expiratory volume (fev), a measure of lung capacity, for 85 male subjects between the ages of 16 and 18. Thirty five were smokers and 50 were not. The mean and variance for the smoking group were 3.624 and 0.084. For the nonsmokers they were 3.747 and 0.120. Find a 95% confidence interval for the difference between the mean fev of nonsmokers and smokers for this age and gender population.

Solution: The sample sizes m = 50 and n = 35 should be large enough to have confidence in the central limit theorem. We will apply Theorems 9.1 and 9.2. Later we will confirm our conclusions with other procedures.

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 163

The standard error of X − Y is

se(X − Y ) = √ S2x m

+ S2y n

=

√ 0.120

50 +

0.084

35

= 0.069

So, the 95% confidence interval is

X − Y ± z.025 × se(X − Y ) = 3.747− 3.624± 1.96(0.069) = 0.123± 0.135,

i.e., the interval (−0.012, 0.258).

9.1.2 Comparing Two Population Proportions

If the Xi and Yj are samples of Bernoulli random variables with success probabilities px and py, then the averages X and Y are just the sample proportions p̂x and p̂y. Their variances are, respectively,

var(p̂x) = px(1− px)

m

and

var(p̂y) = py(1− py)

n .

The standard deviation of p̂x − p̂y = X − Y is

sd(p̂x − p̂y) = √ px(1− px)

m + py(1− py)

n

and its standard error is

se(p̂x − p̂y) = √ p̂x(1− p̂x)

m + p̂y(1− p̂y)

n .

Therefore, by Theorem 9.1, if m and n are large, a 100(1− α)% confidence interval for the difference px − py in the population proportions is

p̂x − p̂y ± zα/2

√ p̂x(1− p̂x)

m + p̂y(1− p̂y)

n (9.6)

Example 9.2. In a sample of m = 40 older houses in New Orleans, 12 had termite damage. In a sample of n = 60 older houses in Houston, 14 had termite damage. Find a 90 % confidence interval for the difference in the incidence of termite damage in New Orleans and Houston.

Solution: The sample incidences are p̂x = 12/40 = 0.30 for New Orleans and p̂y = 14/60 = 0.233 for Houston. Therefore, the 90% confidence interval is

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 164

0.30− 0.233± z.05

√ .30(.70)

40 + .233(.767

60

= 0.067± 1.645(0.091) = 0.067± 0.149

Testing Equality of Population Proportions

To test the null hypothesis that two population proportions are equal H0 : px = py, one could apply Theorem 9.2 adapted for Bernoulli samples. However, there is a test that is slightly more powerful, that is, less likely to make a type 2 error. If H0 is true, and px = py = p, then a better estimator of the common value p than either p̂x or p̂y is their weighted average

p̂ = mp̂x + np̂y m+ n

,

called the pooled estimator of p. If H0 is true then the standard deviation and standard error of p̂x− p̂y are

sd(p̂x − p̂y) =

√ p(1− p)

( 1

m +

1

n

)

se(p̂x − p̂y) =

√ p̂(1− p̂)

( 1

m +

1

n

) (9.7)

For large m and n and significance level α reject H0 for H1 : px 6= py if

|p̂x − p̂y| > zα/2se(p̂x − p̂y) (9.8)

For a one sided alternative, erase the absolute value signs, replace zα/2 with zα and reverse the inequality, if necessary.

Comparing Proportions with R

Previously we used the ”prop.test” function in R to get confidence intervals and test hypotheses for a single proportion. The same function works for comparing two population proportions. To illustrate, we will rework the preceding example.

> prop.test(c(12,14),c(40,60),conf.level=.90,correct=F)

2-sample test for equality of proportions without

continuity correction

data: c(12, 14) out of c(40, 60)

X-squared = 0.5544, df = 1, p-value = 0.4565

alternative hypothesis: two.sided

90 percent confidence interval:

-0.08256681 0.21590014

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 165

sample estimates:

prop 1 prop 2

0.3000000 0.2333333

Except for roundoff error, the confidence interval is the same as the one we derived by hand. The p-value of 46% means that there is no evidence that the two cities differ in infestation rates. The first argument to prop.test, c(12,14), is the vector of counts of successes in the two samples and the second argument, c(40,60), is the vector of the sample sizes, or numbers of trials in the two binomial experiments. The argument ”correct = F” was to prevent R from applying the Yates continuity correction, so that the answers would be the same as before. There is some dispute about whether the Yates correction is desireable, but it is the default option in R and it does cause slightly different answers to be returned. The number labelled ”X-squared” is essentially the square of the standardized test statistic. Therefore, under H0, it has approximately a chi-square distribution with 1 degree of freedom.

9.1.3 Samples from Normal Distributions

When m and n are not large enough for reliance on the central limit theorem, but the two distribu- tions are nearly normal, there are procedures based on student-t distributions that are similar to the procedures for inferences about a single population mean.

The Welch Test and Confidence Interval

Theorem 9.3. If the samples {Xi} and {Yj} are from normal distributions, then the statistic T in (9.5) has an approximate student-t distribution with degrees of freedom ν equal to

ν =

( S2x m +

S2y n

)2 S4x

m2(m−1) + S4y

n2(n−1)

. (9.9)

Though the student-t distribution is only an approximation, the test for the null hypothesis H0 : µx = µy associated with this theorem performs very well in small sample situations. The test is known as Welch’s t test. If you are using a table of the student-t distributions, round ν given above to the nearest integer. Student-t distributions with fractional degrees of freedom are perfectly legitimate and most software is capable of handling them.

Using this result, the Welch 100(1− α)% confidence interval for µx − µy is

X − Y ± tα/2(df = ν) √ S2x m

+ S2y n

(9.10)

where ν is given by (9.9).

If α is the desired significance level, the Welch test for H0 : µx = µy against H1 : µx 6= µy rejects H0 if ∣∣X − Y ∣∣ > tα/2(ν)√S2x

m + S2y n

(9.11)

For one-sided alternatives, make obvious modifications to (9.11).

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 166

The t-test with Equal Variances

When the distributions are normal and have equal variances σ2x = σ 2 y = σ

2, there are exact student-t procedures for confidence intervals and tests. Both S2x and S

2 y are unbiased estimators of σ

2, but a better estimator is a weighted average of them called the pooled estimator of the common variance:

S2p = (m− 1)S2x + (n− 1)S2y

m+ n− 2 .

S2p/σ 2 has a chi-square distribution with m+ n− 2 degrees of freedom and is independent of both X

and Y . Since

X − Y ∼ Norm

( µx − µy, σ

√ 1

m +

1

n

) it follows that

Tp =

( X − Y

) − (µx − µy)

Sp

√ 1 m +

1 n

has a student-t distribution with m+ n− 2 degrees of freedom. The 100(1− α)% confidence interval is

X − Y ± tα/2(m+ n− 2)Sp

√ 1

m +

1

n (9.12)

Example 9.3. The ”lungcap”data set is adapted from the forced expiratory volume data set provided by the Journal of Statistics Education 1 in the JSE Data Archive. It has two variables, one named ”fev” for forced expiratory volume and the other named ”smoke” for smoking status. ”smoke” is a factor with two levels, ”no” and ”yes”.

It is always a good idea to look at side by side boxplots before applying a formal inference procedure.

1Kahn,M., An exhalent problem for teaching statistics, JSE vol.13,2,2005

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 167

no yes

3. 0

3. 5

4. 0

4. 5

smoke

fe v

The shapes of the boxplots are consistent with the normality assumption, but their spreads seem to cast doubt on the equal variance assumption. We will apply R’s ”t.test” function for both equal and unequal variances to get confidence intervals and to test the hypothesis that the population means are equal.

> attach(lungcap)

> t.test(fev~smoke)

Welch Two Sample t-test

data: fev by smoke

t = 1.7642, df = 80.225, p-value = 0.0815

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.01565884 0.26039598

sample estimates:

mean in group no mean in group yes

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 168

3.746740 3.624371

> t.test(fev~smoke,var.equal=T)

Two Sample t-test

data: fev by smoke

t = 1.7101, df = 83, p-value = 0.09098

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.01995121 0.26468836

sample estimates:

mean in group no mean in group yes

3.746740 3.624371

The t-tests and the normal test in Example 9.1 all give nearly the same answers. The confidence intervals are also nearly the same. At α = 10% we would reject H0 : µx = µy, but not at α = 5%.

The formula ”fev ∼ smoke” in these calls to t.test tells R to treat fev as a function of smoke, i.e. to separate the values of fev into two groups according to the value of smoke and to treat them as independent samples. There are other ways of calling t.test, including optional arguments for specify- ing one-sided alternatives and different confidence levels. Read about them by calling the help function.

> help(t.test)

9.1.4 Exercises

1. A sample of size 60 from one population of weights had a sample average of 10.4 lb. and a sample standard deviation of 2.7 lb. An independent sample of size 100 from another population of weights had a sample average of 9.7 lb. with a sample standard deviation of 1.9 lb. Find a 95% confidence interval for the difference between the population means.

2. Repeat problem 1, but assume the sample sizes are 6 and 10. State any assumptions you make.

3. The Payroll data set has data on the numbers of employees and the monthly payroll in thousands of dollars for 50 firms in two different industries. If you divide ”payroll” by ”employees” you get the average monthly salary for each firm. The populations of interest are the firms in industry A and those in industry B. The population variable of interest is the average monthly salary in each of the firms of these populations. At a significance level of α = 0.05 test the null hypothesis that the means of by-firm average monthly salaries in industries A and B are equal.

4. Construct side-by-side boxplots of average monthly salaries per firm in industries A and B. Critique your answer in problem 3.

5. Samples of sizes 100 and 80 of calculus students were acquired. The students in the first sample got into calculus by passing the pre-calculus course. Those in the second sample got in by getting a passing score on a placement test. In the first group, 65 succeeded in calculus. In the second group,

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 169

41 succeeded. Without using R find a 95% confidence interval for the difference in the success rates of the two populations.

6. For the same data, find the p-value for a test of the alternative hypothesis that the two success rates are not equal without using R.

7. Repeat problems 5 and 6 using R.

9.2 Paired Observations

An experimental setup that is superficially similar to two independent samples is the paired obser- vations design. In this design, a pair of observations X and Y is made on each of n experimental subjects, or perhaps 2n subjects are matched in pairs according to similar characteristics and X is measured on one subject of each pair while Y is measured on the other. In the end, we have n pairs (Xi, Yi), i.e., a sample of size n from a bivariate distribution. This is not the same thing as a sample of n values of X from one population and n independent values of Y from another population. The goal is inference about the mean of D = X − Y and usually it suffices to apply the one-sample t-test or confidence interval to the n differences Di = Xi−Yi. Recall Gosset’s split plot experiment for com- paring two methods of drying seeds before planting. The experimental units are small homogeneous plots of land. Half of each plot i is planted with seeds dried by the regular method resulting in a yield Xi. The other half is planted with seeds that are kiln dried, resulting in yield Yi. We repeat the data and the analysis here.

Plot 1 2 3 4 5 6 7 8 9 10 11

REG 1903 1935 1910 2496 2108 1961 2060 1444 1612 1316 1511

KILN 2009 1915 2011 2463 2180 1925 2122 1482 1542 1443 1535

> attach(gosset)

> t.test(REG-KILN)

One Sample t-test

data: REG – KILN

t = -1.6905, df = 10, p-value = 0.1218

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

-78.18164 10.72710

sample estimates:

mean of x

-33.72727

There is another way to do this in R, illustrated below.

> t.test(REG,KILN,paired=T)

Paired t-test

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 170

data: REG and KILN

t = -1.6905, df = 10, p-value = 0.1218

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-78.18164 10.72710

sample estimates:

mean of the differences

-33.72727

9.2.1 Crossover Studies

In a typical crossover study, a potentially effective treatment and a placebo are both applied to each of m+n human subjects and their responses (Xi, Yi) to the treatment and placebo are recorded. Because of concerns that the order of application might affect the responses, they are applied in random order. Thus, m of the subjects get the placebo first and n get the treatment first. The treatment effect is measured for each subject by first observation – second observation. For those who receive the treatment first, it is

Di = Xi − Yi, i = i, · · · ,m

and for the subjects who receive the placebo first it is

D′j = Y ′ j −X ′j , j = 1, · · · , n.

If the treatment has no real effect, the distributions ofD andD′ should have the same mean, µD = µD′ . The order of application would not affect D and D′ differently. On the other hand, if the treatment effect is real, µD 6= µD′ . So, even though we are looking at differences of paired observations, the problem is a two-sample problem with independent samples from the populations ”treatment first” and ”placebo first”.

Example 9.4. Twenty mildly hypertensive men were recruited for a crossover study of the effective- ness of a drug to lower blood pressure. Each man was given the new drug (drug A) for a week and an older drug (drug B) for a week. Their average morning systolic blood pressures for each week were recorded. The order of administration of the drugs was randomized and unknown both to the subjects and to medical attendants. The data is in the data set ”bpcrossover”. The average blood pressure for each subject and each drug is shown. The ”period” variable indicates whether the new drug was given first or second. The goal is to determine if there is a difference in the two drugs.

drug A drug B period

1 112 139 1

2 140 135 2

3 125 138 1

4 149 138 2

5 139 151 1

6 121 127 2

7 136 137 2

8 130 139 1

9 146 120 2

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 171

10 125 122 2

11 121 138 1

12 145 166 1

13 132 139 2

14 146 128 2

15 125 136 2

16 143 142 2

17 143 152 1

18 129 136 1

19 136 132 2

20 128 140 1

If this problem is approached as an ordinary paired observation problem, the student-t test is not significant.

> attach(bpcrossover)

> t.test(drugA-drugB)

One Sample t-test

data: drugA – drugB

t = -1.458, df = 19, p-value = 0.1612

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

-10.229179 1.829179

sample estimates:

mean of x

-4.2

However, treating the problem correctly as a crossover experiment, there is a significant difference in the drugs.

> diff1=drugA[period==1]-drugB[period==1]

> diff2=drugB[period==2]-drugA[period==2]

> t.test(diff1,diff2)

Welch Two Sample t-test

data: diff1 and diff2

t = -2.5781, df = 16.544, p-value = 0.01984

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-18.568571 -1.835469

sample estimates:

mean of x mean of y

-14.111111 -3.909091

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 172

Estimating the Size of the Effect

In the example just above, the student-t procedure returns a confidence interval for the difference in means

E(X − Y )− E(Y ′ −X ′) = E(X − Y ) + E(X ′ − Y ′),

where X is the response to drug A when it is given first, X ′ is the drug A response when it is given second, Y is the drug B response when it is given second and Y ′ is the drug B response when it is given first. Each of the terms on the right is a measure of the difference in the efficacy of drug A compared to drug B, under different experimental conditions. Therefore, a reasonable overall measure of A’s efficacy compared to B’s is their average.

1

2 [E(X − Y ) + E(X ′ − Y ′)] .

If we divide the end points of the returned confidence interval by 2 we obtain an interval estimate of the difference in mean efficacy. In this particular example, the interval is (−9.285,−0.918).

9.2.2 Exercises

1. Concentrations of particulate matter were measured at 25 locations in a lake after rains caused heavy runoff from surrounding areas. After a period of dry weather, they were measured again at the same locations. The data are in the file ”runoff”. Find a 90% confidence interval for the effect of runoff on particulate concentrations.

2. Repeat problem 1 treating the rainy and dry measurements as independent samples. Compare the results. Which is the better procedure?

3. With the bpcrossover data, make side-by-side boxplots of drugA – drugB for both values of period. Does this indicate anything about whether the order of administration of the drugs makes a difference? If it exists, this is called a period effect.

4. Go to StatSci.org http://www.statsci.org/data/general/vitaminc.html and download the data set ”Effect of Vitamin C on Muscular Endurance” 2. Perform the appropriate analysis to determine if vitamin C has an effect and if so, the size of the effect.

9.3 More than Two Independent Samples: Single Factor Anal- ysis of Variance

Let X be a variable defined for several subgroups of a main population. Let

X11, X12, · · · , X1n1 be a random sample of size n1 from Norm(µ1, σ),

2Keith, R. E., and Merrill, E. (1983). The effects of vitamin C on maximum grip strength and muscular endurance. Journal of Sports Medicine and Physical Fitness, 23, 253-256

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 173

X21, X22, · · · , X2n2 be a random sample of size n2 from Norm(µ2, σ),

XM1, XM2, · · · , XMnM be a random sample of size nM from Norm(µM , σ).

We assume that the samples are independent of each other. Everything is independent of everything else. Also note that we are assuming that the variances of the variable X in the M groups are all the same: var(X) = σ2. Typically, the M groups are defined by the M levels of a discrete factor variable A that covaries with X in the larger population. If you like, you can think of Xij as a sample from the conditional distribution of X, given A = i.

The objective is to decide if there is any real difference in the group means µ1, µ2, · · · , µM , that is, to test the null hypothesis H0 : µ1 = µ2 = · · · = µM against the many sided alternative H1 : µj 6= µk for at least one pair j, k. If we reject this null hypothesis, we conclude that factor A has a real effect on the expected response.

Let

Xi� = 1

ni

ni∑ j=1

Xij

denote the average of the observations in the ith group. Let

N =

M∑ i=1

ni

be the total number of observations in all the groups, and let

X �� = 1

N

M∑ i=1

ni∑ j=1

Xij

be the average of all the observations (the grand average). The grand average is also a weighted average of the M group averages.

X �� = 1

N

M∑ i=1

niXi�.

The total sum of squares is

SS(tot) =

M∑ i=1

ni∑ j=1

(Xij −X ��)2. (9.13)

SS(tot) can be expanded by the binomial expansion:

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 174

SS(tot) =

M∑ i=1

ni∑ j=1

(Xij −Xi� +Xi� −X ��)2

=

M∑ i=1

ni∑ j=1

(Xij −Xi�)2 + 2 M∑ i=1

ni∑ j=1

(Xij −Xi�)(Xi� −X ��)

+

M∑ i=1

ni∑ j=1

(Xi� −X ��)2

The middle term above is 0. After simplifying the third term,

SS(tot) =

M∑ i=1

ni∑ j=1

(Xij −Xi�)2 + M∑ i=1

ni(Xi� −X ��)2

= SS(resid) + SS(betw). (9.14)

The residual sum of squares, also sometimes called the within group or error sum of squares, is

SS(resid) =

M∑ i=1

ni∑ j=1

(Xij −Xi�)2.

The between group sum of squares, or treatment sum of squares, or factor A sum of squares is

SS(betw) =

M∑ i=1

ni(Xi� −X ��)2.

It is pretty clear that SS(betw) is a measure of how widely dispersed the group averages Xi� are about their center X ��. Thus, it is the basis for a test statistic for accepting the alternative hypothesis that the population means are not all the same. To be useful, it has to be compared to a measure of the inherent variability of the data. SS(resid) is just such a measure. The following theorem tells us how to relate the two.

Theorem 9.4. SS(resid) and SS(betw) are independent random variables.

SS(resid)

σ2 ∼ Chisq(df = N −M).

If H0 : µ1 = µ2 = · · · = µM is true,

SS(betw)

σ2 ∼ Chisq(df = M − 1)

and

F = (N −M) (M − 1)

SS(betw)

SS(resid) ∼ FDist(M − 1, N −M),

has the F distribution with M−1 degrees of freedom in the numerator and N−M degrees of freedom in the denominator.

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 175

We define the mean square residual to be

MS(resid) = SS(resid)

N −M ,

and the factor A mean square to be

MS(betw) = SS(betw)

M − 1 ,

so that F can be written

F = MS(betw)

MS(resid) .

For a given significance level α, let fα(ν1, ν2) be the 100(1 − α)th percentile of FDist(ν1, ν2). For a test at significance level α of H0 : µ1 = µ2 = · · · = µM against H1 : µi 6= µj for at least one pair i, j, reject H0 if

F = MS(betw)

MS(resid) > fα(M − 1, N −M). (9.15)

In R, fα can be found with the function ”qf”, the quantile function of the F distribution, and the p-value Pr(F > Fobs) with the function ”pf”.

Example 9.5. The summary statistics for samples of observations of a variable X on four groups are shown below.

n mean var

group1 10 1.786 13.792

group2 10 3.268 5.649

group3 10 2.419 3.860

group4 10 4.374 1.631

There are M = 4 groups and each group sample size ni is 10. N = 40 is the total combined sample size. To find the grand average, we multiply each group average by the group sample size, add, and divide by N .

X �� = (10× 1.786 + 10× 3.268 + 10× 2.419 + 10× 4.374)/40 = 2.962.

Now, to get SS(resid) we multiply the sample variance for each group by ni−1 and add them together.

SS(resid) = 9× 13.792 + 9× 5.649 + 9× 3.860 + 9× 1.631 = 224.388

The factor A sum of squares SS(betw) = ∑M i=1 ni(Xi� −X ��)2 is

SS(betw) = 10× (1.786− 2.962)2 + 10× (3.268− 2.962)2

+ 10× (2.419− 2.962)2 + 10× (4.374− 2.962)2

= 37.652,

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 176

and then, finally,

MS(betw) = SS(betw)

M − 1 =

37.652

3 = 12.551

MS(resid) = SS(resid)

N −M =

224.388

36 = 6.233

F = MS(betw)

MS(resid) =

12.551

6.233 = 2.014.

We will use the R function ”pf” to find the p-value Pr(F > 2.014).

> 1-pf(2.014,3,36)

 0.1293147

Thus the test is not significant at α = 0.10. We do not conclude that there is a difference in population means.

This kind of problem and the kind of analysis shown above is called one-way analysis of variance or single factor analysis of variance because the observations of X are categorized by the levels of a single factor variable. Analysis of variance is abbreviated anova.

9.3.1 Example Using R

The data set ”apfilternoise.txt” shows data presented to a Senate subcommittee on a comparison of a new type of automobile pollution filter to an older type. One of the variables of concern was the noise created inside the vehicle by the filter. A factor that might influence the noise is the size of the vehicle. The data below is a subset ”filternoise1” of the full data set consisting only of measurements for the old filter type.

NOISE SIZE

1 810 small

2 820 small

3 820 small

4 840 midsize

5 840 midsize

6 845 midsize

7 785 large

8 790 large

9 785 large

10 835 small

11 835 small

12 835 small

13 845 midsize

14 855 midsize

15 850 midsize

16 760 large

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 177

17 760 large

18 770 large

R has a function ”aov” for performing analyses of variance for data such as this, but the ”lm” function that we used for regression actually gives more information. In fact, linear regresson models and analysis of variance models are both examples of linear models, and ”lm” is designed for all types of linear models.

> filternoise.lm=lm(NOISE~SIZE,data=filternoise1)

> summary(filternoise.lm)

Call:

lm(formula = NOISE ~ SIZE, data = filternoise1)

Residuals:

Min 1Q Median 3Q Max

-15.8333 -5.8333 -0.8333 9.1667 15.0000

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 825.833 4.271 193.362 < 2e-16 ***

SIZEmidsize 20.000 6.040 3.311 0.00475 **

SIZElarge -50.833 6.040 -8.416 4.59e-07 ***

Residual standard error: 10.46 on 15 degrees of freedom

Multiple R-squared: 0.907, Adjusted R-squared: 0.8946

F-statistic: 73.11 on 2 and 15 DF, p-value: 1.841e-08

> anova(filternoise.lm)

Analysis of Variance Table

Response: NOISE

Df Sum Sq Mean Sq F value Pr(>F)

SIZE 2 16002.8 8001.4 73.109 1.841e-08 ***

Residuals 15 1641.7 109.4

The estimated coefficients in the summary of the linear model object ”filternoise.lm” need some expla- nation. The variable SIZE is a factor, i.e., a discrete variable with nominal values or levels, ”small”, ”midsize” and ”large”. These are stored internally in that order, but this is merely coincidental. SIZE is not an ordered factor, which is a separate class in R.

The Intercept coefficient in the summary is the estimated mean of NOISE for SIZE=”small”. In other words, it is the quantity X1� in the discussion above. The second coefficient, labelled SIZEmidsize, is the difference in estimated means for SIZE=”midsize” and SIZE=”small”. That is, it is the quantity X2� −X1�. Finally the coefficient SIZElarge is the estimated difference in the mean of NOISE when

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 178

SIZE=”large” and its mean when SIZE=”small”, i.e., X3�−X1�. The first level of the factor is treated as a base level and the others are compared to it.

For each coefficient, the summary gives a student-t statistic for testing the hypothesis that the true coefficient is zero and a p-value for the statistic. Since the p-values here are so small, we are justified in concluding that µ2 6= µ1 and µ3 6= µ1. The summary gives no information to justify the conclusion that µ3 6= µ2.

After creating the linear model object, the call

> anova(filternoise.lm)

produces an analysis of variance table. The first line, headed ”SIZE”, shows the between group degrees of freedom M − 1, the between group sum of squares SS(betw), the between group mean square MS(betw) and the F statistic F = MS(betw)/MS(resid). The last entry in the first row is the p-value for the observed value of F . The second row, headed ”Residuals” gives the degrees of freedom N −M , the residual sum of squares SS(resid), and the mean square residual MS(resid). An anova table is the traditional way of presenting the calculations in an analysis of variance. Some of them look slightly different from this, but they all convey about the same information. Notice that the F statistic and its p-value are the same in the linear model summary and the anova table.

9.3.2 Multiple Comparisons

In the example just above, we concluded that µ2 6= µ1 and µ3 6= µ1 because the corresponding p-values were so small. We were unable to draw any conclusion about µ3−µ2. At best, the anova table allows us to say that some of the group means are different, without saying which ones. We could perform separate two-sample t-tests at level α on each pair of groups and reject some of the hypotheses µi = µj while not rejecting others. The problem is, when there are a lot of pairwise comparisons to be made, the probability that one or more of these pairwise null hypotheses is rejected is substantially greater than α, even if they are all true. This is the problem of multiple comparisons. A solution is to reduce the significance level of the pairwise tests enough so that the probability of one or more type 1 errors in the whole set of comparisons is less than α. The Bonferroni method of adjustment reduces the significance level for the pairwise tests to α/k, where k is the number of comparisons. The Holm method of adjustment is considerably more complicated. It is the default method used in R, although the Bonferroni method can be selected as an option.

The R function is ”pairwise.t.test”. It actually adjusts p-values instead of pre-established values of α. We will illustrate it with the data of the preceding example.

> attach(filternoise1)

> pairwise.t.test(NOISE,SIZE)

Pairwise comparisons using t tests with pooled SD

data: NOISE and SIZE

large midsize

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 179

midsize 1.8e-08 –

small 9.2e-07 0.0047

> pairwise.t.test(NOISE,SIZE,”bonferroni”)

Pairwise comparisons using t tests with pooled SD

data: NOISE and SIZE

large midsize

midsize 1.8e-08 –

small 1.4e-06 0.014

> pairwise.t.test(NOISE,SIZE,”none”)

Pairwise comparisons using t tests with pooled SD

data: NOISE and SIZE

large midsize

midsize 5.9e-09 –

small 4.6e-07 0.0047

The method of adjustment didn’t make any difference here because the p-values were so small and there were only three comparisons. The default holm method is recommended, in general.

9.3.3 Exercises

1. Using the summary data in Example 9.5, construct an anova table.

2. The table below shows summary statistics for normally distributed measurements on 5 groups. The population variances are all equal. Construct an anova table and determine if there is a difference in the population means by calculating a p-value.

n mean var

grp1 10 52.40 243.38

grp2 21 55.00 142.00

grp3 16 36.25 246.73

grp4 20 53.65 173.82

grp5 18 47.50 267.91

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 180

3. The data set ”airquality” in the R datasets library has data on ozone concentration, wind speed, temperature, and solar radiation by month and day for May through September in New York. Attach airquality to your workspace and then construct side-by-side boxplots of Wind by Month. Month is a numeric variable in the airquality data frame. You can treat it as a factor by using the ”as.factor” function, e.g.,

> plot(Wind ∼ as.factor(Month))

Next, do an analysis of variance to determine if wind speed varies significantly by month. Finally, use the ”pairwise.t.test” function to pick out which pairs of months are significantly different. Are the answers what you would expect from looking at the boxplots?

4. From the course data folder

import the reading comprehension data. ”Group” is a factor whose levels are abbreviations of three methods of reading instuction. Create a linear model object in R with ”Post3 as a function of ”Group”. Apply the ”summary” function to the linear model object and interpret the coefficients. Note the as- sociated p-values. Apply the ”anova” function to the linear model object to create an anova table and interpret its output. Apply the function ”pairwise.t.test” in the following manner.

> pairwise.t.test(Post3,Group,”bonferroni”)

Compare these p-values to the p-values in the linear model summary.

9.4 Two-Way Analysis of Variance

Agricultural researchers are interested in comparing the yield of 4 varieties of corn. To control for the effects of rainfall, climate, and soil fertility, they select 5 small plots of land and divide each into fourths. The four varieties are randomly assigned to the subplots. At harvest, the yield of the jth variety on the ith plot is measured. Let Xij denote its value. There are two factors that affect the yield – the variation in growing conditions between the main plots and the variation between seed varieties. Hence, this is a two-factor experiment. It is a generalization of the paired observation experimental design discussed earlier. Researchers are primarily interested in the differences between seed varieties and not so much in the effect of the growing conditions. Note that there is only one observation for each combination of levels of the two factors.

Let A denote the plot factor. It has a = 5 levels. Let B denote the variety factor, with b = 4 levels. Let µij = E(Xij) be the expected response (yield) for level i of A and level j of B. It is assumed that the Xij are independent and normally distributed with common variance σ

2.

No inferences are possible without making explicit what kind of effects we are looking for and restricting the expected values µij in an appropriate way. The model we choose is called the additive model:

µij = µ+ αi + βj (9.16)

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 181

It is called the additive model because changing only the level of A simply adds some amount to µij and it is the same amount irrespective of j – the level of B. A similar statement holds for changing the level of B. The parameters satisfy the restrictions

a∑ i=1

αi =

b∑ j=1

βj = 0.

Because of these restrictions, only a − 1 of the αi and b − 1 of the βj vary freely. Including µ then, there are a− 1 + b− 1 + 1 = a+ b− 1 parameters that determine the a× b means µij .

If the means µij do satisfy this model, then µ, αi, and βj can be found as follows:

µ = µ�� = 1

ab

a∑ i=1

b∑ j=1

µij ,

αi = µi� − µ��, βj = µ�j − µ��,

µi� = 1

b

b∑ j=1

µij ,

µ�j = 1

a

a∑ i=1

µij .

The model can also be expressed as

µij = µi� + µ�j − µ��, (9.17)

and the degree to which the model is untrue is

a∑ i=1

b∑ j=1

(µij − µi� − µ�j + µ��)2. (9.18)

Now, factor A has no effect on the expected response if and only if all the αi are equal to zero, i.e., if and only if

a∑ i=1

α2i =

a∑ i=1

(µi� − µ��)2 = 0. (9.19)

Similarly, factor B has no effect if and only if

b∑ j=1

β2j =

b∑ j=1

(µ�j − µ��)2 = 0. (9.20)

Finally, assuming the model to be true, neither factor has any effect if and only if the µij are all the same, i.e., if and only if

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 182

a∑ i=1

b∑ j=1

(µij − µ��)2 = 0.

The following algebraic identity holds for all of these quantities:

a∑ i=1

b∑ j=1

(µij − µ��)2 = b a∑ i=1

(µi� − µ��)2 + a b∑ j=1

(µ�j − µ��)2

+

a∑ i=1

b∑ j=1

(µij − µi� − µ�j + µ��)2. (9.21)

Of course none of the parameters in this equation are known, but they all have estimated values from the observations Xij . Exactly the same identity holds for the estimated values.

SS(tot) = SS(A) + SS(B) + SS(resid), (9.22)

where

SS(tot) =

a∑ i=1

b∑ j=1

(Xij −X ��)2,

SS(A) = b

a∑ i=1

(Xi� −X ��)2,

SS(B) = a

b∑ j=1

(X �j −X ��)2,

SS(resid) =

a∑ i=1

b∑ j=1

(Xij −Xi� −X �j +X ��)2.

Theorem 9.5. The three terms on the right of (9.22) are independent random variables. Furthermore,

SS(resid)/σ2 ∼ Chisq(df = (a− 1)(b− 1)).

If α1 = α2 = · · · = αa = 0 (factor A has no effect),

SS(A)/σ2 ∼ Chisq(df = a− 1).

If β1 = β2 = · · · = βb = 0 (factor B has no effect),

SS(B)/σ2 ∼ Chisq(df = b− 1).

Define mean square figures by dividing each of the above by its degrees of freedom:

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 183

MS(A) = SS(A)

a− 1

MS(B) = SS(B)

b− 1

MS(resid) = SS(resid)

(a− 1)(b− 1)

and F statistics by

FA = MS(A)

MS(resid)

FB = MS(B)

MS(resid)

To test the null hypothesis HA : α1 = · · · = αa = 0 against the alternative that αi 6= αk for some i and k, reject HA if

FA > fδ(a− 1, (a− 1)(b− 1))

where δ isthe desired type 1 error probability. To test HB : β1 = · · · = βb = 0, reject HB if

FB > fδ(b− 1, (a− 1)(b− 1)).

Example 9.6. We return to the auto pollution filter noise data. Below is a two-way table showing one observation of NOISE for each combination of the factors SIZE and TYPE.

standard Octel

small 825.8 822.5

midsize 845.8 821.7

large 775.0 770.0

In the exercises below, you are asked to perform the calculations above by hand and test the null hypothesis that the type of filter has no effect on noise. This is an absurdly small data set, so hand calculations are feasible. For any problem of significant size, the data is more likely to be given in a form similar to the data frame below.

> filternoise2

SIZE TYPE NOISE

1 small standard 825.8

2 midsize standard 845.8

3 large standard 775.0

4 small Octel 822.5

5 midsize Octel 821.7

6 large Octel 770.0

We will use R’s ”anova” and ”lm” functions to construct the analysis of variance table.

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 184

> anova(lm(NOISE~SIZE+TYPE,data=filternoise2))

Analysis of Variance Table

Response: NOISE

Df Sum Sq Mean Sq F value Pr(>F)

SIZE 2 4341.0 2170.48 32.5434 0.02981 *

TYPE 1 175.0 174.96 2.6233 0.24674

Residuals 2 133.4 66.69

Based on the p-values for the F statistics, we can conclude that vehicle size has an effect on noise, but we cannot conclude that the type of filter has an effect. This is probably not what the manufacturers were hoping for.

9.4.1 Interactions Between the Factors

With only one observation for each combination of factor levels, the additive model must be taken as given. If there are multiple observations for each combination then it becomes possible to test the hypothesis that all the interactions

µij − µi� − µ�j + µ�� are equal to zero, in other words, to test the hypothesis that the additive model is true. This is because with multiple observations there is an independent estimate of the error variance σ2. Rather than go through the mathematics, we will illustrate with an example in R.

Example 9.7. The IRS is concerned about the time it takes taxpayers to fill out tax forms. This example is purely fictional. Managers arranged an experiment in which subjects from three income brackets are timed in their completion of four forms. Since the forms are different, the times to complete them are expected to be different. Managers are mainly interested in the effect of income group on completion time. Ten response times were recorded for each combination of income group and form. The data is available in two different formats, ”taxforms” in a format more suitable for presentation and ”Taxforms” in a format more suitable for R.

We will first do the analysis of variance without interactions, i.e., assuming an additive model.

> anova(lm(time~group+form,data=Taxforms))

Analysis of Variance Table

Response: time

Df Sum Sq Mean Sq F value Pr(>F)

group 2 6719 3359.4 4.1039 0.01901 *

form 3 6280 2093.3 2.5572 0.05868 .

Residuals 114 93319 818.6

Next, allowing interactions:

Go to TOC

CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 185

> anova(lm(time~group*form,data=Taxforms))

Analysis of Variance Table

Response: time

Df Sum Sq Mean Sq F value Pr(>F)

group 2 6719 3359.4 4.1127 0.01899 *

form 3 6280 2093.3 2.5627 0.05857 .

group:form 6 5102 850.3 1.0410 0.40297

Residuals 108 88217 816.8

The third line in the table gives the sum of squares, mean square and F statistic for interactions. The p-value of 40% does not indicate a significant interaction between the factors. The main additive effect for the group factor is significant and the additive effect for form is almost significant. Notice that the analysis without interactions gives almost the same answers for the main effects.

This experimental design is balanced in that each combination of factor levels has the same number of responses. Unbalanced designs can be analyzed but the interpretation of the answers becomes more complicated.

9.4.2 Exercises

1. Do the calculations in Example 9.6 by hand, without using R except as a calculator.

2. Use the paired observations student-t test to determine if there is an effect due to TYPE in Example 9.6. Compare the p-value to the anova p-value.

3. With the apfilternoise data perform a two way analysis of variance with NOISE as the response and SIZE and TYPE as the factors. Ignore the variable SIDE. Do the analysis first without and then with interactions. What are your conclusions?

Go to TOC

Chapter 10

Analysis of Categorical Data

10.1 Multinomial Distributions

Let X be a factor variable with levels L1, · · · , Lm. This means that X is discrete with a small number of distinct values, which may be expressed numerically for convenience, but generally signify categories such as ”male”,”female” or ”low income”,”middle income”, ”high income”. Let pi = Pr(X = Li) be the probability of the ith category. These probabilities satisfy pi > 0 for each i and

∑m i=1 pi = 1. Because

of this last constraint, only m− 1 of the pi may be specified with some degree of freedom.

Suppose we have n independent observations of X. Let Yi be the number of observations in which level Li occurs, i = 1, · · · ,m. These jointly distributed random variables Y1, · · · , Ym have nonnegative integer values and

∑m i=1 Yi = n. As we showed in Chapter 4, their joint frequency function is

Pr(Y1 = y1;Y2 = y2; · · · ;Ym = ym) = n!

y1!y2! · · · ym! py11 p

y2 2 · · · pymm ,

where the yi are nonnegative integers whose sum is n.

Example 10.1. Suppose that the students in a calculus class of 45 can be regarded as a random sample from the much larger population of all calculus students. In the population of all calculus students, 15% make A’s, 25% make B’s, 30% make C’s, 15% make D’s and 15% either fail or drop. What is the probability that in this particular class there are 4 A’s, 9 B’s, 10 C’s, 12 D’s, and 10 F’s or W’s?

Solution: According to the formula, the probability is

45!

4!9!10!12!10! 0.1540.2590.30100.15120.1510.

It is highly recommended that you use the R function ”dmultinom” for such calculations.

> dmultinom(c(4,9,10,12,10),size=45,prob=c(.15,.25,.30,.15,.15))

 1.85789e-05

186

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 187

10.1.1 Estimators and Hypothesis Tests for the Parameters

For each level or category i, the marginal distribution of the number of occurrences Yi is binomial, Yi ∼ Binom(n, pi). For large n, the natural estimator p̂i = Yi/n is approximately normal with mean pi and standard deviation

√ pi(1−pi)

n . Any of the methods described in Chapter 6 may be used to

construct confidence intervals and hypothesis tests for the individual pi. Right now we are more interested in a measure of the overall accuracy with which all m estimators p̂1, · · · , p̂m approximate their target values p1, · · · , pm. One such measure is the weighted average squared relative error

m∑ i=1

pi

( p̂i − pi pi

)2 (10.1)

In terms of the observations Yi, this is

1

n

m∑ i=1

(Yi − npi)2

npi .

The factor of 1/n is irrelevant for our purposes, so we ignore it. We also abbreviate the expected value E(Yi) = npi by Ei and rewrite the expression as

Q =

m∑ i=1

(Yi − Ei)2

Ei . (10.2)

The following theorems and the tests derived from them are primarily due to Karl Pearson 1

Theorem 10.1. As n→∞, the distribution of Q approaches the chi-square distribution with m− 1 degrees of freedom, i.e., for large n, Q ∼ Chisq(df = m− 1), approximately.

To see how this theorem mught lead to a hypothesis test, suppose that a null hypothesis specifies the values of p1, · · · , pm, while respecting the constraint

∑n i=1 pi = 1. If the estimated values p̂i are far

from the hypothesized values pi as measured by (10.1), then this tends to disconfirm H0. For a formal test of significance level α of H0 : p1, p2, · · · pm = givennumbers, against the many sided alternative that at least one of the pi is not equal to the given value, reject H0 when

Q > χ2α(m− 1), where χ2α(m− 1) is the 100(1− α) percentile of Chisq(df = m− 1).

Example 10.2. Returning to Example 10.1, let us test the null hypothesis that the distribution of grades in the particular class is the population distribution of calculus grades. We will assume that the numbers in each grade category are the ones given in Example 10.1. The Yi are 4, 9, 10, 12, and 10. The expected values, assuming that the population proportions are the true parameters for this particular class, are E1 = 45 × 0.15 = 6.75, E2 = 45 × 0.25 = 11.25, E3 = 45 × 0.30 = 13.5, E4 = E5 = 45× 0.15 = 6.75. The observed value of Q is

Qobs = (4− 6.75)2

6.75 +

(9− 11.25)2

11.25 +

(10− 13.5)2

13.5 +

(12− 6.75)2

6.75 +

(10− 6.75)2

6.75 = 8.13.

1Karl Pearson 1857-1936. One of the founders of modern mathematical statistics. A student of Francis Galton.

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 188

Since under the null hypothesis Q ∼ Chisq(df = 4), the p-value is

> 1-pchisq(8.13,4)

 0.08693054

So, the grade distribution in this class is significantly different from the combined grade distribution in all classes at significance level α = 0.10, but not at α = 0.05.

10.1.2 Multinomial Probabilities That Are Functions of Other Parameters

In many applications, the null hypothesis H0 does not completely specify the values of p1, · · · , pm. Instead it restricts their values by expressing them as functions pi(θ1, · · · , θk) of more fundamental unknown parameters θ1, · · · , θk, where k < m − 1. These must be well-behaved, regular functions, but that is seldom a matter of practical concern. A good example is the problem of testing the Hardy-Weinberg model of genetic equilibrium, which we discussed in Chapter 4. A particular gene has two alleles, designated A and a. Each individual has two copies of this gene, and thus has one of the genotypes AA, Aa, or aa. Let pAA, pAa, and paa denote the proportions of these genotypes in the population. In a random sample of size n from the population, let YAA, YAa, and Yaa denote the counts of the three genotypes. These random variables have a joint multinomial distribution.

Let θ denote the proportion of all the copies of the gene in the population that are allele A. It is easy to see that

θ = pAA + 1

2 pAa. (10.3)

If the population is thoroughly mixed and breeding indiscriminately, then the pairing of gene copies in reproduction is random and the frequencies of the genotypes do not change over time. In this case, pAA = θ

2, pAa = 2θ(1 − θ), and paa = (1 − θ)2. This is the null hypothesis if one intends to test whether or not this particular gene is in equilibrium in the population.

H0 : There is a number θ ∈ (0, 1) such that pAA = θ2 and pAa = 2θ(1− θ). H1 : There is no such number.

If we reject H0, then we conclude that the population is not in genetic equilibrium.

In the general setting, when pi = pi(θ1, · · · , θk), we have to find estimates of the underlying parameters θ1, · · · , θk. They are supposed to be maximum likelihood estimates θ̂i which maximize the multinomial log-likelihood function

l(θ̂1, · · · , θ̂k) = m∑ i=1

Yi log pi(θ̂1, · · · , θ̂k). (10.4)

Fortunately, the maximum likelihood estimates often turn out to be common sense estimators. That is the case in the most important applications below. In the Hardy-Weinberg example, the maximum likelihood estimator of θ is the sample analog of (10.3),

θ̂ = YAA n

+ YAa 2n

. (10.5)

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 189

If θ̂1, · · · , θ̂k are maximum likelihood estimates, let

Êi = npi(θ̂1, · · · , θ̂k)

be the corresponding estimate of E(Yi) = npi.

Theorem 10.2. As n→∞, the distribution of

Q̂ =

m∑ i=1

(Yi − Êi)2

Êi (10.6)

approaches the chi square distribution with m− 1− k degrees of freedom.

Example 10.3. Suppose we have a random sample of 60 organisms and are able to determine the genotype of each one. Suppose YAA = 24, YAa = 12 and Yaa = 24. Can we conclude that the popula- tion is not in equilibrium?

Solution: The maximum likelihood estimate of θ is

θ̂ = 24

60 +

12

120 = 0.5

and the estimated expected counts of the genotypes are

ÊAA = 60(.5) 2 = 15,

ÊAa = 60(2)(.5)(.5) = 30,

Êaa = 60(.5) 2 = 15.

So, the observed value of Q̂ is

Q̂obs = (24− 15)2

15 +

(12− 30)2

30 +

(24− 15)2

15 = 21.6.

Q̂ has a chi square distribution with df = 3− 1− 1 = 1. The p-value is tiny,

> 1-pchisq(21.6,1)

 3.358518e-06

so we definitely conclude that the population is not in equilibrium.

10.1.3 Exercises

1. A six-sided die is thrown 50 times. The numbers of occurrences of each face are shown below.

Face 1 2 3 4 5 6 Count 12 5 9 11 6 7

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 190

Can you conclude that the die is not fair?

2. Look at the variable ”payroll” in the data set ”Payroll”. It has 50 values, all between 100 and 400. Count the number of values in each of the intervals 100 – 160, 160 – 220, 220 – 280, 280 – 340, 340 – 400. Can you conclude that this data does not come from the uniform distribution on the interval (100,400)? (Hint: If the distrbution is uniform, what is the expected count in each interval?) The counting can be done with R’s histogram function.

> paycounts=hist(Payroll\$payroll, breaks=seq(100,400,60))\$counts > paycounts

Do the calculations by hand, except for finding the p-value. Then check your work by using R’s ”chisq.test” function.

> chisq.test(paycounts)

3. Explain how this same method could be adapted to test the hypothesis that the data comes from any given continuous cumulative distribution F such that F (100) = 0 and F (400) = 1.

4. A sample of 60 animals from a given population was obtained. The genotype counts were YAA = 20, YAa = 30, Yaa = 10. Is the population in equilibrium?

5. In the Hardy-Weinberg example, the log-likelihood function (10.4) is

l(θ̂) = YAA log θ̂ 2 + YAa log(2θ̂(1− θ̂)) + Yaa log(1− θ̂)2.

Show that this is maximized by (10.5). Make use of properties of the logarithm.

6. You can make a graph depicting genetic equilibrium as follows.

> theta=seq(.01,.99,.01)

> pAA=theta^2

> pAa=2*theta*(1-theta)

> plot(pAA,pAa,type=”l”)

> paa=(1-theta)^2

> plot(pAA,paa,type=”l”)

Points on the curve represent genetic equilibrium, Points off the curve represent disequilibrium. Pick a few points off the curve and calculate where on the curve they end up after applying equation (10.3).

10.2 Testing Equality of Multinomial Probabilities

Suppose that r independent multinomial experiments are performed and that all the experiments have the same set of categories L1, · · · , Lm. Let Yij denote the number of occurrences of Lj in the ith experiment and let pij denote the probability of Lj in the i

th experiment. Let ni denote the number of trials in the ith experiment. This information can be arranged in tabular form as

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 191

Experiment L1 L2 · · · Lm ni 1 Y11 Y12 · · · Y1m n1 2 Y21 Y22 · · · Y2m n2 …

… …

… …

… r Yr1 Yr2 · · · Yrm nr

The ith row of the table gives the data for the ith multinomial experiment. The rows are independent of one another.

We are interested in testing the null hypothesis that the probability of Lj is the same for all r experiments. More precisely,

H0 : p11 = p21 = p31 = · · · = pr1 = θ1; (10.7) p12 = p22 = p32 = · · · = pr2 = θ2; …

p1m = p2m = p3m = · · · = prm = θm,

where θ1, θ2, · · · , θm are unknown positive numbers whose sum is 1.

For each j = 1, . . . ,m, let Y�j = ∑r i=1 Yij . Let N =

∑r i=1 ni. The log-likelihood function for the r

combined experiments is the sum of their individual log-likelihoods.

l(θ̂1, . . . , θ̂m) =

r∑ i=1

m∑ j=1

Yij log θ̂j =

m∑ j=1

Y�j log θ̂j . (10.8)

Given the constraint ∑m j=1 θ̂j = 1, this is maximized when

θ̂j = Y�j N .

This is a perfectly sensible estimator. It is just the combined proportion of occurrences of category Lj in all r experiments. The corresponding estimated expected value of Yij is

Êij = niθ̂j = niY�j N

.

Now let

Q̂ =

r∑ i=1

m∑ j=1

( Yij − Êij

)2 Êij

. (10.9)

By a slight extension of Theorem 10.2, as all ni → ∞, the distribution of Q̂ approaches chi square. The number of degrees of freedom is equal to the number of free parameters in the full model, without assuming the null hypothesis, minus the number of free parameters under the null hypothesis. The number of free parameters in the unrestricted model is r × (m − 1), while under the null hypothesis it is m− 1 Therefore, if all ni are large,

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 192

Q̂ ∼ Chisq(df = (r − 1)(m− 1)).

To test the null hypothesis (10.7) at significance level α, reject H0 if

Q̂ > χ2α ((r − 1)(m− 1)) .

Example 10.4. A certain course had 3 sections last semester. The observed counts of their grades are shown below. Can we conclude that the probabilities of the grade categories are actually different, perhaps because the 3 instructors have different standards, or could the apparent differences be due merely to chance?

A B C DFW

class 1 7 17 17 22

class 2 17 14 11 15

class 3 13 14 11 13

A B C DFW Sum

class 1 7 17 17 22 63

class 2 17 14 11 15 57

class 3 13 14 11 13 51

Sum 37 45 39 50 171

The second table is the same as the first, except that the rows and columns have been summed. This gives us the values ni and Y�j in the discussion above. The number N is the grand total in the lower right corner, N = 171. For given i and j, the estimated expected count

Êij = niY�j N

is the sum of row i times the sum of column j divided by the grand total N , e.g.,

Ê11 = 63× 37

171 = 13.63,

Ê12 = 63× 45

171 = 16.58,

Ê34 = 51× 50

171 = 14.91.

Q̂ = (7− 13.63)2

13.63 +

(17− 16.58)2

16.58 + · · ·+ (13− 14.91)

2

14.91 = 7.38

The p-value is

> 1-pchisq(7.38,6)

 0.2871293

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 193

Thus, there is little evidence that the grade probabilities differ among the 3 instructors.

In R, the procedure is simple.

Pearson’s Chi-squared test

X-squared = 7.3753, df = 6, p-value = 0.2875

10.3 Independence of Attributes: Contingency Tables

Let N randomly sampled members of a large population be cross-classified according to two factors or attributes A and B. For example, A might be income level, low, medium or high and B might be political party affiliation, Whig, Free Soil, Know Nothing or Other. The question of interest is whether these factor variables are independent or not. In general, let us designate the levels of A as i = 1, 2, · · · , r and the levels of B as j = 1, 2, · · · ,m. No particular ordering of levels is implied.

Let Yij be the number of individuals in the sample that have attribute A at level i and attribute B at level j. Let

Y�j =

r∑ i=1

Yij ,

Yi� =

m∑ j=1

Yij .

Then

r∑ i=1

m∑ j=1

Yij =

r∑ i=1

Yi� =

m∑ j=1

Y�j .

The joint distribution of the r × m variables Yij is multinomial, with N trials and with outcome probabilities

pij = Pr(A = i;B = j).

Note that there are rm− 1 free parameters. Let pi� = Pr(A = i) and p�j = Pr(B = j). A and B are independent if and only if

H0 : pij = pi�p�j for all i and j

is true. In the restricted model, as specified by H0, there are r − 1 free choices for the pi� parameters and m− 1 free choices for the p�j parameters, for a total of r +m− 2. The difference is

(rm− 1)− (r +m− 2) = (r − 1)(m− 1).

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 194

The maximum likelihood estimators of the parameters pi� and p�j are just the natural sample frequency estimators

p̂i� = Yi� N ,

p̂�j = Y�j N .

and so the estimated expected counts at the combinations of factor levels are

Êij = Np̂i�p̂�j = Yi�Y�j N

.

Notice that Êij is row total × column total divided by grand total, the same formula that was used in testing equality of several multinomial distributions. In fact, the procedure for testing for inde- pendence is exactly the same as for testing equality of multinomial parameters. For large N the distribution of Q̂ (10.9) is chi square with (r − 1)(m − 1) degrees of freedom. If the p-value for the observed value of Q̂ is too small we conclude that factors A and B are dependent.

A tabular layout such as

A B C DFW

class 1 7 17 17 22

class 2 17 14 11 15

class 3 13 14 11 13

is called a contingency table, and whether testing for equality of multinomial parameters or testing for indpendence of attributes, the chi-square test is called a contingency table analysis. The difference is merely one of emphasis.

Obviously, more than two factors could be in play. If so, the contingency table would have three or more dimensions. The extension of the procedure is straightforward.

Example 10.5. The data set ”Titanic” included with R has a cross classification of 2201 passengers on the Titanic, classified according to sex, class of accommodations, adult or child, and survived or did not survive. Read about the data:

> help(Titanic)

We will test whether survival rates for males and females are equal or different. We could do this by testing for equality of proportions the way we did in Chapter 9, but instead we will create a 2 × 2 contingency table and apply the procedure above.

> margin.table(Titanic,c(2,4))

Survived

Sex No Yes

Male 1364 367

Female 126 344

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 195

> chisq.test(.Last.value,correct=F)

Pearson’s Chi-squared test

data: .Last.value

X-squared = 456.8742, df = 1, p-value < 2.2e-16

Obviously, sex and survival status are not independent. Factor 2 is ”Sex” and factor 4 is ”Survived”. The argument ”c(2,4)” to the function ”margin.table” tells R to sum the entries in the 4 dimensional array along all the variables but these two. The argument ”correct=F” to ”chisq.test” prevents R from applying the Yates continuity correction so that the answer will agree with the answer derived from the procedure described above.

Sometimes the data is not already cross tabulated. Instead it may be presented in the form of a text file, a spreadsheet, or an R data frame that lists individual cases. If so, the ”table” function in R will convert it to suitable input for ”chisq.test”. Data from the Montana outlook poll conducted by the University of Montana is included in the course data folder at

http://www.math.uh.edu/ charles/data/Montana.txt

For 209 randomly selected residents it lists age group, sex, income group, political affiliation, region of residence, personal financial outlook, and opinion about state outlook. All of these are categorical variables, although SEX is coded as 0 for male and 1 for female. Hence it will imported in R as a numeric variable. If the imported data frame is named ”Montana”, this can be fixed by

> Montana\$SEX=factor(Montana\$SEX,labels=c(”m”,”f”))

After making this change, the summary of all the variables looks like this:

> summary(Montana)

AGE SEX INC POL AREA FIN

<35 :72 f:102 <20K :47 Dem :84 NE:58 better:71

>=55 :70 m:107 >35K :60 Ind :40 SE:78 same :76

35-54:66 20-35K:83 Rep :78 W :73 worse :61

NA’s : 1 NA’s :19 NA’s: 7 NA’s : 1

STAT

better :118

no better: 63

NA’s : 28

We will tabulate political affiliation and area of residence and test the null hypothesis that these two attributes are independent,

> attach(Montana)

> table(POL,AREA)

Go to TOC

CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 196

AREA

POL NE SE W

Dem 15 30 39

Ind 12 16 12

Rep 30 31 17

> chisq.test(.Last.value)

Pearson’s Chi-squared test

data: .Last.value

X-squared = 13.849, df = 4, p-value = 0.007793

From the p-value we see that they are not independent.The west is blue, northeast red, and southeast purple.

10.3.1 Exercises

1. Carry out the calculations of the Titanic example by hand.

2. Use ”margin.table” to get the Titanic marginal table for factors 1 (class of accomodations) and 4 (survival). Apply the chi square test to see if class and survival are dependent. Do it both by hand and with ”chisq.test” in R.

3. With the Montana survey data, tabulate other pairs of variables and test for independence.

Go to TOC

Chapter 11

Miscellaneous Topics

11.1 Multiple Linear Regression

In Chapter 8 we studied simple linear regression, in which the expected value of the response random variable Y depends linearly on a single predictor variable X:

E(Y |X = x) = β0 + β1x.

In multiple linear regression we allow multiple predictor variables X1, · · · , Xk and the expected re- sponse for given values of X1, · · · , Xk is

E(Y |X1 = x1, X2 = x2, · · · , Xk = xk) = β0 + β1×1 + β2×2 + · · ·+ βkxk, (11.1)

where β0, β1, · · · , βk are unknown real numbers, the k + 1 regression coefficients.

As in simple linear regression, we assume that the variance of the response Y is constant, independent of the values x1, · · · , xk:

var(Y |X1 = x1, · · · , Xk = xk) = σ2, (11.2)

where σ2 is an unknown positive constant.

The data in a multiple regression experiment arises from n independent observations of the response Y corresponding to n possibly different values of each of the predictor variables X1, · · · , Xk. In tabular form it would look something like this.

Y X1 X2 · · · Xk Y1 x11 x12 · · · x1k …

… …

… …

Yn xn1 xn2 · · · xnk

Here xij is the i th value of the design variable Xj .

197

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 198

Given estimates β̂0, β̂1, · · · , β̂k of the regression coefficients, define the ith prediction error or residual as

ei = Yi − β̂0 − k∑ j=1

β̂jxij .

As in simple linear regression, we estimate the regression coefficients by the method of least squares. That is, we choose β̂0, β̂1, · · · , β̂k to minimize the residual sum of squares

SS(resid) =

n∑ i=1

e2i .

The least squares estimates satisfy a linear system of p = k + 1 equations. The system can have a unique solution only if n ≥ p, which we shall henceforth assume. Deriving the system of equations and their solution requires a background in linear algebra, so we will omit it. Complete derivations can be found in many textbooks 1.

11.1.1 Inferences Based on Normality

For the rest of theis chapter, we will assume that the conditional distribution of Y , given the values of X1, · · · , Xk is normal, with mean given by (11.1) and constant variance (11.2).

Definition 11.1. The predicted or fitted value of Yi is

Ŷi = β̂0 +

k∑ j=1

xij β̂j .

The total sum of squares is

SS(tot) =

n∑ i=1

(Yi − Y )2.

The regression sum of squares is

SS(regr) =

n∑ i=1

(Ŷi − Y i)2.

The residual sum of squares is

SS(resid) =

n∑ i=1

(Yi − Ŷi)2.

The proof of the following theorem requires linear algebra for a complete understanding. See the book by Montgomery, Peck and Vining previously cited.

Theorem 11.1. 1. SS(tot) = SS(regr) + SS(resid).

Under the assumption of normality,

1E.g., Introduction to Linear Regression Analysis, 5th Ed. by Montgomery, Peck and Vining, Wiley 2012

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 199

2. Each of the estimated regression coefficients β̂j has a normal distribution with mean βj . The

standard deviation of β̂j depends only on σ = √ σ2 and a known function of all of the values {xil}.

3. SS(regr) and SS(resid) are independent random variables.

SS(resid)

σ2 ∼ Chisq(df = n− p).

If β1 = β2 = · · · = βk = 0, then SS(regr)

σ2 ∼ Chisq(df = k).

4. If MS(regr) = SS(regr)k and MS(resid) = SS(resid) n−p and β1 = · · · = βk = 0, then

F = MS(regr)

MS(resid) (11.3)

has the F distribution with k degrees of freedom in the numerator and n−p degrees of freedom in the denominator: F ∼ FDist(k, n− p).

5. If in the expression for the standard deviation of β̂j we replace the unknown σ with S =√ MS(resid), the resulting value is the standard error of β̂j , se(β̂j), and

β̂j − βj se(β̂j)

∼ t(df = n− p), (11.4)

the student-t distribution with n− p degrees of freedom.

11.1.2 Using R’s ”lm” Function for Multiple Regression

The computations involved in multiple regression problems are virtually impossible without computer help. The principal tool in R for multiple regression is the function ”lm”. We will illustrate its use by fitting a linear model to the data ”nlschools”. 2 The response variable Y is the score on a language test administered to 200 school children in the Netherlands. The predictor variables are verbal IQ (VerbIQ), class size, and a numeric measure of socioeconomic status (SocEconStatus). Below is R’s output.

> summary(nlschools.lm)

Call:

lm(formula = Language ~ VerbIQ + ClassSize + SocEconStatus,

data = nlschools)

Residuals:

Min 1Q Median 3Q Max

-21.8839 -4.7511 0.1737 5.6190 16.3403

2Snijders, T. A. B. and Bosker, R. J. (1999) Multilevel Analysis. An Introduction to Basic and Advanced Multilevel Modelling. London: Sage.

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 200

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.16999 3.83467 1.870 0.063 .

VerbIQ 2.60929 0.27116 9.623 <2e-16 ***

ClassSize -0.01034 0.09123 -0.113 0.910

SocEconStatus 0.01973 0.05978 0.330 0.742

Residual standard error: 8.078 on 196 degrees of freedom

Multiple R-squared: 0.3574, Adjusted R-squared: 0.3475

F-statistic: 36.33 on 3 and 196 DF, p-value: < 2.2e-16

In the ”Coefficients” section of the output the estimated regression coefficients β̂j are listed, labelled with the names of the predictor variables they are associated with. The next column gives their standard errors se(β̂j). Next comes the values of the student-t test statistics (11.4) for testing the individual null hypotheses H0 : βj = 0. Finally, the p-values of the test statistics are given. In this example, only the estimated coefficient of VerbIQ is significantly different from 0, and it is highly significant. There is no reason to conclude that ClassSize and SocEconStatus have any predictive power for the language test score when VerbIQ is included in the model.

”F-statistic” is the value of F in (11.3) and is the test statistic for the null hypothesis that all the regression coefficients are equal to 0. Clearly, we can reject that hypothesis in this example. ”Multiple R-squared” and ”Adjusted R-squared” are defined by the equations

1−R2 = SS(resid) SS(tot)

,

MS(tot)

where

MS(tot) = SS(tot)

n− 1 .

R2 is interpreted in the same way as in simple linear regression. It is the fraction of the total squared variation in the response Y accounted for by the linear relationship and the variation in the predictor variables. If additional predictor variables are added to the model equation, the value of R2 always increases, indicating greater success at predicting the observed values Yi. However, at some point, this greater success may be so small that the cost of increasing the complexity of the model is not justified. Adjusted R2 does not always increase when additional terms are included. Typically, it begins to decrease when more terms are included than are needed. For this reason, it is a good criterion for deciding when to stop adding terms. To illustrate, we will refit the model above with only VerbIQ as a predictor.

> summary(nlschools.lm2)

Call:

lm(formula = Language ~ VerbIQ, data = nlschools)

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 201

Residuals:

Min 1Q Median 3Q Max

-21.8126 -4.6522 0.3191 5.6919 16.2251

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.038 2.902 2.425 0.0162 *

VerbIQ 2.642 0.252 10.484 <2e-16 ***

Residual standard error: 8.04 on 198 degrees of freedom

Multiple R-squared: 0.357, Adjusted R-squared: 0.3537

F-statistic: 109.9 on 1 and 198 DF, p-value: < 2.2e-16

Notice that the smaller model has a smaller value of R2 but a larger value of R2adj .

The ”confint” function works for multiple regression just as it does for simple linear regression to give confidence intervals for the regression coefficients.

> confint(nlschools.lm,level=.90)

5 % 95 %

(Intercept) 0.8325551 13.5074157

VerbIQ 2.1611647 3.0574217

ClassSize -0.1611106 0.1404393

SocEconStatus -0.0790690 0.1185373

The ”predict” function works for multiple regression as well.

> predict(nlschools.lm,newdata=data.frame(VerbIQ=10,ClassSize=30,

SocEconStatus=12),interval=”c”)

fit lwr upr

1 33.18966 31.19966 35.17966

A visual assessment of how well the model fits the data can be obtained by plotting the fitted values Ŷi on the horizontal axis and the observed values Yi on the vertical axis.

> nlschools.lm=lm(Language~VerbIQ+ClassSize+SocEconStatus,data=nlschools)

> plot(fitted(nlschools.lm),nlschools\$Language)

> abline(0,1)

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 202

20 30 40 50

10 20

30 40

50

fitted(nlschools.lm)

nl sc

ho ol

s\$ La

ng ua

ge

11.1.3 Factor Variables as Predictors

Suppose X is a factor variable with only two levels L0 and L1. Code the values of X numerically as 0 for L0 and 1 for L1. Let Y be a numeric variable whose distribution depends on the value of X. Consider the simple linear regression equation

E(Y |X = x) = β0 + β1x.

Under the usual assumptions for linear regression, the values of Y are grouped as independent samples from two normal populations having the same variance, one corresponding to X = 0 with mean

E(Y |X = 0) = β0,

and the other corresponding to X = 1 with mean

E(Y |X = 1) = β0 + β1.

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 203

Therefore, the parameter β1 is the difference between the population means. If we want to test hypotheses or find confidence intervals for the difference in means, we can use the methods of simple linear regression to make inferences about β1. We already have a method for two-sample problems from Chapter 9, the two-sample t test with equal variances. In fact, the two sample t test with equal variances and the regression approach are mathematically equivalent. To illustrate we will revisit the ”lungcap” data set to see how the distributions of the variable ”fev” (forced expiratory volume) depend on whether the subject is a smoker or not. Using the two sample t-test, the results are:

> t.test(fev~smoke,var.equal=T,data=lungcap)

Two Sample t-test

data: fev by smoke

t = 1.7101, df = 83, p-value = 0.09098

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.01995121 0.26468836

sample estimates:

mean in group no mean in group yes

3.746740 3.624371

Using the regression approach,

> fev.lm=lm(fev~smoke,data=lungcap)

> summary(fev.lm)

Call:

lm(formula = fev ~ smoke, data = lungcap)

Residuals:

Min 1Q Median 3Q Max

-0.87474 -0.20437 0.01363 0.19526 0.76826

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.74674 0.04592 81.60 <2e-16 ***

smokeyes -0.12237 0.07155 -1.71 0.091 .

Residual standard error: 0.3247 on 83 degrees of freedom

Multiple R-squared: 0.03404, Adjusted R-squared: 0.0224

F-statistic: 2.925 on 1 and 83 DF, p-value: 0.09098

> confint(fev.lm)

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 204

2.5 % 97.5 %

(Intercept) 3.6554150 3.83806503

smokeyes -0.2646884 0.01995121

As shown here, it is not necessary to actually carry out the numeric 0-1 coding of the factor levels. R automatically does that internally. It chooses one level as the base level and the other is compared to it. The results from ”lm” are the same as those from ”t.test” except for the sign of the difference in means. For unordered factors X with more than two levels, we have already observed in Chapter 10 that ”lm” gives results equivalent to analysis of variance. For ordered factors, the interpretation of the estimated coefficients returned by ”lm” is quite different and we shall not discuss it here.

A more interesting kind of problem is one in which the expected response is modeled as a linear function of several predictors, some numeric and others factor variables.

Example 11.1. In the Netherlands school data, let us treat the response ”Language” as a linear func- tion of the numeric predictor ”SocEconStatus” and also of the two-level factor variable ”CombGrades”, which is an indicator of whether the student was taught in a classroom with combined grades or not. ”CombGrades” is already coded as 0 for no and 1 for yes.

> summary(nlschools.lm1)

Call:

lm(formula = Language ~ SocEconStatus + CombGrades, data = nlschools)

Residuals:

Min 1Q Median 3Q Max

-26.4304 -6.3586 0.8444 7.2351 22.1529

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 32.17093 1.84627 17.425 < 2e-16 ***

SocEconStatus 0.26762 0.06668 4.014 8.49e-05 ***

CombGrades -4.76910 1.36909 -3.483 0.00061 ***

NA

Residual standard error: 9.49 on 197 degrees of freedom

Multiple R-squared: 0.1085, Adjusted R-squared: 0.09941

F-statistic: 11.98 on 2 and 197 DF, p-value: 1.227e-05

The mean Language score for students not in combined classrooms exceeds the means score for students in combined classrooms by 4.76910 for all values of SocEconStatus. The fitted lines for the two groups of students are parallel, separated by a vertical distance of 4.76910.

> abline(32.17093,0.26762,col=1)

> abline(32.17093-4.76910,0.26762,col=2)

> legend(x=35,y=15,legend=c(“Not combined”,”Combined”),fill=c(1,2))

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 205

10 20 30 40 50

10 20

30 40

50

SocEconStatus

La ng

ua ge

Not combined Combined

The fitted lines for the two groups in the preceding example are parallel because our model formula specified that there be no interaction between SocEconStatus and CombGrades. In other words, the expected effect of being in a combined classroom is the same for all values of socioeconomic status. It is certainly conceivable that being in a combined classroom could affect the rate at which increased status leads to increased language comprehension. In that case, there would be an interaction between SocEconStatus and CombGrades. Let us temporarily rename these variables X1 and X2 and the response Y . Consider the model regression equation

E(Y |X1 = x1;X2 = x2) = β0 + β1×1 + β2×2 + γx1x2,

where γ is another unknown constant. The term γx1x2 is called an interaction term.

When X2 = 0 we have E(Y |X1 = x1;X2 = 0) = β0 + β1×1,

whereas when X2 = 1,

E(Y |X1 = x1;X2 = 1) = (β0 + β2) + (β1 + γ)x1.

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 206

Therefore, the parameter γ is the difference in expected rate of change of Y with respect to x1 when X2 = 1, i.e., when students are in a combined classroom. The output from R with the interaction model is

> summary(nlschools.lm)

Call:

lm(formula = Language ~ SocEconStatus * CombGrades, data = nlschools)

Residuals:

Min 1Q Median 3Q Max

-26.709 -6.876 1.274 6.504 20.603

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 34.82266 2.36785 14.706 < 2e-16 ***

SocEconStatus 0.15743 0.09087 1.732 0.08476 .

CombGrades -10.91000 3.72008 -2.933 0.00376 **

SocEconStatus:CombGrades 0.23578 0.13292 1.774 0.07764 .

NA

Residual standard error: 9.439 on 196 degrees of freedom

Multiple R-squared: 0.1225, Adjusted R-squared: 0.1091

F-statistic: 9.124 on 3 and 196 DF, p-value: 1.109e-05

The coefficients whose estimates are given in this summary are, in order from the top, β0, β1, β2, and γ. With a p-value of 7.8%, the estimated value of γ is marginally significantly different from 0.

> abline(34.82266,0.15743,col=1)

> abline(34.82266-10.91000,0.15743+0.23578,col=2)

> legend(x=35,y=15,legend=c(“Not combined”,”Combined”),fill=c(1,2))

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 207

10 20 30 40 50

10 20

30 40

50

SocEconStatus

La ng

ua ge

Not combined Combined

11.1.4 Exercises

1. The airquality data has missing values in some of the variables. Eliminate those records from the data set with the command

> airquality2=airquality[complete.cases(airquality), ]

With the airquality2 data, fit a multiple linear regression model with Ozone as the response and So- lar.R and Wind as predictor variables. Which variables contribute significantly to ozone levels?

2. Find a 95% confidence interval for the expected value of Ozone when Solar.R=300 and Wind=15. Find a 95% confidence intervals for the regression coefficients.

3. Make a scatterplot of observed values of Ozone vs. fitted values, with fitted values on the hori- zontal axis and observed values on the vertical axis. Superimpose the line with intercept 0 and slope 1.

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 208

4. With the data ”nlschools” fit a multiple linear regression model with Language as the response and ClassSize, SocEconStatus and CombGrades as predictor variables. Allow interaction between SocEconStatus and CombGrades by using the model formula

The asterisk * separating two terms in a model formula means to include additive effects as well as interactions between the variables.

Interpret the results. How does the expected value of Language depend on ClassSize and SocEcon- Status when CombGrades=0?, when CombGrades=1?

11.2 Nonparametric Methods

The families of distributions studied in this course up to now are parametric families. That is, individual members of the family are singled out by giving the values of a few parameters. For example, the family of normal distributions is parametric because the value of the mean and standard deviation completely determine which normal distribution is meant. Other parametric families are the binomial distributions, the Poisson distributions, the gamma distributions, Weibull distributions, and so on. Except for large sample procedures for a population mean, the inference procedures we have studied so far are primarily inferences (hypothesis tests or confidence intervals) for parametric families. In this section we introduce some methods that make almost no assumptions about the underlying distributions, except that they are continuous.

11.2.1 The Signed Rank Test

The Wilcoxon 3 signed rank test and rank sum test utilize the ranks of sample values rather than the data values themselves. The signed rank test is designed to test the hypothesis that a continuous distribution is symmetric about a certain value and also to find a confidence interval for the center of symmetry.

Definition 11.2. The distribution of a random variable X is symmetric about 0 if −X and X have the same distribution. If the cumulative distribution F of X is a continuous function, this is equivalent to

F (−x) = 1− F (x) (11.5)

for all real numbers x. The distribution of X is symmetric about a number ∆ if X −∆ is symmetric about 0, that is, if

F (∆− x) = 1− F (∆ + x) (11.6)

for all x.

If a distribution is symmetric about ∆, then ∆ is a median. If the distribution has a mean, then ∆ is its mean. If it has a continuous density function f , then f(∆− x) = f(∆ + x) for all real x.

3Frank Wilcoxon, 1892-1965

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 209

Let x1, x2, · · · , xn be distinct numbers. The rank of xi is 1 if it is the smallest of these n numbers. rank(xi) = 2 if xi is the next smallest, rank(xi) = 3 if xi is the third smallest. Finally, rank(xi) = n if xi is the largest of the numbers. Below is a list of 10 numbers and beneath it is the list of their comparative ranks.

> round(rexp(10),digits=3)

 0.083 0.916 0.369 0.009 1.422 1.065 0.439 0.571 0.004 3.904

> rank(.Last.value)

 3 7 4 2 9 8 5 6 1 10

Let X1, X2, · · · , Xn be a random sample from a continuous distribution F . Find the comparative ranks of their absolute values and let rank(|Xi|) denote the rank of |Xi|. Let sgn(Xi) = 1 if Xi > 0 and sgn(Xi) = −1 if Xi < 0. The Wilcoxon signed rank statistic is

W1 =

n∑ i=1

sgn(Xi)rank(|Xi|). (11.7)

Theoretically, since the distribution F is continuous, no two of the absolute values will be equal and none of the observations will be exactly 0. Therefore, the signs and ranks are unambiguously deter- mined. Of course roundoff error may intrude and cause some ties in the ranks. R has a procedure for handling ties, which we need not be concerned about now.

If the distribution of X is symmetric about 0, then there is no association between ranks and signs. Each possible rank is as likely to go with a positive sample value as with a negative sample value. In that case, the distribution of W1 is the same as the distribution of

V =

n∑ i=1

iSi, (11.8)

where S1, · · · , Sn are independent and each Si is either +1 or -1, each with probability 1/2. On the other hand, if the distribution of X is symmetric about a number ∆ > 0, then the higher ranks will tend to go with positive sample values and W1 will tend to be greater than 0. If ∆ < 0, then W1 tends to be less than 0. Therefore, W1 is a reasonable test statistic for the null hypothesis H0 : ∆ = 0 against either a one-sided or two-sided alternative. To make use of W1, we must be able to compute p-values Pr(V > v) or Pr(V < v), where v is the observed value of W1. Many textbooks

4 have tables of the distribution of V . We will use R’s built-in function ”wilcox.test” for calculations.

To test for symmetry of the distribution of X about some point ∆0 6= 0, simply replace rank(|Xi|) with rank(|Xi −∆0|) and sgn(Xi) with sgn(Xi −∆0) in (11.7).

The sum of the ranks for positive observations is

W1+ = ∑ Xi>0

rank(|Xi|) (11.9)

and the sum of ranks for negative observations is

4E.g., Devore, J.L., Probability and Statistics for Engineering and the Sciences, 8th Ed., Brooks-Cole 2012

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 210

W1− = ∑ Xi<0

rank(|Xi|).

From the definitions, we have

W1+ −W1− = W1 and

W1+ +W1− =

n∑ i=1

i = n(n+ 1)

2 .

These two equations imply that

W1+ = 1

2 W1 +

n(n+ 1)

4 . (11.10)

Therefore, W1+ is just as good as W1 as a test statistic. In fact, it is the statistic used in R.

The null distribution of W1+ is the same as the distribution of

V+ = ∑ Si=1

i. (11.11)

Corresponding to (11.10), we have

V+ = 1

2 V +

n(n+ 1)

4 . (11.12)

In the exercises below you are asked to calculate the exact distribution of V+ for a very simple case.

The Mean and Variance of V and V+

In the definition of V , the expected value of Si is 0 and its variance is 1. Thus,

E(V ) =

n∑ i=1

iE(Si) = 0. (11.13)

and

var(V ) =

n∑ i=1

i2 = n(n+ 1)(2n+ 1)

6 . (11.14)

From these and (11.12), we have

E(V+) = n(n+ 1)

4 (11.15)

and

var(V+) = 1

4 var(V ) =

n(n+ 1)(2n+ 1)

24 . (11.16)

Proofs of the following theorem as well as other theorems in this section can be found in the book by Randles and Wolfe. 5

5R.H. Randles and D.A. Wolfe, Introduction to the Theory of Nonparametric Statistics, Wiley 1979.

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 211

Theorem 11.2. As n→∞, the distribution of

Z = V+ − E(V+) sd(V+)

approaches standard normal. The same is true with V+ replaced by V .

For small values of n the exact distribution of V+ or V is tabulated or can be calculated without much trouble. For larger values of n, Theorem 11.2 allows us to use a normal approximation for finding p-values.

Example 11.2. The signed rank test is most often used with paired data to test the hypothesis that the difference in paired observations has median 0. We will use it to analyze Gosset’s split plot data, for which we used a student-t test in Chapter 9. The yields for the two drying methods on each of the split plots and their differences are repeated below. The R output shows two methods of using the function ”wilcox.test” for paired observations. The t test is repeated for comparison. Note the close agreement in p-values.

PLOT 1 2 3 4 5 6 7 8 9 10 11

REG 1903 1935 1910 2496 2108 1961 2060 1444 1612 1316 1511

KILN 2009 1915 2011 2463 2180 1925 2122 1482 1542 1443 1535

DIFF -106 20 -101 33 -72 36 -62 -38 70 -127 -24

> wilcox.test(DIFF)

Wilcoxon signed rank test

data: DIFF

V = 15, p-value = 0.123

alternative hypothesis: true location is not equal to 0

> wilcox.test(REG,KILN,paired=T)

Wilcoxon signed rank test

data: REG and KILN

V = 15, p-value = 0.123

alternative hypothesis: true location shift is not equal to 0

> t.test(DIFF)

One Sample t-test

data: DIFF

t = -1.6905, df = 10, p-value = 0.1218

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

-78.18164 10.72710

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 212

sample estimates:

mean of x

-33.72727

Confidence Intervals for the Location Parameter ∆

Let X1 and X2 be independent random variables having the same continuous distribution F . The pseudomedian of F is the median of the random variable (X1 +X2)/2. If the distribution is symmetric about ∆, then both the median and the pseudomedian are equal to ∆. If X1, · · · , Xn is a random sample from F , the sample pseudomedian is the median of the n(n+ 1)/2 pairwise averages

Aij = Xi +Xj

2

for all i and j, i ≤ j. The signed rank statistic W1+ is equal to the number of positive Aij 6. The sample pseudomedian is also called the Hodges-Lehmann estimator of ∆ 7. The Hodges-Lehmann estimator is in some respects a better estimator of ∆ than the sample median is if the distribution is symmetric.

A confidence interval for the parameter ∆ is obtained by taking sample quantiles of the Aij as end points. We will illustrate again with Gosset’s data.

> wilcox.test(REG,KILN,paired=T,conf.int=T)

Wilcoxon signed rank test

data: REG and KILN

V = 15, p-value = 0.123

alternative hypothesis: true location shift is not equal to 0

95 percent confidence interval:

-84 20

sample estimates:

(pseudo)median

-34.5

Compare this to the paired sample t test results.

Exercises

1. Tabulate the exact distribution of V+ for n = 5. (Hint: The possible values of V+ are the integers 0 through 15. One way that V+ = 5 could occur is for ranks 1 and 4 to belong to positive observations and all the other ranks to belong to negative observations, but there are other ways it could occur.)

6Randles and Wolfe, op cit, p.57 7J.L. Hodges 1922-2000, E.L. Lehmann 1917-2009

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 213

2. Apply the signed rank test and confidence interval to the paired data in the data frame ”runoff”. Compare the results to the results of a student-t test and confidence interval.

3. From the course data folder

www.math.uh.edu/ charles/data/punts.txt

import the paired data on punt distance for footballs inflated with Helium and with air. Find the Hodges-Lehmann estimator and confidence interval for the difference between Helium punt distance and air punt distance. Since the data is recorded with low precision, there will be ties in the ranks. ”wilcox.test” can handle this, but will return some warnings that exact intervals cannot be obtained. To suppress the warnings, use the argument ”exact=F” in the call to wilcox.test. This forces a normal approximation.

4. If a distribution is symmetric, the median and pseudomedian are equal and also equal to the mean, if it exists. This is not necessarily true for non-symmetric distributions. Find the median and pseudomedian of the exponential distribution with mean 1. If X1 and X2 are independent and have the exponential distribution with mean 1, then X1 + X2 ∼ Gamma(shape = 2, scale = 1). You will have to either find a numerical approximation of the pseudomedian or estimate it by simulation.

11.2.2 The Wilcoxon Rank Sum Test

For testing equality of population means with independent samples from two distributions, we have so far been restricted to a large sample normal test, which assumes the distributions have variances, or the student-t test, which for small samples assumes near normality of the populations. A nonpara- metric alternative to these procedures is the Wilcoxon rank sum test. The development of the rank sum test closely parallels that of the Wilcoxon signed rank test. The outline below omits many of the details.

Let X1, X2, · · · , Xn and Y1, Y2, · · · , Ym be independent samples from two continuous cumulative dis- tributions FX and FY . If these distributions differ only in location, there is a constant ∆ such that Y has the same distribution as X + ∆, or in other words,

FY (y) = FX(y −∆) for all real numbers y. ∆ is called a shift parameter. It is the difference between the quantiles of Y and corresponding quantiles of X, also the difference between their means if they exist. We are interested in testing the null hypothesis that ∆ = 0, in which case the distributions of X and Y are the same.

Rank all m + n data values (signed values, not absolute values) together, with the smallest getting rank 1 and the largest getting rank m+ n. One form of the Wilcoxon rank sum statistic is

WY =

m∑ i=1

rank(Yi).

If ∆ = 0 and X and Y have the same distribution, then there is nothing special about the ranks assigned to the Y ’s. WY will be just the sum of m randomly chosen integers between 1 and m + n.

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 214

This defines the null distribution of WY . If ∆ > 0, the ranks of the Y ’s will tend to be among the larger ones and WY will tend to be large. If ∆ < 0, WY will tend to be small. Thus, WY is a reasonable test statistic for H0 : ∆ = 0 against either a one-sided or two-sided alternative.

WY is also the number of pairs (i, j) such that Yj > Xi. The test using this expression for WY is called the Mann-Whitney test.8 It was not recognized at first that the Mann-Whitney test and the rank sum test are mathematically equivalent.

It is convenient to modify the definition of WY by subtracting the smallest possible value of the rank sum. Thus,

WY =

m∑ i=1

rank(Yi)− m(m+ 1)

2 . (11.17)

Theorem 11.3. The mean of the random variable WY is

E(WY ) = mn

2 .

Its variance is

var(WY ) = mn(m+ n+ 1)

12 .

If ∆ = 0, for large m and n the distribution of

Z = WY − E(WY )

sd(WY )

is approximately standard normal.

Estimating the Shift Parameter

The shift parameter ∆ is estimated much like the center of symmetry is in the one-sample Wilcoxon signed rank procedure. The estimator ∆̂ is also called the Hodges-Lehmann estimator and it is defined as the median of the mn pairwise differences

dij = Yj −Xi.

End points of confidence intervals for ∆ are obtained from quantiles of the dij .

Example 11.3. In Chapter 9 we used the two sample student-t test to test for a difference in forced expiratory volume (fev) for smokers and nonsmokers. The data is in the ”lungcap” data set. We will repeat the t test, assuming equal variances, and compare the results to the results of the Wilcoxon test and confidence interval.

> t.test(fev~smoke,data=lungcap,var.equal=T)

Two Sample t-test

8Henry B. Mann 1905-2000, Donald R. Whitney 1915-2001

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 215

data: fev by smoke

t = 1.7101, df = 83, p-value = 0.09098

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.01995121 0.26468836

sample estimates:

mean in group no mean in group yes

3.746740 3.624371

> wilcox.test(fev~smoke,data=lungcap,conf.int=T)

Wilcoxon rank sum test with continuity correction

data: fev by smoke

W = 1072.5, p-value = 0.07855

alternative hypothesis: true location shift is not equal to 0

95 percent confidence interval:

-0.01502935 0.27396080

sample estimates:

difference in location

0.1389037

As you can see, the results are very similar. This is near normal data, so the Wilcoxon test compares favorably to the t-test, even when conditions are favorable for the latter.

11.2.3 Exercises

1. In Chapter 9, Example 4 we analyzed the data in ”bpcrossover” from a crossover experiment on blood pressure. Repeat this analysis, first using the student-t test and then the Wilcoxon rank sum test. Compare the results. In the call to ”wilcox.test” use the argument ”exact=F” so that you won’t get warnings about ties. However, note that the sample sizes are small, so the normal approximation is questionable.

2. The rank sum test is usually applied in the context of two distributions that are supposedly the same except for location. However, it applies to the more general situation that one of the distributions is always less than or equal to the other one, i.e., the null and alternative hypotheses

H0 :FX = FY

H1 :FX(x) ≤ FY (x) for all x with strict inequality for some x.

For example, the rank sum test detects the difference between two exponential distributions even though they are not related by a shift parameter. Generate a random sample of X values of size 10 from the exponential distribution with mean 2 and a random sample of Y values of size 10 from the exponential distribution with mean 1. Compare them with ”wilcox.test” using the optional argument ”alternative = ”greater””. Play with the sample sizes and the mean of Y . Note the cases when the p-value reaches 10 % or less.

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 216

11.3 Bootstrap Confidence Intervals

A random variable X with density function

f(x) = 2

(1 + x)3 , x ≥ 0

has a mean but does not have a variance. The conditions for the large sample confidence interval for the mean that we used in Chapter 7 are not satisfied. The distribution of X is not symmetric, so the Hodges-Lehmann estimator for the center of symmetry also cannot be justified. Bootstrap confidence intervals are a way of proceeding when one wants to make as few theoretical assumptions as possible. The name comes from the fable about a character lifting himself by his own bootstraps.

The histogram below is of a random sample of size 30 from the distribution above.

x

F re

qu en

cy

0 1 2 3 4 5 6 7

0 5

10 15

20 25

The idea behind the bootstrap is that repeated resampling from the original samples yields as much information about precision of estimators as can be obtained from the data itself without making additional theoretical assumptions. One sample of the same size as the original sample, obtained by sampling the data with replacement, is called a bootstrap sample. The values in a bootstrap sample

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 217

are among the values of the original sample data, but some will be repeated and others will be omitted. For the distribution above, which incidentally has a mean of 1, the original sample was given a name.

> thesample=runif(30)

> thesample=1/sqrt(thesample)-1

> thesample

 1.01865 0.69580 0.40683 3.02067 0.46590 0.04336 1.07206

1.33057 0.02912 1.12058 2.51662 0.05060 0.10014 0.30890

3.32139

 0.75437 0.35492 0.10888 0.92074 0.08729 0.17593 0.35980

0.66303 0.01714 0.88771 0.09114 1.01506 0.10497 0.27204

0.36616

> mean(thesample)

 0.7226787

Then a single bootstrap sample of size 30 is obtained and its mean (the bootstrap mean) is

> bootsamp=sample(thesample,30,replace=T)

> bootsamp

 0.01714 0.35492 1.07206 0.17593 1.12058 0.17593 0.09114

0.10014 1.01506 0.69580 0.40683 0.40683 3.32139 1.12058

0.92074

 3.02067 0.05060 1.01506 0.27204 0.75437 0.10014 0.10497

0.04336 3.32139 1.33057 0.75437 0.02912 3.02067 3.02067

1.07206

> mean(bootsamp)

 0.9635043

Notice that the bootstrap mean x∗ is different from the original sample mean x̄. If another bootstrap sample should be obtained and its bootstrap mean calculated, it would be different too. In other words, there is a distribution of bootstrap means. The key assumption in the bootstrap confidence interval is that the bootstrap distribution of x∗ − x̄ is similar to the sampling distribution of x̄ − µ, where µ is the population mean. At least they are similar enough that quantiles of the bootstrap distribution of x∗ − x̄ can substitute for quantiles of x̄− µ.

If the quantiles q(x̄ − µ, α/2) and q(x̄ − µ, 1 − α/2) were known, we could form the 100(1 − α)% confidence interval

x̄− q(x̄− µ, 1− α/2) < µ < x̄− q(x̄− µ, α/2).

Since they are not known, we substitute the quantiles q∗(x∗ − x̄, 1− α/2) = q∗(x∗, 1− α/2)− x̄ and q∗(x∗ − x̄, α/2) = q∗(x∗, α/2) − x̄ of the bootstrap distribution of x∗. After rearranging terms, this leads to the bootstrap confidence interval

2x̄− q∗(x∗, 1− α/2) < µ < 2x̄− q∗(x∗, α/2) (11.18)

for µ.

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 218

To get the bootstrap quantiles of x∗, we generate many bootstrap samples of size 30 and calculate the bootstrap mean x∗ for each one. Then we take the sample quantiles of all the bootstrap means. The R procedure is

> bootmeans=replicate(1000,mean(sample(thesample,30,replace=T)))

> 2*mean(thesample)-quantile(bootmeans,.975)

97.5%

0.3890227

> 2*mean(thesample)-quantile(bootmeans,.025)

2.5%

0.986073

For comparison, the student-t confidence interval is

> t.test(thesample)

One Sample t-test

data: thesample

t = 4.6273, df = 29, p-value = 7.139e-05

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

0.4032608 1.0420965

sample estimates:

mean of x

0.7226787

The t-test confidence interval happens to be better in this one instance, in the sense that it does con- tain the target value of 1. It should be noted that no two bootstrap confidence intervals are exactly the same, even starting with the same primary sample data. This is because of the randomness in obtaining bootstrap samples.

As another example, we will find a bootstrap confidence interval for the standard deviation of an exponential distribution with mean 3, and therefore with standard deviation σ = 3. If s is the sample standard deviation from the primary sample and s∗ is the bootstrap sample standard deviation, we could equate bootstrap quantiles of s∗ − s with sampling quantiles of s − σ in the manner above. However, since σ is a positive scale parameter, it is preferable to equate bootstrap quantiles of s∗/s to sampling quantiles of s/σ. This leads to the bootstrap confidence interval

s2

q∗(s∗, 1− α/2) < σ <

s2

q∗(s∗, α/2) (11.19)

In R,

> thesample=rexp(30,rate=1/3)

> sd(thesample)

 2.180378

> bootsd=replicate(1000,sd(sample(thesample,30,replace=T)))

Go to TOC

CHAPTER 11. MISCELLANEOUS TOPICS 219

> var(thesample)/quantile(bootsd,.975)

97.5%

1.891767

> var(thesample)/quantile(bootsd,.025)

2.5%

2.874222

11.3.1 Exercises

1. The coefficient of variation of a distribution with mean µ and standard deviation σ is

γ = σ

µ .

Create a function in your R workspace for calculating the sample coefficient of variation.

> cv=function(x) sd(x)/mean(x)

Use the ”rgamma” function in R to simulate a sample of size 30 from a gamma distribution with shape parameter α = 4. Mimic the example above and find a 95% bootstrap confidence interval for the coefficient of variation. Treat the coefficient of variation like a scale parameter. The call ”replicate” should be

> bootcv=replicate(1000,cv(thesample,30,replace=T))

Repeat this with other shape parameters. How many of the confidence intervals contain the true coefficient of variation?

• Background
• Populations, Samples and Variables
• Types of Variables
• Random Experiments and Sample Spaces
• Computing in Statistics
• Exercises
• Descriptive and Graphical Statistics
• Location Measures
• The Mean
• The Median and Other Quantiles
• Trimmed Means
• Grouped Data
• Histograms
• Robustness
• The Five Number Summary
• The Mode
• Exercises
• Measures of Variability or Scale
• The Variance and Standard Deviation
• The Coefficient of Variation
• The Mean and Median Absolute Deviation
• The Interquartile Range
• Boxplots
• Exercises
• Jointly Distributed Variables
• Side by Side Boxplots
• Scatterplots
• Covariance and Correlation
• Exercises
• Probability
• Basic Definitions. Equally Likely Outcomes
• Combinations of Events
• Exercises
• Rules for Probability Measures
• Counting Outcomes. Sampling with and without Replacement
• Exercises
• Conditional Probability
• Relating Conditional and Unconditional Probabilities
• Bayes’ Rule
• Independent Events
• Exercises
• Replications of a Random Experiment
• Discrete Distributions
• Random Variables
• Discrete Random Variables
• Expected Values
• Exercises
• Bernoulli Random Variables
• The Mean and Variance of a Bernoulli Variable
• Binomial Random Variables
• The Mean and Variance of a Binomial Distribution
• Exercises
• Hypergeometric Distributions
• The Mean and Variance of a Hypergeometric Distribution
• Poisson Distributions
• The Mean and Variance of a Poisson Distribution
• Exercises
• Jointly Distributed Variables
• Covariance and Correlation
• Multinomial Distributions
• Exercises
• Continuous Distributions
• Density Functions
• Expected Values and Quantiles for Continuous Distributions
• Expected Values
• Quantiles
• Exercises
• Uniform Distributions
• Exponential Distributions and Their Relatives
• Exponential Distributions
• Gamma Distributions
• Weibull Distributions
• Exercises
• Normal Distributions
• Tables of the Standard Normal Distribution
• Other Normal Distributions
• The Normal Approximation to the Binomial Distribution
• Exercises
• Joint Distributions and Sampling Distributions
• Introduction
• Jointly Distributed Continuous Variables
• Mixed Joint Distributions
• Covariance and Correlation
• Bivariate Normal Distributions
• Independent Random Variables
• Exercises
• Sums of Random Variables
• Simulating Random Samples
• Sample Sums and the Central Limit Theorem
• Exercises
• Other Distributions Associated with Normal Sampling
• Chi Square Distributions
• Student t Distributions
• The Joint Distribution of the Sample Mean and Variance
• Exercises
• Statistical Inference for a Single Population
• Introduction
• Estimation of Parameters
• Estimators
• Desireable Properties of Estimators
• Estimating a Population Mean
• Confidence Intervals
• Small Sample Confidence Intervals for a Normal Mean
• Exercises
• Estimating a Population Proportion
• Choosing the Sample Size
• Confidence Intervals for p
• Exercises
• Estimating Quantiles
• Exercises
• Estimating the Variance and Standard Deviation
• Hypothesis Testing
• Test Statistics, Type 1 and Type 2 Errors
• Hypotheses About a Population Mean
• Tests for the mean when the variance is unknown
• p-values
• Exercises
• Hypotheses About a Population Proportion
• Exercises
• Regression and Correlation
• Examples of Linear Regression Problems
• Least Squares Estimates
• The “lm” Function in R
• Exercises
• Distributions of the Least Squares Estimators
• Exercises
• Inference for the Regression Parameters
• Confidence Intervals for the Parameters
• Hypothesis Tests for the Parameters
• Exercises
• Correlation
• Confidence intervals for
• Exercises
• Inference from Multiple Samples
• Comparison of Two Population Means
• Large Samples
• Comparing Two Population Proportions
• Samples from Normal Distributions
• Exercises
• Paired Observations
• Crossover Studies
• Exercises
• More than Two Independent Samples: Single Factor Analysis of Variance
• Example Using R
• Multiple Comparisons
• Exercises
• Two-Way Analysis of Variance
• Interactions Between the Factors
• Exercises
• Analysis of Categorical Data
• Multinomial Distributions
• Estimators and Hypothesis Tests for the Parameters
• Multinomial Probabilities That Are Functions of Other Parameters
• Exercises
• Testing Equality of Multinomial Probabilities
• Independence of Attributes: Contingency Tables
• Exercises
• Miscellaneous Topics
• Multiple Linear Regression
• Inferences Based on Normality
• Using R’s “lm” Function for Multiple Regression
• Factor Variables as Predictors
• Exercises
• Nonparametric Methods
• The Signed Rank Test
• The Wilcoxon Rank Sum Test
• Exercises
• Bootstrap Confidence Intervals
• Exercises