Start the Lab Math Alive Welcome Page

Problem Set. Probability and Statistics, Part 2.

You can answer by filling in the blank spaces. If there is not enough space attach other sheets.

For Problems 1 to 5 , you can use the web page "Statistical Calculations".

Problem 1. Baseball Stats.

a) Hank Aaron was an outfielder for the Braves from 1954 to 1974. Here are the number of home runs hit each year by Aaron:

13, 27, 26, 44, 30, 39, 40, 34, 45, 44, 24, 32, 44, 39, 29, 44, 38, 47, 34, 40, 20.

Mark McGwire played in the major leagues from 1986 to 2001 as a first baseman for the Oakland A's and the St. Louis Cardinals. The number of home runs hit by McGwire in each year are:

3, 49, 32, 33, 39, 22, 42, 9, 9, 39, 52, 58, 70, 65, 32, 29.

Please give mean, median, standard deviation and quartiles for each: (you can use the applet on the webpage for these computations). Remember that the definition of quartiles has been given in class. It is also in the lecture Notes.

Aaron:

• mean:
• median:
• standard deviation:
• first quartile:
• third quartile:

McGwire:

• mean:
• median:
• standard deviation:
• first quartile:
• third quartile:

b) What conclusion would you draw from a comparison of these data? Judging only from the information you calculated above, who would you say was the most consistent player? Explain your answer. Is this borne out by the actual numbers?

Problem 2. Landslide victories in presidential elections.

Here are the percentages of the popular vote won by the successful presidential candidate in each of the presidential elections from 1948 to 2004:

 year % 1948 49.6 1952 55.1 1956 57.4 1960 49.7 1964 61.1 1968 43.4 1972 60.7 1976 50.1 1980 50.7 1984 58.8 1988 53.9 1992 43.3 1996 50.0 2000 47.9 2004 50.9

a) Compute the:

• mean
• median
• standard deviation

b) What are the first and third quartiles?

Definition: An election is called a landslide if it is at or above the third quartile.

c) Which elections were landslides?

The following list gives the final grades (out of 100) of 30 students in a calculus course.

Grades (out of 100): 7, 10, 14, 15, 15, 17, 18, 21, 21, 24, 25, 26, 26, 29, 32, 34, 37, 63, 65, 66, 69, 69, 72, 75, 77, 77, 78, 80, 81, 93.

a) Compute the:

• mean
• median
• standard deviation
• first quartile:
• third quartile:

b) Do you think the {mean & standard deviation} gives a good summary of the information conveyed in the above data? Explain you answer.

c) How about the {median, quartiles and extremes}? Again, explain you answer.

d) Do you think the information conveyed by the mean alone and/or by the median alone is useful?

e) Calculate the mean and standard deviation of

• the grades below the mean.
• the grades above the mean.

f) Do you think the two means you calculated in e) give a useful description of the data? Explain why.

i) In your opinion, which combination of the statistical quantities calculated from a) to f) gives the best summary of the information conveyed by the full list of class grades?

Problem 4. Non-normal distribution.

Make a list of 10 numbers for which the mean lies above the third quartile:

Problem 5. A study of jury awards.

A study of the size of jury awards in civil cases (such as injury, liability and medical malpractice) showed that the median award in Cook County, Illinois, was about \$8000. But the mean award was about \$89,000. Can you explain how this is possible? Create an example with actual numbers as part of your explanation.

In Problems 6 to 9, you will manipulate confidence intervals. Here is a reminder:

If the probability of an event A (e.g A={people who like cats}) is p% (e.g.59%),
if you find that for a sample of size N (e.g. N=1,000 persons), p(N)% (e.g. 58% ) agree with A (e.g. 580 persons of the N=1,000 persons asked like cats).
Then you calculate the following:

Case 1
If you know the exact value p%, the 95% confidence interval for p(N)% is: p% +/- 2*S % where S=sqrt(p*(100-p)/N) (sqrt is the square root).
This means that there are 95% chance that p(N)% is in the interval [p% - 2*S%, p% + 2*S%]. S is the standard deviation.

Case 2
If you don't know the exact value p%, the 95% confidence interval for p% is: p(N)% +/- 2*S % where S=sqrt(p(N)%*(100-p(N)%)/N) (sqrt is the square root).
This means that there are 95% chance that p% is in the interval [p(N)% - 2*S%, p(N)% + 2*S%]. S is the standard deviation.
In this case, you can use the Confidence Interval web page to compute the confidence interval if p(N)% is an integer.

Problem 6. Unlisted Numbers.

In a particular area code, about 35% of telephones have unlisted numbers. Imagine that you call 300 numbers, each chosen randomly. That is, you pick the different numbers randomly from the whole collection of possible numbers (not all seven-digit numbers are possible phone numbers - for instance, no phone number starts with 0). The number of unlisted numbers you reach this way is more or less normally distributed.

a) What is the mean of this distribution? (That is, what is the average number of unlisted numbers reached in 300 attempts, if a large group of people would try 300 attempts each, independently, and compare their results at the end?)

b) And what would the standard deviation be?(you can use the confidence interval applet to calculate this, or the formula reminded above).

Problem 7. Visiting Yosemite National Park.

The Forest Service of Yosemite National Park is considering additional restrictions on the number of vehicles allowed to enter the Park. To assess public reaction, the Service asks a random sample of 200 visitors if they favor the proposal. Of these, 126 say "yes".

a) Give a 95% confidence interval for the proportion of all visitors to Yosemite who favor the restrictions.

b) Are you 95% confident that more than half are in favor? Explain your answer!

Problem 8. Poll results.

A news report says that a national opinion poll of 1200 randomly selected adults found that 38% thought that they would be worse off during the next year. The news report went on to say that the margin of error in the poll is + or - 3 percentage points with 95% confidence. The poll was carried out by calling random telephone numbers.

a) Using the formula seen before or the web page, compute for yourself the 95% interval. It is of the form p% +/- 2*S%. What in the interval you find and the one they give?

b) Could you propose explanations for the difference between what you find and what they describe?

c) If we wanted a 90% confidence interval, and not a 95% confidence interval, would the width of the confidence interval be greater or smaller than the + or - 3 percentage points? Why?

Problem 9. Elections.

Suppose that you want to call the result of an election with 95% confidence. The ratio of people who prefer the leading candidate hovers around 70%. Be careful, for this problem you can't use the webpage!

a) How many people should you have in your sample to be 95% confident that more than half the population will vote for this candidate? ()

b) How many people should you question if the percentage of the population who prefer this leading candidate is around 60% rather than around 70%?

c) And if it were 51%?

In class, we saw an instance of Simpson's paradox.

In that case, we had two groups, A and B, and group A was claiming that they were unfairly treated in the graduate admissions process. Indeed, out of a pool of 1100 applicants in group A, only 190 were admitted, while out of a pool of 1100 applicants from group B a total of 910 were admitted.

However, a closer inspection of these (made-up) data showed that there were two different programs to which the candidates could apply. Program 1 had an admission rate of 90%, but program 2 had an admission rate of 10%. Then the numbers were explained by the fact that of the 1100 applicants from group A, 100 had applied to program 1, and 1000 to program 2; while the reverse happened for the applicants from group B. Both programs treated the two groups entirely fairly, but nevertheless the total end result looked skewed.

Here we shall see some more instances of Simpson's paradox.

Problem 10. A tale of two hospitals.

A community has two hospitals. Hospital A is large medical center, while Hospital B is a more fashionable and much more expensive hospital where most patients are wealthy. An article in the local paper claims that a higher percentage of surgery patients die at Hospital A than at Hospital B, and deplores the fact that people who are less well off are disadvantaged. It also recommends to the people in the community that if they can afford it, they should choose to have their surgery in Hospital B rather than A.

A more detailed look at the number of surgery patients in the last few months at both Hospitals, taking into account also whether the incoming patients were in good or poor health, shows the following:

 HOSPITAL A HOSPITAL B Good Health Poor Health Good Health Poor Health Died 4 57 5 8 Survived 559 1422 585 196

a) What are the percentages of patients admitted for surgery who are in bad health prior to the operation

• in A :
• in B :

b) What are the total percentages of patients who died?

• in A :
• in B :

c) What are the percentages of patients in previously good health who died?

• in A :
• in B :

d) What are the percentages of patients in previously poor health who died?

• in A :
• in B :

e) Try describing this paradox in your own words - Imagine that you have to write a short, one-or-two paragraph article about it in the local newspaper, or that you respond to the article that appeared in the local paper with a letter to the Editor. Don't just repeat the numbers: give a clear explanation of what is happening in these statistics.

Problem 11. Cancer Statistics.

• 75% of cancer patients eat cucumbers
• 80% of cancer patients drink orange juice
• 30% of cancer patients smoke.

He promptly took up smoking again and omitted cucumbers and orange juice from his diet, with a secure feeling that his chances of developing cancer were low.

Do you agree with his reasoning? Explain why. To do this, explain which probabilities or conditional probabilities the information above gave to Mr Jonhson and also explain what probabilities or conditional probabilities he needs to end up with his conclusion.

Check other examples on pages "Car Accidents Statistics", "Financial Aid Statistics" and "Statistics Quiz". Which example was your favorite? Do you know any other interesting example?

Challenge Questions

Problem C1. Conducting a poll.

Suppose you would like to conduct a poll among the adult population. As you are an expert in Statistics, you know that you need to choose your respondents at random. So you design the following procedure. You generate a random telephone number and dial it. If nobody comes to the phone you cross it out. If somebody comes to the phone you ask how many adult members of the household there are there and pick a person to survey at random among them. In fact, your procedure would make your choice non-random. Explain this! For each step below, invent a real life question you could be asking in your survey where your method will show incorrect results.

a) You use the phone numbers as the base:

b) You discard the phone number, when nobody comes to the phone:

c) You choose one of the members of the household:

d) "Gala" magazine conducted a stress survey. Out of 5 million readers 1.3 percent returned the questionnaire. This was one of the biggest stress survey's in history. Explain why it nevertheless doesn't reflect the actual opinions of the population at large, or even among average "Gala" subscribers.

Problem C2. Delaying labor with statistics.

A study showed that women who give birth after 40 live longer than women who give birth before 35.

As a result of this study, delaying child bearing was proposed as a way of lengthening women's lives.

Do you agree with this proposition? Explain why.

Problem C3. Tax break.

Imagine a very simple tax system, where there are only two tax brackets, that is, everyone earning less than \$ A is taxed at rate a%, and everyone earning \$ A or more is taxed at rate b%.

A big tax break is carried out, in which the rate goes down for both brackets, that is, both a and b are replaced by smaller numbers a' and b'. Yet, when you look at the numbers afterwards, the total tax revenue for the country has increased.

How is this possible? Make up some actual numbers (values for the cut-off that defines who is in which bracket, for a, b, a', b' and for the numbers of people who earn certain amounts) that fit this scenario. (Hint: the cut-off A need not be the same before and after the new tax legislation passed.)