The joy of statistics: a treasury of elementary statistical tools and their applications steve selvi

Page 1


https://ebookmass.com/product/the-joy-of-statistics-a-

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

eTextbook Elementary Statistics 13th Edition

https://ebookmass.com/product/etextbook-elementary-statistics-13thedition/

ebookmass.com

My Ladybird Treasury of Stories and Rhymes

https://ebookmass.com/product/my-ladybird-treasury-of-stories-andrhymes/

ebookmass.com

Elementary Statistics (13th Edition ) 13th Edition

https://ebookmass.com/product/elementary-statistics-13th-edition-13thedition/

ebookmass.com

Severance Package Carmen Bishop

https://ebookmass.com/product/severance-package-carmen-bishop/

ebookmass.com

The Little Bookstore On Cape San Blas (A Journey With You Book 2) Grace Meyers

https://ebookmass.com/product/the-little-bookstore-on-cape-san-blas-ajourney-with-you-book-2-grace-meyers/

ebookmass.com

Elicitive Conflict Mapping 1st Edition Wolfgang Dietrich (Auth.)

https://ebookmass.com/product/elicitive-conflict-mapping-1st-editionwolfgang-dietrich-auth/

ebookmass.com

Two Brothers, Their Cousin and a Girl named Keera: Spicy Romantic Comedy (Limited Edition) N.J. Adel

https://ebookmass.com/product/two-brothers-their-cousin-and-a-girlnamed-keera-spicy-romantic-comedy-limited-edition-n-j-adel/

ebookmass.com

Flight of the Hawk The River Gear

https://ebookmass.com/product/flight-of-the-hawk-the-river-gear/

ebookmass.com

(eTextbook PDF) for Money Banking and the Financial System 3rd Edition

https://ebookmass.com/product/etextbook-pdf-for-money-banking-and-thefinancial-system-3rd-edition/

ebookmass.com

Causation, Second Edition 2nd Revised ed. Edition, (Ebook PDF)

https://ebookmass.com/product/ama-guides-to-the-evaluation-of-diseaseand-injury-causation-second-edition-2nd-revised-ed-edition-ebook-pdf/

ebookmass.com

THE JOY OF STATISTICS

THE JOY OF STATISTICS

A Treasury of Elementary Statistical Tools and their Applications

1

Great Clarendon Street, Oxford, OX2 6DP, United Kingdom

Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © Steve Selvin 2019

The moral rights of the author have been asserted

First Edition published in 2019

Impression: 1

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above

You must not circulate this work in any other form and you must impose this same condition on any acquirer

Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America

British Library Cataloguing in Publication Data Data available

Library of Congress Control Number: 2018965382

ISBN 978–0–19–883344–4

DOI: 10.1093/oso/9780198833444.001.0001

Printed in Great Britain by Bell & Bain Ltd., Glasgow

Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.

FormygrandsonsBenjaminandEli

Preface

Many introductory statistics textbooks exist for one of two purposes: as a text of statistical methods required by a variety of disciplines or courses leading to more advanced statistical methods with the goal of providing statistical tools for analysis of data. “The Joy of Statistics” is not one of these books. It is an extensive discussion of the many roles statistics plays in every day life with explanations and examples of how statistics works to explore important and sometimes unimportant questions generated from data.

A few of examples of these questions:

Who is Monty Hall?

Why is Florence Nightingale in the statistics hall of fame?

What is the relationship between a father’s height, his son’s height, and Sports Illustrated magazine?

How do we know the number of gray whales living in the Pacific Ocean?

How accurate are home drug testing kits?

Is 0.11 a large number?

Does a dog owner likely own a cat?

What is the difference between the law of averages and the law of large numbers?

This book is about the beauty, utility, and often simplicity of using statistics to distill messages from data. The logic and magic of statistics, without extensive technical details, is applied to answer a wide variety of questions generated from collected data. A bit of algebra, 10th grade algebra, and some elementary statistical/mathematical notation provide clear and readily accessible descriptions of a large number of ways statistics provides a path to human decision making. Included are a few classic “statistical” jokes and puzzles. Also bits of statistical history and brief biographies of important statisticians are sprinkled among various topics. The presented material is not a progression from simple to less simple to challenging techniques. The more than 40 topics present an anthology of various statistical “short stories” intended to be the first step into the world of statistical logic and methods. Perhaps the title of the text should be an “Elementary Introduction to the book Elementary Statistical Analysis.”

Acknowledgments

I would especially like to thank my daughter Dr. Elizabeth Selvin, my son-in-law, David Long, and my wife of 53 years, artist Nancy Selvin for their constant support and encouragement. I owe thanks to the many colleagues and friends at the University California Berkeley and Johns Hopkins School of Public Health who have taken an enthusiastic interest in my work over the years. I also would like to acknowledge Jenny Rosen for her technical assistance and the design skills of Henna artist Robyn Jean for inspiring the cover artwork.

28.

29.

30. Geometry of an approximate

31. Simpson’s paradox—two examples and a bit more

32. Smoothing—median values

33. Two by two table—a missing

34. Survey data—randomized

35. Viral incidence estimation—a

36. Two-way table—a graphical

37.

38. A binary variable—twin

39. Mr. Rich and Mr. Poor—a

40. Log-normal distribution—leukemia

1 Probabilities—rules and review

Statistics and probability, statistics and probability Go together like data and predictability This I tell you brother You can’t have one without the other. (With apologies to Frank Sinatra.)

Probability theory is certainly one of the most difficult areas of mathematics. However, with little effort, probabilities can be simply used to effectively summarize, explore, and analyze statistical issues generated from sampled data.

A probability is a numeric value always between zero and one used to quantify the likelihood of occurrence of a specific event or events. It measures the likelihood a specific event occurs among all possibilities. For example, roll a die and the probability the top face is a one is 1/6 or the probability it is less than 3 is 2/6 because these are two specific events generated from six equally likely possibilities. Typical notation for an event denoted A is probability = P(A). In general, a probability is defined as the count of occurrences of a specific event divided by the number of all possible equally likely events. Thus, a probability becomes a statistical tool that provides a rigorous and formal assessment of the role of chance in assessing collected data.

For two events labeled A and B, a probability measures:

P(A) = probability event A occurs

PA() = probability event A does not occur

P(B) = probability event B occurs

PB() = probability event B does not occur.

Joint probabilities:

P(A and B) = probability both events A and B occur

P(A and B ) = probability event A occurs and event B does not occur

P( A and B) = probability event A does not occur and event B occurs

P(A and B ) = probability event A does not occur and event B does not occur

TheJoyofStatistics:ATreasuryofElementaryStatisticalToolsandtheirApplications Steve Selvin. © Steve Selvin 2019. Published in 2019 by Oxford University Press. DOI: 10.1093/oso/9780198833444.001.0001

Conditional probabilities:

P(A|B) = probability event A occurs when event B has occurred or

P(B|A) = probability event B occurs when event A has occurred.

Note: the probability P(A) = 0 means event A cannot occur (“impossible event”). For example, rolling a value more than six with a single die or being kidnaped by Martians. Also note: the probability P(A) = 1 means event A always occurs (“sure event”). For example, rolling a value less than 7 with a single die or not being kidnaped by Martians.

A repeat of joint probabilities for two events A and B usefully displayed in a 2 × 2 table:

Table 1.1 Joint distribution of events A and B events B B

P(A and B)

(A and B ) P(A) P(A and B)

( A and B )

(B)

( A )

( B )

A subset of data (n = 100 observations) from a national survey of pet ownership conducted by the US Bureau of Labor Statistics containing counts of the number of people surveyed who own at least one dog (event denoted D) or own at least one cat (event denoted C) or own neither a dog or a cat (event denoted D and C ) or own both a cat and a dog (event denoted D and C).

Table 1.2 Joint distribution of 100 cat and dog owners

C D and C = 15 D and C = 45 C = 60 D and C = 20

and C = 20 C = 40 sum D = 35

= 65 n = 100

Some specific cat and dog probabilities

P(D) = 35/100 = 0.35 and P(C) = 60/100 = 0.60

P(D and C) = 15/100 = 0.15, P(D and C) = 45/100 = 0.45

P(D and C ) = 20/100 = 0.20, P(D and C ) = 20/100 = 0.20

P(D or C) = P(D and C) + P(D and C) + P(D and C ) = 0. 15 + 0.45 + 0. 20 = 0. 80

also, P(D or C) = 1 P( D and C ) = 1 0. 20 = 0. 80 because P(D or C) + P( D and C ) = 1. 0.

Conditionalprobabilities

Probability a cat owner owns a dog = P(D|C) = 15/60 = 0.25 (row: condition = cat = C); probability a dog owner owns a cat = P(C | D) = 15/35 = 0.43 (column: condition = dog = D).

In general, for events A and B, then PAB PAB PB B (| ) () () =− −= and condition and PBA PAB PA A (| ) () () =− −= and condition

For example, again dogs and cats:

PDC PDC PC (| ) () () / / == =− −= and condition cat 15 100 60 100 025 and PCD

PDC PD (| ) () () / / == =− −= and condition dog 15 100 35 100 043

Independence

Two events A and B are independent when occurrence of event A is unrelated to occurrence of event B. In other words, occurrence of event A is not influenced by occurrence of event B and vice versa. Examples of independent events: toss coin: first toss of a coin does not influence the second toss, birth: boy infant born first does not influence the sex of a second child, lottery: this week’s failure to win the lottery does not influence winning next week, politics: being left-handed does not influence political party affiliation, and genetics: being a male does not influence blood type.

Notation indicating independence of events A and B is PABPAPBAPB () () () (). = = or

Thus, from the pet survey data, the conditional probability P(D | C) = 0.25 is not equal to P(D) = 0.35, indicating, as might be suspected, that owning a dog and a cat are not independent events. Similarly, P(C | D) = 0.43 is not equal to P(C) = 0.60, necessarily indicating the same dog/cat association.

An important relationship: PAB PAB PB (| ) () () = and

Furthermore, P(A | B) × P(B) = P(A) × P(B) = P(A and B) when events A and B are independent because then P(A | B) = P(A). Incidentally, expression P(A | B) × P(B) = P(A and B) is called the multiplicationrule. In addition, when events {A, B, C, D, . . .} are independent, joint occurrence:

PABCDPAPBPCPD () () () () () . andand andand  =× ×× ×

Statistics pays off

Capitalizing on lack of independence in a gambling game called blackjack made Professor Edward Thorp famous, at least in Las Vegas.

First a short and not very complete description of the rules of this popular casino card game that requires only a standard deck of 52 playing cards.

The object is to beat the casino dealer by:

1. a count higher than the dealer without exceeding a score of 21 or 2. the dealer drawing cards creating a total count that exceeds 21 or 3. a player’s first two cards are ace and a ten count card.

The card counts are face values 2 through 10, jack, queen, and king count 10, and ace counts one or eleven.

At the casino gaming table, two cards are dealt to each player and two cards to the dealer. Both players and dealer then have options of receiving additional cards. The sum determines the winner. The dealer starts a new game by dealing a second set of cards using the remaining

cards in the deck. Thus inducing a dependency (lack of independence) between cards already played and cards remaining in the deck. Professor Thorp (1962) realized the house advantage could be overcome by simply counting the number of 10-count cards played in the previous game. Thus, when the remaining deck contains a large number of 10-count cards it gives an advantage to the player and when the cards remaining in the deck lack 10-count cards the advantage goes to the dealer. Professor Thorp simply bet large amounts of money when the deck was in his favor (lots of 10-count cards) and small amounts when the deck was not in his favor (few 10-count cards), yielding a small but profitable advantage. That is, cards left for the second game depend on the cards played in the first game, producing non-independent events yielding a detectable pattern.

Blackjack rules were immediately changed (using eight decks of cards, not one, for example) making card counting useless. Thus, like casino games roulette, slots machines, keno, and wheel of fortune, blackjack also does not produce a detectable pattern. That is, each card dealt is essentially independent of previous cards dealt.

The playing cards could be shuffled after each game producing independence, but this would be time consuming, causing a monetary loss to the casino. Also of note, Professor Thorp wrote a book about his entire experience entitled “Beat the Dealer.”

Picture of Probabilities—Events A and B (circles)

Table

Picture of Probabilities—Independent Events

Table of the same data—no association

Probability of event A restricted to occurrence of event B—again denoted P(A|B) is:

PAB (| ). == 3 9 033columnB.

The probability of event A is not influenced by event B, both P(A) and P(A|B) equal 0.33. In symbols, P(A|B) = P(A). Technically, the two events are said to be independent. More technically, they are said to be stochastically independent. Also, necessarily P(B|A) = P(B) = 0. 60. An important consequence of independence of two events, as noted, is P(A and B) = P(A) × P(B). Specifically, from the example, P(A and B) = 0.20 and, therefore, P(A) × P(B) = (0.33)(0.60) = 0.20.

These two-circle representations of joint probabilities are called Venn diagrams (created by John Venn, 1880).

Roulette

A sometimes suggested strategy for winning at roulette:

Waituntilthreerednumbersoccurthenbetonblack.

Red (R) and black (B) numbers appear with equal probabilities or P(B) = P(R). Let R1, R2, and R3 represent occurrence of three independent red outcomes and B represents occurrence of an additional independent black outcome. Then, consecutive occurrences of three red outcomes followed by a single black outcome dictates that:

No change in the probability of the occurrence of black! The red outcomes must be unpredictable (independent) or, for example, everyone would bet on black after red occurred, or vice versa, making the game rather boring. If a successful strategy existed, the casino game of roulette would not.

One last note:

The game of roulette was originated by the famous 17th century mathematician/physicist Blaise Pascal and the word roulette means little wheel in French.

Summation notation (Σ)

Data:

Sum (denoted S):

Mean

(denoted x ), then:

Application:

Types of variables

Qualitative: types examples

nominal socioeconomic status ethnicity occupation ordinal educational levels military ranks egg sizes discrete counts reported ages cigarettes smoked binary yes/no exposed/unexposed case/control

Quantitative: types examples

continuous weight distance time ratio speed rate odds

Reference: BeattheDealer, by Edward O. Thorp, Vintage Books, 1966

2

Distributions of data—four plots

Two quotes from the book entitled Gadsby by Ernest Vincent Wright:

First and last paragraphs:

If youth, throughout all history, had a champion to stand up for it; to show a doubting world that a child can think; and, possibly, do it practically; you would not constantly run across folks today who claim ‘a child do not know anything.’ A child’s brain starts functioning at birth; and has amongst its many infants convolutions, thousands of dormant atoms, into which God has put a mystic possibility for noticing an adult acts, and figuring out it purport.

A glorious full moon sails across a sky without a cloud. A crisp night air has folks turning up coats collars and kids hopping up and down for warmth. And that giant star, Sirius, winking slily, knows that soon, that light up in his honors room window will go out. Fttt! It is out! So, as Sirius and Luna hold an all night vigil, I will say soft ‘Good-night’ to all our happy bunch, and to John Gadsby, youth’s champion.

Question: Notice anything strange?

A table and plot of letter frequencies from the first and last paragraphs from the Gadsby book clearly show the absence of the letter “e.” A table or a plot makes the absence of letter “e” obvious. In fact, the entire book of close to 50,000 words does not contain letter “e.” Ironically, the author’s name contains the letter “e” three times. The plotted distribution of indeed an extreme example illustrates the often effective use of a table or simple plot to identify properties of collected data.

Table 2.1 Distribution of letters (counts)

Frequency distribution of letters

Frequency plots are basic to describing many kinds of data. A large number of choices exist among graphical representations. Four popular choices are: a barplot, a histogram, a stem-leaf plot, and a frequency polygon.

Barplot

The height of each bar indicates the frequency of the values of the variable displayed. Order of the bars is not a statistical issue. The example is a comparison of the number of world champion chess players from

each country (height = frequency). For example, visually, Russia has produced three times more world champions than the US.

Histogram

A histogram is distinctly different. Sample data (n = 19), ordered for convenience:

X = 21 24 27 29 42 44 48 67 68 71 73 78 82 84 86 91 95 96 99 , ,,, ,, ,, , ,,,, ,, ,,, { {}

A histogram starts with an ordered sequence of numerical intervals. In addition, a series of rectangles are constructed to represent each of the frequencies of the sampled values within each of the sequence of these intervals.

The example displays frequencies of observed values labeled X classified into four intervals of size 20. That is, four intervals each containing 4, 3, 5, and 7 values. The areas of the rectangles again provide a direct visual comparison of the data frequencies. The resulting plot, like the previous plots, is a visual description of the distribution of collected numeric values.

Stem-leaf plot

The stem-leaf plot is another approach to visually displaying data but often differs little from a histogram. The “stem” is the leftmost column of the plot usually creating single digit categories ordered smallest to largest. Multiple digit categories can be also used to create the stem. This “stem” is separated from the “leaves” by a vertical bar {“|”}. The “leaves” are an ordered list creating rows made up of the remaining digits from each category. That is, the stem consists of an ordered series of initial digits from observed values and leaves consist of remaining digits listed to the right of their respective stem value. For example, the stem-leaf plot of the previous data labeled X for intervals {20, 40, 60, 80, 100} is:

Turn static files into dynamic content formats.

Create a flipbook
The joy of statistics: a treasury of elementary statistical tools and their applications steve selvi by Education Libraries - Issuu