https://ebookmass.com/product/an-introduction-tostatistical-learning-with-applications-in-r-ebook/
Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...
Generalized Additive Models: An Introduction with R (Chapman & Hall/CRC Texts in Statistical Science) 1st Edition – Ebook PDF Version
https://ebookmass.com/product/generalized-additive-models-anintroduction-with-r-chapman-hall-crc-texts-in-statistical-science-1stedition-ebook-pdf-version/ ebookmass.com
An Introduction to Qualitative Research: Learning in the Field – Ebook PDF Version
https://ebookmass.com/product/an-introduction-to-qualitative-researchlearning-in-the-field-ebook-pdf-version/
ebookmass.com
An Introduction to Statistical Methods and Data Analysis 7th Edition, (Ebook PDF)
https://ebookmass.com/product/an-introduction-to-statistical-methodsand-data-analysis-7th-edition-ebook-pdf/ ebookmass.com
Illustrated Anatomy of the Head and Neck 5th Edition Edition Margaret Fehrenbach
https://ebookmass.com/product/illustrated-anatomy-of-the-head-andneck-5th-edition-edition-margaret-fehrenbach/ ebookmass.com
The Trajectory of Global Education Policy: Community-Based Management in El Salvador and the Global Reform Agenda 1st Edition D. Brent Edwards Jr. (Auth.)
https://ebookmass.com/product/the-trajectory-of-global-educationpolicy-community-based-management-in-el-salvador-and-the-globalreform-agenda-1st-edition-d-brent-edwards-jr-auth/
ebookmass.com
Bagels, Schmears, and a Nice Piece of Fish Cathy Barrow
https://ebookmass.com/product/bagels-schmears-and-a-nice-piece-offish-cathy-barrow/
ebookmass.com
Commercial Aviation Safety, Sixth Edition
https://ebookmass.com/product/commercial-aviation-safety-sixthedition/
ebookmass.com
Retrosynthesis in the Manufacture of Generic Drugs Pedro Paulo Santos
https://ebookmass.com/product/retrosynthesis-in-the-manufacture-ofgeneric-drugs-pedro-paulo-santos/
ebookmass.com
Strategies and Models for Teachers: Teaching Content and Thinking Skills 6th Edition Paul Eggen And Don Kauchak https://ebookmass.com/product/strategies-and-models-for-teachersteaching-content-and-thinking-skills-6th-edition-paul-eggen-and-donkauchak/
ebookmass.com
Auditing: Principles and Practices 1st Edition Ashish Kumar Sana
https://ebookmass.com/product/auditing-principles-and-practices-1stedition-ashish-kumar-sana/
ebookmass.com
Gareth James
Daniela Witten
Trevor Hastie
Robert
Tibshirani
An Introduction to Statistical Learning with Applications in R
Preface Statisticallearningreferstoasetoftoolsformodelingandunderstanding complexdatasets.Itisarecentlydevelopedareainstatisticsandblends withparalleldevelopmentsincomputerscienceand,inparticular,machine learning.Thefieldencompassesmanymethodssuchasthelassoandsparse regression,classificationandregressiontrees,andboostingandsupport vectormachines.
Withtheexplosionof“BigData”problems,statisticallearninghasbecomeaveryhotfieldinmanyscientificareasaswellasmarketing,finance, andotherbusinessdisciplines.Peoplewithstatisticallearningskillsarein highdemand.
Oneofthefirstbooksinthisarea—TheElementsofStatisticalLearning (ESL)(Hastie,Tibshirani,andFriedman)—waspublishedin2001,witha secondeditionin2009.ESLhasbecomeapopulartextnotonlyinstatisticsbutalsoinrelatedfields.OneofthereasonsforESL’spopularityis itsrelativelyaccessiblestyle.ButESLisintendedforindividualswithadvancedtraininginthemathematicalsciences. AnIntroductiontoStatistical Learning (ISL)arosefromtheperceivedneedforabroaderandlesstechnicaltreatmentofthesetopics.Inthisnewbook,wecovermanyofthe sametopicsasESL,butweconcentratemoreontheapplicationsofthe methodsandlessonthemathematicaldetails.Wehavecreatedlabsillustratinghowtoimplementeachofthestatisticallearningmethodsusingthe popularstatisticalsoftwarepackage R.Theselabsprovidethereaderwith valuablehands-onexperience.
Thisbookisappropriateforadvancedundergraduatesormaster’sstudentsinstatisticsorrelatedquantitativefieldsorforindividualsinother
disciplineswhowishtousestatisticallearningtoolstoanalyzetheirdata. Itcanbeusedasatextbookforacoursespanningoneortwosemesters. Wewouldliketothankseveralreadersforvaluablecommentsonpreliminarydraftsofthisbook:PallaviBasu,AlexandraChouldechova,Patrick Danaher,WillFithian,LuellaFu,SamGross,MaxGrazierG’Sell,CourtneyPaulson,XinghaoQiao,ElisaSheng,NoahSimon,KeanMingTan, andXinLuTan. It’stoughtomakepredictions,especiallyaboutthefuture.
-YogiBerra
LosAngeles,USAGarethJames Seattle,USADanielaWitten PaloAlto,USATrevorHastie PaloAlto,USARobertTibshirani
2.1.1WhyEstimate f ?
2.1.2HowDoWeEstimate f ?
2.1.3TheTrade-OffBetweenPredictionAccuracy andModelInterpretability
2.1.4SupervisedVersusUnsupervisedLearning
2.1.5RegressionVersusClassificationProblems
2.2AssessingModelAccuracy
2.2.1MeasuringtheQualityofFit
2.2.2TheBias-VarianceTrade-Off
2.2.3TheClassificationSetting
2.3Lab:IntroductiontoR
2.3.1BasicCommands
2.3.2Graphics
2.3.3IndexingData
2.3.4LoadingData
2.3.5AdditionalGraphicalandNumericalSummaries
2.4Exercises
3LinearRegression 59
3.1SimpleLinearRegression ...................61
3.1.1EstimatingtheCoefficients ..............61
3.1.2AssessingtheAccuracyoftheCoefficient Estimates ........................63
3.1.3AssessingtheAccuracyoftheModel .........68
3.2MultipleLinearRegression ..................71
3.2.1EstimatingtheRegressionCoefficients ........72
3.2.2SomeImportantQuestions ..............75
3.3OtherConsiderationsintheRegressionModel ........82
3.3.1QualitativePredictors .................82
3.3.2ExtensionsoftheLinearModel ............86
3.3.3PotentialProblems ...................92
3.4TheMarketingPlan ......................102
3.5ComparisonofLinearRegressionwith K -Nearest Neighbors ............................104
3.6Lab:LinearRegression .....................109
3.6.1Libraries .........................109
3.6.2SimpleLinearRegression
3.6.3MultipleLinearRegression ..............113
3.6.4InteractionTerms ...................115
3.6.5Non-linearTransformationsofthePredictors ....115
3.6.6QualitativePredictors .................117
3.6.7WritingFunctions ...................119
3.7Exercises ............................120
4Classification
4.1AnOverviewofClassification
4.2WhyNotLinearRegression?
4.3LogisticRegression .......................130
4.3.1TheLogisticModel ...................131
4.3.2EstimatingtheRegressionCoefficients ........133
4.3.3MakingPredictions ...................134
4.3.4MultipleLogisticRegression ..............135
4.3.5LogisticRegressionfor >2ResponseClasses .....137
4.4LinearDiscriminantAnalysis .................138
4.4.1UsingBayes’TheoremforClassification .......138
4.4.2LinearDiscriminantAnalysisfor p =1 ........139
4.4.3LinearDiscriminantAnalysisfor p>1 ........142
4.4.4QuadraticDiscriminantAnalysis ...........149
4.5AComparisonofClassificationMethods ...........151
4.6Lab:LogisticRegression,LDA,QDA,andKNN ......154
4.6.1TheStockMarketData ................154
4.6.2LogisticRegression ...................156
4.6.3LinearDiscriminantAnalysis .............161
4.6.4QuadraticDiscriminantAnalysis ...........163
4.6.5 K -NearestNeighbors ..................163
4.6.6AnApplicationtoCaravanInsuranceData .....165
4.7Exercises ............................168
5ResamplingMethods
5.1Cross-Validation ........................176
5.1.1TheValidationSetApproach
5.1.2Leave-One-OutCross-Validation
5.1.3 k -FoldCross-Validation ................181
5.1.4Bias-VarianceTrade-Offfor k -Fold Cross-Validation ....................183
5.1.5Cross-ValidationonClassificationProblems .....184
5.2TheBootstrap .........................187
5.3Lab:Cross-ValidationandtheBootstrap ...........190
5.3.1TheValidationSetApproach .............191
5.3.2Leave-One-OutCross-Validation ...........192
5.3.3 k -FoldCross-Validation ................193
5.3.4TheBootstrap .....................194
5.4Exercises ............................197
6LinearModelSelectionandRegularization 203
6.1SubsetSelection ........................205
6.1.1BestSubsetSelection
6.1.2StepwiseSelection ...................207
6.1.3ChoosingtheOptimalModel
6.2ShrinkageMethods .......................214
6.2.1RidgeRegression ....................215
6.2.2TheLasso ........................219
6.2.3SelectingtheTuningParameter ............227
6.3DimensionReductionMethods ................228
6.3.1PrincipalComponentsRegression ...........230
6.3.2PartialLeastSquares .................237
6.4ConsiderationsinHighDimensions ..............238
6.4.1High-DimensionalData ................238
6.4.2WhatGoesWronginHighDimensions? .......239
6.4.3RegressioninHighDimensions ............241
6.4.4InterpretingResultsinHighDimensions .......243
6.5Lab1:SubsetSelectionMethods ...............244
6.5.1BestSubsetSelection .................244
6.5.2ForwardandBackwardStepwiseSelection ......247
6.5.3ChoosingAmongModelsUsingtheValidation SetApproachandCross-Validation ..........248
6.6Lab2:RidgeRegressionandtheLasso
6.6.1RidgeRegression
6.6.2TheLasso ........................255
6.7Lab3:PCRandPLSRegression ...............256
6.7.1PrincipalComponentsRegression
6.7.2PartialLeastSquares
6.8Exercises ............................259
7MovingBeyondLinearity 265
7.1PolynomialRegression
7.2StepFunctions
7.3BasisFunctions
7.4RegressionSplines
7.4.1PiecewisePolynomials .................271
7.4.2ConstraintsandSplines ................271
7.4.3TheSplineBasisRepresentation ...........273
7.4.4ChoosingtheNumberandLocations oftheKnots ......................274
7.4.5ComparisontoPolynomialRegression
7.5SmoothingSplines
7.5.1AnOverviewofSmoothingSplines
7.5.2ChoosingtheSmoothingParameter λ
7.6LocalRegression ........................280
7.7GeneralizedAdditiveModels
7.7.1GAMsforRegressionProblems
7.7.2GAMsforClassificationProblems
7.8Lab:Non-linearModeling
7.8.1PolynomialRegressionandStepFunctions .....288
7.8.2Splines ..........................293
7.8.3GAMs ..........................294
7.9Exercises
8Tree-BasedMethods
8.1TheBasicsofDecisionTrees
8.1.1RegressionTrees
8.1.2ClassificationTrees
8.1.3TreesVersusLinearModels ..............314
8.1.4AdvantagesandDisadvantagesofTrees .......315
8.2Bagging,RandomForests,Boosting
8.2.1Bagging .........................316
8.2.2RandomForests ....................320
8.2.3Boosting .........................321
8.3Lab:DecisionTrees
8.3.1FittingClassificationTrees
8.3.2FittingRegressionTrees
8.3.3BaggingandRandomForests
8.3.4Boosting .........................330
8.4Exercises ............................332
9SupportVectorMachines 337
9.1MaximalMarginClassifier ...................338
9.1.1WhatIsaHyperplane? ................338
9.1.2ClassificationUsingaSeparatingHyperplane ....339
9.1.3TheMaximalMarginClassifier ............341
9.1.4ConstructionoftheMaximalMarginClassifier ...342
9.1.5TheNon-separableCase ................343
9.2SupportVectorClassifiers ...................344
9.2.1OverviewoftheSupportVectorClassifier ......344
9.2.2DetailsoftheSupportVectorClassifier .......345
9.3SupportVectorMachines ...................349
9.3.1ClassificationwithNon-linearDecision Boundaries .......................349
9.3.2TheSupportVectorMachine
9.3.3AnApplicationtotheHeartDiseaseData ......354
9.4SVMswithMorethanTwoClasses ..............355
9.4.1One-Versus-OneClassification .............355
9.4.2One-Versus-AllClassification
9.5RelationshiptoLogisticRegression ..............356
9.6Lab:SupportVectorMachines ................359
9.6.1SupportVectorClassifier ...............359
9.6.2SupportVectorMachine ................363
9.6.3ROCCurves ......................365
9.6.4SVMwithMultipleClasses ..............366
9.6.5ApplicationtoGeneExpressionData ........366 9.7Exercises ............................368
10UnsupervisedLearning 373 10.1TheChallengeofUnsupervisedLearning
10.2PrincipalComponentsAnalysis
10.2.1WhatArePrincipalComponents? ..........375
10.2.2AnotherInterpretationofPrincipalComponents ..379
10.2.3MoreonPCA ......................380
10.2.4OtherUsesforPrincipalComponents ........385
10.3ClusteringMethods .......................385
10.3.1 K -MeansClustering ..................386
10.3.2HierarchicalClustering
10.3.3PracticalIssuesinClustering
10.4Lab1:PrincipalComponentsAnalysis
10.5Lab2:Clustering ........................404
10.5.1 K -MeansClustering ..................404
10.5.2HierarchicalClustering .................406
10.6Lab3:NCI60DataExample .................407
10.6.1PCAontheNCI60Data ...............408
10.6.2ClusteringtheObservationsoftheNCI60Data ...410
10.7Exercises ............................413
1 Introduction AnOverviewofStatisticalLearning Statisticallearning referstoavastsetoftoolsfor understandingdata.These toolscanbeclassifiedas supervised or unsupervised.Broadlyspeaking, supervisedstatisticallearninginvolvesbuildingastatisticalmodelforpredicting,orestimating,an output basedononeormore inputs.Problemsof thisnatureoccurinfieldsasdiverseasbusiness,medicine,astrophysics,and publicpolicy.Withunsupervisedstatisticallearning,thereareinputsbut nosupervisingoutput;neverthelesswecanlearnrelationshipsandstructurefromsuchdata.Toprovideanillustrationofsomeapplicationsof statisticallearning,webrieflydiscussthreereal-worlddatasetsthatare consideredinthisbook.
WageData Inthisapplication(whichwerefertoasthe Wage datasetthroughoutthis book),weexamineanumberoffactorsthatrelatetowagesforagroupof malesfromtheAtlanticregionoftheUnitedStates.Inparticular,wewish tounderstandtheassociationbetweenanemployee’s age and education,as wellasthecalendar year,onhis wage.Consider,forexample,theleft-hand panelofFigure 1.1,whichdisplays wage versus age foreachoftheindividualsinthedataset.Thereisevidencethat wage increaseswith age butthen decreasesagainafterapproximatelyage60.Theblueline,whichprovides anestimateoftheaverage wage foragiven age,makesthistrendclearer.
G.Jamesetal., AnIntroductiontoStatisticalLearning:withApplicationsinR, SpringerTextsinStatistics,DOI10.1007/978-1-4614-7138-7 1,
FIGURE1.1. Wage data,whichcontainsincomesurveyinformationformales fromthecentralAtlanticregionoftheUnitedStates. Left: wage asafunctionof age.Onaverage, wage increaseswith age untilabout 60 yearsofage,atwhich pointitbeginstodecline. Center: wage asafunctionof year.Thereisaslow butsteadyincreaseofapproximately $10,000 intheaverage wage between 2003 and 2009. Right: Boxplotsdisplaying wage asafunctionof education,with 1 indicatingthelowestlevel(nohighschooldiploma)and 5 thehighestlevel(an advancedgraduatedegree).Onaverage, wage increaseswiththelevelofeducation.
Givenanemployee’s age,wecanusethiscurveto predict his wage.However, itisalsoclearfromFigure 1.1 thatthereisasignificantamountofvariabilityassociatedwiththisaveragevalue,andso age aloneisunlikelyto provideanaccuratepredictionofaparticularman’s wage. Wealsohaveinformationregarding eachemployee’seducationleveland the year inwhichthe wage wasearned.Thecenterandright-handpanelsof Figure 1.1,whichdisplay wage asafunctionofboth year and education,indicatethatbothofthesefactorsareassociatedwith wage.Wagesincrease byapproximately$10,000,inaroughlylinear(orstraight-line)fashion, between2003and2009,thoughthisriseisveryslightrelativetothevariabilityinthedata.Wagesarealsotypicallygreaterforindividualswith highereducationlevels:menwiththelowesteducationlevel(1)tendto havesubstantiallylowerwagesthanthosewiththehighesteducationlevel (5).Clearly,themostaccuratepredictionofagivenman’s wage willbe obtainedbycombininghis age,his education,andthe year.InChapter 3, wediscusslinearregression,whichcanbeusedtopredict wage fromthis dataset.Ideally,weshouldpredict wage inawaythataccountsforthe non-linearrelationshipbetween wage and age.InChapter 7,wediscussa classofapproachesforaddressingthisproblem.
StockMarketData The Wage datainvolvespredictinga continuous or quantitative outputvalue. Thisisoftenreferredtoasa regression problem.However,incertaincases wemayinsteadwishtopredictanon-numericalvalue—thatis,a categorical
FIGURE1.2. Left: Boxplotsofthepreviousday’spercentagechangeintheS&P indexforthedaysforwhichthemarketincreasedordecreased,obtainedfromthe Smarket data. CenterandRight: Sameasleftpanel,butthepercentagechanges for2and3dayspreviousareshown.
or qualitative output.Forexample,inChapter 4 weexamineastockmarketdatasetthatcontainsthedailymovementsintheStandard&Poor’s 500(S&P)stockindexovera5-yearperiodbetween2001and2005.We refertothisasthe Smarket data.Thegoalistopredictwhethertheindex will increase or decrease onagivendayusingthepast5days’percentage changesintheindex.Herethestatisticallearningproblemdoesnotinvolvepredictinganumericalvalue.Insteaditinvolvespredictingwhether agivenday’sstockmarketperformancewillfallintothe Up bucketorthe Down bucket.Thisisknownasa classification problem.Amodelthatcould accuratelypredictthedirectioninwhichthemarketwillmovewouldbe veryuseful!
Theleft-handpanelofFigure 1.2 displaystwoboxplotsoftheprevious day’spercentagechangesinthestockindex:oneforthe648daysforwhich themarketincreasedonthesubsequentday,andoneforthe602daysfor whichthemarketdecreased.Thetwoplotslookalmostidentical,suggestingthatthereisnosimplestrategyforusingyesterday’smovementinthe S&Ptopredicttoday’sreturns.Theremainingpanels,whichdisplayboxplotsforthepercentagechanges2and3daysprevioustotoday,similarly indicatelittleassociationbetweenpast andpresentreturns.Ofcourse,this lackofpatternistobeexpected:inthe presenceofstrongcorrelationsbetweensuccessivedays’returns,onecouldadoptasimpletradingstrategy togenerateprofitsfromthemarket.Nevertheless,inChapter 4,weexplore thesedatausingseveraldifferentstatisticallearningmethods.Interestingly, therearehintsofsomeweaktrendsinthedatathatsuggestthat,atleast forthis5-yearperiod,itispossibletocorrectlypredictthedirectionof movementinthemarketapproximately60%ofthetime(Figure 1.3).
FIGURE1.3. Wefitaquadraticdiscriminantanalysismodeltothesubset ofthe Smarket datacorrespondingtothe2001–2004timeperiod,andpredicted theprobabilityofastockmarketdecreaseusingthe2005data.Onaverage,the predictedprobabilityofdecreaseishigherforthedaysinwhichthemarketdoes decrease.Basedontheseresults,weareabletocorrectlypredictthedirectionof movementinthemarket60%ofthetime.
GeneExpressionData Theprevioustwoapplicationsillustratedatasetswithbothinputand outputvariables.However,anotherimportantclassofproblemsinvolves situationsinwhichweonlyobserveinputvariables,withnocorresponding output.Forexample,inamarketingsetting,wemighthavedemographic informationforanumberofcurrentorpotentialcustomers.Wemaywishto understandwhichtypesofcustomersaresimilartoeachotherbygrouping individualsaccordingtotheirobservedcharacteristics.Thisisknownasa clustering problem.Unlikeinthepreviousexamples,herewearenottrying topredictanoutputvariable.
WedevoteChapter 10 toadiscussionofstatisticallearningmethods forproblemsinwhichnonaturaloutputvariableisavailable.Weconsider the NCI60 dataset,whichconsistsof6,830geneexpressionmeasurements foreachof64cancercelllines.Insteadofpredictingaparticularoutput variable,weareinterestedindeterminingwhethertherearegroups,or clusters,amongthecelllinesbasedontheirgeneexpressionmeasurements. Thisisadifficultquestiontoaddress,inpartbecausetherearethousands ofgeneexpressionmeasurementsper cellline,makingithardtovisualize thedata.
Theleft-handpanelofFigure 1.4 addressesthisproblembyrepresentingeachofthe64celllinesusingjusttwonumbers, Z1 and Z2 .These arethefirsttwo principalcomponents ofthedata,whichsummarizethe 6, 830expressionmeasurementsforeachcelllinedowntotwonumbersor dimensions.Whileitislikelythatthisdimensionreductionhasresultedin
FIGURE1.4. Left: Representationofthe NCI60 geneexpressiondatasetin atwo-dimensionalspace, Z1 and Z2 .Eachpointcorrespondstooneofthe 64 celllines.Thereappeartobefourgroupsofcelllines,whichwehaverepresented usingdifferentcolors. Right: Sameasleftpanelexceptthatwehaverepresented eachofthe 14 differenttypesofcancerusingadifferentcoloredsymbol.Celllines correspondingtothesamecancertypetendtobenearbyinthetwo-dimensional space.
somelossofinformation,itisnowpossibletovisuallyexaminethedatafor evidenceofclustering.Decidingonthenumberofclustersisoftenadifficultproblem.Buttheleft-handpanelofFigure 1.4 suggestsatleastfour groupsofcelllines,whichwehaverepresentedusingseparatecolors.We cannowexaminethecelllineswithineachclusterforsimilaritiesintheir typesofcancer,inordertobetterunderstandtherelationshipbetween geneexpressionlevelsandcancer.
Inthisparticulardataset,itturnsoutthatthecelllinescorrespond to14differenttypesofcancer.(However,thisinformationwasnotused tocreatetheleft-handpanelofFigure 1.4.)Theright-handpanelofFigure 1.4 isidenticaltotheleft-handpanel,exceptthatthe14cancertypes areshownusingdistinctcoloredsymb ols.Thereisclearevidencethatcell lineswiththesamecancertypetendtobelocatedneareachotherinthis two-dimensionalrepresentation.Inaddition,eventhoughthecancerinformationwasnotusedtoproducetheleft-handpanel,theclusteringobtained doesbearsomeresemblancetosomeoftheactualcancertypesobserved intheright-handpanel.Thisprovidessomeindependentverificationofthe accuracyofourclusteringanalysis.
ABriefHistoryofStatisticalLearning Thoughtheterm statisticallearning isfairlynew,manyoftheconcepts thatunderliethefieldweredevelopedlongago.Atthebeginningofthe nineteenthcentury,LegendreandGausspublishedpapersonthe method
ofleastsquares,whichimplementedtheearliestformofwhatisnowknown as linearregression.Theapproachwasfirstsuccessfullyappliedtoproblems inastronomy.Linearregressionisusedforpredictingquantitativevalues, suchasanindividual’ssalary.Inordertopredictqualitativevalues,suchas whetherapatientsurvivesordies,orwhetherthestockmarketincreases ordecreases,Fisherproposed lineardiscriminantanalysis in1936.Inthe 1940s,variousauthorsputforthanalternativeapproach, logisticregression. Intheearly1970s,NelderandWedderburncoinedtheterm generalized linearmodels foranentireclassofstatisticallearningmethodsthatinclude bothlinearandlogisticregressionasspecialcases.
Bytheendofthe1970s,manymoretechniquesforlearningfromdata wereavailable.However,theywerealmostexclusively linear methods,becausefitting non-linear relationshipswascomputationallyinfeasibleatthe time.Bythe1980s,computingtechnologyhadfinallyimprovedsufficiently thatnon-linearmethodswerenolongercomputationallyprohibitive.Inmid 1980sBreiman,Friedman,OlshenandStoneintroduced classificationand regressiontrees,andwereamongthefirsttodemonstratethepowerofa detailedpracticalimplementationofamethod,includingcross-validation formodelselection.HastieandTibshiranicoinedtheterm generalizedadditivemodels in1986foraclassofnon-linearextensionstogeneralizedlinear models,andalsoprovidedapracticalsoftwareimplementation.
Sincethattime,inspiredbytheadventof machinelearning andother disciplines,statisticallearninghasemergedasanewsubfieldinstatistics, focusedonsupervisedandunsupervisedmodelingandprediction.Inrecent years,progressinstatisticallearninghasbeenmarkedbytheincreasing availabilityofpowerfulandrelativelyuser-friendlysoftware,suchasthe popularandfreelyavailable R system.Thishasthepotentialtocontinue thetransformationofthefieldfromasetoftechniquesusedanddeveloped bystatisticiansandcomputerscientiststoanessentialtoolkitforamuch broadercommunity.
ThisBook TheElementsofStatisticalLearning (ESL)byHastie,Tibshirani,and Friedmanwasfirstpublishedin2001.Sincethattime,ithasbecomean importantreferenceonthefundamentalsofstatisticalmachinelearning. Itssuccessderivesfromitscomprehensiveanddetailedtreatmentofmany importanttopicsinstatisticallearning,aswellasthefactthat(relativeto manyupper-levelstatisticstextbo oks)itisaccessibletoawideaudience. However,thegreatestfactorbehindthesuccessofESLhasbeenitstopical nature.Atthetimeofitspublication,interestinthefieldofstatistical
learningwasstartingtoexplode.ESLprovidedoneofthefirstaccessible andcomprehensiveintroductionstothetopic.
SinceESLwasfirstpublished,thefieldofstatisticallearninghascontinuedtoflourish.Thefield’sexpansionhastakentwoforms.Themost obviousgrowthhasinvolvedthedevelopmentofnewandimprovedstatisticallearningapproachesaimedatansweringarangeofscientificquestions acrossanumberoffields.However,thefieldofstatisticallearninghas alsoexpandeditsaudience.Inthe1990s,increasesincomputationalpower generatedasurgeofinterestinthefieldfromnon-statisticianswhowere eagertousecutting-edgestatisticaltoolstoanalyzetheirdata.Unfortunately,thehighlytechnicalnatureoftheseapproachesmeantthattheuser communityremainedprimarilyrestrictedtoexpertsinstatistics,computer science,andrelatedfieldswiththetraining(andtime)tounderstandand implementthem.
Inrecentyears,newandimprovedsoftwarepackageshavesignificantly easedtheimplementationburdenformanystatisticallearningmethods. Atthesametime,therehasbeengrowingrecognitionacrossanumberof fields,frombusinesstohealthcaretogeneticstothesocialsciencesand beyond,thatstatisticallearningisapowerfultoolwithimportantpractical applications.Asaresult,thefieldhasmovedfromoneofprimarilyacademic interesttoamainstreamdiscipline, withanenormouspotentialaudience. Thistrendwillsurelycontinuewiththeincreasingavailabilityofenormous quantitiesofdataandthesoftwaretoanalyzeit.
Thepurposeof AnIntroductiontoStatisticalLearning (ISL)istofacilitatethetransitionofstatisticallearningfromanacademictoamainstream field.ISLisnotintendedtoreplaceESL,whichisafarmorecomprehensivetextbothintermsofthenumberofapproachesconsideredandthe depthtowhichtheyareexplored.WeconsiderESLtobeanimportant companionforprofessionals(withgraduatedegreesinstatistics,machine learning,orrelatedfields)whoneedtounderstandthetechnicaldetails behindstatisticallearningapproaches.However,thecommunityofusersof statisticallearningtechniqueshasexpandedtoincludeindividualswitha widerrangeofinterestsandbackgrounds.Therefore,webelievethatthere isnowaplaceforalesstechnicalandmoreaccessibleversionofESL.
Inteachingthesetopicsovertheyears,wehavediscoveredthattheyare ofinteresttomaster’sandPhDstudentsinfieldsasdisparateasbusiness administration,biology,andcomputerscience,aswellastoquantitativelyorientedupper-divisionundergraduates.Itisimportantforthisdiverse grouptobeabletounderstandthemodels,intuitions,andstrengthsand weaknessesofthevariousapproaches.Butforthisaudience,manyofthe technicaldetailsbehindstatisticallearningmethods,suchasoptimizationalgorithmsandtheoreticalproperties,arenotofprimaryinterest. Webelievethatthesestudentsdonotneedadeepunderstandingofthese aspectsinordertobecomeinformedus ersofthevariousmethodologies,and
inordertocontributetotheirchosenfieldsthroughtheuseofstatistical learningtools.
ISLRisbasedonthefollowingfourpremises.
1. Manystatisticallearningmethodsarerelevantandusefulinawide rangeofacademicandnon-academicdisciplines,beyondjustthestatisticalsciences. Webelievethatmanycontemporarystatisticallearningproceduresshould,andwill,becomeaswidelyavailableandused asiscurrentlythecaseforclassicalmethodssuchaslinearregression.Asaresult,ratherthanattemptingtoconsidereverypossible approach(animpossibletask),wehaveconcentratedonpresenting themethodsthatwebelievearemostwidelyapplicable.
2. Statisticallearningshouldnotbeviewedasaseriesofblackboxes. No singleapproachwillperformwellinallpossibleapplications.Withoutunderstandingallofthecogsinsidethebox,ortheinteraction betweenthosecogs,itisimpossibletoselectthebestbox.Hence,we haveattemptedtocarefullydescribethemodel,intuition,assumptions,andtrade-offsbehindeachofthemethodsthatweconsider.
3. Whileitisimportanttoknowwhatjobisperformedbyeachcog,it isnotnecessarytohavetheskillstoconstructthemachineinsidethe box! Thus,wehaveminimizeddiscussionoftechnicaldetailsrelated tofittingproceduresandtheoreticalproperties.Weassumethatthe readeriscomfortablewithbasic mathematicalconcepts,butwedo notassumeagraduatedegreeinthemathematicalsciences.Forinstance,wehavealmostcompletelyavoidedtheuseofmatrixalgebra, anditispossibletounderstandtheentirebookwithoutadetailed knowledgeofmatricesandvectors.
4. Wepresumethatthereaderisinterestedinapplyingstatisticallearningmethodstoreal-worldproblems. Inordertofacilitatethis,aswell astomotivatethetechniquesdiscussed,wehavedevotedasection withineachchapterto R computerlabs.Ineachlab,wewalkthe readerthrougharealisticapplicationofthemethodsconsideredin thatchapter.Whenwehavetaughtthismaterialinourcourses, wehaveallocatedroughlyone-thirdofclassroomtimetoworking throughthelabs,andwehavefoundthemtobeextremelyuseful. Manyofthelesscomputationally-orientedstudentswhowereinitiallyintimidatedby R’scommandlevelinterfacegotthehangof thingsoverthecourseofthequarterorsemester.Wehaveused R becauseitisfreelyavailableandispowerfulenoughtoimplementall ofthemethodsdiscussedinthebook.Italsohasoptionalpackages thatcanbedownloadedtoimplementliterallythousandsofadditionalmethods.Mostimportantly, R isthelanguageofchoicefor academicstatisticians,andnewapproachesoftenbecomeavailablein
R yearsbeforetheyareimplementedincommercialpackages.However,thelabsinISLareself-contained,andcanbeskippedifthe readerwishestouseadifferentsoftwarepackageordoesnotwishto applythemethodsdiscussedtoreal-worldproblems.
WhoShouldReadThisBook? Thisbookisintendedforanyonewhoisinterestedinusingmodernstatisticalmethodsformodelingandpredictionfromdata.Thisgroupincludes scientists,engineers,dataanalysts,or quants,butalsolesstechnicalindividualswithdegreesinnon-quantitativefieldssuchasthesocialsciencesor business.Weexpectthatthereaderwillhavehadatleastoneelementary courseinstatistics.Backgroundinlinearregressionisalsouseful,though notrequired,sincewereviewthekeyconceptsbehindlinearregressionin Chapter 3.Themathematicallevelofthisbookismodest,andadetailed knowledgeofmatrixoperationsisnotrequired.Thisbookprovidesanintroductiontothestatisticalprogramminglanguage R.Previousexposure toaprogramminglanguage,suchas MATLAB or Python,isusefulbutnot required.
Wehavesuccessfullytaughtmaterialatthisleveltomaster’sandPhD studentsinbusiness,computerscience,biology,earthsciences,psychology, andmanyotherareasofthephysicalandsocialsciences.Thisbookcould alsobeappropriateforadvancedundergraduateswhohavealreadytaken acourseonlinearregression.Inthe contextofamoremathematically rigorouscourseinwhichESLservesastheprimarytextbook,ISLcould beusedasasupplementarytextforteachingcomputationalaspectsofthe variousapproaches.
NotationandSimpleMatrixAlgebra Choosingnotationforatextbookisalwaysadifficulttask.Forthemost partweadoptthesamenotationalconventionsasESL.
Wewilluse n torepresentthenumberofdistinctdatapoints,orobservations,inoursample.Wewilllet p denotethenumberofvariablesthatare availableforuseinmakingpredictions.Forexample,the Wage datasetconsistsof12variablesfor3,000people,sowehave n =3,000observationsand p =12variables(suchas year, age, wage,andmore).Notethatthroughout thisbook,weindicatevariablenamesusingcoloredfont: VariableName. Insomeexamples, p mightbequitelarge,suchasontheorderofthousandsorevenmillions;thissituationarisesquiteoften,forexample,inthe analysisofmodernbiologicaldataorweb-basedadvertisingdata.
Ingeneral,wewilllet xij representthevalueofthe j thvariableforthe ithobservation,where i =1, 2,...,n and j =1, 2,...,p.Throughoutthis book, i willbeusedtoindexthesamplesorobservations(from1to n)and j willbeusedtoindexthevariables(from1to p).Welet X denotea n × p matrixwhose(i,j )thelementis xij .Thatis,
Forreaderswhoareunfamiliarwithmatrices,itisusefultovisualize X as aspreadsheetofnumberswith n rowsand p columns. Attimeswewillbeinterestedintherowsof X,whichwewriteas x1 ,x2 ,...,xn .Here xi isavectoroflength p,containingthe p variable measurementsforthe ithobservation.Thatis,
(Vectorsarebydefaultrepresented ascolumns.)Forexample,forthe Wage data, xi isavectoroflength12,consistingof year, age, wage,andother valuesforthe ithindividual.Atothertimeswewillinsteadbeinterested inthecolumnsof X,whichwewriteas x1 , x2 ,..., xp .Eachisavectorof length n.Thatis,
Forexample,forthe Wage data, x1 containsthe n =3,000valuesfor year Usingthisnotation,thematrix X canbewrittenas
or
The T notationdenotesthe transpose ofamatrixorvector.So,forexample,
while
Weuse yi todenotethe ithobservationofthevariableonwhichwe wishtomakepredictions,suchas wage.Hence,wewritethesetofall n observationsinvectorformas
Thenourobserveddataconsistsof {(x1 ,y1 ), (x2 ,y2 ),..., (xn ,yn )},where each xi isavectoroflength p.(If p =1,then xi issimplyascalar.)
Inthistext,avectoroflength n willalwaysbedenotedin lowercase bold ;e.g.
However,vectorsthatarenotoflength n (suchasfeaturevectorsoflength p,asin(1.1))willbedenotedin lowercasenormalfont,e.g. a.Scalarswill alsobedenotedin lowercasenormalfont,e.g. a.Intherarecasesinwhich thesetwousesforlowercasenormalfontleadtoambiguity,wewillclarify whichuseisintended.Matriceswillbedenotedusing boldcapitals,such as A.Randomvariableswillbedenotedusing capitalnormalfont,e.g. A, regardlessoftheirdimensions.
Occasionallywewillwanttoindicatethedimensionofaparticularobject.Toindicatethatanobjectis ascalar,wewillusethenotation a ∈ R. Toindicatethatitisavectoroflength k ,wewilluse a ∈ Rk (or a ∈ Rn ifitisoflength n).Wewillindicatethatanobjectisa r × s matrixusing A ∈ Rr ×s .
Wehaveavoidedusingmatrixalgebrawheneverpossible.However,in afewinstancesitbecomestoocumbersometoavoiditentirely.Inthese rareinstancesitisimportanttounderstandtheconceptofmultiplying twomatrices.Supposethat A ∈ Rr ×d and B ∈ Rd×s .Thentheproduct
of A and B isdenoted AB.The(i,j )thelementof AB iscomputedby multiplyingeachelementofthe ithrowof A bythecorrespondingelement ofthe j thcolumnof B.Thatis,(AB)ij = d k=1 aik bkj .Asanexample, consider
Notethatthisoperationproducesan r × s matrix.Itisonlypossibleto compute AB ifthenumberofcolumnsof A isthesameasthenumberof rowsof B.
OrganizationofThisBook Chapter 2 introducesthebasicterminologyandconceptsbehindstatisticallearning.Thischapteralsopresentsthe K -nearestneighbor classifier,a verysimplemethodthatworkssurprisinglywellonmanyproblems.Chapters 3 and 4 coverclassicallinearmethodsforregressionandclassification. Inparticular,Chapter 3 reviews linearregression,thefundamentalstartingpointforallregressionmethods.InChapter 4 wediscusstwoofthe mostimportantclassicalclassificationmethods, logisticregression and lineardiscriminantanalysis. Acentralprobleminallstatisticallearningsituationsinvolveschoosing thebestmethodforagivenapplication.Hence,inChapter 5 weintroduce cross-validation andthe bootstrap,whichcanbeusedtoestimatethe accuracyofanumberofdifferentmethodsinordertochoosethebestone. Muchoftherecentresearchinstatisticallearninghasconcentratedon non-linearmethods.However,linearmethodsoftenhaveadvantagesover theirnon-linearcompetitorsintermsofinterpretabilityandsometimesalso accuracy.Hence,inChapter 6 weconsiderahostoflinearmethods,both classicalandmoremodern,whichofferpotentialimprovementsoverstandardlinearregression.Theseinclude stepwiseselection, ridgeregression, principalcomponentsregression, partialleastsquares,andthe lasso. Theremainingchaptersmoveintotheworldofnon-linearstatistical learning.WefirstintroduceinChapter 7 anumberofnon-linearmethods thatworkwellforproblemswithasingleinputvariable.Wethenshowhow thesemethodscanbeusedtofitnon-linear additive modelsforwhichthere ismorethanoneinput.InChapter 8,weinvestigate tree-basedmethods, including bagging, boosting,and randomforests. Supportvectormachines, asetofapproachesforperformingbothlinearandnon-linearclassification,
arediscussedinChapter 9.Finally,inChapter 10,weconsiderasetting inwhichwehaveinputvariablesbutnooutputvariable.Inparticular,we present principalcomponentsanalysis, K -meansclustering,and hierarchicalclustering.
Attheendofeachchapter,wepresentoneormore R labsectionsin whichwesystematicallyworkthroughapplicationsofthevariousmethodsdiscussedinthatchapter.Theselabsdemonstratethestrengthsand weaknessesofthevariousapproaches,andalsoprovideausefulreference forthesyntaxrequiredtoimplementthevariousmethods.Thereadermay choosetoworkthroughthelabsathisorherownpace,orthelabsmay bethefocusofgroupsessionsaspartofaclassroomenvironment.Within each R lab,wepresenttheresultsthatweobtainedwhenweperformed thelabatthetimeofwritingthisbook.However,newversionsof R are continuouslyreleased,andovertime,thepackagescalledinthelabswillbe updated.Therefore,inthefuture,itispossiblethattheresultsshownin thelabsectionsmaynolongercorrespondpreciselytotheresultsobtained bythereaderwhoperformsthelabs.Asnecessary,wewillpostupdatesto thelabsonthebookwebsite.
Weusethe symboltodenotesectionsor exercisesthatcontainmore challengingconcepts.Thesecanbeeasilyskippedbyreaderswhodonot wishtodelveasdeeplyintothematerial,orwholackthemathematical background.
DataSetsUsedinLabsandExercises Inthistextbook,weillustratestatisticallearningmethodsusingapplicationsfrommarketing,finance,biology,andotherareas.The ISLR package availableonthebookwebsitecontainsanumberofdatasetsthatare requiredinordertoperformthelabsandexercisesassociatedwiththis book.Oneotherdatasetiscontainedinthe MASS library,andyetanother ispartofthebase R distribution.Table 1.1 containsasummaryofthedata setsrequiredtoperformthelabsandexercises.Acoupleofthesedatasets arealsoavailableastextfilesonthebookwebsite,foruseinChapter 2
NameDescription
Auto Gasmileage,horsepower,andotherinformationforcars.
Boston HousingvaluesandotherinformationaboutBostonsuburbs.
Caravan Informationaboutindividualsofferedcaravaninsurance.
Carseats Informationaboutcarseatsalesin400stores.
College Demographiccharacteristics,tuition,andmoreforUSAcolleges.
Default Customerdefaultrecordsforacreditcardcompany.
Hitters Recordsandsalariesforbaseballplayers.
Khan Geneexpressionmeasurementsforfourcancertypes. NCI60 Geneexpressionmeasurementsfor64cancercelllines.
OJ SalesinformationforCitrusHillandMinuteMaidorangejuice.
Portfolio Pastvaluesoffinancialassets,foruseinportfolioallocation.
Smarket DailypercentagereturnsforS&P500overa5-yearperiod. USArrests Crimestatisticsper100,000residentsin50statesofUSA.
Wage IncomesurveydataformalesincentralAtlanticregionofUSA.
Weekly 1,089weeklystockmarketreturnsfor21years.
TABLE1.1. Alistofdatasetsneededtoperformthelabsandexercisesinthis textbook.Alldatasetsareavailableinthe ISLR library,withtheexceptionof Boston (partof MASS)and USArrests (partofthebase R distribution).
Itcontainsanumberofresources,includingthe R packageassociatedwith thisbook,andsomeadditionaldatasets.
Acknowledgements AfewoftheplotsinthisbookweretakenfromESL:Figures 6.7, 8.3, and 10.12.Allotherplotsarenewtothisbook.
2 StatisticalLearning 2.1WhatIsStatisticalLearning? Inordertomotivateourstudyofstatisticallearning,webeginwitha simpleexample.Supposethatwearestatisticalconsultantshiredbya clienttoprovideadviceonhowtoimprovesalesofaparticularproduct.The Advertising datasetconsistsofthe sales ofthatproductin200different markets,alongwithadvertisingbudgetsfortheproductineachofthose marketsforthreedifferentmedia: TV, radio,and newspaper.Thedataare displayedinFigure 2.1.Itisnotpossibleforourclienttodirectlyincrease salesoftheproduct.Ontheotherhand,theycancontroltheadvertising expenditureineachofthethreemedia.Therefore,ifwedeterminethat thereisanassociationbetweenadvertisingandsales,thenwecaninstruct ourclienttoadjustadvertisingbudgets,therebyindirectlyincreasingsales. Inotherwords,ourgoalistodevelopanaccuratemodelthatcanbeused topredictsalesonthebasisofthethreemediabudgets.
Inthissetting,theadvertisingbudgetsare inputvariables while sales input variable isan outputvariable.Theinputvariablesaretypicallydenotedusingthe output variable symbol X ,withasubscripttodistinguishthem.So X1 mightbethe TV budget, X2 the radio budget,and X3 the newspaper budget.Theinputs gobydifferentnames,suchas predictors, independentvariables, features, predictor independent variable feature orsometimesjust variables.Theoutputvariable—inthiscase, sales—is variable oftencalledthe response or dependentvariable,andistypicallydenoted response dependent variable usingthesymbol Y .Throughoutthisbook,wewillusealloftheseterms interchangeably.
G.Jamesetal., AnIntroductiontoStatisticalLearning:withApplicationsinR, SpringerTextsinStatistics,DOI10.1007/978-1-4614-7138-7 2,
FIGURE2.1. The Advertising dataset.Theplotdisplays sales,inthousands ofunits,asafunctionof TV, radio,and newspaper budgets,inthousandsof dollars,for 200 differentmarkets.Ineachplotweshowthesimpleleastsquares fitof sales tothatvariable,asdescribedinChapter 3.Inotherwords,eachblue linerepresentsasimplemodelthatcanbeusedtopredict sales using TV, radio, and newspaper,respectively.
Moregenerally,supposethatweobserveaquantitativeresponse Y and p differentpredictors, X1 ,X2 ,...,Xp .Weassumethatthereissome relationshipbetween Y and X =(X1 ,X2 ,...,Xp ),whichcanbewritten intheverygeneralform
Here f issomefixedbutunknownfunctionof X1 ,...,Xp ,and isarandom errorterm,whichisindependentof X andhasmeanzero.Inthisformulaerrorterm tion, f representsthe systematic informationthat X providesabout Y systematic
Asanotherexample,considertheleft-handpanelofFigure 2.2,aplotof income versus yearsofeducation for30individualsinthe Income dataset. Theplotsuggeststhatonemightbeabletopredict income using yearsof education.However,thefunction f thatconnectstheinputvariabletothe outputvariableisingeneralunknown.Inthissituationonemustestimate f basedontheobservedpoints.Since Income isasimulateddataset, f is knownandisshownbythebluecurveintheright-handpanelofFigure 2.2. Theverticallinesrepresenttheerrorterms .Wenotethatsomeofthe 30observationslieabovethebluecurveandsomeliebelowit;overall,the errorshaveapproximatelymeanzero.
Ingeneral,thefunction f mayinvolvemorethanoneinputvariable. InFigure 2.3 weplot income asafunctionof yearsofeducation and seniority.Here f isatwo-dimensionalsurfacethatmustbeestimated basedontheobserveddata.
FIGURE2.2. The Income dataset. Left: Thereddotsaretheobservedvalues of income (intensofthousandsofdollars)and yearsofeducation for 30 individuals. Right: Thebluecurverepresentsthetrueunderlyingrelationshipbetween income and yearsofeducation,whichisgenerallyunknown(butisknownin thiscasebecausethedataweresimulated).Theblacklinesrepresenttheerror associatedwitheachobservation.Notethatsomeerrorsarepositive(ifanobservationliesabovethebluecurve)andsomearenegative(ifanobservationlies belowthecurve).Overall,theseerrorshaveapproximatelymeanzero.
Inessence,statisticallearningreferstoasetofapproachesforestimating f .Inthischapterweoutlinesomeofthekeytheoreticalconceptsthatarise inestimating f ,aswellastoolsforevaluatingtheestimatesobtained.
2.1.1WhyEstimate f ? Therearetwomainreasonsthatwemaywishtoestimate f : prediction and inference.Wediscusseachinturn.
Prediction
Inmanysituations,asetofinputs X arereadilyavailable,buttheoutput Y cannotbeeasilyobtained.Inthissetting,sincetheerrortermaverages tozero,wecanpredict Y using
where ˆ f representsourestimatefor f ,and ˆ Y representstheresultingpredictionfor Y .Inthissetting, ˆ f isoftentreatedasa blackbox,inthesense thatoneisnottypicallyconcernedwiththeexactformof ˆ f ,providedthat ityieldsaccuratepredictionsfor Y .