https://ebookmass.com/product/using-r-for-data-analysis-in-
Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...
Data Analysis for the Life Sciences with R 1st Edition
https://ebookmass.com/product/data-analysis-for-the-life-scienceswith-r-1st-edition/
ebookmass.com
Numerical Methods Using Kotlin: For Data Science, Analysis, and Engineering 1st Edition Haksun Li
https://ebookmass.com/product/numerical-methods-using-kotlin-for-datascience-analysis-and-engineering-1st-edition-haksun-li-2/
ebookmass.com
Numerical Methods Using Kotlin: For Data Science, Analysis, and Engineering 1st Edition Haksun Li
https://ebookmass.com/product/numerical-methods-using-kotlin-for-datascience-analysis-and-engineering-1st-edition-haksun-li/
ebookmass.com
Devil's Due: Complete Series Books 1-4 Eva Charles
https://ebookmass.com/product/devils-due-complete-seriesbooks-1-4-eva-charles/
ebookmass.com
The Sermons of John Donne: Volume 8
https://ebookmass.com/product/the-sermons-of-john-donne-volume-8/
ebookmass.com
Health Services Research Methods 2nd Edition, (Ebook PDF)
https://ebookmass.com/product/health-services-research-methods-2ndedition-ebook-pdf/
ebookmass.com
Psychiatric Drugs Explained 7th 7th Edition David Healy
https://ebookmass.com/product/psychiatric-drugs-explained-7th-7thedition-david-healy/
ebookmass.com
Analytic Theology and the Academic Study of Religion
William Wood
https://ebookmass.com/product/analytic-theology-and-the-academicstudy-of-religion-william-wood/
ebookmass.com
Lilleys Pharmacology for Canadian Health Care Practice 4e 4th Edition Kara Sealock
https://ebookmass.com/product/lilleys-pharmacology-for-canadianhealth-care-practice-4e-4th-edition-kara-sealock/
ebookmass.com
Fintech, Digital Currency and the Future of Islamic Finance: Strategic, Regulatory and Adoption Issues in the Gulf Cooperation Council Nafis Alam
https://ebookmass.com/product/fintech-digital-currency-and-the-futureof-islamic-finance-strategic-regulatory-and-adoption-issues-in-thegulf-cooperation-council-nafis-alam/
ebookmass.com
UsingRforDataAnalysis inSocialSciences
UsingRforData Analysisin SocialSciences
AResearchProject-OrientedApproach
QUANLI
OxfordUniversityPressisadepartmentoftheUniversityofOxford.Itfurthers theUniversity’sobjectiveofexcellenceinresearch,scholarship,andeducation bypublishingworldwide.OxfordisaregisteredtrademarkofOxfordUniversity PressintheUKandincertainothercountries.
PublishedintheUnitedStatesofAmericabyOxfordUniversityPress 198MadisonAvenue,NewYork,NY10016,UnitedStatesofAmerica.
©OxfordUniversityPress2018
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedin aretrievalsystem,ortransmitted,inanyformorbyanymeans,withoutthe priorpermissioninwritingofOxfordUniversityPress,orasexpresslypermitted bylaw,bylicenseorundertermsagreedwiththeappropriatereproduction rightsorganization.Inquiriesconcerningreproductionoutsidethescopeofthe aboveshouldbesenttotheRightsDepartment,OxfordUniversityPress,atthe addressabove.
Youmustnotcirculatethisworkinanyotherform andyoumustimposethissameconditiononanyacquirer.
LibraryofCongressCataloging-in-PublicationData Names:Li,Quan,1966–author.
Title:UsingRfordataanalysisinsocialsciences:aresearch project-orientedapproach/QuanLi. Description:NewYork,NY:OxfordUniversityPress,[2018] Identifiers:LCCN2017010031|ISBN9780190656225(pbk.)| ISBN9780190656218(hardcover)|ISBN9780190656232(updf)| ISBN9780190656249(epub)Subjects:LCSH:Socialsciences–Research–Data processing.|Socialsciences–Statisticalmethods.|R(Computerprogramlanguage) Classification:LCCH61.3.L522018|DDC330.285/5133–dc23 LCrecordavailableathttps://lccn.loc.gov/2017010031
135798642
PaperbackprintedbyWebCom,Inc.,Canada HardbackprintedbyBridgeportNationalBindery,Inc.,UnitedStatesofAmerica
CONTENTS
ListofFigures ix
ListofTables xi
Acknowledgments xiii
Introduction xv
1.LearnaboutRandWriteFirstToyPrograms 1
WHENTOUSERINARESEARCHPROJECT 2
ESSENTIALSABOUTR 3
HOWTOSTARTAPROJECTFOLDERANDWRITEOURFIRSTRPROGRAM 4
CREATE,DESCRIBE,ANDGRAPHAVECTOR:ASIMPLETOYEXAMPLE 7
SIMPLEREAL-WORLDEXAMPLE:DATAFROMIVERSENANDSOSKICE(2006) 23
CHAPTER1:RPROGRAMCODE 28
TROUBLESHOOTANDGETHELP 32
IMPORTANTREFERENCEINFORMATION:SYMBOLS,OPERATORS,ANDFUNCTIONS 34
SUMMARY 35
MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 36
EXERCISES 42
2.GetDataReady:Import,Inspect,andPrepareData 43
PREPARATION 43
IMPORTPENNWORLDTABLE7.0DATASET 45
INSPECTIMPORTEDDATA 49
PREPAREDATAI:VARIABLETYPESANDINDEXING 55
PREPAREDATAII:MANAGEDATASETS 59
PREPAREDATAIII:MANAGEOBSERVATIONS 65
PREPAREDATAIV:MANAGEVARIABLES 68
CHAPTER2PROGRAMCODE 78
SUMMARY 85
MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 86 EXERCISES 93
3.One-SampleandDifference-of-MeansTests 94
CONCEPTUALPREPARATION 95
DATAPREPARATION 101
WHATISTHEAVERAGEECONOMICGROWTHRATEINTHEWORLDECONOMY? 104
DIDTHEWORLDECONOMYGROWMOREQUICKLYIN1990THANIN1960? 115
CHAPTER3PROGRAMCODE 128
SUMMARY 133
MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 133 EXERCISES 142
4.CovarianceandCorrelation 143
DATAANDSOFTWAREPREPARATIONS 143
VISUALIZETHERELATIONSHIPBETWEENTRADEANDGROWTHUSING SCATTERPLOT 146
ARETRADEOPENNESSANDECONOMICGROWTHCORRELATED? 149
DOESTHECORRELATIONBETWEENTRADEANDGROWTHCHANGEOVERTIME? 154
CHAPTER4PROGRAMCODE 160
SUMMARY 163
MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 164 EXERCISES 168
5.RegressionAnalysis 170
CONCEPTUALPREPARATION:HOWTOUNDERSTANDREGRESSIONANALYSIS 171
DATAPREPARATION 175
VISUALIZEANDINSPECTDATA 182
HOWTOESTIMATEANDINTERPRETOLSMODELCOEFFICIENTS 185
HOWTOESTIMATESTANDARDERROROFCOEFFICIENT 187
HOWTOMAKEANINFERENCEABOUTTHEPOPULATIONPARAMETER OFINTEREST 188
HOWTOINTERPRETOVERALLMODELFIT 190
HOWTOPRESENTSTATISTICALRESULTS 193
CHAPTER5PROGRAMCODE 194
SUMMARY 198
MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 199 EXERCISES 204
6.RegressionDiagnosticsandSensitivityAnalysis 206
WHYAREOLSASSUMPTIONSANDDIAGNOSTICSIMPORTANT? 206
DATAPREPARATION 211
LINEARITYANDMODELSPECIFICATION 215
PERFECTANDHIGHMULTICOLLINEARITY 221
CONSTANTERRORVARIANCE 223
INDEPENDENCEOFERRORTERMOBSERVATIONS 227
INFLUENTIALOBSERVATIONS 240
NORMALITYTEST 245
REPORTFINDINGS 247
CHAPTER6PROGRAMCODE 251
SUMMARY 259
MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 259 EXERCISES 262
7.ReplicationofFindingsinPublishedAnalyses 263
WHATEXPLAINSTHEGEOGRAPHICSPREADOFMILITARIZEDINTERSTATEDISPUTES?
REPLICATIONANDDIAGNOSTICSOFBRAITHWAITE(2006) 264
DOESRELIGIOSITYINFLUENCEINDIVIDUALATTITUDESTOWARDINNOVATION?
REPLICATIONOFBÉNABOUETAL.(2015) 284
CHAPTER7PROGRAMCODE 295
SUMMARY 301
8.Appendix:ABriefIntroductiontoAnalyzingCategorical DataandFindingMoreData 302
OBJECTIVE 302
GETTINGDATAREADY 303
DOMENANDWOMENDIFFERINSELF-REPORTEDHAPPINESS? 304
DOBELIEVERSINGODANDNON-BELIEVERSDIFFERINSELF-REPORTED HAPPINESS? 310
SOURCESOFSELF-REPORTEDHAPPINESS:LOGISTICREGRESSION 313 WHERETOFINDMOREDATA 323
ReferencesandReadings 327 Index 331
LISTOFFIGURES
1.1HowtoWriteFirstToyPrograminR 8
1.2HowtoInstallAdd-onPackage 18
1.3DistributionofDiscreteVariablevd$v1:BarChart 21
1.4DistributionofContinuousVariablevd$v1:Boxplotand Histogram 23
1.5DistributionofWageInequalityfromIversenand Soskice(2006) 27
1.6DistributionofPRandMajoritarianSystemsfromIversenand Soskice(2006) 27
1.7RStudioScreenshot 38
2.1UsingView()FunctiontoViewRawData 50
2.2DistributionofVariablergdpl 55
3.1TypesofErrorsandAlternativeSamplingDistributions 100
3.2HistogramforGrowth 113
3.3Meanand95%ConfidenceIntervalforGrowth 114
3.4Meanand95%ConfidenceIntervalforGrowth:1960and1990 127
4.1SimulatedPositiveCorrelationsofTwoRandomVariables 147
4.2ScatterPlotofTradeOpennessandEconomicGrowth 148
4.3CorrelationbetweenTradeandGrowthoverTime 157
4.4 P ValueofCorrelationbetweenTradeandGrowthoverTime 159
4.5AnscombeQuartetScatterPlot 166
5.1OriginalStatisticalResultsfromFrankelandRomer(1999) 174
5.2ComparingUnloggedandLoggedIncomeperPerson 184
5.3TradeOpennessandLogofIncomeperPerson 184
5.4CoefficientsPlotforModel1 194
5.5PartialRegressionPlot 203
5.6ExplorePairwiseRelationshipsamongVariables 204
6.1AnscombeQuartetRegressions 210
6.2AnscombeQuartetResidualsversusFittedValuesPlots 211
6.3DiagnosticPlotsforaWell-BehavedRegression 212
6.4ResidualsversusFittedValues:Linearity 216
6.5ResidualsversusIndependentVariables:Linearity 217
6.6TradeOpennessandLogofIncomeperPerson 220
6.7DistributionofResidualsbyRegion 228
6.8ScatterPlotofTradeandIncomebyRegion 230
6.9EstimatedEffectofTradeonIncomebyRegion 237
6.10InfluencePlotofInfluentialObservations 241
6.11InfluentialObservationsAboveCook’sDThreshold 243
6.12NormalityAssumptionDiagnosticPlot 245
7.1RegressionDiagnosticPlot:ResidualsversusFittedValues 274
7.2DiagnosticPlotforInfluentialObservations:Cook’sD 278
7.3NormalityAssumptionDiagnosticPlot 281
8.1SamplePagefromWorldValuesSurveyCodebook 303
LISTOFTABLES
1.1CountryMeansforVariablesUsedinRegressionAnalysis (fromIversonandSoskice,2006) 24
1.2StatisticsofImportedDatafromIversenandSoskice(2006) 26
1.3ImportantSymbolsinR 34
1.4ArithmeticOperators 35
1.5LogicalOperators 35
1.6CommonStatisticalandMathematicalFunctions 36
2.1ListofDataPreparationTasksandRelatedRFunctions 46
3.1LogicofStatisticalInference 96
3.2Two-SampleDifference-of-MeansTests 123
5.1CoefficientInterpretationinLogorUnloggedModels 175
5.2DescriptiveStatisticsofFinalDataset 183
5.3EffectofTradeOpennessonRealIncomeperPerson 193
6.1RegressionResultsUsingAnscombe’sQuartet 209
6.2EffectofTradeonIncome:RobustnessChecksPartI 249
6.3EffectofTradeonIncome:RobustnessChecksPartII 250
7.1VariableMeasuresandExpectedEffects 266
7.2OLSRegressionofDisputeDispersion(OriginalStatisticalResults TablefromBraithwaite,2006) 267
7.3OriginalDescriptiveStatisticsTableinBraithwaite(2006) 269
7.4CausesofSpreadofMilitaryDisputes:ReplicationandRobustness Tests 282
7.5MostImportantQualitiesforChildrentoHave(fromBénabouetal., 2015) 285
7.6VariableLabelsforDatasetinBénabouetal.(2015) 288
7.7ReplicatingTable2inBénaboutetal.(2015) 293
ACKNOWLEDGMENTS
Fiveoriginaltablesfromfourdifferentjournalarticlesarereprintedinthebook forreplicationexercises.Thearticlesinclude(1)Iversen,Torben,andDavid Soskice,2006,“ElectoralInstitutionsandthePoliticsofCoalitions:WhySome DemocraciesRedistributeMoreThanOthers,”AmericanPoliticalScienceReview 100(2):165–81,TableA1.Copyright:CambridgeUniversityPress.(2)Frankel, JeffreyA.,andDavidRomer,1999.“DoesTradeCauseGrowth?”American EconomicReview89(3):379–99,Table3.Copyright:AmericanEconomicAssociation.(3)Braithwaite,Alex.2006.“TheGeographicSpreadofMilitarizedDisputes,”JournalofPeaceResearch43(5):507–22,TableIandTableII.Copyright: SAGEPublications.(4)Bénabou,Roland,DavideTicchi,andAndreaVindigni, 2015,“Religionand‘Innovation”’AmericanEconomicReview105(5):346–51, Table2.Copyright:AmericanEconomicAssociation.Permissionstoreprintthe relevanttablesinIversenandSoskice(2006)andBraithwaite(2006)havebeen acquiredandlicensedfromCambridgeUniversityPressandSAGEPublications.
JeffreyFrankel,RolandBénabou,andAmericanEconomicAssociationdeserve specialthanksforgraciouslygrantingmepermissiontoreprinttherelevant tablesintheirarticlesforfree.
Figures1through4inF.J.Anscombe’s“GraphsinStatisticalAnalysis,” publishedin1973in TheAmericanStatistician 27(1):17–21,havebeenadapted andusedwithpermissionofthepublisher,Taylor&FrancisLtdhttp://www. tandfonline.com.
Thisbookwouldnothavebeenpossiblewithouttheencouragement,help, andsupportofmanystudents,colleagues,andfriends.Myundergraduate studentsinPolimetricsandSeniorResearchSeminaratTexasA&MUniversity gavemethefirstimpetustowritethisbook.Manystudentstakingthosetwo courses,especiallyJacobKingandAlexGoodman,caughttyposandmistakesin earlierdrafts.Duringthesummerof2016,ScarletAmo,CorbinCali,Chandler Dawson,andElizabethGohmertexperimentedwithusinganearlierversionof themanuscripttoself-studyRfordataanalysis.Theyprovideddetailedreports
acknowledgments oneachchapterandcompletedindependentapplicationpapers.Theirinputhas dramaticallychangedandimprovedhowvariousmaterialsinthebookarenow presentedandstructured.Ithankthemfortheirextraordinaryworkandeffort. Mygraduateassistants,MollyBerkemeier,KellyMcCaskey,andAustinJohnson, providedexcellenteditorialassistance.Mycolleaguesandfriends,TiyiFeng,Ren Mu,EricaOwen,andCarlisleRainey,readpartsofanearlierdraftandprovided valuablefeedbackandsuggestions.
ManypeopleatOxfordUniversityPresshavehelpedtomakethismanuscript possibleandbetter.ScottParris,whowastheeditorformyfirstbookby CambridgeUniversityPress,hadbeenpatientlyencouragingandproddingme tofinishthisbookuntilhisretirementfromOxford.Happyretirement,Scott! BeforeretiringfromOxford,ScotthandedmycasetoAnneDellinger.Anne’s enthusiasmandencouragementwerethemainreasonthatIdecidedtostaywith Oxford.AfterAnnedepartedfromOxford,DavidPervinbecamemyeditorand offeredsoundadvice.Scott’sassistantCathrynVaulmanandDavid’sassistants EmilyMackenzieandHayleySingertookcareofmanyofthelogisticissues intheprocess.DebbieRuelcorrectedmanyerrorsanddidagreatjobduring copyediting,andLincyPriyapatientlydealtwithmyrequestsandsmoothly handledtheproductionofmybook.XunPangandJudeHaysprovidedvaluable commentsandsuggestionsthathelpedtomakethebookenormouslybetter.
Finally,mygreatestdebtofgratitudeisowedtomywife,Liu,andmytwo children,EllenandAndrew.Withouttheirunyieldingsupport,constantinquiry, andevenreadingpartsofthebookandcheckingmyRcode,Iwouldnothave finishedtheproject.Thisbookisdedicatedtothem!
INTRODUCTION
Thisbookseekstoteachseniorundergraduateandbeginninggraduatestudents insocialscienceshowtouseRtomanage,visualize,andanalyzedatain ordertoanswersubstantiveresearchquestionsandreproducethestatistical analysisinpublishedjournalarticles.Overthepastseveraldecades,statistical analysistraininghasbecomeincreasinglyimportantforundergraduateand graduatestudentsinmanydisciplineswithinsocialandbehavioralsciences,such aseconomics,politicalscience,publicadministration,business,publichealth, anthropology,psychology,sociology,education,andcommunication.Withrapid progressinstatisticalcomputing,proficiencyinusingstatisticalsoftwarehas becomealmostauniversalrequirement,albeittovaryingdegrees,instatistical methodscourses.Popularsoftwarechoicesinclude:SAS,SPSS,Stata,andR. WhileSAS,SPSS,andStataallhaveaccessibleintroductorytextbookstargeting studentsinsocialsciences,suchtextbooksonRarerare.
ComparedwithcommercialpackageslikeSAS,SPSS,andStata,Rhasat leastthreestrengths.Itisawell-thought-out,coherentsystemthatcomes withasuiteofsoftwarefacilitiesfordatamanagement,visualization,and analysis.Inaddition,tomeetemergingneeds,alargecommunityofRusers constantlydevelopsnewopensourceadd-onpackages,alreadyreachingover 10,000.Finally,perhapsthegreatestperkofthesoftwareisthatitisfree.This financialbenefitcannotbeover-emphasized.Cash-strappedcollegestudents oftenfindthemselvesrelyingonlabcomputersforaccesstoSAS,SPSS,and Stata,orconstrainedbythelimitationsofthestudentversionsofthose commercialpackages.Evenpostgraduation,manyfinditdifficulttoconvince theiremployerstopurchaseaparticularcommercialpackagetheyknowfortheir everydayuse.
TherearemanyreasonswhyRispreferredtootherstatisticalsoftware packagesinhighereducation.ButR’sgreatesthandicaptoitswidespreaduse inthesocialsciencesisitssteeplearningcurve.Whilethemarkethasproduced numerousbooksonRatvariouslevels,introductorytextbooksthatfocusonthe
needsofstudentsinthesocialsciencesarenoteasytofind.Thisbookseeksto fillthisvoid.
ThisbookdistinguishesitselffromotherintroductoryRorstatisticsbooksin threeimportantways.First,itintendstoserveasanintroductorytextonusing Rfordataanalysisprojects,targetinganaudiencerarelyexposedtostatistical programming.Therationaleforemphasizingtheintroductorynatureofthis bookissimple;itisdrivenbytheneedsandheterogeneityofthestudentbodywe oftencomeacrossinclassroomteachinginsocialsciencedepartments.Unlike studentsinmathandstatistics,manystudentusersofRinsocialsciences havenoexperienceinanycomputinglanguageorprogrammingsoftware,and manywillneverachieveahigherlevelofprogrammingbeyondwhatisnecessary fortheireverydayuseinR.However,studentsinsocialscienceswillfindthat theopportunitytouseRfordatamanipulation,visualization,andanalysis frequentlypresentsitselfinvariouscoursesandfuturecareers.Hence,they needtobecomeproficientataccomplishingcommontasksindatamanipulation, visualization,andanalysisusingR,withoutgettingoverlytechnical.Inthis respect,existingintroductorytextsonRprogrammingthatdonotinvolve statisticstendtobeoverlycomprehensiveincoverageandareoftengeared towardstudentsinmath,statistics,sciences,andengineering,thusintimidating mostsocialsciencestudents.AlainZuur,ElenaIeno,andErikMeesters’ A Beginner’sGuidetoR andPhilipSpector’s DataManipulationwithR aregood examples.Theirtargetaudiencesoftenarestudentsinmath,statistics,sciences, andengineeringmajorswhohavemoreexperiencesinprogrammingthanfellow classmatesinsocialsciences.
Thisbook,incontrast,adoptsaminimalistapproachinteachingR.Itcovers onlythemostimportantfeaturesandfunctionsinRthatonewillneedforconductingreproducibleresearchprojects,withothermaterialsmovedtochapter appendicesorremovedfromconsiderationcompletely.Risextremelyflexible, almostalwaysallowingmultiplesolutionstooneprogrammingtask.Whilethis isastrength,itdoeschallengebeginningRusersrarelyexposedtocomputer programming.Theminimalistapproachadoptedherewillpresenttypicallyone waytodealwithataskinthemainpartofachapter,leavingotherstuffto asectioncalled“MiscellaneousQuestionsforAmbitiousReaders.”Asaresult, theminimalistapproachshouldflattenthesteeplearningcurve—acommonly noteddisadvantageofR—therebyimprovingthesoftware’saccessibilityto undergraduatesandsimilaraudiences.Organizationally,thisbookbreaksdown chaptersintosmallsectionsthatmimiclabsessionsforstudents.Eachchapter focusesononlytheessentialRfunctionsoneneedstoknowinorderto manipulate,visualize,andanalyzedatatoaccomplishsomeprimarystatistical analysistasks.Intheend,throughthisminimalistapproach,thereaderwill accumulateenoughRknowledgeandskillstocompleteacourseresearchproject andtoself-studymoreadvancedRmaterialsifnecessary.
Aseconduniquefeatureofthisbookisitsemphasisonmeetingthepractical needsofstudentsusingRtoconductstatisticalanalysisforresearchprojects drivenbysubstantivequestionsinsocialsciences.Inadditiontohomework assignmentsandproblemsets,statisticalmethodscoursesinsocialsciences oftenrequirethecompletionofafull-blown,substantivelymotivatedresearch project.Suchtrainingiscriticalifstatisticalknowledgeistoprovetobeofany valueandrelevancetosubstantivecoursesandstudents’futurecareers.Ideally, studentscanutilizecompletedstatisticalanalysispapersaswritingsamplesto showcasetheirquantitativeskillsintheirgraduateschoolorjobapplications.
Inpractice,toaccomplishsuchaprojectonasubstantivequestion,astudent hastocollect,clean,andmanipulatedata,visualizeandanalyzedatasystematicallytoaddressthequestionasked,andreportfindingsinanorganizedmanner. ManyRbooksforintroductorystatisticstendtoemphasizetheRcodesfor statisticaltechniques,givinginsufficientattentiontothepre-analysisneedsof usersaswellastheprocessofcompletingaresearchproject.Forexample,John Verzani’s UsingRforIntroductoryStatistics andMichaelCrawley’s Introductory StatisticsUsingR aretwopopulartextsinthiscategory.Datapreparationisnot linkedtoparticularresearchprojectsthataddresssubstantivequestions.
Incontrast,thisbookiswrittenunderthepremisethatthereaderuses Rprimarilytoaddresssomesubstantivequestionofinterest.Thisleadsto severalnotabledifferencesfromotherintroductorystatisticsbooksusingR.This bookbeginswiththeuseofRtogetanoriginalrawdatasetintoacondition appropriateforstatisticalanalysis,thusemphasizinghowtodealwithvarious issuesthatariseinsuchaprocess.Next,insteadofstartingwiththeinteractive useofR,whichistypicalinothertextbooks,thisbookgivesexclusiveattention towritingandexecutingRprograms.Thisapproachallowseasyverification, recollection,andreplicationofanalysis,anditisalmostalwayshowthings aredoneinactualreproducibleresearch.Studentsfollowingthisapproachwill writemanywell-documentedRcodesthataddressavarietyofpracticalissues suchthattheycansavethoseprogramsforfuturereference.Lastbutnotleast, theuseofRinthisbookiscloselyintegratedintoaprototypicalprocessthat consistsofasequenceofelements:asubstantivequestiontobeanswered,a hypothesisthatanswersthequestion,thelogicofstatisticalinferencebehind theempiricaltestofthehypothesis,theteststatisticforstatisticalinference representedinmathematicalnotationandimplementedcomputationallyinR, andthepresentationoffindingsinanorganizedmanner.Theemphasisison anin-depthunderstandingofwhywedostatisticalanalysisandhowRfits intoactualempiricalresearch.Hence,thisresearchprocess-orproject-oriented approachoughttosignificantlyincreasethelikelihoodthatstudentswillactually useRtosolveproblemsintheirfuturecoursesandcareers.
Athirduniquefeatureofthisbookisitsemphasisonteachingstudents howtoreplicatestatisticalanalysesinpublishedjournalarticles.Scientific
progressrequirespreviousfindingsbereplicableandreplicated;scientificeducation,likeinphysicsandchemistry,alwaysincludeslabexercisesthatreplicatepreviousexperiments.Associalscientificknowledgebecomesincreasingly evidence-basedandreliesonextensivedataanalysis,learningtoreplicate publishedresultsisanecessarystepforundergraduatesandfirst-yeargraduate studentsintheirlearningtoconductsocialscientificresearch.Suchtraining nowbecomesfeasiblebecauseoftheavailabilityofpowerfulfreesoftwareanda widerangeofdatasetsinthepublicdomain.Manyjournalsnowrequireauthors tosubmitanddepositreplicationdatasets.Manyoriginaldatafromsurveys andarchivalresearcharedownloadablefromtheinternet.Studentsnolonger havetobejustpassiveconsumersofsocialscientificresearchbutinsteadcan activelyscrutinizepublishedresearch,playwiththedata,andreproduceorfailto reproducepreviousfindings.Thiswillconvertstudentsfrompassiveconsumers intoactivelearners.Asreproducingresearchfindingsbecomesthenormrather thantheexception,itwillempowerthestudents,lowerthebarriertotheirentry intotheacademiccommunity,andchallengetheprofessorsandotherknowledge producers.Thewidegapbetweenteachingandresearchcommonlyobserved inundergraduatecoursesinsocialscienceswillbenarrowed.Suchchangesare likelytomaketeachingmoreinterestingforprofessors,renderlearningmore fruitfulforstudents,andenablebothpartiestobecomemoresuccessfulintheir endeavors.
Thisbookconsistsofeightchapters.Chapter1introducesR,illustrating howtowriteandexecuteprogramsusingthesoftware.Chapter2goesthrough theprocessof,andvariousmaintasksin,gettingdatareadyforanalysisinR. Chapter3providesaconceptualbackgroundonthelogicofstatisticalinference andthendemonstrateshowtomakestatisticalinferencewithrespecttoone continuousoutcomevariableusingone-andtwo-samplettests.Chapter4moves intoanalyzingtherelationshipbetweentwocontinuousvariables,focusingon covarianceandcorrelation.Chapter5introducesregressionanalysis,covering itsconceptualfoundation,modelspecification,estimation,interpretation,and inference.Chapter6continueswithregressionanalysis,delvingintovarious diagnosticsandsensitivityanalyses.Chapters4through6followthesame approach,integratingconceptualandmathematicalfoundation,datapreparation,statisticalanalysis,andresultsreportingwithineachchapter.Chapter 7walksreadersthroughtheprocessofreplicatingtwopublishedanalyses. Finally,Chapter8,asanappendix,providesabriefintroductiontoanalyzing discretedata,demonstratingtheChi-squaredtestofindependenceandlogistic regression.
Notextbookcanbeperfect;thisoneisnoexception.Theminimalistapproach, emphasizingtheaccessibilityofR,comesataprice.Manycommonlyused functionsandfeaturesofR,suchaswritingfunctionsandloops,arenot covered.Similarly,byfocusingonteachingtheresearchprocessofhowtouse
Rtoaddresssubstantivequestions,thisbookcoversprimarilyexplainingone continuousoutcomevariableandrelevantstatisticaltechniques,suchasmean, differenceofmeans,covariance,correlation,andcross-sectionalregression. Hence,comprehensivenessinbothprogrammingandstatisticsissacrificed,on purpose,forgreateraccessibility,clarity,anddepth.Thegoalistomakethisbook accessibleandusefulfornovicesinbothprogramminganddataanalysis.
Insum,thisbookintegratesRprogramming,thelogicandstepsofstatistical inference,andtheprocessofempiricalsocialscientificresearchinahighly accessibleandstructuredfashion.ItemphasizeslearningtouseRforessential datamanagement,visualization,analysis,andreplicatingpublishedresearch findings.Bytheendofthisbook,studentswillhavelearnedhowtodothe following:(1)useRtoimportdata,inspectdata,identifydatasetattributes, andmanageobservations,variables,anddatasets;(2)useRtographsimple histograms,boxplots,scatterplots,andresearchfindings;(3)useRtosummarizedata,conductone-samplet-test,testthedifference-of-meansbetween groups,computecovarianceandcorrelation,estimateandinterpretordinary leastsquare(OLS)regression,anddiagnoseandcorrectregressionassumption violations;and(4)replicateresearchfindingsinpublishedjournalarticles. The principlebehindthisbookistoteachstudentstolearnaslittleRaspossiblebutto doasmuchsubstantivelydrivendataanalysisatthebeginnerorintermediatelevel aspossible. Theminimalistapproachshoulddramaticallyreducethelearning costbutstillproveadequateformeetingthepracticalresearchneedsofsenior undergraduateandbeginninggraduatestudentsinthesocialsciences.Having completedthisbook,studentscancompetentlyuseRandstatisticalanalysisto answersubstantivequestionsregardingsomesubstantivelyinterestingcontinuousoutcomevariableinacross-sectionaldesign.Itismyhopethat,thenewly acquiredcompetencewillmotivatestudentstowantto,ratherthanbeingforced to,learnmoreaboutRandstatistics.
UsingRforDataAnalysis inSocialSciences
LearnaboutRandWrite
FirstToyPrograms
ChapterObjectives
Inthisfirstchapter,wewillaimtoachievethefollowingobjectives:
1.UnderstandwhentouseRinaresearchproject.
2.LearnaboutthebasicbackgroundofR,softwareinstallation,andgetting help.
3.LearntosetupaprojectfolderforRprogramsanddatafiles.
4.Learntowriteandexecutesimpletoyprograms.
5.LearntofindandsettheworkingdirectoryforaprojectinR.
6.Learntocreateadatavector.
7.Learntocalculatedescriptivestatisticsandhandlemissingvalues.
8.Learntoconvertadatavectorintoadataframe.
9.Learntorefertoavariablewithinadataframe.
10.Learntoinstallanadd-onpackage,"stargazer,"loaditintoR,anduseitto getadescriptivestatisticstable.
11.Learntographthedistributionofavariable.
12.Applyallthelessonslearnedtoareal-worlddataexample.
13.Learnaboutcommoncodingerrorsandhowtogethelp.
Materialsinthischapterneedaboutanhourandahalfforaclassofabout 10studentstocoverinalab,includingbrieflecturingandhands-onpractice. Largerclassesorself-studycouldtakelonger.
WhentoUseRinaResearchProject
Tocompleteanempiricalresearchprojectinvolvesseveralstages,oftenstarting withtheidentificationofaresearchproblemandendingwiththereportof findingsandimplications:
1.Identifyaresearchproblem
2.Surveytheliterature(Findoutwhatisknownabouttheproblem)
3.Formulateatheoreticalargumentandsometestablehypothesis
4.Measureconcepts
5.Collectdata
6.Preparedata
7.Analyzedata
8.Reportfindingsandimplications
Thetasksofidentifyingasignificantandinterestingresearchproblem, surveyingtheextantliterature,formulatingacoherenttheoreticalargumentand sometestablehypothesisthatexplaintheresearchpuzzle,measuringconcepts inthetheoryempirically,andcollectingdatafortheempiricalindicatorsofthe concepts—tasks(1)to(5)—aregenerallydealtwithinsubstantiveandresearch designcoursesinafield.ThosetopicsarebeyondthescopeofthislittleRbook. Yettasks(6)to(8)mayallinvolveRasaresearchinstrument.Specifically,using Rforactualresearchprojectsistoanalyzeparticularresearchproblems,such asevaluatingtheimpactofapolicyortestingtheimpactofacausalfactor(or anindependentvariable)onanoutcome(oradependentvariable)ofinterest, aspostulatedbypre-specifiedtheoreticalexpectations.Howtoaccomplishtasks (6)to(8)willbeillustratedinthefollowingchapters.
Aresearchprojectofthistypepresentsatleasttwochallenges,forwhichR willbeuseful.First,inpractice,suchaprojectinvolvesarangeoftasks,such asimportingdataintosoftware,mergingdifferentdatasetstogether,verifying data,creatingnewvariables,recodingandrenamingvariables,visualizingdata, runningstatisticalestimationprocedures,carryingoutdiagnostictests,andso on.Second,ananalystneedstobeabletoreproducehisorherownanalysis, includingdatasetconstructionandestimationresults,evenyearslater.Thefirst challengeconcernstheefficiencyofananalysis,whereasthesecondconcernsthe reproducibilityandintegrityoftheanalysis.
Toachievebothefficiencyandreproducibility,experiencedanalystsalways choosetowritedowntheircomputingcodeinoneormoreprogramssothat thecodecanbesubmitted,revised,andresubmittedtoreproduceananalysis speedilyandwhenevernecessary.Hence,inthisbook,wewillfocusonhowto writeandsubmitRprogramsforspecifictasksinaprogrameditor,ratherthan theinteractiveuseormenu-driveninterfaceofR.Forallpracticalpurposes,
theprogrammingapproachismuchmoreefficientandconsistentthanthe interactiveormenu-drivenapproach.
BeforewestepintohowtouseR,wewillneedtoclarifysomerelated organizationalandhousekeepingissues.Inthischapter,wewillfirstoffera verybriefintroductiontoR,thendemonstratehowtoinstallR,writeand executeRprograms,installandloadadd-onpackages,andproducegraphical andnumericaloutput,andthenturntoessentialreferenceinformationabout importantsymbolsandcommoncodingerrors.Notably,eachlineofRcodewill likelyappearthreetimes:presentedasastand-alonecommandlineprecededor followedbyanexplanationofitspurposeandfunction,listedtogetherwiththe outputfromitsexecution,andcollatedwithallotherprogramcodeinthechapter forthesakeofconvenientreference.Wewillendthechapterwithasectionabout miscellaneousissuesofinteresttoambitiousreadersandasectiononexercises.
EssentialsaboutR
AOne-ParagraphIntroductiontoR
Risacomputerlanguageandanenvironmentforstatisticalcomputingand graphicswithimportantadvantages.StartedbyRobertGentlemanandRoss IhakaoftheUniversityofAucklandin1995,itisnowmaintainedbytheR core-developmentteamofvolunteerdevelopers.Risreferredtoasacomputer languagebecauseasadialectoftheSlanguagedevelopedinthelate1980s atAT&T’slabs,Rallowsuserstofollowthealgorithms,defineandaddnew functions,andwritenewanalyticmethods,ratherthanmerelysupplyingcanned routines.Risalsoacoherentsystemwhichprovidesanenvironmentwithan integratedsuiteofsoftwarefacilitiesfordatastorage,manipulation,analysis, andvisualization.Inaddition,Risflexible.ItrunsonWindows,UNIX,andMac OSX.Itcanbeeasilyextendedintermsofnewfunctionsandstate-of-the-art statisticalmethods;theover10,000add-onpackagesbytheendofJanuary 2017throughtheCRANfamilyofinternetsitestestifytothisfact.Lastbutnot least,Risfree,asareitsnumerousadd-onpackages.Hence,Rispopularamong practitionersinmanyfieldsandscholarsinmanydisciplines,includingthesocial sciences.
Installation
Asanopensourcesoftwareforstatisticalcomputing,Rcanbeeasilydownloaded fromthefollowingsite:http://www.r-project.org/.Wemaysimplyclickonthe highlighted downloadR linktoreachalistofCRANmirrorsites.Clickingon anysitewepreferdirectsustothepagefordownloadingthesoftwareforthree differentplatforms:Linux,Windows,andMac.Rworksslightlydifferentlyacross