https://ebookmass.com/product/insights-from-data-with-r-anintroduction-for-the-life-and-environmental-sciences-owen-l-
Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...
Data Analysis for the Life Sciences with R 1st Edition
https://ebookmass.com/product/data-analysis-for-the-life-scienceswith-r-1st-edition/
ebookmass.com
Applied Statistics with R: A Practical Guide for the Life Sciences Justin C. Touchon
https://ebookmass.com/product/applied-statistics-with-r-a-practicalguide-for-the-life-sciences-justin-c-touchon/
ebookmass.com
Data Treatment in Environmental Sciences 1st Edition Edition Valérie David (Auth.)
https://ebookmass.com/product/data-treatment-in-environmentalsciences-1st-edition-edition-valerie-david-auth/
ebookmass.com
A Dangerous Universe 4 Andrew
https://ebookmass.com/product/a-dangerous-universe-4-andrew/
ebookmass.com
The European Debt Crisis: How Portugal Navigated the post-2008 Financial Crisis 1st ed. Edition João Moreira Rato
https://ebookmass.com/product/the-european-debt-crisis-how-portugalnavigated-the-post-2008-financial-crisis-1st-ed-edition-joao-moreirarato/ ebookmass.com
Seaweed polysaccharides : isolation, biological and biomedical applications 1st Edition Anil
https://ebookmass.com/product/seaweed-polysaccharides-isolationbiological-and-biomedical-applications-1st-edition-anil/
ebookmass.com
Nothing Special Nicole Flattery
https://ebookmass.com/product/nothing-special-nicole-flattery/
ebookmass.com
The Sustainability of Oil Ports: An Holistic Framework for China 1st ed. Edition Xuemuge Wang
https://ebookmass.com/product/the-sustainability-of-oil-ports-anholistic-framework-for-china-1st-ed-edition-xuemuge-wang/
ebookmass.com
Why We Fly Kimberly Jones
https://ebookmass.com/product/why-we-fly-kimberly-jones/
ebookmass.com
Database System Concepts 6th Edition, (Ebook PDF)
https://ebookmass.com/product/database-system-concepts-6th-editionebook-pdf/
ebookmass.com
InsightsfromDatawithR
InsightsfromDatawithR
AnIntroductionfortheLifeand EnvironmentalSciences
OWENL.PETCHEY
DepartmentofEvolutionaryBiologyandEnvironmentalStudies, UniversityofZürich,Switzerland
ANDREWP.BECKERMAN
DepartmentofAnimalandPlantSciences,UniversityofSheffield,UK
NATALIECOOPER
NaturalHistoryMuseum,London,UK
DYLANZ.CHILDS
DepartmentofAnimalandPlantSciences,UniversityofSheffield,UK
GreatClarendonStreet,Oxford,OX26DP, UnitedKingdom
OxfordUniversityPressisadepartmentoftheUniversityofOxford. ItfurtherstheUniversity’sobjectiveofexcellenceinresearch,scholarship, andeducationbypublishingworldwide.Oxfordisaregisteredtrademarkof OxfordUniversityPressintheUKandincertainothercountries ©OwenL.Petchey,AndrewP.Beckerman,NatalieCooper,DylanZ.Childs2021
Themoralrightsoftheauthorshavebeenasserted FirstEditionpublishedin2021
Impression:1
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedin aretrievalsystem,ortransmitted,inanyformorbyanymeans,withoutthe priorpermissioninwritingofOxfordUniversityPress,orasexpresslypermitted bylaw,bylicenceorundertermsagreedwiththeappropriatereprographics rightsorganization.Enquiriesconcerningreproductionoutsidethescopeofthe aboveshouldbesenttotheRightsDepartment,OxfordUniversityPress,atthe addressabove
Youmustnotcirculatethisworkinanyotherform andyoumustimposethissameconditiononanyacquirer
PublishedintheUnitedStatesofAmericabyOxfordUniversityPress 198MadisonAvenue,NewYork,NY10016,UnitedStatesofAmerica
BritishLibraryCataloguinginPublicationData
Dataavailable
LibraryofCongressControlNumber:2020948906
ISBN978–0–19–884981–0(hbk.)
ISBN978–0–19–884982–7(pbk.)
DOI:10.1093/oso/9780198849810.001.0001
PrintedinGreatBritainby Bell&BainLtd.,Glasgow
LinkstothirdpartywebsitesareprovidedbyOxfordingoodfaithand forinformationonly.Oxforddisclaimsanyresponsibilityforthematerials containedinanythirdpartywebsitereferencedinthiswork.
TheprefaceofthisbookispublishedunderanAttribution-NonCommercial-NoDerivatives4.0 International(CCBY-NC-ND4.0)licence.
Preface
Overview
Manyactivities,includingexperiments,surveys,clinicaltrials,andfieldwork,generatedata.Thesedataprovideinsights—intuitionsandconclusionsthatcomefromidentifyingpatternsindata.Insightsarecriticalfor answeringquestions,solvingproblems,guidingdecisions,andformulating strategy.Butgettinginsightsfromdata,anddoingsoefficiently,reliably, andconfidently,doesnotcomeeasily.Yetgettinginsightsfromdataisa foundationalskillforallscientists.
InsightsfromDatawithR isforlifeandenvironmentalscienceundergraduates(thoughmayalsohelpanyonebeginningintheirlearningabout dataanalysis),andfortheirinstructorstoteachalongside.Itisnotabout statisticsperse,butaboutthatinitialtransitionfromhavingcollected dataaspartofaprojecttothatfirst,andsosatisfying,realizationthat thereisapatterninyourdata.Itcombinestheelementsofthesuccessful undergraduatedataanalysiscoursesofPetcheyattheUniversityofZürich andofChildsattheUniversityofSheffield,the‘IntroductiontoR’courses taughtinternationallyfor15yearsbyallfourauthors,andthebook Getting StartedwithR:AnIntroductionforBiologists,secondedition,byBeckerman,Childs,andPetchey(2017),allusingRwiththeRStudioplatform.
Insights (fromDatawithR)firstcoverswhatinsightsareandwhythey’re soimportant,andmovesontodiscussfeaturesofdatathatcanmake ithardoreasytogaininsights.Itthendescribeshowtoobtaininsights
InsightsfromdatawithR:AnIntroductionfortheLifeandEnvironmentalSciences. OwenL.Petchey,AndrewP.Beckerman,NatalieCooperandDylanZ.Childs,OxfordUniversity Press(2021).©OwenL.Petchey,AndrewP.Beckerman,NatalieCooperandDylanZ.Childs. DOI:10.1093/oso/9780198849810.001.0001
TheprefaceofthisbookispublishedunderanAttribution-NonCommercial-NoDerivatives4.0 International(CCBY-NC-ND4.0)licence.
fromdata.Obtainingtheminvolvesknowingwhatyouareaimingfor,and thenawholelotofpreparation,importing,cleaning,tidying,checking, double-checking,manipulating,andultimatelysummarizingandvisualizingthedata.
Itiscommontohearpeoplewhoworkalotwithdatasaythatabout 80%ofeffortandtimeduringreal-worlddataanalysisisspentonthese kindsoftasks(andonlyabout20%onmakingstatisticalinference).Yet manybooksaboutdataanalysisignorethis80%.Theyalsooverlookthat theskillsinvolvedinthis80%arevaluableintheirownright.Weareofthe opinionthattheseskillsalonegoalongwaytowardsallowingyoutogain robust,informativeinsightsfromyourdata.
Insights willhelpyoudevelopanefficient,reliable,andconfidenceinspiringworkflowformanagingyourdataanddrawingthoseinitial insightsoutofthem,andatthesametimeintroduceyoutocoreRskillsfor datamanagementandvisualization.Efficiencycomesfromlearningmethodsofanalysisthataretransferablebetweenproblemsandtheirassociated datasets,andputtingthesemethodstogetherintoanequallytransferable workflow.Reliability—theabilitytoavoid,identify,andcorrectmistakes, andtoreproducework—comesfrombeingabletoevaluatemultiplemethodsandfunctionsanduseasystemofchecksandbalancesthroughoutyour workflow.Confidencecomesfrompractice,encouragement,andachievement.Weseekconfidencethatourworkflowssuccessfullygenerateinsight.
Givenourexpertiseanditsever-growingimportance,weuseRand RStudiothroughoutInsights.WeuseRStudiotointeractwithR,asitmakes workingwithRamorepleasurableexperiencefortheuser.Asinour undergraduatecourses,andinthesecondeditionof GettingStartedwithR, weteachanapproachtousingRbasedonthe‘tidyverse’packagesthathave revolutionizeddataexplorationandanalysisinR.Thisapproachprovides averyconsistent,efficient,andtransferableworkflowthatiseasilytaught andlearned.Itisalsousablewithonlinedatasourcesandscalabletolarge datasets,particularlybyinterfacingwellwithvariousdatabasesystems. Gettingtogripswiththetoolstomanage,summarize,andvisualizesmall
datasetsliketheonesweusehereforinsightswillinspireyouwithconfidenceformuchbiggerones.
Althoughwearebiologists,andthedemonstrationsofgettingreal insightsfromdatain Insights arefromthebiologicalandenvironmental sciences,weimagine Insights willbeappropriateforanyoneseekingtogain insightsfromdata,andatthebeginningoftheirjourneyindoingso.
Thelearning‘curve’
It’sworthknowingwhat’scoming.Thelearningcurve(Figure0.1)for thisbookisnotacurve!Itisacontinualupwardline,hopefullynottoo steepatthebeginning,andnevertoosteep,butalsonotsoshallowthat yougetbored.Asyouworkthroughthebookyouwilllearnmoreand more,whilebuildingonwhatcamebefore.Youshouldfeelcontinually challenged(whichmaygetabittiring),andperhapsattimesfeelalittle overwhelmed,butalwaysbeclearthatyou,withourhelp,havetheability tomakeprogress.
Therewilllikelybesometoughtimes,perhapseventimeswhenyoufeel likeyoucan’tcontinue.Youwillbelearingnewvocabulary,newwaysof usingyourcomputer,workingwithdatathathasproblems,fixingthese problems,andultimatelydevelopingsummariesandgraphstodevelop
Figure0.1 The Insights learningcurve(left)andtheeffort-requiredcurve(right). Wetrytomakethebeginningofthelearningcurvenotsteep,andthentokeep youlearningandlearning,suchthatareasonableandrelativelyconstanteffort isrequired.
insight.Ifyougetstuckorfrustrated,don’tbeafraidtotakeabreak,havea drinkandacookie/biscuit,goforawalk,andthentryagain,perhapswith somehelp.
Untidyanddirtydata
Thedatausedin Insights aredifferentfromthoseassociatedwithmany otherdataanalysiscoursesandbooks.Thedataaredeliberately disorganized.Thisisquitedifferentfrommanydataanalysiscoursesandbooks, wheredatasetsaresuppliedreadyforanalysis.Butitisalsomorelikewhat youmightstartwithfromlabbooks,machines,oronlinedatasources. Aconsequenceisthatthedataarenotvisualizationoranalysis‘ready’.One mightsaythedataare untidy.Also,thedataarenotprovidedbyus;rather, youwilldownloadthemfromwebsiteswherethedataareavailabletothe public.Expecttospendtimeworkingwiththedatatogetthem‘research ready’,gettingtoknowthedata,andlearningthetricksandtipsofhowto dosoefficientlyandconfidently.
Nostatisticaltestsormodels
Aswenotedabove,insightsareintuitionsandconclusionsthatcome fromidentifyingpatternsindata.Thisdoesnotformallyrequirestatistics. Itdoes,however,requirehavedevelopedanunderstandingofwhatthe questioniswe(you)aretryingtoanswerbeforemakingdatasummaries andgraphs.
Thisbookdoesnotincludeanystatisticaltests,suchasnullhypothesis significancetests(oranyotherstatisticaltestsormodels),forafewreasons. (i)Thereisenoughtobelearnedandgainedfromdataanalysiswithout suchtests.Webelievethatthefirststepsinanintroductorydataanalysis courseshouldfocusaroundthecontentof Insights;statisticaltestscanwait theirturn.(ii)Statisticaltestscanbequitedauntinganddifficult,sowe leavethemuntilwehaveasolidholdonidentifyingpatternswithrespect toourquestionsthatultimatelyformthebasisfordevelopingappropriate statisticalmodelsandmakingstatisticalinferences.(iii)Thereisarisk
thatearlylearningofstatisticaltestsencouragesaratherone-dimensional viewofdataanalysis(e.g.thedimensionofa p-value),whereasinreality weneedtotakeintoaccountmanyfeaturesofthedata,includingwhy theywerecollected,howtheywerecollected,andevenwhotheywere collectedby.(iv)Avoidingstatisticsatthisinitialstageofdataanalysis forcesyoutofocusonthequestionsmotivatingthecollectionofthedata andexpectationsofpatternsinthedataratherthanfocusingon p-values andstatisticalsignificance.ThegreatsuccessofHansRoslinginpublicizing andexplainingissuesinglobalhealthanddevelopment,viabrilliantand simpledatavisualization,isagreatexampleofhowclearmessagescan (sometimes)beconveyedwithoutstatisticaltests.
Perhapsyouareoftheopinionthatstatisticsandhypothesistestingare requiredforobjectivity,andthatwithoutthemwearejustsubjectively lookingforpatterns.Ifso,perhapstakealookatthearticle‘Manyanalysts, onedataset:Makingtransparenthowvariationsinanalyticalchoicesaffect results.’1Therearemanyrathersubjectivechoicesinvolvedindoingstatistics.Tobeclear,wedothinkthereisaveryimportant,evennecessary,place forstatisticalmodelsandtests,butthatanintroduction-to-data-analysis courseisnotthatplace.
Exploratorydataanalysis
Exploratorydataanalysis(EDA)waspromotedbythestatisticianJohn Tukeyinhis1977book ExploratoryDataAnalysis.ThebroadaimofEDA istohelpusformulateandrefinehypothesesthatwillleadtoinformative analysesorfurtherdatacollection.ThecoreobjectivesofEDAare:
• tosuggesthypothesesaboutthecausesofobservedphenomena;
• toguidetheselectionofappropriatestatisticaltoolsandtechniques;
• toassesstheassumptionsonwhichstatisticalanalysiswillbebased;
• toprovideafoundationforfurtherdatacollection.
1 https://psyarxiv.com/qkwst/
EDAinvolvesamixofbothnumericalandvisualmethods.Statistical methodsaresometimesusedtosupplementEDA,butitsmainpurposeis tofacilitateunderstandingbeforedivingintoformalstatisticalmodelling. Evenifwethinkwealreadyknowwhatkindofanalysisweneedtopursue, it’salwaysagoodideato exploreadatasetbeforedivingintotheanalysis. Attheveryleast,thiswillhelpustodeterminewhetherornotourplans aresensible.Veryoftenituncoversnewpatternsandinsights.Inasense, thisbookconcernsEDA.Butthisbookisalsoaboutansweringquestions, includingassessingtheweightofevidenceinsupportof(oragainst)a hypothesis.ThereforeitperhapsgoesalittlefurtherthanEDA.
Zenandtheartof‘datascience’
Theemergenceofevermoredataaboutevermorethings,andofmoreand moremethods,techniques,andtoolsforlookingatthesedatahasledto theemergenceof‘datascience’:thescienceofanalysingcomplexandlarge dataresources.Includedindatascienceareactivitiessuchasdatacollection,storage,archiving,distribution,analysis,modelling,communication, andethics.Thebook DataScienceforUndergraduates:Opportunitiesand Options2statesthat‘allundergraduateswillbenefitfromafundamental awarenessofandcompetenceindatascience.’It’sprobablyOKtothinkof Insights asabookforlearningthefoundationsofdatascience,butit’salso importanttoknowthat Insights doesn’tcoverlotsofdatascienceaspects (suchasdataarchivingorethics).
WheredoesZencomeintothis?Togainthedeepest,mostrobust,most interesting,mostvaluableinsightsfromdataweneedtobe‘atonewiththe data’.Howdoweachievethisheadystateofmind?Weneedtoknowthe detailsofthedatawhilemaintainingbroadawarenessofwhy we’reworking withthedata.Wemusthaveawarenessofthebigpictureofwhywe’re workingonthedata.Weneedtoanticipatemissingvaluesandbeprepared toaskwhytherearemissingvalueswhenonemightnotexpectany.We needtobekeentoexplorethedistributionofthedataandperhapsaskwhy
2 https://www.nap.edu/catalog/25104/data-science-for-undergraduates-opportunities-and-options
thereareafewextreme-lookingvalues.AndweneedtobeOKwithgetting warningmessagesfromR.Putanotherway,wemustgetstuckdeeplyinto thedetailsandalsoseethebigpicture.Wemustseeeverydetailofevery tree,andthewholeforest.Anarticlealongtheselinesdiscusseshowdata scientistswiththisabilitycanbeverycompetitivebusinessconsultants.3
Open-sciencetrends
Thereisincreasingmovementtowardsmakingscienceamoreopen process.Partofthismovementinvolvesmakingdatamorefindable, accessible,interoperable,andreusable(theFAIRguidingprinciplesof datamanagementandstewardship).⁴Whenworkingwiththedatasetsin theWorkflowDemonstrationsin Insights,youmighttakeamomentto thinkwhethertheyareparticularlyfindable,accessible,interoperable,and reusable.However, Insights isnotaboutteachingyouhowtoadheretothe FAIRguidelines—thatisastoryforanotherplace,andonethatisbeing increasinglytold. Insights doesfocusondataanalysismethodsthatare repeatable,shareable,andreliable…ifthereareguidingprinciplesfordata analysis,then Insights adherestothem!
Putanotherway, Insights teachesdataanalysismethodsthatresultin high reproducibility (astudyisreproducibleifsomeonecantakethesame dataandreproducethesameresultsasreportedintheoriginalstudy). AnotherfashioninwhichInsightsassistswithopenscienceisthatitteaches methodsthatmakecollaborativeworkrathereasierthanitmightotherwise be,suchasmakingourworkeasyforotherpeopletounderstandand implementthemselves,hopefullywithoutbreakingit.
Intendedreaders
Insights isaimedatfirst-orsecond-yearundergraduatesinthelifeand environmentalsciences,toaccompanytheirfirstcoursein‘dataanalysis’,
3 http://www.programmingr.com/content/zen-and-the-art-of-competing-against-mbas/ ⁴ https://www.nature.com/articles/sdata201618
andattheirinstructors.Asfarasweareaware,thereisnoequivalent bookavailable(thoughwedescribeinsomedetailthenumerousrelated booksonthe Insights companionwebsite(http://insightsfromdata.io)). Insights purposelyexcludesstatisticalmethods,sostudentscanfirst masterthevaluableandprerequisiteskillsofworkingwithdata,suchas manipulating,summarizing,andvisualizingdata.Itteachesanapproach tousingRbasedonthetidyverseofadd-onpackages,providingefficient, reliable,andconfidence-inspiringmethodsandworkflows.Ourapproach tolearningandteachinghasdevelopedovermorethantwodecades andprovensuccessfulinbothundergraduatecoursesandtraining programmes.
Somecompetenciesrequiredforbeginningwiththisbook:
• Youshouldknowyourwayaroundyourcomputer(e.g.howtofind files,makefolders,installapplications).
• Youshouldknowhowtolookatandenterdataintoaspreadsheet (e.g.inExcel).
• Youshouldknowhowtousetheinternet,downloadfiles,findthem onyourcomputer,andmovethemtoaspecificfolderonyour computer.
Howisthebookorganized?
Figure0.2showstheorganizationofthisbook,andthearrowsshowhow youcould(probablyshould)workthroughit.Nothingisveryspecialabout theorganizationofthefirsttwochapters.
Chapter1. Anintroductiontoinsights,todata,andtothedemonstrationsinthebookandonthe Insights companionwebsite.⁵
Chapter2. GettingacquaintedwithRandRStudio,includinginstalling them,doingsomebasiccalculations,andgettinghelp.
⁵ http://insightsfromdata.io
Figure0.2 Howthisbookisorganized,andhowyoushouldworkthroughit. Thisisexplainedindetailinthetext.
Then,withChapters3and4,theorganizationofthebookshifts.Chapters 3and4walkthrough gettinginsights usinganexampledataset. Chapters 5–7 containmorein-depth,complete,anddetailedexplanationsofthe mechanicsofwhatyouaredoingwithRandwithtidyversefunctionsin Chapters3and4. Chapters8 and 9 returntoafocusontheexampledataset andfurtherdevelopcoreskillsforinsightaroundthevarioustypesofdata intheexample.
Hence,asyouworkthroughChapters3and4,youmay,ormaynot, choosetodipintoasectionofChapters5–9.Allofthisisreflectedin thebidirectionalarrowsjoiningChapters3and4,andChapters5–9in Figure0.2.Itwillbeuptoyouhowyouworkwiththesechapters;each ofyouisdifferentandwillprobablydoitdifferently.Itwill,however, likelybeworthallofyoubeingorganized,forexamplebykeepingnotes aboutwhatyouunderstoodduringtheworkflowsinChapters3and4 andwhatyoudidnot,andthencheckingthisoffwhenworkingthrough Chapters5–9.
HereisaquicksummaryofChapters3–10.
Chapter3 demonstratespreparationtasks,suchaspreparingyourquestion,study,data,andcomputer,andgettingdataintoRandreadyfor makinginsights.Allofthisprovidesasolidfoundationfordevelopinga robustworkflowtogaininsightsfromdata.
Chapter4 demonstratesgettinginsights,includingconstructingnew variables,graphingdata,calculatingsummaries(e.g.means),andevaluatingpatternsinthegraphsandtablestogaininsights.
Chapter5 providesadeeperdiveintodatamanipulationusingtoolsin the dplyr package,includingsubsettingdatasets,andmakingsummaries ofthesesubsets.
Chapter6 providesadeeperdiveintootherdatamanipulationrequirementsthatoftenariseinthelifeandenvironmentalsciences.Theseinclude workingwithstrings(words)anddates,andrearrangingdatafrombeing acrosscolumnstowithincolumnsofadataset.Wealsoconsidersome formaldosanddon’ts.
Chapter7 givesanin-depthandguidedexplanationofhowtomake multipletypesofgraphsandenhancetheircapacitytoprovideinsights usingthe ggplot2 package.ThisbuildsontheintroductioninChapter4.
Chapter8 providesadeeperdiveintoevaluatingfeaturesofspecific variablesinyourdata,includingvisualizingsampledistributionsandestimatingnumericdescriptorsofcentraltendency(meansvsmedians),data dispersion,andasymmetry(variation,interquartileranges).
Chapter9 shiftsthefocustoexaminingpatternsbetweentwovariables.Thechapterincludessectionsonexaminingrelationshipsbetween twonumeric/continuousvariables,twocategoricalvariables(factors),and onenumericandonecategoricalvariable.Itfinisheswithaflurry,lookingatrelationshipsamongthreeormorevariables(includingpotential interactions).
Chapter10 isthefinalchapterofthebook,offeringcongratulationsand someinformationandadviceaboutreproducibility,anequallyimportant subjectwhengettinginsightsfromdata.
So,overall,you’llbelearningalanguageofdatamanagementand visualizationusingR,you’llbeworkingwithexampledata,andyou’ll developrobustnumericalsummariesandclassyvisualizationsofdata.You certainlywon’tlearneverythingyouwanttoknow,butwecanguarantee thatyou’lldevelopsomeexcellentautonomyinlearning,aplatformon whichtodevelopyour InsightsfromDatawithR skillset.
Onlinecompanionmaterial
The Insights companionwebsite⁶containssupplementarymaterial including:
• anonlineoverviewofthe Insights workflow;
• moretopicsinR;
• additionaldataanalysisconcepts;
• threeadditionalWorkflowDemonstrations;
• completeWorkflowDemonstrationRscripts;
• detailsofalivedataanalysisdemonstrationweoftenuseinour introductoryundergraduateclasses;
• exercisesandquestionsforeachsectionofthebook;
• morestudyquestionsanddatasetsthatcouldbedevelopedintonew WorkflowDemonstrations(perhapsforstudentstopractisewith and/orinstructorstouse);
• somerelated/suggestedreading;
Boxes
Throughoutthebookarefourtypesofbox:
Efficiencyandreliability. Inthese,wedescribepracticesandmethods forachievinghigherefficiencyandreliabilityinourjourneyfromdata toinsights.Theycontaininformationabouthowtomakeourwork morerobustandreliable,suchthatitcanstillfunctionifwegetoradd somenewdata,orotherwisemakesomechangesinourwork.And informationtohelpensurethatconclusions/insightsarerobust.
⁶ http://insightsfromdata.io
Beaware. Thesecontaininstructionsaboutanopportunity/needto carefullyconsideranissue,forexampleawaytoworkthatreduces thepotentialformistakes,suchasincludingappropriatechecksand balances.Thesecanalsoconcernawarningoracommon‘gotcha’. ThereareanumberofcommonpitfallsthattripupnewusersofR(and moreexperienceduserstoo!).Weaimtohighlighttheseandshow youhowtoavoidthem.
Action. Aboxcontaininginstructionsforyoutodosomethingimportant.Now! Information. Theseaimtoofferanot-too-technicaldiscussionof howorwhysomethingworksthewayitdoes.Youdonothaveto understandeverythingintheseboxestouseR,buttheinformation willhelpyouunderstandhowitworks.
Boxiconattributions:
• ‘Action’byIconsProducerfromtheNounProject(https:// thenounproject.com/icon/1899450/).
• ‘Information’bySELiconfromtheNounProject(https://thenounproject. com/icon/2119887/).
• ‘Efficiencyandreliability’byBomSymbolsfromtheNounProject (https://thenounproject.com/icon/1555215/).
• ‘Beaware’isthe‘Warning’iconbyKristinHoganfromtheNoun Project(https://thenounproject.com/icon/77514/).
AlliconsarelicensedasCreativeCommonsCCBY(https://creativecommons. org/licenses/by/3.0/).Coloursandsizeshavebeenaltered.
Someideasforinstructorsusingthisbook
Asmentioned,wehavegoodexperiencesteachingintroduction-to-dataanalysisundergraduateclassesof200+studentsusingtheapproachand methodsinthisbook.Studentstellusthatthelearningischallenging, representsarelativelyhighworkload,isvaluable,andisenjoyable.Ouraim isthatallstudentspassthecourse,andsofarover95%do.Herearesome recommendationsbasedonourexperiences:
• Thematerialissuitableforundergraduateswithlittleornoprior experienceofworkingwithdata,ofprogramming,orofstatistics.
• Theamountofmaterialissuitableforasix-weekcourseoffivetosix hoursperweek(reading,practicals,andhomework).
• Inthefirstclassofthefirstweekweleada livedataanalysisdemonstration.Withinonehourwegofromquestiontoanswer,including collectingsomedataabouteachofthestudents.Webelievethis demonstrationhelpsstudentsconnectwiththeimportanceandfun ofthecontentofthecourse.Detailsofthedemonstrationareonthe Insights companionwebsite.⁷
• Wehavefouractivitieseachweek:alecture(sometimesinperson, sometimesvideolectures),beforepracticalreadingorviewing(e.g.a chapterorsectionofthisbook,orsomevideotutorials),apractical session(e.g.seethematerialathttp://insightsfromdata.io),anda weeklygradedassessment(administeredinanautomatedonline learningplatform).
• Alldecisionsaboutthecourse,e.g.organization,schedule,content, requirement,assessment,aretakeninthecontextofmaximizing studentautonomy,purpose,andmastery,inordertonurtureand stimulatestudents’intrinsicmotivation.Wetreatthestudentslikethe adultstheyare.
⁷ http://insightsfromdata.io
• Alldecisionsarealsotakenwithefficiencyinmind...achievinga combinationofgreatlearningoutcomesandreasonableinstructor effort.
• Heavyuseofautomatedfeedbackandgradingleavestimeforinstructorstogivestudents1:1support,eveninaclassofover200students.
• OwenPetcheyusesthe exams package⁸toorganizealibraryof questionsandtocreateexaminationsfromthesewith(almost)the clickofabutton.The rexams packagehasoptionsforoutputformat, includingpdfandvariousonescompatiblewithmanyonlinelearning platforms.
Ifyouhaveanyquestionsaboutusingthisbookasacoursebookforyour undergraduateintroduction-to-data-analysiscourse,pleasegetintouch. Weareveryhappytoshareexpertiseandexperiences.
Relationshipwith GettingStartedwithR (GSwR),secondedition, Beckerman,Childs,andPetchey(2017)
InsightsisacompletelydifferentbookfromGettingStartedwithR.Hereare themostimportantdifferences,providedwiththeaimofhelpingyouknow whichbooktoworkwith.Ifyouhaveanyuncertaintyafterlookingatthese differences,don’thesitatetocontactoneofus.
• Whatdifferentiatestheaudiencesof GSwR and Insights? GSwR:folk whoalreadydodataworkandstatisticsandwanttolearntouseR. Insights:folkwhohaven’tdoneanydataworkbefore.
• GSwR motivatespeoplewhoalreadyhavereasonableknowledgeof gettinginsightsfromdatawithnon-RtoolstolearntouseRand toimplementdatamanagement,visualization,andstatisticalanalysis withRandthetidyversesetofpackages. Insights motivatespeopleto
⁸ http://www.r-exams.org
learnhowtogetinsightsfromdatawithRandthetidyversesetof packages.
• GSwR assumessomepriorknowledgeofstatistics. Insights assumes nopriorexperienceofworkingwithdataorofstatistics.
• Insights isdesignedasatextbookforanundergraduate‘introduction togettingknowledgefromdata’course. GSwR wasnotdesignedfor this,andseemstonotworkverywellforsuchpurposes(though selectedchaptersfromitcombinewellwithchaptersfromother books).
• Forthesmallamountofoverlappingcontent, Insights providesmore detailabouthowandwhy(ratherthanprovidinganoverviewtour).
• Insights andthe Insights companionwebsite⁹containmoreofthe contentoftenassociatedwithundergraduatecoursesthandoesGSwR, suchasexercisesandquizzes.
Acknowledgements
Risaproductoftheeffortsofmanyindividuals.RStudioisalsotheworkof manyindividuals,organizedbythevisionofthecompanyRStudio,whose missionistocreateopensourcesoftwarefordataanalysisandstatistical computing.Thetidyversecollectionofadd-onpackageswasinitiatedby HadleyWickhamandhasmanycontributors.Weareextremelygrateful totheseindividualsformakingourdataanalysisandresearchsomuch morereliable,efficient,andfun.Thisbookwaswrittenusingthebookdown packagecreatedbyYihuiXie,whichprovidesasuitableenvironmentforR andRStudiouserstoauthordocuments,fromsimpletocomplex.
WeeachhavebeenteachingRfornearly20years,andinthattimeit isourexperienceswithinterested,bored,critical,andallothertypesof studentthathaveallowedustobecomebetteratteachingR.Wethankall thestudentsforputtingintheeffortandgivingtheirfeedbackaboutwhat
⁹ http://insightsfromdata.io
works,aboutwhatdoesnot,andwhatmightworkbetter.Andweapologize totheboredandcriticalstudentsforouroversightsandmistakes.
WearehonouredtopublishwithOxfordUniversityPressandtowork withitsstaff,particularIanSherman,CharlesBath,andLucyNash. DouglasMeekisonveryskilfullycopyeditedthemanuscript.Several reviewerscommentedontheoriginalbookproposal,includingmaking suggestionsforimprovementsthatwereimplemented.
VanessaMatawaskindenoughtoplaceonDryadthedatausedin herandhercolleagues’studyofbatdiets.Thismadeitpossibleforusto usethedataandquestionsfromthestudyasthebasisoftheWorkflow Demonstrationinthisbook.
Finally,thankstoourfamiliesforlettingushavethetimeduring eveningsandholidaystoworkon Insights.Weloveyouall,lots.
2.4.1Errors
2.4.2Warnings
2.6Add-onpackages
2.6.1Findingadd-onpackages
2.6.2Installing(downloading)packages
2.6.3Loadingpackages
2.6.4Ananalogy
2.6.5UpdatingR,RStudio,andyourpackages
2.7Gettinghelp
2.7.1Rhelpsystemandfiles
2.7.4Cheatsheets
2.7.6Askingforhelpfromothers
2.9Summingupandlookingforward
3.1.1Thethreeresponsevariables
3.4Preparingyourcomputer
3.4.1Makingtheprojectfolderforthebatdata
3.4.2ProjectsinRStudio
3.4.3CreateanewRscriptandloadpackages
3.5GetthedataintoR
3.5.1Viewandrefinetheimport
3.6Gettinggoingwithdatamanagement
3.6.1HowthedataarestoredinR
3.7Cleanandtidythedata
3.7.1Tidyingthedata
3.7.2Cleaningthedata
3.7.3Refinethevariablenames
3.7.4Fixthedates
3.7.5Renamesomevaluesinavariable
3.7.6Checkforduplicates
3.7.7Checkforimplausibleandinvalidvalues
3.8Stopthat!Don’teventhinkaboutit!
3.8.1Don’tmesswiththe‘workingdirectory’
3.8.2Don’tusethedataimporttoolor file.choose
3.8.3Don’teventhinkaboutusingthe attach function
3.8.4Avoidusingsquarebracketsordollarsigns
WorkflowDemonstrationpart2:Gettinginsights
4.1.1Ourfirstinsights:Thenumber,sex,and ageofbats
4.2Initialinsights2:Distributions
4.2.1Insights....you’vedoneit!
4.3Transformthedata
4.4Insightsaboutourquestions
4.4.1Distributionofnumberofprey
4.4.2Shapes:Meanwingspan
4.4.3Shapes:Proportionmigratory
4.4.5Communication(beautifyingthegraphs)
4.4.6Beautifyingthewingspan,age,sexgraph
4.5Anotherviewofthequestionanddata
4.5.1Beforeyoucontinue…
4.7Summingupandlookingforward
4.8Asmallreward,ifyoulikedogs
5.1Introducing dplyr
5.1.1Selectingvariableswiththe select function
5.1.2Renamingvariableswith select and rename
5.1.3Creatingnewvariableswiththe mutate
5.1.4Gettingparticularobservationswith filter
5.1.5Orderingobservationswith arrange
5.2Groupingandsummarizingdatawith dplyr
5.2.1Summarizingdata—thenitty-gritty
5.2.2Groupedsummariesusing group_by magic
5.2.3Morethanonegroupingvariable
5.2.4Using group_by withotherverbs
5.2.5Removinggroupinginformation
5.3Summingupandlookingforward
Chapter6: Dealingwithdata2:Expandingyourtoolkit
6.1Pipesandpipelines
6.1.1Whydoweneedpipes?
6.1.2Onwhyyoushouldn’tnestfunctions
6.2Subduingthepeskystring
6.3Elegantlymanagingdatesandtimes
6.3.1Date/timeformats
6.3.2Datesinthebatprojectdata
6.3.3Whyparsedates?
6.3.4Moreaboutparsingdates/times
6.3.5Calculationswithdates/times
6.4Changingbetweenwiderandlongerdataarrangements
6.4.1Goinglonger
6.4.2Goingwider
6.5Summingupandlookingforward
Chapter7: Gettingtogripswith
7.1Anatomyofa
7.1.1Layers
7.1.3Coordinatesystem
7.1.4Fantasticfaceting
7.2Puttingitintopractice
7.2.1Inheritingdataandaestheticsfrom ggplot
7.3Beautifyingplots
7.3.1Workingwithlayer-specificgeomproperties
7.3.2Addingtitlesandlabels
7.3.3Themes
7.4Summingupandlookingforward
Chapter8: Makingdeeperinsightspart1:Workingwithsinglevariables
8.1Variablesanddata
8.1.1Numericversuscategoricalvariables
8.1.2Ratioversusintervalscales
8.2Samplesanddistributions
8.2.1Understandingnumericalvariables
8.3Graphicalsummariesofnumericvariables
8.3.1Makingsomeinsightsaboutwingspan
8.3.2Descriptivestatisticsfornumericvariables
8.3.3Measuringcentraltendency
8.3.4Measuringdispersion
8.3.5Mappingmeasuresofcentraltendencyanddispersiontoafigure231
8.3.6Combininghistogramsandboxplots
8.4Amomentwithmissingvaluesinnumericvariables(NAs)
8.5Exploringacategoricalvariable
8.5.1Understandingcategoricalvariables
8.6Summingupandlookingforward
8.7Acat-relatedreward
Chapter9: Makingdeeperinsightspart2:Relationshipsamong(many)variables247
9.1Associationsbetweentwonumericvariables
9.1.1Descriptivestatistics:Correlations
9.1.2Othermeasuresofcorrelation
9.1.3Graphicalsummariesbetweentwonumericvariables: Thescatterplot
9.2Associationsbetweentwocategoricalvariables
9.2.1Numericalsummaries
9.2.3Analternative,andperhapsmorevaluable
9.3Categorical–numericalassociations
9.3.1Numericalsummaries
9.3.2Graphicalsummariesfornumericalversuscategoricaldata
9.3.3Alternativestobox-and-whiskerplots
9.4Buildingincomplexity:Relationshipsamongthreeormorevariables267
9.5Summingupandlookingforward