Insights from data with r: an introduction for the life and environmental sciences owen l. petchey -

Page 1


https://ebookmass.com/product/insights-from-data-with-r-anintroduction-for-the-life-and-environmental-sciences-owen-l-

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

Data Analysis for the Life Sciences with R 1st Edition

https://ebookmass.com/product/data-analysis-for-the-life-scienceswith-r-1st-edition/

ebookmass.com

Applied Statistics with R: A Practical Guide for the Life Sciences Justin C. Touchon

https://ebookmass.com/product/applied-statistics-with-r-a-practicalguide-for-the-life-sciences-justin-c-touchon/

ebookmass.com

Data Treatment in Environmental Sciences 1st Edition Edition Valérie David (Auth.)

https://ebookmass.com/product/data-treatment-in-environmentalsciences-1st-edition-edition-valerie-david-auth/

ebookmass.com

A Dangerous Universe 4 Andrew

https://ebookmass.com/product/a-dangerous-universe-4-andrew/

ebookmass.com

The European Debt Crisis: How Portugal Navigated the post-2008 Financial Crisis 1st ed. Edition João Moreira Rato

https://ebookmass.com/product/the-european-debt-crisis-how-portugalnavigated-the-post-2008-financial-crisis-1st-ed-edition-joao-moreirarato/ ebookmass.com

Seaweed polysaccharides : isolation, biological and biomedical applications 1st Edition Anil

https://ebookmass.com/product/seaweed-polysaccharides-isolationbiological-and-biomedical-applications-1st-edition-anil/

ebookmass.com

Nothing Special Nicole Flattery

https://ebookmass.com/product/nothing-special-nicole-flattery/

ebookmass.com

The Sustainability of Oil Ports: An Holistic Framework for China 1st ed. Edition Xuemuge Wang

https://ebookmass.com/product/the-sustainability-of-oil-ports-anholistic-framework-for-china-1st-ed-edition-xuemuge-wang/

ebookmass.com

Why We Fly Kimberly Jones

https://ebookmass.com/product/why-we-fly-kimberly-jones/

ebookmass.com

Database System Concepts 6th Edition, (Ebook PDF)

https://ebookmass.com/product/database-system-concepts-6th-editionebook-pdf/

ebookmass.com

InsightsfromDatawithR

InsightsfromDatawithR

AnIntroductionfortheLifeand EnvironmentalSciences

OWENL.PETCHEY

DepartmentofEvolutionaryBiologyandEnvironmentalStudies, UniversityofZürich,Switzerland

ANDREWP.BECKERMAN

DepartmentofAnimalandPlantSciences,UniversityofSheffield,UK

NATALIECOOPER

NaturalHistoryMuseum,London,UK

DYLANZ.CHILDS

DepartmentofAnimalandPlantSciences,UniversityofSheffield,UK

GreatClarendonStreet,Oxford,OX26DP, UnitedKingdom

OxfordUniversityPressisadepartmentoftheUniversityofOxford. ItfurtherstheUniversity’sobjectiveofexcellenceinresearch,scholarship, andeducationbypublishingworldwide.Oxfordisaregisteredtrademarkof OxfordUniversityPressintheUKandincertainothercountries ©OwenL.Petchey,AndrewP.Beckerman,NatalieCooper,DylanZ.Childs2021

Themoralrightsoftheauthorshavebeenasserted FirstEditionpublishedin2021

Impression:1

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedin aretrievalsystem,ortransmitted,inanyformorbyanymeans,withoutthe priorpermissioninwritingofOxfordUniversityPress,orasexpresslypermitted bylaw,bylicenceorundertermsagreedwiththeappropriatereprographics rightsorganization.Enquiriesconcerningreproductionoutsidethescopeofthe aboveshouldbesenttotheRightsDepartment,OxfordUniversityPress,atthe addressabove

Youmustnotcirculatethisworkinanyotherform andyoumustimposethissameconditiononanyacquirer

PublishedintheUnitedStatesofAmericabyOxfordUniversityPress 198MadisonAvenue,NewYork,NY10016,UnitedStatesofAmerica

BritishLibraryCataloguinginPublicationData

Dataavailable

LibraryofCongressControlNumber:2020948906

ISBN978–0–19–884981–0(hbk.)

ISBN978–0–19–884982–7(pbk.)

DOI:10.1093/oso/9780198849810.001.0001

PrintedinGreatBritainby Bell&BainLtd.,Glasgow

LinkstothirdpartywebsitesareprovidedbyOxfordingoodfaithand forinformationonly.Oxforddisclaimsanyresponsibilityforthematerials containedinanythirdpartywebsitereferencedinthiswork.

TheprefaceofthisbookispublishedunderanAttribution-NonCommercial-NoDerivatives4.0 International(CCBY-NC-ND4.0)licence.

Preface

Overview

Manyactivities,includingexperiments,surveys,clinicaltrials,andfieldwork,generatedata.Thesedataprovideinsights—intuitionsandconclusionsthatcomefromidentifyingpatternsindata.Insightsarecriticalfor answeringquestions,solvingproblems,guidingdecisions,andformulating strategy.Butgettinginsightsfromdata,anddoingsoefficiently,reliably, andconfidently,doesnotcomeeasily.Yetgettinginsightsfromdataisa foundationalskillforallscientists.

InsightsfromDatawithR isforlifeandenvironmentalscienceundergraduates(thoughmayalsohelpanyonebeginningintheirlearningabout dataanalysis),andfortheirinstructorstoteachalongside.Itisnotabout statisticsperse,butaboutthatinitialtransitionfromhavingcollected dataaspartofaprojecttothatfirst,andsosatisfying,realizationthat thereisapatterninyourdata.Itcombinestheelementsofthesuccessful undergraduatedataanalysiscoursesofPetcheyattheUniversityofZürich andofChildsattheUniversityofSheffield,the‘IntroductiontoR’courses taughtinternationallyfor15yearsbyallfourauthors,andthebook Getting StartedwithR:AnIntroductionforBiologists,secondedition,byBeckerman,Childs,andPetchey(2017),allusingRwiththeRStudioplatform.

Insights (fromDatawithR)firstcoverswhatinsightsareandwhythey’re soimportant,andmovesontodiscussfeaturesofdatathatcanmake ithardoreasytogaininsights.Itthendescribeshowtoobtaininsights

InsightsfromdatawithR:AnIntroductionfortheLifeandEnvironmentalSciences. OwenL.Petchey,AndrewP.Beckerman,NatalieCooperandDylanZ.Childs,OxfordUniversity Press(2021).©OwenL.Petchey,AndrewP.Beckerman,NatalieCooperandDylanZ.Childs. DOI:10.1093/oso/9780198849810.001.0001

TheprefaceofthisbookispublishedunderanAttribution-NonCommercial-NoDerivatives4.0 International(CCBY-NC-ND4.0)licence.

fromdata.Obtainingtheminvolvesknowingwhatyouareaimingfor,and thenawholelotofpreparation,importing,cleaning,tidying,checking, double-checking,manipulating,andultimatelysummarizingandvisualizingthedata.

Itiscommontohearpeoplewhoworkalotwithdatasaythatabout 80%ofeffortandtimeduringreal-worlddataanalysisisspentonthese kindsoftasks(andonlyabout20%onmakingstatisticalinference).Yet manybooksaboutdataanalysisignorethis80%.Theyalsooverlookthat theskillsinvolvedinthis80%arevaluableintheirownright.Weareofthe opinionthattheseskillsalonegoalongwaytowardsallowingyoutogain robust,informativeinsightsfromyourdata.

Insights willhelpyoudevelopanefficient,reliable,andconfidenceinspiringworkflowformanagingyourdataanddrawingthoseinitial insightsoutofthem,andatthesametimeintroduceyoutocoreRskillsfor datamanagementandvisualization.Efficiencycomesfromlearningmethodsofanalysisthataretransferablebetweenproblemsandtheirassociated datasets,andputtingthesemethodstogetherintoanequallytransferable workflow.Reliability—theabilitytoavoid,identify,andcorrectmistakes, andtoreproducework—comesfrombeingabletoevaluatemultiplemethodsandfunctionsanduseasystemofchecksandbalancesthroughoutyour workflow.Confidencecomesfrompractice,encouragement,andachievement.Weseekconfidencethatourworkflowssuccessfullygenerateinsight.

Givenourexpertiseanditsever-growingimportance,weuseRand RStudiothroughoutInsights.WeuseRStudiotointeractwithR,asitmakes workingwithRamorepleasurableexperiencefortheuser.Asinour undergraduatecourses,andinthesecondeditionof GettingStartedwithR, weteachanapproachtousingRbasedonthe‘tidyverse’packagesthathave revolutionizeddataexplorationandanalysisinR.Thisapproachprovides averyconsistent,efficient,andtransferableworkflowthatiseasilytaught andlearned.Itisalsousablewithonlinedatasourcesandscalabletolarge datasets,particularlybyinterfacingwellwithvariousdatabasesystems. Gettingtogripswiththetoolstomanage,summarize,andvisualizesmall

datasetsliketheonesweusehereforinsightswillinspireyouwithconfidenceformuchbiggerones.

Althoughwearebiologists,andthedemonstrationsofgettingreal insightsfromdatain Insights arefromthebiologicalandenvironmental sciences,weimagine Insights willbeappropriateforanyoneseekingtogain insightsfromdata,andatthebeginningoftheirjourneyindoingso.

Thelearning‘curve’

It’sworthknowingwhat’scoming.Thelearningcurve(Figure0.1)for thisbookisnotacurve!Itisacontinualupwardline,hopefullynottoo steepatthebeginning,andnevertoosteep,butalsonotsoshallowthat yougetbored.Asyouworkthroughthebookyouwilllearnmoreand more,whilebuildingonwhatcamebefore.Youshouldfeelcontinually challenged(whichmaygetabittiring),andperhapsattimesfeelalittle overwhelmed,butalwaysbeclearthatyou,withourhelp,havetheability tomakeprogress.

Therewilllikelybesometoughtimes,perhapseventimeswhenyoufeel likeyoucan’tcontinue.Youwillbelearingnewvocabulary,newwaysof usingyourcomputer,workingwithdatathathasproblems,fixingthese problems,andultimatelydevelopingsummariesandgraphstodevelop

Figure0.1 The Insights learningcurve(left)andtheeffort-requiredcurve(right). Wetrytomakethebeginningofthelearningcurvenotsteep,andthentokeep youlearningandlearning,suchthatareasonableandrelativelyconstanteffort isrequired.

insight.Ifyougetstuckorfrustrated,don’tbeafraidtotakeabreak,havea drinkandacookie/biscuit,goforawalk,andthentryagain,perhapswith somehelp.

Untidyanddirtydata

Thedatausedin Insights aredifferentfromthoseassociatedwithmany otherdataanalysiscoursesandbooks.Thedataaredeliberately disorganized.Thisisquitedifferentfrommanydataanalysiscoursesandbooks, wheredatasetsaresuppliedreadyforanalysis.Butitisalsomorelikewhat youmightstartwithfromlabbooks,machines,oronlinedatasources. Aconsequenceisthatthedataarenotvisualizationoranalysis‘ready’.One mightsaythedataare untidy.Also,thedataarenotprovidedbyus;rather, youwilldownloadthemfromwebsiteswherethedataareavailabletothe public.Expecttospendtimeworkingwiththedatatogetthem‘research ready’,gettingtoknowthedata,andlearningthetricksandtipsofhowto dosoefficientlyandconfidently.

Nostatisticaltestsormodels

Aswenotedabove,insightsareintuitionsandconclusionsthatcome fromidentifyingpatternsindata.Thisdoesnotformallyrequirestatistics. Itdoes,however,requirehavedevelopedanunderstandingofwhatthe questioniswe(you)aretryingtoanswerbeforemakingdatasummaries andgraphs.

Thisbookdoesnotincludeanystatisticaltests,suchasnullhypothesis significancetests(oranyotherstatisticaltestsormodels),forafewreasons. (i)Thereisenoughtobelearnedandgainedfromdataanalysiswithout suchtests.Webelievethatthefirststepsinanintroductorydataanalysis courseshouldfocusaroundthecontentof Insights;statisticaltestscanwait theirturn.(ii)Statisticaltestscanbequitedauntinganddifficult,sowe leavethemuntilwehaveasolidholdonidentifyingpatternswithrespect toourquestionsthatultimatelyformthebasisfordevelopingappropriate statisticalmodelsandmakingstatisticalinferences.(iii)Thereisarisk

thatearlylearningofstatisticaltestsencouragesaratherone-dimensional viewofdataanalysis(e.g.thedimensionofa p-value),whereasinreality weneedtotakeintoaccountmanyfeaturesofthedata,includingwhy theywerecollected,howtheywerecollected,andevenwhotheywere collectedby.(iv)Avoidingstatisticsatthisinitialstageofdataanalysis forcesyoutofocusonthequestionsmotivatingthecollectionofthedata andexpectationsofpatternsinthedataratherthanfocusingon p-values andstatisticalsignificance.ThegreatsuccessofHansRoslinginpublicizing andexplainingissuesinglobalhealthanddevelopment,viabrilliantand simpledatavisualization,isagreatexampleofhowclearmessagescan (sometimes)beconveyedwithoutstatisticaltests.

Perhapsyouareoftheopinionthatstatisticsandhypothesistestingare requiredforobjectivity,andthatwithoutthemwearejustsubjectively lookingforpatterns.Ifso,perhapstakealookatthearticle‘Manyanalysts, onedataset:Makingtransparenthowvariationsinanalyticalchoicesaffect results.’1Therearemanyrathersubjectivechoicesinvolvedindoingstatistics.Tobeclear,wedothinkthereisaveryimportant,evennecessary,place forstatisticalmodelsandtests,butthatanintroduction-to-data-analysis courseisnotthatplace.

Exploratorydataanalysis

Exploratorydataanalysis(EDA)waspromotedbythestatisticianJohn Tukeyinhis1977book ExploratoryDataAnalysis.ThebroadaimofEDA istohelpusformulateandrefinehypothesesthatwillleadtoinformative analysesorfurtherdatacollection.ThecoreobjectivesofEDAare:

• tosuggesthypothesesaboutthecausesofobservedphenomena;

• toguidetheselectionofappropriatestatisticaltoolsandtechniques;

• toassesstheassumptionsonwhichstatisticalanalysiswillbebased;

• toprovideafoundationforfurtherdatacollection.

1 https://psyarxiv.com/qkwst/

EDAinvolvesamixofbothnumericalandvisualmethods.Statistical methodsaresometimesusedtosupplementEDA,butitsmainpurposeis tofacilitateunderstandingbeforedivingintoformalstatisticalmodelling. Evenifwethinkwealreadyknowwhatkindofanalysisweneedtopursue, it’salwaysagoodideato exploreadatasetbeforedivingintotheanalysis. Attheveryleast,thiswillhelpustodeterminewhetherornotourplans aresensible.Veryoftenituncoversnewpatternsandinsights.Inasense, thisbookconcernsEDA.Butthisbookisalsoaboutansweringquestions, includingassessingtheweightofevidenceinsupportof(oragainst)a hypothesis.ThereforeitperhapsgoesalittlefurtherthanEDA.

Zenandtheartof‘datascience’

Theemergenceofevermoredataaboutevermorethings,andofmoreand moremethods,techniques,andtoolsforlookingatthesedatahasledto theemergenceof‘datascience’:thescienceofanalysingcomplexandlarge dataresources.Includedindatascienceareactivitiessuchasdatacollection,storage,archiving,distribution,analysis,modelling,communication, andethics.Thebook DataScienceforUndergraduates:Opportunitiesand Options2statesthat‘allundergraduateswillbenefitfromafundamental awarenessofandcompetenceindatascience.’It’sprobablyOKtothinkof Insights asabookforlearningthefoundationsofdatascience,butit’salso importanttoknowthat Insights doesn’tcoverlotsofdatascienceaspects (suchasdataarchivingorethics).

WheredoesZencomeintothis?Togainthedeepest,mostrobust,most interesting,mostvaluableinsightsfromdataweneedtobe‘atonewiththe data’.Howdoweachievethisheadystateofmind?Weneedtoknowthe detailsofthedatawhilemaintainingbroadawarenessofwhy we’reworking withthedata.Wemusthaveawarenessofthebigpictureofwhywe’re workingonthedata.Weneedtoanticipatemissingvaluesandbeprepared toaskwhytherearemissingvalueswhenonemightnotexpectany.We needtobekeentoexplorethedistributionofthedataandperhapsaskwhy

2 https://www.nap.edu/catalog/25104/data-science-for-undergraduates-opportunities-and-options

thereareafewextreme-lookingvalues.AndweneedtobeOKwithgetting warningmessagesfromR.Putanotherway,wemustgetstuckdeeplyinto thedetailsandalsoseethebigpicture.Wemustseeeverydetailofevery tree,andthewholeforest.Anarticlealongtheselinesdiscusseshowdata scientistswiththisabilitycanbeverycompetitivebusinessconsultants.3

Open-sciencetrends

Thereisincreasingmovementtowardsmakingscienceamoreopen process.Partofthismovementinvolvesmakingdatamorefindable, accessible,interoperable,andreusable(theFAIRguidingprinciplesof datamanagementandstewardship).⁴Whenworkingwiththedatasetsin theWorkflowDemonstrationsin Insights,youmighttakeamomentto thinkwhethertheyareparticularlyfindable,accessible,interoperable,and reusable.However, Insights isnotaboutteachingyouhowtoadheretothe FAIRguidelines—thatisastoryforanotherplace,andonethatisbeing increasinglytold. Insights doesfocusondataanalysismethodsthatare repeatable,shareable,andreliable…ifthereareguidingprinciplesfordata analysis,then Insights adherestothem!

Putanotherway, Insights teachesdataanalysismethodsthatresultin high reproducibility (astudyisreproducibleifsomeonecantakethesame dataandreproducethesameresultsasreportedintheoriginalstudy). AnotherfashioninwhichInsightsassistswithopenscienceisthatitteaches methodsthatmakecollaborativeworkrathereasierthanitmightotherwise be,suchasmakingourworkeasyforotherpeopletounderstandand implementthemselves,hopefullywithoutbreakingit.

Intendedreaders

Insights isaimedatfirst-orsecond-yearundergraduatesinthelifeand environmentalsciences,toaccompanytheirfirstcoursein‘dataanalysis’,

3 http://www.programmingr.com/content/zen-and-the-art-of-competing-against-mbas/ ⁴ https://www.nature.com/articles/sdata201618

andattheirinstructors.Asfarasweareaware,thereisnoequivalent bookavailable(thoughwedescribeinsomedetailthenumerousrelated booksonthe Insights companionwebsite(http://insightsfromdata.io)). Insights purposelyexcludesstatisticalmethods,sostudentscanfirst masterthevaluableandprerequisiteskillsofworkingwithdata,suchas manipulating,summarizing,andvisualizingdata.Itteachesanapproach tousingRbasedonthetidyverseofadd-onpackages,providingefficient, reliable,andconfidence-inspiringmethodsandworkflows.Ourapproach tolearningandteachinghasdevelopedovermorethantwodecades andprovensuccessfulinbothundergraduatecoursesandtraining programmes.

Somecompetenciesrequiredforbeginningwiththisbook:

• Youshouldknowyourwayaroundyourcomputer(e.g.howtofind files,makefolders,installapplications).

• Youshouldknowhowtolookatandenterdataintoaspreadsheet (e.g.inExcel).

• Youshouldknowhowtousetheinternet,downloadfiles,findthem onyourcomputer,andmovethemtoaspecificfolderonyour computer.

Howisthebookorganized?

Figure0.2showstheorganizationofthisbook,andthearrowsshowhow youcould(probablyshould)workthroughit.Nothingisveryspecialabout theorganizationofthefirsttwochapters.

Chapter1. Anintroductiontoinsights,todata,andtothedemonstrationsinthebookandonthe Insights companionwebsite.⁵

Chapter2. GettingacquaintedwithRandRStudio,includinginstalling them,doingsomebasiccalculations,andgettinghelp.

⁵ http://insightsfromdata.io

Figure0.2 Howthisbookisorganized,andhowyoushouldworkthroughit. Thisisexplainedindetailinthetext.

Then,withChapters3and4,theorganizationofthebookshifts.Chapters 3and4walkthrough gettinginsights usinganexampledataset. Chapters 5–7 containmorein-depth,complete,anddetailedexplanationsofthe mechanicsofwhatyouaredoingwithRandwithtidyversefunctionsin Chapters3and4. Chapters8 and 9 returntoafocusontheexampledataset andfurtherdevelopcoreskillsforinsightaroundthevarioustypesofdata intheexample.

Hence,asyouworkthroughChapters3and4,youmay,ormaynot, choosetodipintoasectionofChapters5–9.Allofthisisreflectedin thebidirectionalarrowsjoiningChapters3and4,andChapters5–9in Figure0.2.Itwillbeuptoyouhowyouworkwiththesechapters;each ofyouisdifferentandwillprobablydoitdifferently.Itwill,however, likelybeworthallofyoubeingorganized,forexamplebykeepingnotes aboutwhatyouunderstoodduringtheworkflowsinChapters3and4 andwhatyoudidnot,andthencheckingthisoffwhenworkingthrough Chapters5–9.

HereisaquicksummaryofChapters3–10.

Chapter3 demonstratespreparationtasks,suchaspreparingyourquestion,study,data,andcomputer,andgettingdataintoRandreadyfor makinginsights.Allofthisprovidesasolidfoundationfordevelopinga robustworkflowtogaininsightsfromdata.

Chapter4 demonstratesgettinginsights,includingconstructingnew variables,graphingdata,calculatingsummaries(e.g.means),andevaluatingpatternsinthegraphsandtablestogaininsights.

Chapter5 providesadeeperdiveintodatamanipulationusingtoolsin the dplyr package,includingsubsettingdatasets,andmakingsummaries ofthesesubsets.

Chapter6 providesadeeperdiveintootherdatamanipulationrequirementsthatoftenariseinthelifeandenvironmentalsciences.Theseinclude workingwithstrings(words)anddates,andrearrangingdatafrombeing acrosscolumnstowithincolumnsofadataset.Wealsoconsidersome formaldosanddon’ts.

Chapter7 givesanin-depthandguidedexplanationofhowtomake multipletypesofgraphsandenhancetheircapacitytoprovideinsights usingthe ggplot2 package.ThisbuildsontheintroductioninChapter4.

Chapter8 providesadeeperdiveintoevaluatingfeaturesofspecific variablesinyourdata,includingvisualizingsampledistributionsandestimatingnumericdescriptorsofcentraltendency(meansvsmedians),data dispersion,andasymmetry(variation,interquartileranges).

Chapter9 shiftsthefocustoexaminingpatternsbetweentwovariables.Thechapterincludessectionsonexaminingrelationshipsbetween twonumeric/continuousvariables,twocategoricalvariables(factors),and onenumericandonecategoricalvariable.Itfinisheswithaflurry,lookingatrelationshipsamongthreeormorevariables(includingpotential interactions).

Chapter10 isthefinalchapterofthebook,offeringcongratulationsand someinformationandadviceaboutreproducibility,anequallyimportant subjectwhengettinginsightsfromdata.

So,overall,you’llbelearningalanguageofdatamanagementand visualizationusingR,you’llbeworkingwithexampledata,andyou’ll developrobustnumericalsummariesandclassyvisualizationsofdata.You certainlywon’tlearneverythingyouwanttoknow,butwecanguarantee thatyou’lldevelopsomeexcellentautonomyinlearning,aplatformon whichtodevelopyour InsightsfromDatawithR skillset.

Onlinecompanionmaterial

The Insights companionwebsite⁶containssupplementarymaterial including:

• anonlineoverviewofthe Insights workflow;

• moretopicsinR;

• additionaldataanalysisconcepts;

• threeadditionalWorkflowDemonstrations;

• completeWorkflowDemonstrationRscripts;

• detailsofalivedataanalysisdemonstrationweoftenuseinour introductoryundergraduateclasses;

• exercisesandquestionsforeachsectionofthebook;

• morestudyquestionsanddatasetsthatcouldbedevelopedintonew WorkflowDemonstrations(perhapsforstudentstopractisewith and/orinstructorstouse);

• somerelated/suggestedreading;

Boxes

Throughoutthebookarefourtypesofbox:

Efficiencyandreliability. Inthese,wedescribepracticesandmethods forachievinghigherefficiencyandreliabilityinourjourneyfromdata toinsights.Theycontaininformationabouthowtomakeourwork morerobustandreliable,suchthatitcanstillfunctionifwegetoradd somenewdata,orotherwisemakesomechangesinourwork.And informationtohelpensurethatconclusions/insightsarerobust.

⁶ http://insightsfromdata.io

Beaware. Thesecontaininstructionsaboutanopportunity/needto carefullyconsideranissue,forexampleawaytoworkthatreduces thepotentialformistakes,suchasincludingappropriatechecksand balances.Thesecanalsoconcernawarningoracommon‘gotcha’. ThereareanumberofcommonpitfallsthattripupnewusersofR(and moreexperienceduserstoo!).Weaimtohighlighttheseandshow youhowtoavoidthem.

Action. Aboxcontaininginstructionsforyoutodosomethingimportant.Now! Information. Theseaimtoofferanot-too-technicaldiscussionof howorwhysomethingworksthewayitdoes.Youdonothaveto understandeverythingintheseboxestouseR,buttheinformation willhelpyouunderstandhowitworks.

Boxiconattributions:

• ‘Action’byIconsProducerfromtheNounProject(https:// thenounproject.com/icon/1899450/).

• ‘Information’bySELiconfromtheNounProject(https://thenounproject. com/icon/2119887/).

• ‘Efficiencyandreliability’byBomSymbolsfromtheNounProject (https://thenounproject.com/icon/1555215/).

• ‘Beaware’isthe‘Warning’iconbyKristinHoganfromtheNoun Project(https://thenounproject.com/icon/77514/).

AlliconsarelicensedasCreativeCommonsCCBY(https://creativecommons. org/licenses/by/3.0/).Coloursandsizeshavebeenaltered.

Someideasforinstructorsusingthisbook

Asmentioned,wehavegoodexperiencesteachingintroduction-to-dataanalysisundergraduateclassesof200+studentsusingtheapproachand methodsinthisbook.Studentstellusthatthelearningischallenging, representsarelativelyhighworkload,isvaluable,andisenjoyable.Ouraim isthatallstudentspassthecourse,andsofarover95%do.Herearesome recommendationsbasedonourexperiences:

• Thematerialissuitableforundergraduateswithlittleornoprior experienceofworkingwithdata,ofprogramming,orofstatistics.

• Theamountofmaterialissuitableforasix-weekcourseoffivetosix hoursperweek(reading,practicals,andhomework).

• Inthefirstclassofthefirstweekweleada livedataanalysisdemonstration.Withinonehourwegofromquestiontoanswer,including collectingsomedataabouteachofthestudents.Webelievethis demonstrationhelpsstudentsconnectwiththeimportanceandfun ofthecontentofthecourse.Detailsofthedemonstrationareonthe Insights companionwebsite.⁷

• Wehavefouractivitieseachweek:alecture(sometimesinperson, sometimesvideolectures),beforepracticalreadingorviewing(e.g.a chapterorsectionofthisbook,orsomevideotutorials),apractical session(e.g.seethematerialathttp://insightsfromdata.io),anda weeklygradedassessment(administeredinanautomatedonline learningplatform).

• Alldecisionsaboutthecourse,e.g.organization,schedule,content, requirement,assessment,aretakeninthecontextofmaximizing studentautonomy,purpose,andmastery,inordertonurtureand stimulatestudents’intrinsicmotivation.Wetreatthestudentslikethe adultstheyare.

⁷ http://insightsfromdata.io

• Alldecisionsarealsotakenwithefficiencyinmind...achievinga combinationofgreatlearningoutcomesandreasonableinstructor effort.

• Heavyuseofautomatedfeedbackandgradingleavestimeforinstructorstogivestudents1:1support,eveninaclassofover200students.

• OwenPetcheyusesthe exams package⁸toorganizealibraryof questionsandtocreateexaminationsfromthesewith(almost)the clickofabutton.The rexams packagehasoptionsforoutputformat, includingpdfandvariousonescompatiblewithmanyonlinelearning platforms.

Ifyouhaveanyquestionsaboutusingthisbookasacoursebookforyour undergraduateintroduction-to-data-analysiscourse,pleasegetintouch. Weareveryhappytoshareexpertiseandexperiences.

Relationshipwith GettingStartedwithR (GSwR),secondedition, Beckerman,Childs,andPetchey(2017)

InsightsisacompletelydifferentbookfromGettingStartedwithR.Hereare themostimportantdifferences,providedwiththeaimofhelpingyouknow whichbooktoworkwith.Ifyouhaveanyuncertaintyafterlookingatthese differences,don’thesitatetocontactoneofus.

• Whatdifferentiatestheaudiencesof GSwR and Insights? GSwR:folk whoalreadydodataworkandstatisticsandwanttolearntouseR. Insights:folkwhohaven’tdoneanydataworkbefore.

• GSwR motivatespeoplewhoalreadyhavereasonableknowledgeof gettinginsightsfromdatawithnon-RtoolstolearntouseRand toimplementdatamanagement,visualization,andstatisticalanalysis withRandthetidyversesetofpackages. Insights motivatespeopleto

⁸ http://www.r-exams.org

learnhowtogetinsightsfromdatawithRandthetidyversesetof packages.

• GSwR assumessomepriorknowledgeofstatistics. Insights assumes nopriorexperienceofworkingwithdataorofstatistics.

• Insights isdesignedasatextbookforanundergraduate‘introduction togettingknowledgefromdata’course. GSwR wasnotdesignedfor this,andseemstonotworkverywellforsuchpurposes(though selectedchaptersfromitcombinewellwithchaptersfromother books).

• Forthesmallamountofoverlappingcontent, Insights providesmore detailabouthowandwhy(ratherthanprovidinganoverviewtour).

• Insights andthe Insights companionwebsite⁹containmoreofthe contentoftenassociatedwithundergraduatecoursesthandoesGSwR, suchasexercisesandquizzes.

Acknowledgements

Risaproductoftheeffortsofmanyindividuals.RStudioisalsotheworkof manyindividuals,organizedbythevisionofthecompanyRStudio,whose missionistocreateopensourcesoftwarefordataanalysisandstatistical computing.Thetidyversecollectionofadd-onpackageswasinitiatedby HadleyWickhamandhasmanycontributors.Weareextremelygrateful totheseindividualsformakingourdataanalysisandresearchsomuch morereliable,efficient,andfun.Thisbookwaswrittenusingthebookdown packagecreatedbyYihuiXie,whichprovidesasuitableenvironmentforR andRStudiouserstoauthordocuments,fromsimpletocomplex.

WeeachhavebeenteachingRfornearly20years,andinthattimeit isourexperienceswithinterested,bored,critical,andallothertypesof studentthathaveallowedustobecomebetteratteachingR.Wethankall thestudentsforputtingintheeffortandgivingtheirfeedbackaboutwhat

⁹ http://insightsfromdata.io

works,aboutwhatdoesnot,andwhatmightworkbetter.Andweapologize totheboredandcriticalstudentsforouroversightsandmistakes.

WearehonouredtopublishwithOxfordUniversityPressandtowork withitsstaff,particularIanSherman,CharlesBath,andLucyNash. DouglasMeekisonveryskilfullycopyeditedthemanuscript.Several reviewerscommentedontheoriginalbookproposal,includingmaking suggestionsforimprovementsthatwereimplemented.

VanessaMatawaskindenoughtoplaceonDryadthedatausedin herandhercolleagues’studyofbatdiets.Thismadeitpossibleforusto usethedataandquestionsfromthestudyasthebasisoftheWorkflow Demonstrationinthisbook.

Finally,thankstoourfamiliesforlettingushavethetimeduring eveningsandholidaystoworkon Insights.Weloveyouall,lots.

2.4.1Errors

2.4.2Warnings

2.6Add-onpackages

2.6.1Findingadd-onpackages

2.6.2Installing(downloading)packages

2.6.3Loadingpackages

2.6.4Ananalogy

2.6.5UpdatingR,RStudio,andyourpackages

2.7Gettinghelp

2.7.1Rhelpsystemandfiles

2.7.4Cheatsheets

2.7.6Askingforhelpfromothers

2.9Summingupandlookingforward

3.1.1Thethreeresponsevariables

3.4Preparingyourcomputer

3.4.1Makingtheprojectfolderforthebatdata

3.4.2ProjectsinRStudio

3.4.3CreateanewRscriptandloadpackages

3.5GetthedataintoR

3.5.1Viewandrefinetheimport

3.6Gettinggoingwithdatamanagement

3.6.1HowthedataarestoredinR

3.7Cleanandtidythedata

3.7.1Tidyingthedata

3.7.2Cleaningthedata

3.7.3Refinethevariablenames

3.7.4Fixthedates

3.7.5Renamesomevaluesinavariable

3.7.6Checkforduplicates

3.7.7Checkforimplausibleandinvalidvalues

3.8Stopthat!Don’teventhinkaboutit!

3.8.1Don’tmesswiththe‘workingdirectory’

3.8.2Don’tusethedataimporttoolor file.choose

3.8.3Don’teventhinkaboutusingthe attach function

3.8.4Avoidusingsquarebracketsordollarsigns

WorkflowDemonstrationpart2:Gettinginsights

4.1.1Ourfirstinsights:Thenumber,sex,and ageofbats

4.2Initialinsights2:Distributions

4.2.1Insights....you’vedoneit!

4.3Transformthedata

4.4Insightsaboutourquestions

4.4.1Distributionofnumberofprey

4.4.2Shapes:Meanwingspan

4.4.3Shapes:Proportionmigratory

4.4.5Communication(beautifyingthegraphs)

4.4.6Beautifyingthewingspan,age,sexgraph

4.5Anotherviewofthequestionanddata

4.5.1Beforeyoucontinue…

4.7Summingupandlookingforward

4.8Asmallreward,ifyoulikedogs

5.1Introducing dplyr

5.1.1Selectingvariableswiththe select function

5.1.2Renamingvariableswith select and rename

5.1.3Creatingnewvariableswiththe mutate

5.1.4Gettingparticularobservationswith filter

5.1.5Orderingobservationswith arrange

5.2Groupingandsummarizingdatawith dplyr

5.2.1Summarizingdata—thenitty-gritty

5.2.2Groupedsummariesusing group_by magic

5.2.3Morethanonegroupingvariable

5.2.4Using group_by withotherverbs

5.2.5Removinggroupinginformation

5.3Summingupandlookingforward

Chapter6: Dealingwithdata2:Expandingyourtoolkit

6.1Pipesandpipelines

6.1.1Whydoweneedpipes?

6.1.2Onwhyyoushouldn’tnestfunctions

6.2Subduingthepeskystring

6.3Elegantlymanagingdatesandtimes

6.3.1Date/timeformats

6.3.2Datesinthebatprojectdata

6.3.3Whyparsedates?

6.3.4Moreaboutparsingdates/times

6.3.5Calculationswithdates/times

6.4Changingbetweenwiderandlongerdataarrangements

6.4.1Goinglonger

6.4.2Goingwider

6.5Summingupandlookingforward

Chapter7: Gettingtogripswith

7.1Anatomyofa

7.1.1Layers

7.1.3Coordinatesystem

7.1.4Fantasticfaceting

7.2Puttingitintopractice

7.2.1Inheritingdataandaestheticsfrom ggplot

7.3Beautifyingplots

7.3.1Workingwithlayer-specificgeomproperties

7.3.2Addingtitlesandlabels

7.3.3Themes

7.4Summingupandlookingforward

Chapter8: Makingdeeperinsightspart1:Workingwithsinglevariables

8.1Variablesanddata

8.1.1Numericversuscategoricalvariables

8.1.2Ratioversusintervalscales

8.2Samplesanddistributions

8.2.1Understandingnumericalvariables

8.3Graphicalsummariesofnumericvariables

8.3.1Makingsomeinsightsaboutwingspan

8.3.2Descriptivestatisticsfornumericvariables

8.3.3Measuringcentraltendency

8.3.4Measuringdispersion

8.3.5Mappingmeasuresofcentraltendencyanddispersiontoafigure231

8.3.6Combininghistogramsandboxplots

8.4Amomentwithmissingvaluesinnumericvariables(NAs)

8.5Exploringacategoricalvariable

8.5.1Understandingcategoricalvariables

8.6Summingupandlookingforward

8.7Acat-relatedreward

Chapter9: Makingdeeperinsightspart2:Relationshipsamong(many)variables247

9.1Associationsbetweentwonumericvariables

9.1.1Descriptivestatistics:Correlations

9.1.2Othermeasuresofcorrelation

9.1.3Graphicalsummariesbetweentwonumericvariables: Thescatterplot

9.2Associationsbetweentwocategoricalvariables

9.2.1Numericalsummaries

9.2.3Analternative,andperhapsmorevaluable

9.3Categorical–numericalassociations

9.3.1Numericalsummaries

9.3.2Graphicalsummariesfornumericalversuscategoricaldata

9.3.3Alternativestobox-and-whiskerplots

9.4Buildingincomplexity:Relationshipsamongthreeormorevariables267

9.5Summingupandlookingforward

Turn static files into dynamic content formats.

Create a flipbook