Automated machine learning for business kai r. larsen - The ebook is ready for download to explore t

Page 1


https://ebookmass.com/product/automated-machine-learningfor-business-kai-r-larsen-2/

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

Automated Machine Learning for Business Kai R. Larsen

https://ebookmass.com/product/automated-machine-learning-for-businesskai-r-larsen-2/

ebookmass.com

Automated Machine Learning for Business R. Larsen

https://ebookmass.com/product/automated-machine-learning-for-businessr-larsen/

ebookmass.com

Machine Learning in Microservices: Productionizing microservices architecture for machine learning solutions Abouahmed

https://ebookmass.com/product/machine-learning-in-microservicesproductionizing-microservices-architecture-for-machine-learningsolutions-abouahmed/ ebookmass.com

Music

for Prime Time: A History of American Television

Themes and Scoring Jon Burlingame

https://ebookmass.com/product/music-for-prime-time-a-history-ofamerican-television-themes-and-scoring-jon-burlingame/

ebookmass.com

The Importance of Work in an Age of Uncertainty: The Eroding Work Experience in America David L. Blustein

https://ebookmass.com/product/the-importance-of-work-in-an-age-ofuncertainty-the-eroding-work-experience-in-america-david-l-blustein/

ebookmass.com

Counseling Across Cultures 7th Edition, (Ebook PDF)

https://ebookmass.com/product/counseling-across-cultures-7th-editionebook-pdf/

ebookmass.com

The Ideas That Made America: A Brief History Jennifer Ratner-Rosenhagen

https://ebookmass.com/product/the-ideas-that-made-america-a-briefhistory-jennifer-ratner-rosenhagen/

ebookmass.com

Walking Among Pharaohs: George Reisner and the Dawn of Modern Egyptology Peter Der Manuelian

https://ebookmass.com/product/walking-among-pharaohs-george-reisnerand-the-dawn-of-modern-egyptology-peter-der-manuelian/

ebookmass.com

Savarkar and the Making of Hindutva 1st Edition Janaki Bakhle

https://ebookmass.com/product/savarkar-and-the-making-of-hindutva-1stedition-janaki-bakhle/

ebookmass.com

The Nightmare Thief Series, Book 1 Nicole Lesperance

https://ebookmass.com/product/the-nightmare-thief-seriesbook-1-nicole-lesperance-3/

ebookmass.com

Automated Machine Learning for Business

Automated Machine Learning for Business

Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and certain other countries.

Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America.

© Oxford University Press 2021

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by license, or under terms agreed with the appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above.

You must not circulate this work in any other form and you must impose this same condition on any acquirer.

Library of Congress Cataloging-in-Publication Data

Names: Larsen, Kai R., author. | Becker, Daniel S., author. Title: Automated machine learning for business / Kai R. Larsen and Daniel S. Becker. Description: New York, NY : Oxford University Press, [2021] | Includes bibliographical references and index.

Identifiers: LCCN 2020049814 (print) | LCCN 2020049815 (ebook) | ISBN 9780190941659 (hardback) | ISBN 9780190941666 (paperback) | ISBN 9780190941680 (epub) Subjects: LCSH: Business planning—Data processing—Textbooks. | Business planning—Statistical methods—Textbooks. | Machine learning— Industrial applications—Textbooks. | Decision making—Statistical methods—Textbooks. Classification: LCC HD30.28 .L3733 2021 (print) | LCC HD30.28 (ebook) | DDC 658.4/030285631—dc23 LC record available at https://lccn.loc.gov/2020049814 LC ebook record available at https://lccn.loc.gov/2020049815

DOI: 10.1093/oso/9780190941659.001.0001

1 3 5 7 9 8 6 4 2

Paperback printed by Sheridan Books, Inc., United States of America Hardback printed by Bridgeport National Bindery, Inc., United States of America

Preface

According to PricewaterhouseCoopers, there will be an enormous demand for professionals with the skills you will develop through this book, as the clear majority of future jobs will be analytics-enabled (Ampil et al., 2017). Machine learning is at the core of such jobs and how they are transforming business—no wonder some have termed “data scientist” the sexiest job of the twenty-first century (Davenport & Patil, 2012). While you may have no desire to become a data scientist, at a minimum, you must know conceptually what machine learning is, but to thrive you should be able to use machine learning to make better and faster decisions.

Automated Machine Learning for Business is for these readers (hereafter often referred to as “analysts”):

• Businesspeople wanting to apply the power of machine learning to learn about their business environment and extract visualizations allowing the sharing of their newfound knowledge.

• Businesspeople and developers wanting to learn how to create machine learning models for automating high-quality decisions.

• Subject matter experts assigned to a machine learning project. The book will help you understand the process ahead and allow you to better communicate with your data science colleagues.

• Students in introductory business analytics or machine learning classes, whether as part of a business analytics program or a stand-alone course wholly or in part focused on machine learning, in either the undergraduate or masterlevel curriculum.

• Machine learning experts with no previous exposure to automated machine learning or who want to evaluate their machine learning approach against the industry-leading processes embedded in DataRobot, the automated machine learning platform used for this book.

The tools and processes in this book were developed by some of the best data scientists in the world. Even very successful colleagues with decades of experience in business analytics and machine learning had useful experiences while testing material for the book and the DataRobot automated machine learning (AutoML) platform.

This book is not about artificial intelligence (AI). AI can be thought of as a collection of machine learning algorithms with a central unit deciding which of the ML algorithms need to kick in at that time, similar to how different parts of the human brain specialize in different tasks. Machine learning is in the driving seat as AI is

becoming a reality at blinding speeds. This accelerated pace of AI development is due to recent improvements in deep learning neural networks as well as other algorithms that require less fine-tuning to work. While this book does not cover AI, it may be one of the gentlest introductions to machine learning, and as such will serve as a great starting point on your journey toward AI understanding.

Automated Machine Learning (AutoML)

In this book, we teach the machine learning process using a new development in data science; automated machine learning. AutoML, when implemented properly, makes machine learning accessible to most people because it removes the need for years of experience in the most arcane aspects of data science, such as the math, statistics, and computer science skills required to become a top contender in traditional machine learning. Anyone trained in the use of AutoML can use it to test their ideas and support the quality of those ideas during presentations to management and stakeholder groups. Because the requisite investment is one semester-long undergraduate course rather than a year in a graduate program, these tools will likely become a core component of undergraduate programs, and over time, even the high school curriculum.

It has been our experience that even after taking introductory statistics classes, only a few are capable of effectively applying a tool like regression to business problems. On the other hand, most students seem capable of understanding how to use AutoML, which will generally outperform logistic and linear regression, even in the cases where regression is the right tool for the problem. If we start with the assumption that machine learning is in the process of transforming society in the most radical way since the Industrial Revolution, we can also conclude that developing undergraduate and graduate business degrees that do not cover this material will soon be considered akin to educational malpractice. Moreover, several master’s degree programs in Analytics have started introducing their students to AutoML to speed their learning and capabilities. When AutoML outperforms whole teams of experienced data scientists, does it make sense to exclude it from analytics program training, whether at the undergraduate or graduate level?

A Note to Instructors

The full content of this book may be more than is desired for a class in the core of an undergraduate curriculum but perhaps not enough content desired for an intro to data science or business analytics class, as those classes would likely spend more time on Section III, Acquire and Explore Data, and also cover unsupervised machine learning. While this book will stick to easy-to-use graphical interfaces, the content of the book is transformable to a class teaching AutoML through a coding approach,

primarily with R or Python. Moreover, once your students understand the concepts covered in this book, they will avoid many pitfalls of machine learning as they move on to more advanced classes.

Acknowledgments

We are exceedingly grateful to Jeremy Achin and Tom De Godoy, who first built a system for automating the data science work necessary to compete in Kaggle data science competitions, a system that over time morphed into the DataRobot platform. The process taught in this book owes much to their original approach. The book also could not have become a reality without the support of many world-leading data scientists at DataRobot, Inc. We are particularly indebted to Zachary Mayer, Bill Surrette, João Gomes, Andrew Engel, John Boersma, Raju Penmatcha, Ina Ko, Matt Sanda, and Ben Solari, who provided feedback on drafts of the book. We are also grateful to past students Alexander Truesdale, Weston Ballard, and Briana Butler for support and editing, as well as Alex Truesdale, Briana Butler, Matt Mager, Weston Ballard, and Spencer Rafii for helping develop the datasets provided with this book. The following students helped improve the book during its first use in the classroom at the University of Colorado (in order of helpfulness): Mac Bolak, Gena Welk, Jackson McCullough, Alan Chen, Stephanie Billett, Robin Silk, Cierra Hubbard, Meredith Maney, Philip Bubernak, Sarah Sharpe, Meghan McCrory, Megan Thiede, Pauline Flores, Christine Pracht, Nicole Costello, Tristan Poulsen, Logan Hastings, Josh Cohn, Alice Haugland, Alex Ward, Meg O’Connell, Sam Silver, Tyler Hennessy, Daniel Hartman, Anna La, Ben Friedman, Jacob Lamon, Zihan Gao, and August Ridley.

Finally, but no less important, Kai is grateful to his children, and especially his youngest daughter, Katrine, who would have received more of his attention had he not worked on this book.

Book Outline

Section I discusses the ubiquitous use of machine learning in business as well as the critical need for you to conceptually understand machine learning’s transformative effect on society. Applications and websites you already use now rely on machine learning in ways you will never be aware of, as well as the more obvious cases such as when the car next to you has no driver, or that the driver is watching a video (with or without holding his hands on the steering wheel). We then discuss automated machine learning, a new development that makes machine learning accessible to most businesspeople.

Section II begins with Define Project Objectives, where we specify the business problem and make sure we have the skill sets needed to succeed. We can then plan

what to predict. In this section, we will carefully consider the inherent project risks and rewards, concluding with the truly difficult task of considering whether the project is worthwhile or, like most ideas, belongs in the cognitive trash pile.

Section III focuses on how to Acquire and Explore Data. It starts with a consideration of internal versus external data, asking the question of whether the data we have is the data we need while keeping benefit/cost considerations in mind. With all key data collected, the data can be merged into a single table and prepared for exploratory analysis. In this process, we examine each column in the dataset to determine what data type it is. The most common types of data are either numeric or categorical. For numeric data, we examine simple statistics, such as mean, median, standard deviation, minimum value, maximum value. For categorical data, we examine how many unique categories there are, how many possible values there are in each category, and within each category, which value is the most common (referred to as the mode). For both types, we examine their distribution—are they normally distributed, left- or right-skewed, or perhaps even bimodal? We will also discuss the potential limitations of categories that have few values.

Traditionally, detailed knowledge of your data has been very important in data science, and this will still be true when using AutoML. However, with AutoML, the data knowledge needed shifts from understanding data to understanding relationships. A critically important task in data science is to remove target leakage. Target leakage can occur when we collect a dataset over an extended period such that you have unrealistic data available at the time of prediction within a model. We will discuss this in further depth in Chapter 3, but for now, imagine the following: you have access to information on a potential customer who just arrived at a website for which you have contracted to place one of two available ads. You have information on the previous 100,000 customers who arrived at the website and whether they bought the product advertised in each ad. You used that information to create a model that “understands” what leads a customer to buy that product. Your model has an almost uncanny ability to predict non-purchasing especially, which means that you will never place the ad in front of the group predicted not to buy, and you will save your company quite a bit of money. Unfortunately, after you have convinced your CEO to place the model into production, you find that not all the features used to create the model are available when a new visitor arrives, and your model is expected to decide which ad to place in front of the visitor. It turns out that in your training data, you had a feature that stated whether visitors clicked on the ad or not. Not surprisingly, the visitors who did not click the ad also did not buy the product that was only accessible through clicking the ad. You are now forced to scrap the old model and create a new model before explaining to upper management why your new model is much less effective (but works). In short, the amount of humble pie eaten by data scientists who did not remove target leakage before presenting to management could power many late-night sessions of feature engineering.

While knowledge of your data is still where you must go to tease out small improvements in your ability to predict the future, automated machine learning has

moved the line on the gain from such exercises. The platform we will use to demonstrate the concepts in this book, DataRobot, does most of the feature engineering automatically. We have found that for well-executed automated machine learning, additional manual feature engineering is much more likely to reduce the performance of the algorithms than to improve it. While even the DataRobot platform will not take away the need for target leakage examination, for AutoML tools embedded into another platform, such as Salesforce Einstein, target leak data is removed automatically.

Section IV, Model Data, focuses on creating the machine learning models. Much of the process of becoming a data scientist emphasizes the aspects of developing machine learning models, which one does by going through classes on math, statistics, and programming. Once you have completed your undergraduate degree, as well as a specialized program in data science or business analytics, you qualify as a fledgling data scientist, which is admittedly very cool. Then you start building models and following a process. Often, a machine learning project will take three to six months before you see your model in production. For example, say that your university wants you to create a model to predict which students will have trouble finding a job after achieving their undergraduate degree. While working on the project, approximately every month you will hear about a new algorithm that is rumored to be much better than the ones you’ve tried, and you might take a detour into trying out that algorithm. It has been our experience that often the biggest problems regarding the accuracy of a machine learning project involve the data scientist being unaware of a class of algorithms that could have outperformed the algorithms used.

As part of the modeling process, it is typical to try out different features and perhaps remove those that are not very predictive or that, due to random variations in the data, initially seem to offer some predictive ability. Sometimes less is more, and less is certainly easier to explain to your leadership, to deploy, and to maintain. In this book, we will focus on the concepts of machine learning rather than the math and statistics. We will focus on the results of applying an algorithm rather than an algorithm’s innards. In other words, each algorithm is treated merely regarding its performance characteristics, each ranked by its performance. Section IV is by far the most important because here we are working toward an understanding of something called the confusion matrix (appropriately named). The confusion matrix is a comparison of whether decisions made by an algorithm were correct. The confusion matrix gives rise to most of the metrics we use to evaluate models and will also help drive our business recommendations.

Section V explains how to Interpret and Communicate the model. In this section, we move beyond creating and selecting a model to understand what features drive a target—in other words, which statistical relationships exist between the target and the other features. For example, the model you develop will dictate your management’s trust in the course of action you propose as well as in you. Hence, “we should invest $100 million in loans to the consumers that the model indicates as

least likely to default” will require strong faith in either the model or the data scientist who developed the model.

Given that faith in the person develops over time, our focus will be on developing faith in and understanding of the extent to which the AutoML platform itself not only finds problems with the model but also clarifies the reason(s) it works when it does and fails in some cases. Moreover, when specific data turns out to be predictive, we want to know if this is data to which we can entrust the future of our business. For example, the purchase of floor protectors (those cheap felt dots placed under furniture) was in one project found to be as predictive of credit-worthiness as an individual’s credit score (Watson, 2014). This feature has instinctive appeal because one can easily reason that someone who buys floor protectors not only has a wood floor (more expensive than carpeting) but also cares enough about their property to protect it from damage. A feature like this could be worth gold if it picked up on variance unaccounted for by other features associated with credit-worthiness. However, a manager would be likely to push back and ask how sure we are that we have data from all the places a potential customer would buy floor protectors. This same manager could reasonably worry that the feature only helps us predict the creditworthiness of the top 5% of income earners, a societal segment for which loans come readily, and competition is fierce. The manager may then worry that over time, word would get out about the importance of buying $5 floor protectors before applying for loans, whether one needs them or not. The short story is this: the more of these unanswered worries your model generates; the slower trust will develop in you as a data scientist.

Finally, Section VI, Implement, Document and Maintain, sets up a workflow for using the model to predict new cases. We here describe the stage where we move the selected model into production. For example, let’s say a consumer’s cell phone contract is set to expire next week. Before it does, we feed the information on this consumer into the model, and the model can provide the probability for that consumer to move their service to another cell phone provider. We must then translate this probability into business rules. For each consumer so scored, we weigh the income from keeping them against the risk of losing them, and then different retention programs are considered against that risk-weighted income. It could be that the consumer will be offered nothing because the risk of losing them is negligible, or it could be that they will be offered an extra gigabyte of broadband each month, or perhaps a $200 discount to stay on if the risk-weighted income is high. Once this risk-weighted income is established, the whole process is documented for future reproducibility and re-evaluation of rules. The likelihood of a data source changing becomes a major issue. Just a simple renaming of a column in an Excel spreadsheet or a database can prevent the whole model from working or lead to significantly worsened performance, as the model will not recognize the column and all its predictive capability is lost. We then create a process for model monitoring and maintenance. Simply put, over time, the external world modeled by the machine learning algorithm will change, and if we do not detect this change and retrain the model,

the original assumptions that may have made the model profitable will no longer be true, potentially leading to major losses for the company.

We move forward with the above worries in mind, knowing that we will learn a framework named the machine learning life cycle to keep us relatively safe by providing guard-rails for our work.

Dataset Download

The datasets for the book (described in Appendix A) are available for download at the following link. The zip file contains one zip file for each dataset (listed as Assets A.1–A.8, each referencing the similarly labeled Appendix A datasets). The Zip file containing the datasets is 131.6MB, and the unzipped files take about 524MB of space. Download link: https://www.dropbox.com/s/c8qjxdnmclsfsk2/AutoML_ DatasetsV1.0.zip?dl=0.

Copyrights

All figures and images not developed for the book, including the cover page image, are under open usage rights.

Kai Larsen is an Associate Professor of Information Systems at the Leeds School of Business with a courtesy appointment as an Associate Professor of Information Science in the College of Media, Communication, and Information at the University of Colorado, Boulder. In his research, he applies machine learning and natural language processing (NLP) to address methodological problems in the social and behavioral sciences. He earned a Ph.D. in Information Science from the Nelson A. Rockefeller College at SUNY, Albany.

Daniel Becker is a Data Scientist for Google’s Kaggle division. He has broad data science expertise, with consulting experience for six companies from the Fortune 100, a second-place finish in Kaggle's $3 million Heritage Health Prize, and contributions to the Keras and Tensorflow libraries for deep learning. Dan developed the training materials for DataRobot University. Dan earned his Ph.D. in Econometrics from the University of Virginia.

SECTION I

WHY USE AUTOMATED MACHINE LEARNING?

1

What Is Machine Learning?

1.1 Why Learn This?

Machine learning is currently at the core of many if not most organizations’ strategies. A recent survey of more than 2,000 organizations’ use of machine learning and analytics found that these tools are integral for knowing the customer, streamlining existing operations, and managing risk and compliance. However, the same organizations were only marginally confident in their analytics-driven insights in these areas, including their processes for managing such projects and evaluating outcomes (KPMG, 2017). In the coming decade, there will be two classes of organizations: those that use machine learning to transform their capabilities and those that do not (Davenport, 2006). While barely beyond its inception, the current machine learning revolution will affect people and organizations no less than the Industrial Revolution’s effect on weavers and many other skilled laborers. In the 1700s, weaving required years of experience and extensive manual labor for every product. This skill set was devalued as the work moved into factories where power looms vastly improved productivity. Analogously, machine learning will automate hundreds of millions of jobs that were considered too complex for machines ever to take over even a decade ago, including driving, flying, painting, programming, and customer service, as well as many of the jobs previously reserved for humans in the fields of finance, marketing, operations, accounting, and human resources.

The organizations that use machine learning effectively and survive will most likely focus on hiring primarily individuals who can help them in their journey of continuing to derive value from the use of machine learning. The understanding of how to use algorithms in business will become an essential core competency in the twenty-first century. Reading this book and completing any course using it is the first step in acquiring the skills needed to thrive in this new world. Much has been made of the need for data scientists, and data scientist salaries certainly support the premium that industry is placing on such individuals to create and support all the above applications. A popular meme once decreed that you should “always be yourself, unless you can be Batman, then always be Batman.” Akin to this: if you can find a way to be a data scientist, always be a data scientist (especially one as good at his or her job as Batman is at his), but if you cannot be a data scientist, be the best self you can be by making sure you understand the machine learning process.

Despite the current doubt within many organizations about their machine learning capabilities, the odds are that many use one or several technologies with machine learning built-in. Examples abound and include fitness bands; digital assistants like Alexa, Siri, or Cortana; the new machine learning–powered beds; and the Nest thermostat, as well as search assistants like Google, where hundreds of machine learning models contribute to every one of your searches.

You are the subject of machine learning hundreds of times every day. Your social media posts are analyzed to predict whether you are a psychopath or suffer from other psychiatric challenges (Garcia & Sikström, 2014), your financial transactions are examined by the credit-card companies to detect fraud and money laundering, and each of us logs into a unique, personalized Amazon Marketplace tailored to our lives by their highly customizing machine learning algorithms. Companies have weaved machine learning into everyday life with unseen threads. For example, machine learning models perform the majority of stock trades, and even judges contemplating the level of punishment for convicted criminals make their decisions in consultation with machine learning models. In short, machine learning algorithms already drive hugely disruptive events for humanity, with major revolutions still to come.

1.2 Machine Learning Is Everywhere

Everywhere? While there is likely not an ML algorithm at the top of Mount Everest unless there are also climbers there, there are plenty of machine learning algorithms working through satellite imagery to put together complete maps not hampered by clouds (Hu, Sun, Liang & Sun, 2014). These algorithms also predict poverty through nighttime light intensity (Jean et al., 2016), detect roads (Jean et al., 2016), detect buildings (Sirmacek & Unsalan, 2011), and generate most of the 3D structures in Google Earth. If you have tried Google Earth and have floated through a photo-realistic Paris or Milan wondering what happened to the cars and those incredible Italians who still manage to look marvelous despite riding mopeds to work, an ML algorithm erased them. They simply were too transient and irrelevant for Google Earth’s mapping purpose. Zoom in far enough, though, and you’ll see ghost cars embedded in the asphalt. The important point is that while machine learning algorithms work wonderfully for most large-scale problems, if you know where to look, you’ll find small mistakes. Finding these slips and learning from them could become one of your defining strengths as an analyst, and you will need to develop skills in figuring out which features of a dataset are unimportant, both for the creation of ML models and for determining which results to share with your boss.

We started to create a complete list of all the areas machine learning has taken over or will soon take over, but we soon realized it was a meaningless exercise. Just

within the last 24 hours of this writing, news has come of Baidu machine learning that finds missing children in China through facial recognition; Microsoft is developing machine learning–powered video search that can locate your keys if you fill your home with cameras; Facebook announced that their latest languagetranslation algorithm is nine times faster than their competitors’ algorithms; researchers announced an autonomous robot that uses machine learning to inspect bridges for structural damage; and apparently we will soon be able to translate between English and Dolphin (!?). In the meantime, albeit less newsworthy, dozens if not hundreds of teams of data scientists are engaged in making our lives better by replacing teen drivers with artificial intelligence that is ten times better drivers than adolescents right now, with the potential to be 50–100 fold safer. Cars have traditionally been 4,000-pound death machines when left in the care of people who are sometimes poorly trained, tired, distracted, or under the influence of medications or drugs. If we leave the driving to machines, car travel will one day become safer and less stressful.

Currently, machine learning is involved when you ask Alexa/Cortana/Siri/ Google Assistant to search for anything. ML translates your voice to text, and the Google Assistant uses ML to rank all the pages that contain the keywords you specified. Increasingly, your very use of the Internet is used to figure you out, sometimes for good purposes, such as offering treatment when you are likely to be depressed (Burns et al., 2011) or recognizing when you are likely to drop out of college (Baker & Inventado, 2014). Questionable use of ML-driven identification exists, such as when Facebook infamously shared insights with a potential customer about when children were most likely to be emotionally vulnerable and presumably also more open to certain kinds of ads (Riley, 2017). While such stories are explosive, the reality is that they remain in the territory of poor PR practices. The sad truth is that it likely is impossible to conduct large-scale machine learning to predict what and when to market a product based on something as emotionally driven as social media posts and likes without taking advantage of human weaknesses. Machine learning algorithms will zero in on these weaknesses like a honey badger on the prowl.

In potentially dubious practices, analysts apply machine learning with a training set of past consumers that either bought or didn’t buy a given product to predict new purchasers. Analysts create of a model (a set of weighted relationships) between the act of buying (vs. not buying) and a set of features (information on the consumers). These features could be “likes” of a given artist, organization, the text of their posts, or even behaviors and likes of the person’s friends. If liking an industrial band (a potential sign of untreated depression) on a Monday morning predicts later purchase of leather boots, the model will target the people inclined toward depressions on the day and time that feeling is the strongest. In this case, the machine learning will not label these consumers as depressed, or ever know that they posess a clinical condition, but will still take advantage of

their depression. To evaluate the existence of such unethical shortcuts in models requires tools that allow careful examination of a model’s “innards.” Being able to evaluate what happens when a model produces high-stakes decisions is a critical reason for learning the content of this book.

1.3 What Is Machine Learning?

Machine learning was defined by Arthur Samuel, a machine learning pioneer, as “a field of study that gives computers the ability to learn without being explicitly programmed” (McClendon & Meghanathan, 2015, 3). A slight rewrite on that suggests that machine learning enables computers to learn to solve a problem by generalizing from examples (historical data), avoiding the need explicitly to program the solution.

There are two types of machine learning: supervised and unsupervised. The main difference between the two simplifies as supervised machine learning being a case of the data scientist selecting what they want the machine to learn, whereas unsupervised machine learning leaves it to the machine to decide what it wants to learn, with no guarantee that what it learns will be useful to the analyst. This dichotomy is a gross simplification, however, as over decades of use, humans have figured out which unsupervised approaches lead to desirable results.1 An example of supervised machine learning might be where a model is trained to split people into two groups: one group, “likely buyers,” and another group, “likely non-buyers.” The modeling of relationships in historical data to predict future outcomes is the key central concept through which machine learning is transforming the world. This is due to the relative clarity of the idea and the ever-expanding business potential available in its implementation. This book will focus exclusively on supervised machine learning because we as authors believe this is where most of the benefit of machine learning lies. From here out, we will refer to supervised machine learning as “machine learning.”

As humans, we create models all the time. Think back to how you learned to be funny as a child (potentially a life-saving skill to ensure survival). You were probably sitting in your high chair being spoon-fed mashed beans by your father. Realizing that puréed beans taste horrible, you spit the food back into his face, resulting in your mom bursting out in laughter. Your brain kicked into predictive mode, thinking that there must be a relationship between spitting stuff out and getting a positive reaction (laughter). You try it again and spit the next spoonful out on the floor. This does not elicit a reaction from mom (no laughter). On the

1 In unsupervised machine learning, models are created without the guidance of historical data on which group people with certain characteristics belonged to in the past. Examples of unsupervised machine learning is clustering, factor analysis, and market basket analysis.

Is Machine Learning? 7

third spoonful, you decide to replicate the original process, and again your mom rewards you (laughter). You have now established the relationship between spitting food in the face of a parent and a positive outcome. A few days later, being alone with your mother, you decide to show once again what a born comedian you are by spitting food into her face. Much to your surprise, the outcome (no laughter) indicates that your original model of humor may have been wrong. Over time, you continued spitting food in adult faces until you had developed a perfect model of the relationship between this behavior and humor. This is also how machine learning works, constantly searching for relationships between features and a target, often as naively as a toddler, but with enough data that the model will outperform most human adults.

In machine learning, there is a target (often an outcome) we are trying to understand and predict in future cases. For example, we may want to know the value of a house given that we know the number of bedrooms, bathrooms, square footage, and location. Being good at predicting the value of a home (and typically exaggerating it quite a bit) used to be the main criterion sellers used to select a realtor. Now, websites such as Zillow.com predict home prices better than any individual realtor ever could. In business, there are other important targets such as churn (will this customer stop being a customer?). We also may want to know if a website visitor will click on an ad for Coach handbags, or whether a job applicant will perform well in your organization. Healthcare management organizations may want to know which patients return to the emergency department (ED) within a month of being released, as well as why they are likely to return. Both the who and the why are important to improve services and preserve resources. We will return to the example of diabetes patients released by a hospital ED in this book and use it to understand the aspects of good patient care. We will also share several datasets covering most business disciplines, including finance (lending club loans), operations (part backorders), sales (luxury shoes), reviews (Boston Airbnb), and human resources (HR attrition, two datasets), as well as some datasets of more general interest, such as student grades and college starting salaries.

In this book, we will step through a process that will tease out a set of conclusions about what drives an organizational goal. The process will apply to almost any process that has a definable and measurable target. For now, think of machine learning as analogous to the process you’ve developed through a long life—the ability to decide whether a meal is going to taste good before the first bite (those mashed beans served you well in many ways). Every day, we use our already honed sense of taste to determine which restaurants to go to and which ones to avoid, as well as which of mom’s meals are worth visiting for, and which ones require a hastily conjured excuse for escape. Our brains contain an untold number of models that help us predict the outside world (“I’m really hungry, but this berry looks a lot like the berry that killed my cousin”). If our forebears had not developed such abilities, we would have long since died out.

1.4 Data for Machine Learning

More Data Beats a Cleverer Algorithm

ML principle: With the best of feature engineering, results may still not be adequate. Two options remain: finding/designing a better algorithm or getting more data. Domingos (2012) advocates for more data.

AutoML relevance: With AutoML, more data represents the only realistic path to practical and significant improvements.

Let’s use an example to outline how to use machine learning algorithms in business. Your CEO just received a report (most likely from you) that 75% of your eCommerce site customers make only one purchase and that most repeat customers return within a week. He would like for you to explain what drives a customer to return. It is now your job to quickly gather as much data as possible starting with what you know. You are aware that a year ago the organization made large design changes to the website that changed the traffic patterns quite a bit, so you decide to start collecting data on customer visits starting a month after that change, allowing for the stabilization of traffic patterns. Next, you begin processing account creations from eleven months ago and discard data on accounts created that never led to sales (a different question that your CEO is sure to ask about later). For each account, you add a row to a file of customers for your analysis. For each row, you examine whether the customer made another purchase during the next seven days. If they did, you code a value of True into a new column called “Retained,” and if not, you code that customer as False. You stop the addition of customers who created accounts a week before today’s date to avoid labeling the most recent customers as non-retained because they have not had a whole week to return for a second purchase.

Your next task is to gather as much data as possible for both “retained” and “lost” customers. As such, the specifics of the second purchase are of no interest because that information is only available about your retained customers. However, the specific details on the first purchase are important. Likely information that will require new columns (we call those features) includes:

1. Which site sent them to you? Was it a Google search, Bing search, or did they come directly to your site?

2. Did this customer come to us through an ad? If so, which ad campaign was it (presumably each has a unique identifier)?

3. How many different types of products did the customer buy? Was one of the products a high-end shoe? Was one of the products a kitchen knife? (Yes, depending on your company’s product mix, this set of features can become

quite large, and often, we must operate at the product category level and generalize, saying for example that we sold footwear or kitchen equipment.)

4. How long did it take the customer after creating an account to make a purchase?

5. What was the customer’s purchasing behavior? Did they know what they wanted to purchase, or did they browse several products before deciding? If so, what were those products?

6. Which browser did they use? What language pack do they have installed on their computer?

7. What day of the week did they make their purchase?

8. What time of day was the purchase made?

9. What geographic location did they come from (might require a lookup on their Internet Protocol (IP) address and the coding of that location with a longitude and latitude)?

The process of feature creation continues as the particular case demands. You may create dozens if not hundreds of features before considering the use of a customer’s email address, credit card information, or mailing address to purchase additional information. Such information may be purchased from the likes of Equifax (credit reports and scores), Towerdata (email address–based data like demographics, interests, and purchase information), Nielsen (viewing and listening patterns recorded via television, internet, and radio behavior), or Acxiom. A variety of data on consumers is available in categories such as:

1. Demographic data including gender, educational level, profession, number of children;

2. Data on your home including the number of bedrooms, year built, square footage, mortgage size, house value;

3. Vehicle data including car make and model;

4. Household economic data including credit card type(s) and household income;

5. Household purchase data including what kind of magazines are read;

6. Average dollars spent per offline purchase as well as the average spent per online purchase; and

7. Household interest data.

Datafiniti is another data provider that provides data on consumers, scraped competitor data from websites, business location data, as well as property data. You may also buy data from aggregators like DemystData, which provides access to the above data vendors as well as many others. Returning to your role as an analyst, you generally will not make an argument to purchase additional data until you have examined the usefulness of internal data. Note also the case for retaining a healthy skepticism of the quality of paid-for data. With your table containing several features and a target (customer retained or lost), you now have everything you need to start your machine learning adventure.

Later in the book, we will go through a generalizable procedure for analyzing the gathered data and finding the answer to your CEO’s question: what drives a customer to return and other related questions? You will also become familiar with the business decisions that follow from creating a model. A model is a set of relationships between the features and the target. A model is also the strength of these relationships, and it becomes a “definition,” if you will, of what drives customer retention. Beyond being usable for answering your CEO’s question, the most exciting part of the model you just created is that it can be applied to first-time customers while they are still on the website. In a matter of milliseconds, the model can provide a probability (from zero to one) of whether or not this customer is likely to be retained (for example, 0.11, or not very likely). In the case of low probability, a coupon for their next visit may be provided to increase their motivation to return to the site. Before getting to that point, however, we will learn how to interpret the ML results. We cover which features are most important (for your presentation to the CEO as well as for your understanding of the problem) and finally, the level of confidence in the model (can its probabilities be trusted?) and what kinds of decisions can be driven by the model (what probability cutoffs may be used?). In the end, we hope that while you will still be yourself after completing this book, your CEO will think of you as Batman due to your ability to solve these kinds of problems with relative ease.

1.5  Exercises

1. List areas in which machine learning likely already affects you, personally.

2. What is machine learning?

3. What is a target?

4. What is a feature?

5. What is target leak?

2

Automating Machine Learning

Follow a Clearly Defined Business Process

ML principle: “Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages” (Provost & Fawcett, 2013, 14).

AutoML relevance: Machine learning is at its core a tool for problem-solving. AutoML drastically reduces the necessary technical skills. This can leave more time and attention available for focusing on business process.

Davenport and Patil (2012) suggested that coding was the most basic and universal skill of a data scientist but that this would be less true in five years. Impressively, they predicted that the more enduring skill would be to “communicate in the language that all stakeholders understand”—and to demonstrate the special skills involved in storytelling with data. The next step for us, then, is to develop these abilities vital for the data scientist or a subject matter expert capable of doing some of the work of a data scientist.

True to Davenport and Patil’s vision, this book does not require coding skills unless your data is stored in complex databases or across multiple files. (We will provide some exercises allowing you to test your data skills if you are so inclined.) With that in mind, let’s review the content of the book laid out visually now in Figure 2.1. The machine learning life-cycle has its roots in extensive research and practitioner experience (Shearer, 2000) and is designed to be helpful to everyone from novices to machine learning experts.

The life cycle model, while figured linearly here, is not an altogether linear process. For every step in the process, lessons learned may require a return to a previous step, even multiple steps back. Unfortunately, it is not uncommon to get to the Interpret & Communicate stage and find a problem requiring a return to Define Project Objectives, but careful adherence to our suggestions should minimize such problems. In this book, each of the five stages is broken down into actionable steps, each examined in the context of the hospital readmission project.

This book takes advantage of Automated Machine Learning (AutoML) to illustrate the machine learning process in its entirety. We define AutoML as any machine learning system that automates the repetitive tasks required for effective machine learning. For this reason, among others, AutoML is capturing the

Figure 2.1. Machine Learning Life Cycle Model.

imagination of specialists everywhere. Even Google’s world-leading Google Brain data scientists have been outperformed by AutoML (Le & Zoph, 2017). As machine learning progress is traceable mostly to computer science, it is worth seeing AutoML initially from the code-intensive standpoint. Traditionally, programming has been about automating or simplifying tasks otherwise performed by humans. Machine learning, on the other hand, is about automating complex tasks requiring accuracy and speed beyond the cognitive capabilities of the human brain. The latest in this development, AutoML, is the process of automating machine learning itself. AutoML insulates the analyst from the combined mathematical, statistical, and computer sciences that are taking place “under the hood,” so to speak. As one of us, Dan Becker, has been fond of pointing out, you do not learn to drive a car by studying engine components and how they interact. The process leading to a great driver begins with adjusting the mirrors, putting on the seat belt, placing the right foot on the brake, starting the car, putting the gear shift into “drive,” and slowly releasing the brake.

As the car starts moving, attention shifts to the outside environment as the driver evaluates the complex interactions between the gas, brake, and steering wheel combining to move the car. The driver is also responding to feedback, such as vibrations and the car’s position on the road, all of which require constant adjustments to accommodate, focusing more on the car’s position on the road rather than the parts that make it run. In the same way, we best discover machine learning without the distraction of considering its more complex working components: whether the computer you are on can handle the processing requirements of an algorithm, whether you picked the best algorithms, whether you understand how to tune the algorithms to perform their best, as well as a myriad of other considerations. While the Batmans of the world need to understand the difference between a gasoline-powered car and an electric car and how they generate and transfer power to the drivetrain, we thoroughly believe that the first introduction to machine learning should not require advanced mathematical skills.

2.1 What Is Automated Machine Learning?

We started the chapter by defining AutoML as the process of automating machine learning, a very time- and knowledge-intensive process. A less self-referential definition may be “off-the-shelf methods that can be used easily in the field, without

machine learning knowledge” (Guyon et al., 2015, 1). While this definition may be a bit too optimistic about how little knowledge of machine learning is necessary, it is a step in the right direction, especially for fully integrated AutoML, such as Salesforce Einstein.

Most companies that have adopted AutoML tend to be tight-lipped about the experience, but a notable exception comes from Airbnb, an iconic sharing economy company that recently shared its AutoML story (Husain & Handel, 2017). One of the most important data science tasks at Airbnb is to build customer lifetime value models (LTV) for both guests and hosts. This allows Airbnb to make decisions about individual hosts as well as aggregated markets such as any city. Because the traditional hospitality industry has extensive physical investments, whole cities are often lobbied to forbid or restrict sharing economy companies. Customer LTV models allow Airbnb to know where to fight such restrictions and where to expand operations.

To increase efficiency, Airbnb identified four areas where repetitive tasks negatively impacted the productivity of their data scientists. There were areas where AutoML had a definitive positive impact on productivity. While these will be discussed later, it is worth noting these important areas:

1. Exploratory data analysis. The process of examining the descriptive statistics for all features as well as their relationship with the target.

2. Feature engineering. The process of cleaning data, combining features, splitting features into multiple features, handling missing values, and dealing with text, to mention a few of potentially hundreds of steps.

3. Algorithm selection and hyperparameter tuning.1 Keeping up with the “dizzying number” of available algorithms and their quadrillions of parameter combinations and figuring out which work best for the data at hand.

4. Model diagnostics. Evaluation of top models, including the confusion matrix and different probability cutoffs.

In a stunning revelation, Airbnb stated that AutoML increased their data scientists’ productivity, “often by an order of magnitude” (Husain & Handel, 2017). Given data scientist salaries, this should make every CEO sit up and take notice. We have seen this multiplier in several of our projects and customer projects; Airbnb’s experiences fit with our own experiences in using AutoML for research. In one case, a doctoral student applied AutoML to a project that was then in its third month. The student came back after an investment of two hours in AutoML with a performance improvement twice that of earlier attempts. In this case, the problem was that he had not investigated a specific class of machine learning that turned out to work especially well for the data. It is worth noting that rather than feel defeated by this result, the student could fine-tune the hyperparameters of the

1 Hyperparameter is a fancy word for all the different settings that affect how a given algorithm works.

discovered algorithm to later outperform the AutoML. Without AutoML in the first place, however, we would not have gotten such magnified results. Our experience tracks with feedback from colleagues in industry. The Chief Analytics Officer who convinced Kai to consider the area of AutoML told a story of how DataRobot, the AutoML used in this book, outperformed his large team of data scientists right out of the box. This had clearly impressed him because of both the size of his team, their decades of math and statistics knowledge, and their domain expertise. Similarly, AutoML allowed Airbnb data scientists to reduce model error by over 5%, the significance of which can only be explained through analogy: consider that Usain Bolt, the sprinter whose name has become synonymous with the 100-meter dash, has only improved the world record by 1.6% throughout his career (Aschwanden, 2017).

For all the potential of AutoML to support and speed up existing teams of data scientists, the potential of AutoML is that it enables the democratization of data science. It makes it available and understandable to most, and makes subject matter expertise more important because it may now be faster to train a subject matter expert in the use of AutoML than it is to train a data scientist to understand the business subject matter at hand (for example, accounting).

2.2 What Automated Machine Learning Is Not

Love Improves Accuracy

ML principle: “An expert in some particular [algorithm]—maybe the person who developed it—can squeeze more performance out of it than someone else” (Witten et al., 2011, 378).

AutoML relevance: AutoML companies level the playing field by giving many algorithms a chance to perform for massive sets of data in different contexts. Only algorithms that perform are retained.

AutoML is not automatic ML. That is, there are still several decisions that must be made by the analyst and a requisite skill set for evaluating the results of applying machine learning to any dataset. For example, a subject matter expert must decide which problems are worth solving, determine which ideas are worthy of testing, and develop a solid understanding of common pitfalls and model evaluation skills. We base this book on a belief that most machine learning will be executed by subject matter experts who understand the machine learning life-cycle process (Figure 2.1) and have been trained to understand machine learning conceptually rather than mathematically. While some readers might sigh in relief, the more mathematically inclined among you may be thinking that you are about to be cheated out of essential details. We will recommend great resources for deepening your knowledge on

specific machine learning algorithms but will focus on developing a deep understanding of the machine learning process for business and the evaluation of machine learning results. For the first time, the existence of industrial-strength AutoML systems makes this possible without giving up the accuracy associated with expertdriven machine learning.

The fledgling data scientist may worry that AutoML will take away his or her job. Over time, the job market for data scientists will be greatly affected as AutoML takes over some of the most knowledge-based and time-intensive tasks currently performed by data scientists. However, given a quadrupling of interest in machine learning during the three-year period from 2014–2017,2 it is currently hard for educational programs to keep up with demand. The tools are likely to affect data scientists disproportionately if they have neither deep subject matter knowledge in key business areas (e.g., accounting or marketing) nor cutting-edge skills in optimizing hyperparameters for the latest and best algorithms. Generally speaking, AutoML is good news for data scientists because it frees them from manually testing out all the latest algorithms, the vast majority of which will not improve the performance of their work. Less obvious, it also frees up time for the data scientist to focus on difficult problems not yet solved by AutoML applications, such as time series motifs (repeated- segments long time-series data) and trajectory mining (in which direction is something headed). With AutoML, data scientists can shift focus to the effective definition of machine learning problems, and the enrichment of datasets through location and addition of data. AutoML also saves time on process design and improves efficiency when handling some current tasks of the data scientist, monitoring dozens to hundreds of different algorithms while trying out different hyperparameters, a task at which algorithms have been found to be better than humans (Bergstra, Bardenet, Bengio, & Kégl, 2011).

2.3 Available Tools and Platforms

While there are as many different types of AutoML as there are AutoML tools and platforms, in general, there are two types: context-specific tools and general platforms. Context-specific tools are implemented within another system or for a specific purpose. For example, as mentioned above, Salesforce Einstein embeds AutoML tools within the existing Salesforce platform. The tool scores “leads,” or potential customers regarding the likelihood of sales, and uses customer sentiment, competitor involvement, and overall prospect engagement to examine the likelihood of a deal to close. Another context-specific tool, Wise.io, now owned by

2 Result from http://trends.google.com when searching for “machine learning.” The plot represents the relative level of Google search for the term from May 13, 2014, to the same date in 2017. Interestingly, News search for the same term increased by a factor of 10 in the same period, indicating some hype in the news media.

Turn static files into dynamic content formats.

Create a flipbook