Leaders in the data science community will present challenges their organizations have faced in becoming model-driven, provide frameworks, innovations, tangible tips and more.
Practitioner Track Overview
Why Human-Centered Design is Necessary for Effective Data Science
11:15 a.m.-12:00 p.m. | May 30
Data science is typically framed in terms using data, sound statistical methodology, and machine learning tools to produce valuable insights, predictions, and recommendations. But another hallmark of data science is its practical nature: algorithmic indications are meant to be acted upon. This implies that end-users must understand how to properly interpret and act on algorithmic indications. Borrowing ideas from psychology and behavioral economics, this talk will explore the idea that user-centered design belongs in the data scientist’s conceptual toolkit.
Chief Data Scientist, Deloitte Consulting
Fast Feature Evaluation with Monte Carlo Permutation Testing
1:15-2:00 p.m. | May 30
One of the challenges when working with high-dimensional data is quickly being able to determine the independent variables which most strongly influence the dependent variable. Unfortunately, the higher the dimensionality, the more likely that any influential independent variable selected may be influential by random luck. Enter Monte Carlo Permutation Testing (MCPT). By permuting the dependent variable many times, calculating information measures on these permutations, and comparing these measures to the actual information measure, a practitioner can be more confident that a selected variable will truly be informative. The goal of this talk is to both introduce the concept of MCPT to data science practitioners as well as demonstrate some recent advances within the Python ecosystem which more seamlessly enable parallelization.
Data Scientist, Dell
Using Data Science and DNA to Discover Our Ancestral Origins
2:00-2:45 p.m. | May 30
Similar to social networks of people and their social relationships (e.g., friends), a genetic network is based upon people and their genetic relationships (e.g., cousins). AncestryDNA has built the world’s largest genetic network of ~10M people. Ahna will discuss innovations in discovering Genetic CommunitiesTM within this network, leveraging old census records and other historical information to tell data-driven stories about their history, and using machine learning to determine each customer’s genetic communities. The result helps AncestryDNA customers learn more about themselves and where they come from.
PhD, Senior Computational Research Scientist, AncestryDNA
Transforming Healthcare with Deep Learning on GPUs
3:00-3:45 p.m. | May 30
Essential Administrative functions (aka back office functions) for health care we very well suited for assistance and automation with Deep Learning. These important functions are often too complex for rules but with lots of well-labeled data, they are a good fit for Deep Learning. OptumLabs Center for Applied Data Science will share how they design and train Deep Learning Neural Networks, powered by GPUs, for these use cases.
Vice President, OptumLabs Center for Applied Data Science, UnitedHealth Group
Artisan Data Science is Dead
3:45-4:15 p.m. | May 30
Roar Data is an experiment in Collective Intelligence undertaken by J.P. Morgan, academic and industry partners. The premise is that streaming data, machine learning, microservices and cryptography support powerful new ways of organizing predictive analytics. The engineering ambition is a Prediction Web open to all – furthering the democratization of data science started by MOOCs and open source software.
Executive Director, JPMorgan
Uplift Modeling for Driving Incremental Revenue by Display Remarketing
11:00-11:45 a.m. | May 31
Traditional marketing campaigns target responsive customers, but may not necessarily target people for whom the campaigns are most profitable. In contrast, uplift modeling predicts the causal effect of marketing campaigns by comparing the response rate of individuals in a treatment group against those in a control group, and thus selects the most incremental customers to target. The Wayfair marketing data science team has been using uplift modeling to drive incremental revenue by optimizing our display remarketing to identify such customers. These models work in conjunction with a click-through rate prediction to send Wayfair’s display remarketing advertisements to the right people at the right time in the right place across the Internet.
Senior Data Scientist, Wayfair
Experimental Design: Improving R&D Workflows for Better, Faster Results
11:45 a.m.-12:30 p.m. | May 31
The data science process is inherently more experimental and iterative than traditional software engineering. It is more science than programming. As a result, it is necessary to set up your workflows in a way that accounts for this uncertainty and the fact that you may try 99 things before the 100th experiment pays off. In this session, we’ll share best practices for tracking key quantitative results. In particular, we’ll explore implementing “training wheels” to allow for safe failure modes, hyper-parameter optimization, creating a modern lab notebook, and ensuring your work is optimized for collaboration with other data scientists and with subject matter experts across the business.
Classify All the Things (with Multiple Labels): The Most Common Type of Modeling Task No One Ever Talks About
1:45-2:30 p.m. | May 31
Every introductory data science textbook discusses two types of classification tasks: _binary classification_ tasks such as predicting whether an image depicts a dog, and _multiclass classification_ tasks, like predicting the species of an animal in an image. But they almost never discuss the problem of _multilabel classification_, in which multiple, non-mutually exclusive labels are to be predicted for every input pattern. (An example of such a task might be predicting a set of activities that a pictured subject is engaging in — e.g., eating, talking, driving, etc.) Because there can be dependencies between classes, more sophisticated methods can perform better than simply building a host of multiclass models. This talk will discuss some methods suited to multiclass prediction, and demonstrate their use with existing toolkits.
Data Science Manager, American Family Insurance
Understanding the Customer: Audience Analysis at a Modern Media Company
2:30-3:15pm | May 31
“How can I better understand my customers?” This is a common question that many companies are asking of themselves and of their data in order to drive growth.
At Vox Media, we are continually striving to better understand our audience and deliver the content that is most engaging to them. We will discuss the elements of designing a good “customer database” in order to extract maximum value for the business.
Clickstream tracking can generate millions to billions of records of complex data on a daily basis. While it is extremely valuable to be able to pass and save custom data fields, the sheer volume of data often makes it difficult to extract even basic insights, not to mention implementing advanced analytic or machine learning models.
Creating a derived database from clickstream data is a valuable first step in understanding one’s customer base. If well conceived and constructed, this dataset can become the foundation upon which analysts can quickly arrive at insights and data scientists can build models.
Head of Data Science, Vox Media
Getting Models from Development to Production
3:45 p.m.-4:30 p.m. | May 31
Building a high-performing model that’s been tested, iterated on and validated for production is only half the battle. Translating models from the development environment where they were built to actually run in production — whether that means serving an API, web app, or report that influences a business decision — can often be a challenge in itself. In this session, Randi will talk about the programs and mechanisms Dell is putting in place to support systematic model deployments, starting with an initial pilot project to deploy machine learning models to 200 users, and scaling rapidly from there.
Topics to be covered include:
- Technological decisions, limitations and workarounds
- Integrating with varied endpoints including APIs, web apps and cloud servers
- Usage metrics to track success
- Stakeholder feedback and alignment
- Results to date
- Long-term plans and lessons learned
Data Scientist, Dell
Jupyter Best Practices and Interoperability with Other Tools
4:30-5:15 p.m. | May 31
Jupyter users love it for exploring data, testing ideas, and writing code, and they’re used to sharing work and ideas in notebooks. But few people appreciate how powerful Jupyter is for generating interactive dashboards. Mac Rogers shares best practices for creating Jupyter dashboards and some lesser-known tricks for making Jupyter dashboards interactive and attractive, appeal to nontechnical users, and provide a permanent record for any information delivered. He’ll also talk through ways of leveraging Jupyter alongside other popular modeling tools.
Research Engineer, Domino Data Lab