Leaders in the data science community will present challenges their organizations have faced in becoming model-driven, provide frameworks, innovations, tangible tips and more.

Practitioner Track Overview

Why Human-Centered Design is Necessary for Effective Data Science

11:15 a.m.-12:00 p.m. | May 30

Data science is typically framed in terms using data, sound statistical methodology, and machine learning tools to produce valuable insights, predictions, and recommendations. But another hallmark of data science is its practical nature: algorithmic indications are meant to be acted upon. This implies that end-users must understand how to properly interpret and act on algorithmic indications. Borrowing ideas from psychology and behavioral economics, this talk will explore the idea that user-centered design belongs in the data scientist’s conceptual toolkit.

Jim Guszcza
Chief Data Scientist, Deloitte Consulting

Fast Feature Evaluation with Monte Carlo Permutation Testing

1:15-2:00 p.m. | May 30

One of the challenges when working with high-dimensional data is quickly being able to determine the independent variables which most strongly influence the dependent variable. Unfortunately, the higher the dimensionality, the more likely that any influential independent variable selected may be influential by random luck. Enter Monte Carlo Permutation Testing (MCPT). By permuting the dependent variable many times, calculating information measures on these permutations, and comparing these measures to the actual information measure, a practitioner can be more confident that a selected variable will truly be informative. The goal of this talk is to both introduce the concept of MCPT to data science practitioners as well as demonstrate some recent advances within the Python ecosystem which more seamlessly enable parallelization.

David Patschke
Data Scientist, Dell

Using Data Science and DNA to Discover Our Ancestral Origins

2:00-2:45 p.m. | May 30

Similar to social networks of people and their social relationships (e.g., friends), a genetic network is based upon people and their genetic relationships (e.g., cousins). AncestryDNA has built the world’s largest genetic network of ~10M people. Ahna will discuss innovations in discovering Genetic CommunitiesTM within this network, leveraging old census records and other historical information to tell data-driven stories about their history, and using machine learning to determine each customer’s genetic communities. The result helps AncestryDNA customers learn more about themselves and where they come from.

Ahna Girshick
PhD, Senior Computational Research Scientist, AncestryDNA

Transforming Healthcare with Deep Learning on GPUs

3:00-3:45 p.m. | May 30

Essential Administrative functions (aka back office functions) for health care we very well suited for assistance and automation with Deep Learning. These important functions are often too complex for rules but with lots of well-labeled data, they are a good fit for Deep Learning. OptumLabs Center for Applied Data Science will share how they design and train Deep Learning Neural Networks, powered by GPUs, for these use cases.

Sanji Fernando
Vice President, OptumLabs Center for Applied Data Science, UnitedHealth Group

Artisan Data Science is Dead

3:45-4:15 p.m. | May 30

Roar Data is an experiment in Collective Intelligence undertaken by J.P. Morgan, academic and industry partners. The premise is that streaming data, machine learning, microservices and cryptography support powerful new ways of organizing predictive analytics. The engineering ambition is a Prediction Web open to all – furthering the democratization of data science started by MOOCs and open source software.

Peter Cotton
Executive Director, JPMorgan

Uplift Modeling for Driving Incremental Revenue by Display Remarketing

11:00-11:45 a.m. | May 31

Traditional marketing campaigns target responsive customers, but may not necessarily target people for whom the campaigns are most profitable. In contrast, uplift modeling predicts the causal effect of marketing campaigns by comparing the response rate of individuals in a treatment group against those in a control group, and thus selects the most incremental customers to target. The Wayfair marketing data science team has been using uplift modeling to drive incremental revenue by optimizing our display remarketing to identify such customers. These models work in conjunction with a click-through rate prediction to send Wayfair’s display remarketing advertisements to the right people at the right time in the right place across the Internet.

Robert Yi
Senior Data Scientist, Wayfair

Experimental Design: Improving R&D Workflows for Better, Faster Results

11:45 a.m.-12:30 p.m. | May 31

The data science process is inherently more experimental and iterative than traditional software engineering. It is more science than programming. As a result, it is necessary to set up your workflows in a way that accounts for this uncertainty and the fact that you may try 99 things before the 100th experiment pays off. In this session, we’ll share best practices for tracking key quantitative results. In particular, we’ll explore implementing “training wheels” to allow for safe failure modes, hyper-parameter optimization, creating a modern lab notebook, and ensuring your work is optimized for collaboration with other data scientists and with subject matter experts across the business.

Erik Andrejko
CTO, Wellio

Classify All the Things (with Multiple Labels): The Most Common Type of Modeling Task No One Ever Talks About

1:45-2:30 p.m. | May 31

Every introductory data science textbook discusses two types of classification tasks: _binary classification_ tasks such as predicting whether an image depicts a dog, and _multiclass classification_ tasks, like predicting the species of an animal in an image. But they almost never discuss the problem of _multilabel classification_, in which multiple, non-mutually exclusive labels are to be predicted for every input pattern. (An example of such a task might be predicting a set of activities that a pictured subject is engaging in — e.g., eating, talking, driving, etc.) Because there can be dependencies between classes, more sophisticated methods can perform better than simply building a host of multiclass models. This talk will discuss some methods suited to multiclass prediction, and demonstrate their use with existing toolkits.

Derrick Higgins
Data Science Manager, American Family Insurance

Understanding the Customer: Audience Analysis at a Modern Media Company

2:30-3:15pm | May 31

“How can I better understand my customers?” This is a common question that many companies are asking of themselves and of their data in order to drive growth.

At Vox Media, we are continually striving to better understand our audience and deliver the content that is most engaging to them. We will discuss the elements of designing a good “customer database” in order to extract maximum value for the business.

Clickstream tracking can generate millions to billions of records of complex data on a daily basis. While it is extremely valuable to be able to pass and save custom data fields, the sheer volume of data often makes it difficult to extract even basic insights, not to mention implementing advanced analytic or machine learning models.

Creating a derived database from clickstream data is a valuable first step in understanding one’s customer base. If well conceived and constructed, this dataset can become the foundation upon which analysts can quickly arrive at insights and data scientists can build models.

Amit Bhattacharyya
Head of Data Science, Vox Media

Getting Models from Development to Production

3:45 p.m.-4:30 p.m. | May 31

Building a high-performing model that’s been tested, iterated on and validated for production is only half the battle. Translating models from the development environment where they were built to actually run in production — whether that means serving an API, web app, or report that influences a business decision — can often be a challenge in itself. In this session, Randi will talk about the programs and mechanisms Dell is putting in place to support systematic model deployments, starting with an initial pilot project to deploy machine learning models to 200 users, and scaling rapidly from there.

Topics to be covered include:

  • Technological decisions, limitations and workarounds
  • Integrating with varied endpoints including APIs, web apps and cloud servers
  • Usage metrics to track success
  • Stakeholder feedback and alignment
  • Results to date
  • Long-term plans and lessons learned
Randi Ludwig
Data Scientist, Dell

Jupyter Best Practices and Interoperability with Other Tools

4:30-5:15 p.m. | May 31

Jupyter users love it for exploring data, testing ideas, and writing code, and they’re used to sharing work and ideas in notebooks. But few people appreciate how powerful Jupyter is for generating interactive dashboards. Mac Rogers shares best practices for creating Jupyter dashboards and some lesser-known tricks for making Jupyter dashboards interactive and attractive, appeal to nontechnical users, and provide a permanent record for any information delivered. He’ll also talk through ways of leveraging Jupyter alongside other popular modeling tools.

Mac Rogers
Research Engineer, Domino Data Lab

Leadership Track Overview

Data Responsibility: Positive Social Impact Through Ethical Data Science

11:15 a.m.-12:00 p.m. | May 30

Recently the news is full of data science being used in unethical ways and there are countless discussions of how data science may pose a risk to society. Though these scary stories grab headlines, significant progress towards data responsibility is being made. Data science is being applied by both public and private sector to produce societal benefits. At the same time, data ethics and standards are being established and adopted. This panel will discuss the current state of data responsibility across the domains of non-profits, for-profits, and government. Attendees will leave with practical guidance for how organizations and individuals can contribute to and benefit from data responsibility efforts.

Lisa Green
Social Impact & Public Policy, Domino Data Lab
Natalie Evans Harris
COO, BrightHive
Chad Wilsey
Director of Conservation Science, National Audubon Society
Margit Zwemer
VP of Systematic Active Equities, BlackRock

Workshop: Best Practices for Managing the Data Science Lifecycle

1:15-2:45 p.m. | May 30

In this workshop, Domino Director of Product Mac Steele will walk through a framework for successfully managing data science in the enterprise that covers people, process, and technology. We will step through the key stages of the data science lifecycle, from ideation through to delivery and monitoring, discussing common pitfalls and best practices in each based on Domino’s experience working with leading data science teams. Attendees will be provided with examples of Domino’s Lifecycle Assessment and be guided through an interactive exercise to evaluate the bottlenecks in their own organizations. They will leave with a customized physical artifact that can be used to prioritize investment in hiring, process management, or technology acquisition.

Mac Steele
Director of Product, Domino Data Lab

Operationalize Data Science on the Journey to a Model-driven Enterprise

3:00-3:45 p.m. | May 30

We are at the beginning of the 4th industrial revolution. Companies who do not transform their businesses to become fully data-driven and empowered by Machine Intelligence models will, simply put, not survive. In this session, we will define what a model-driven, agile enterprise looks like and illuminate its strengths. We’ll discuss why it’s so hard to become model-driven, with a particular focus on the organizational change management that must happen in order to operationalize the models and realize the ROI. We’ll provide stepping stones and best practices that can be applied to transform your organization’s culture into one that harnesses data and benefits from models.

Nir Kaldero
Head of Data Science, VP, Galvanize

Panel: Advice for Growing and Managing a Team

3:45-4:15 p.m. | May 30

Growing and managing any team is hard. Growing and managing a team of superstar data scientists can be orders of magnitude more challenging — their work is different, the role is relatively new, and best practices for the function are still being established. In this panel, data science leaders will discuss their experiences and share advice for:

  • Finding and hiring data scientists, and learning how to modify / tweak hiring criteria and best practices as the team grows
  • Retaining talent through motivation and effective management of career expectations
  • Establishing a data science team culture that’s focused on working collaboratively and delivering results that impact the business
  • Learning how to balance skill sets across the team to maximize yield
  • Implementing technologies and processes that will help the team scale without stifling data scientists’ ability to innovate
  • Acting as the data science liaison for the business
Michelangelo D’Agostino
Senior Director of Data Science, ShopRunner
Patrick Phelps
Lead Data Scientist, Insight Data Science
Conor Jensen
VP of Strategic Execution, AmTrust
Carlo Torniai
Global Director, Digital Product Development, Pirelli

Differentiating By Data Science

11:00-11:45 a.m. | May 31

Companies employ various means of differentiation in order to gain a competitive advantage in the market. Traditional differentiators include network economies, branding, economies of scale, and so on. But the availability of data and compute resources, combined with the emergence of new business models, have enabled data science to become a strategic differentiator for some companies.

Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization. If data science is going to be part of your competitive strategy, it warrants rethinking how the company is organized, how it defines its roles, and how it attracts and retains top talent.

Topics include:

  • Why the Data Science team should report to the CEO
  • Why data science is different from other departments like engineering, finance, or marketing
  • How to create compelling roles for your data science team
  • How to foster innovation without structured programs
  • Considerations for measuring data science talent
  • The role of the data platform team in enabling your data scientists
Eric Colson
Chief Algorithms Officer, Stitch Fix

Data-(Science)-Driven

11:45 a.m.-12:30 p.m. | May 31

Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness and hiring experts doesn’t seem to help. This session examines what it takes to build a truly data-driven organizational culture and highlights a vital, yet often neglected, job function: the data science manager. Despite the rise of data engineering and data science functions in today’s corporations, leaders report difficulty in extracting value from data. Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness and hiring experts doesn’t seem to help. Instead, one of two organizational outcomes is common:
(1) Expert teams may grow increasingly isolated and begin to overfocus on publishing for an academic audience, thereby neglecting the needs of their parent organization.

(2) The data science function is diluted by embedding experts in teams that lack the requisite knowledge to direct their work, rewarding output that is more familiar to the team and failing to reward the very work that would make the best use of the expert’s skills.

In both situations, the result hampers the ability of the organization to maximize the impact and value of their data. This session examines what it takes to build a truly data-driven organizational culture and highlights a vital, yet often neglected, job function: the data science manager.

Cassie Kozyrkov
Chief Decision Scientist, Google

Aligning Data Science with IT

1:45 p.m.-2:30 p.m. | May 31

A common gap that data science teams must overcome on the path to make data science a core business function is integrating their people, processes and technologies into the rest of the business. In particular, friction often manifests between data science and IT; they use different tools and technologies, and workflows are equally incongruous. Key to success is effective communication between stakeholders that drives alignment of shared standards and processes. This session will go through Trupanion’s 24-month journey developing its data science program. It will include lessons learned in gaining not only buy-in, but excitement, from operational teams and IT on the way to integration into core business processes.

TJ Houk
Chief Data Officer, Trupanion

Panel: Internal Practices Facilitating Data Science Collaboration and Faster Innovation

2:30-3:15 p.m. | May 31

In this panel discussion, our speakers will share and discuss best practices for building data science organizations that operate at the business core. Topics will include how to manage data science as a product, integrating domain experts into the process, driving alignment among stakeholders, and progressing your career development as a data science leader.

Nancy Hersh
Independent Advisor, Data Science and Analytics
Elena Grewal
Head of Data Science, Airbnb
Patrick Harrison
Associate Director, Data Science, S&P Global
Sivan Aldor-Noiman
Vice President, Head of Data Science, Wellio

Build a Data Science Flywheel with Better Knowledge Management

3:45-4:30 p.m. | May 31

As data science teams scale, they are generating increasing insights and knowledge that aren’t often adequately captured, stored, or leveraged. This leads to re-work, frustration, and missed opportunities for research breakthroughs. This session describes why knowledge management in data science faces the same obstacles of other disciplines, plus its own unique challenges. Based on extensive experience managing quantitative research teams in the asset management world, the two speakers will offer advice to transform knowledge management from an afterthought to a competitive advantage, including changing incentive structures and process management. Attendees will leave with practical steps to create a data science team which accelerates its output with scale, rather than succumbing to complexity.

Matthew Granade
Chief Market Intelligence Officer, Point72

General Session Keynotes

Bad Algorithms & The Ethical Matrix

9:10-9:55 a.m. | May 30

Algorithms can embed bias, they can propagate or even exacerbate inequality, or they can just be plain inaccurate. How do we keep track of all the potential problems? How do we make sure the algorithms we build “work well”? What do we even mean by that? In this talk Cathy O’Neil will introduce the ethical matrix, a construction borrowed from moral philosophy, as a way of organizing our thoughts around important and urgent questions like these.

Cathy O’Neil
Author, Weapons of Math Destruction and mathbabe.org

Advancing the State of Data Science Through Open Source

4:15-4:45 p.m. | May 30

Data science leaders are both excited and concerned about their growing reliance on open source technologies. The best way to ensure smart, sensible use of open source is to be part of the community and to drive its innovation, as opposed to simply benefiting from what others have built. In this session, Wes will discuss open source funding and support models (with an eye towards data science), discussing today’s reality and the ways companies can get involved, either by contributing to projects through code or offering financial support.

Wes McKinney
Creator, Pandas