How to Build a Successful Data Science Workflow

Obtain, scrub, explore, model, interpret.

These five words, better known by the acronym “OSEMN,” provide a general idea of how data science teams work, but ask any data scientist to describe their workflow and they’ll most likely need more than five words or letters to do so. Data science workflows are shaped by a variety of factors, including the questions a team is trying to answer, the technology they work with and their company culture.

Take Root and Pareto Intelligence, for example. Both companies work with immense amounts of data — Root in the insurtech industry and Pareto in the healthtech space — yet the two take very different approaches to building their data science workflows. Root focuses on the technology that underpins each step, while Pareto places an emphasis on clear communication between members of the data team and other technical colleagues.

We recently had the chance to sit down with the directors of data science at Root and Pareto for a conversation about their data science workflows that touched on technology, processes, philosophies and so much more.

The Pareto Intelligence Office — *PHOTO VIA PARETO INTELLIGENCE*

David Thompson

Director, Data Science • Pareto Intelligence

To David Thompson, director of data science at Pareto Intelligence, the key to a stable data science workflow is avoiding interruptions. Needless to say, snoozing Slack messages can only help a team work uninterrupted for so long. For Thompson, long-term stability has come from helping colleagues understand how Pareto’s data science team works, which in turn makes it easier to discuss pushing deadlines to spend more time on research or refining models.

Tell us a bit about your technical process for building data science workflows. What are some of your favorite tools or technologies that you’ve used?

Our most successful data science workflows have been built through a less-than-technical process of clear communication and strong conviction. When I say “clear communication,” I mean we really have to focus on the first step of our typical data science workflow and define the problem. In the business of healthcare data, the details of a workflow are highly contextual to the problem being solved or the question being asked. If we do not clearly and completely understand the problem, steps in the workflow like data collection, data exploration, modeling, validating and report out are trash and a wasted effort.

When I say strong conviction, I mean we have to stand our ground when we believe a step in the workflow needs more time. In addition to the highly contextual nature of our business, we are also integrated into the product development processes and schedules within our company. Many times we respond to aggressive deadlines, and having conviction sometimes means pleading our case over and over for more time to dive deeper into data exploration or to continue further refining a model.

An uninterrupted workflow is a stable workflow.”

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

In addition to making sure our communication is clear and that we are standing firm in our convictions, being consistent and predictable in how we build our workflows has been effective in creating workflows that are reproducible and stable. It’s my goal to see my non-data science colleagues accurately describe and outline what our typical data science workflow would be: stating the problem, collecting the data, exploring the data, modeling, validating the model and reporting results. A great deal of transparency and even some internal education has helped my team with this.

When everyone knows what to expect and is on board with the process, we are less likely to be challenged or blocked when carrying out our workflow. An uninterrupted workflow is a stable workflow. Additionally, comprehensive public documentation is effective in creating reproducible and stable workflows: we call ours the “Mega Guide to All Models.” No one wants to talk about spending time on documentation, but this is absolutely necessary.

What advice do you have for other data scientists looking to improve how they build their workflows?

Technologies change, licenses expire and tools have bugs. To improve how you build your data science workflows, you’ve got to shift your focus from technical tools used within the steps of the workflow and zero in on the refinement of the things that will not change. I believe the most important aspect of what I mentioned above is making sure your workflow processes are familiar to everyone who might experience them, and this will do more for you than chasing the next best thing in technology.

The Root team at work — *PHOTO VIA ROOT*

Kyle Schmitt

Director of Data Science • Root

It can be tough for data scientists to refine their workflow because any time spent looking inward at how a team works is time not spent working on customer-facing projects. In addition to the wealth of technical advice he has on how to build a successful data science workflow, Kyle Schmitt, director of data science at Root, also offered a practical workaround to this common problem.

Tell us a bit about your technical process for building data science workflows. What are some of your favorite tools or technologies that you’ve used?

In building models that predict insurance risk on the basis of telematics sensor data, we might break the process into four phases: feature exploration, feature engineering, modeling and deployment. Feature exploration often involves proof-of-concept evaluations performed at relatively low scales. Data is sourced from our data warehouse on Amazon Redshift or our data lake on Amazon S3, processed in Python — often on SageMaker — and manipulated and visualized with any number of modeling libraries. We are agile at this stage and if we fail, we try to fail fast.

For promising features, it is often required that we batch process and featurize petabytes of data. To process raw data down to features, we apply distributed map jobs with AWS Batch. The output of these jobs are piped into our Aurora-based feature store. To interact and aggregate the resulting “trip-level” features, we use Athena or Spark.

Modeling again is performed in a Python environment, sourcing the condensed “modeling file” from the previous step using Athena. Our favorite modeling tools include XGBoost, H2O and TensorFlow. Today, we deploy model objects natively in Python with a command line interface that is accessed by Ruby application code.

Invest in a transparent ETL pipeline with carefully positioned intermediate persistence points ... it greatly facilitates diagnostics and future modeling efforts.”

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

Try to get an early rough read on performance. Many research efforts won’t pan out, since in most industries much of the low-hanging fruit has already been picked. Therefore, spending a lot of time perfecting analysis code for every project is likely to lead to a lot of carefully documented null results. This may not be the worst thing if you can afford it, but most companies probably want to fail fast and move on to the next idea. If you do find a promising result, take the time to get the “research code” right.

Ensure that late-stage analysis is done in an environment consistent with the production environment, like Amazon ECR. Use an internal package repository like JFrog. Extract common code out of libraries when reasonable. Finally, invest in a transparent ETL pipeline with carefully positioned intermediate persistence points. While this can take a lot of work up front, if it’s easy for anyone to source high-fidelity data at various points in the pipeline, it greatly facilitates diagnostics and future modeling efforts.

What advice do you have for other data scientists looking to improve how they build their workflows?

Many sound approaches involve a fair amount of tooling. If you have a full-blown data engineering team, great! If not, it can be daunting to take these challenges on. Much data science education centers around algorithms and modeling, whereas deployment is often relegated to an afterthought. Further, you may not feel that you have the latitude to make these investments as opposed to “getting something working.”

However, sound data science workflows will only become more important in the future, and becoming proficient now will be a boon to your skillset down the road. Also, if you can clearly point to a past effort that had a rocky deployment or exceeded timelines, and can articulate how it could be done better in the future, this will help to get buy-in from stakeholders.

You don’t have to solve every problem right away. Just try to keep chipping away and keep these factors in mind:

Estimate value with “rough reads” to narrow to the most promising projects.
For those promising projects, invest the time to make “research” code good. This is greatly assisted by good package abstraction and ETL tooling.
In time, you’ll find that the right investment can help narrow the research-production chasm into a manageable crack.

Recent Articles