How to Build a Successful Data Science Workflow

Written by Madeline Hester
Published on Apr. 13, 2020
Brand Studio Logo

In a less mature industry like data science, there aren’t always textbook answers to problems. When undertaking new data science projects, data scientists must consider the specificities of the project, past experiences and personal preferences when setting up the source data, modeling, monitoring, reporting and more.

While there’s no one-size-fits-all method for data science workflows, there are some best practices, like taking the time to set up auto-documentation processes and always conducting post-mortems after projects are completed to find areas ripe for improvement.

Stefon Kern, manager at The Marketing Store, said he focuses on evaluating the utility and business value of what his team has built.

“More important than any specific tool or technology, is proactively, consciously adopting a mindset that’s focused on continual evaluation and optimization,” Kern said.

 Data Science Coach Ben Oren of Flatiron School agreed on the importance of assessing each project after the fact. Improving data science workflows often occurs at the consolidation step, so after every project, he documents what was done, considers where the problems and inefficiencies crept in, and imagines ways to improve processes for the future. 

Data scientists across Chicago tweak their data science workflows to align with what works best for their teams and businesses. What other best practices are they using to optimize their data workflows? Code reviews, collaboration between data scientists and data engineering teams and agile environments are just a few examples.

 

Image of Stefon Kern
Stefon Kern
Manager, Data Science • tms

When data science projects are finished, the post-mortem phase begins, Kern said. This assessment period allows the team to identify areas for improvement rather than putting completed work aside. As potential problems arise in the future, Kern’s team will save time by already being aware of its weak spots.

 

Tell us a bit about your technical process for building data science workflows.

Our data science projects begin with a dedicated R&D phase. Most likely, whatever you’re building has aspects that have already been developed by others in our field. It’s important to be aware of the latest and greatest methodologies, tools and resources out there. Then, you can strategically decide whether you need to reinvent the wheel or not. 

By the end of this initial R&D phase, we have our objectives locked down and we have identified the approach and resources needed to achieve our end goal, including staffing, tech resources and data requirements. Next, we begin building a “minimally working” version of the product. For this stage, we use real data, following development best practices and building a viable workflow. 

Once we’re satisfied with the initial build, we enter a phase dedicated to scaling, testing and optimizing. While we favor Python and Apache Spark when it comes to enterprise-wide, large-scale data science workflows, we have an agnostic, needs-based approach to technology, and often leverage many of the specialized statistical packages developed by the research community.

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

I strongly recommend taking the time to set up an auto-documentation process: Python’s Sphinx package, for instance, will automatically extract information from docstrings to create a perfectly formatted, easy-to-navigate HTML-based documentation page. 

Other important best practices include setting up self-contained virtual environments, appropriate project structure, modularized code, unit tests and quality-control checks. Moreover, once you’ve developed your own best practices and process guardrails, it’s critical to provide your team with training and periodic reminders to ensure that the best practices truly become your team’s practices. 

I strongly recommend taking the time to set up an auto-documentation process.

 

What advice do you have for other data scientists looking to improve how they build their workflows?

More important than any specific tool or technology, is proactively, consciously adopting a mindset that’s focused on continual evaluation and optimization. When a project is complete, there can be a tendency to set it aside and not look at it again until a problem arises. To avoid that trap, I recommend conducting post-mortems and assessments, with a focus on evaluating the utility and business value of what you’ve built, and identifying things that can be improved.

 

Image of Rene Haase
Rene Haase
Vice President of Data Engineering • Integral Ad Science

When tackling a new project, Vice President of Data Engineering Rene Haase integrates the data science and data engineering pods in order to get as many perspectives as possible. Together, they educate each other and brainstorm on how to overcome development and deployment challenges with their machine learning models. At Integral Ad Science, this collaboration starts at recruiting, where members from both teams interview potential candidates.

 

Tell us a bit about your technical process for building data science workflows. 

In order to understand IAS’s data science platform setup, it’s important to understand the scale we are operating at: Our systems are required to process about 4 billion events per hour. Those volume requirements influence the type of tools IAS leverages when developing data science workflows. 

For developing models, we leverage H2O Driverless AI quite extensively and leverage Spark ML for training data pipelines. We are also experimenting with TensorFlow and are building prototypes with Neo4j for GraphDB. The latter we are using to build a fake news machine learning model. We use Airflow for orchestration, Hive and Impala for data exploration, and Jupyter and Zeppelin notebooks for analytics. 

In our latest project, we’re building streaming data pipelines with Apache Flink and are experimenting with building H2O Driverless AI RESTful APIs, which our streaming pipelines will leverage. This creates a great challenge to build not just accurate, but also operationally efficient models in order to support our hourly throughput requirements. We are partnering with experts in the area of streaming and with AWS, which represent great learning opportunities for our teams. 

It’s important to build a strong data science community in your company.

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

We are not just handling large amounts of data, but are also maintaining several hundred machine learning models. Managing this amount of machine learning models requires automation of the data science SDLC. As such, we are persisting models in ModelDBs, maintaining feature development code in GitHub, maintaining automatic monitoring, and creating plenty of dashboards describing the health of the models and systems.

We’ve also found that having a designated data engineering group support the data science organization to be highly beneficial. The members of the latter group are mostly aspiring data scientists who have a good understanding of data science concepts and are honing their skills in the production of data science models. Our data science engineering group partners closely with our data scientists, educating each other and brainstorming on how to overcome common challenges when developing and deploying machine learning models. At IAS, this collaboration starts at recruiting where members from both teams interview potential candidates.

Lastly, it is very important to “know” your data, which means exporting model evaluation metrics, monitoring training data, developing ways to detect and respond to shifts in your data, and retaining training data for as long as possible.

 

What advice do you have for other data scientists looking to improve how they build their workflows?

It’s important to build a strong data science community in your company. That means that members need to be comfortable receiving and providing constructive feedback. This will allow your data scientists to continue to grow and build cutting-edge solutions.

Software and data science models need to be maintained and enhanced wherever possible. We strive as a team to constantly evolve. Failing to maintain models for too long will mean that you fall behind and accumulate technical debt. A mature data science organization continually keeps up with new technologies and finds ways to improve both the software and the models.

 

Image of Matt Levy
Matt Levy
Senior Consultant and Data Science Practice Lead • Analytics8

Not every client is ready for data science implementation, according to Matt Levy, senior data scientist at Analytics8. In order to produce real business value, Levy recommends spending more time understanding the business problems at hand and what you really want to predict before diving into a solution. When projects are data-science ready, clients’ needs and preferences dictate the technologies used on a case-by-case basis.

 

Tell us a bit about your technical process for building data science workflows. 

Analytics8 practices what we call “ethical data science.” We know that data science can result in quick failure without proper planning and preparation. We spend time upfront ensuring our clients are data-science ready, and that their projects will bring value. We help them understand the ramifications of machine learning-based decisions, and avoid bias when building models. 

We are not afraid to tell our customers that they are not ready for data science implementation if we can’t say with integrity that they have the right level of data maturity, or have identified a project that will bring business value.

While we use traditional and more modern tools to prepare, profile and model the data, our clients’ needs and preferences dictate the technologies we use on an individual basis. Our favorite implementations are those where machine learning models are deployed into customers’ existing BI reporting platform, so users gain richer analysis and more insight inside the tools they already know.

We are not afraid to tell our customers that they are not ready for data science implementation. 

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

Everyone wants to implement the “buzzworthy” part of data science: particularly, building and deploying machine learning and AI into their organization. However, not everyone realizes how critical data preparation and data profiling are to data science success. If your organization is not truly data-science ready, then you are doing a huge disservice to outcomes and your bottom line. Invest time in organizing, cleansing and democratizing your data into “one true source” that provides accurate and reliable information. 

Once this is done, make sure to go beyond basic exploratory data analysis and get a better understanding of what your data is telling you, and what features you might be able to unlock with some upfront discovery.

 

What advice do you have for other data scientists looking to improve how they build their workflows?

Spend more time understanding the business problems at hand and what you really want to predict before diving into a solution. 

Creating a perfect model for a question that might not need to be answered, or for which the information quality is low, won’t bring you the most business value and will likely result in failure. The more you emphasize solving true business problems, the better off you will be downstream in your workflow.

 

Image of Rossella Blatt Vital
Rossella Blatt Vital
Director of Data Science and Analytics • Nordstrom

Rossella Blatt Vital said her team at Nordstrom Trunk Club strives to build better data models that outperform their previous build. To accomplish better data models, Vital said maintaining an agile and collaborative environment is crucial. As well, researching technical topics, code reviews and design reviews help maximize the stability of their models. 

 

Tell us a bit about your technical process for building data science workflows.

It’s important to understand the business problem and frame it in a data science context. We work closely with business stakeholders and the data engineering team to identify, collect and create the data needed. This often requires the use of multiple tools depending on the nature of the data (e.g., SQL, Python, Amazon S3, HDFS cluster and Cloud). 

The next phase is data processing and exploratory data analysis (EDA). This is where we explore the available data to gain relevant insights and best approaches moving forward. The insights collected during this phase are then used for model building and tuning where we use a variety of ML frameworks and Python packages (Spark, Scikit-learn and TensorFlow). 

We like to start with simpler modeling approaches and add complexity during the next iterations and evaluate the model’s performance with added complexity. We consider this a paramount step that we perform throughout the workflow and double down once the final model has been selected. The result is an iterative model we deploy into production and leverage for various products. 

It is crucial to maintain an agile and collaborative environment when looking at your workflow.

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

When building a data product, we strive to focus on building the right thing and building it right. We invest in cross-functional collaboration and embrace iterative data development. We believe that adopting a champion challenger approach during deployment, model development and production makes a huge difference. 

We strive to build a better model that outperforms the previous model built. We start this process by researching these topics to expand our technical horizons. Code reviews and design reviews are done throughout the process to maximize the stability of our models. 

 

What advice do you have for other data scientists looking to improve how they build their workflows?

It is crucial to maintain an agile and collaborative environment when looking at your workflow and testing hacky solutions with robust, reproducible ones every sprint. The key is to always stay curious, be open to learning, and be able to bounce ideas off cross-functional teams. Doing so boosts the quality and velocity of the data product workflow. 

Our advice to machine learning leaders is to make sure strong communication is established within your teams and make it known that we can learn from our mistakes, which will make us more experienced data scientists. 

 

Image of Ben Oren
Ben Oren
Data Science Coach • Flatiron School

Every data science workflow begins with the repo at Flatiron School, Oren said, specifically using the Cookiecutter Data Science tool on GitHub. Cookiecutter generates directories tailored to any given project so all engineers can be on the same page. From there, Luigi helps with workflow management, and other tools such as Tableau and open-source tools like Plotly and Flask  help create visualization.

 

Tell us a bit about your technical process for building data science workflows. 

It all starts with the repo. I always start with Cookiecutter Data Science, a scalable tool that can generate a repo structure tailored to a given project while automatically including necessary directories and conforming to a general type. For both big data and local work, Luigi is a flexible workflow management system that can combine simple tasks into complex ETL and model-building processes. 

To round things out with visualization, enterprise tools like Tableau feature easy end-user deployment, dynamic capabilities and integration with coding languages, but might not be worth investing in over stable open-source tools like combining Plotly and Flask (or ggplot2 with a little elbow grease).  

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

Data scientists are generally terrible at and fearful of unit testing, but it’s the backbone of reproducibility and stability. Ensuring that the inputs and outputs for the processes you've built are what you expect them to be is as important for building models as building websites. Beyond making robust unit tests, what I’ve found most effective for stability and reproducibility is building up libraries of common tasks. When the wheel isn’t constantly being reinvented, data cleaning, feature engineering, cross-validation and model tuning go faster and processes between collaborators become standardized.

Data scientists are generally terrible at and fearful of unit testing, but it’s the backbone of reproducibility and stability.

 

What advice do you have for other data scientists looking to improve how they build their workflows?

Remember to document and iterate the workflow, the same way as everything else. There can be a tendency on data science projects to focus on the expansion step of engineering a project: thinking through different approaches to a data set, creatively solving problems that arise, etc. But improving the process across projects, especially at a workflow level, often comes at the consolidation step: documenting what was done, considering where the problems and inefficiencies cropped up and imagining ways to improve for the future. 

Take the time to write out workflow steps as they’re happening, highlight inefficient moments and return to them after the project is done to think through alternatives.

 

Responses have been edited for length and clarity.