A Look Inside 3 Chicago Companies’ Data Science Workflows

Whether you're an enterprise organization or a fast-growing startup, you'll need an effective workflow to successfully manage how your data is collected, analyzed and ultimately deployed into production. Data scientists from three Chicago-based companies

Written by Jeff Kirshman
Published on Sep. 30, 2022
A Look Inside 3 Chicago Companies’ Data Science Workflows
Brand Studio Logo

Data science, at its core, is still a science: a quest to extract meaning from data and improve understanding. Very little is concrete. There’s always more to learn, more to explore and more to discover, with new tools, techniques and methodologies coming to the forefront every day.

Data science is extremely iterative in this way. What works today might not work tomorrow, and what doesn’t work today might have worked yesterday. Yet one constant remains: Whether you’re an enterprise organization or a fast-growing startup, you’ll need an effective workflow to successfully manage how your data is collected, analyzed and ultimately deployed into production.

“In a field as fast-paced as data science, nothing is ever really done,” said Nate Givens, a data czar at Chicago-area logistics company FourKites. “You have to constantly review what you’ve already done in light of new technologies and discoveries. Depending on your outlook, this can be frustrating or exhilarating.”

Givens’ advice? “Embrace it,” he said, and lean into an “always learning” mentality. “If you do, then you will naturally end up with a culture of constant improvement, where everyone is always looking for new solutions and new approaches to doing things better than before — or to do things you couldn’t do before.”

In the spirit of continuous learning, Built In Chicago met with Givens and tech leaders from Tovala and The Operations Clearing Corporation to find out the tools, advice and best practices their teams use to create iterations of successful data workflows. 

Read on to discover insights that may inform the next iteration of your own workflow.

 

Mo Naveed
Director of Data • Tovala

 

Tovala is a foodtech company that offers both smart ovens and meals to cook in them.

 

What are your favorite tools for building a data science workflow? 

I’m a fan of Snowflake, dbt, ReTool and DataRobot.

Snowflake allows data scientists at Tovala to move and analyze massive datasets from disparate sources in very little time. It’s a no-nonsense data warehouse that enables us to run a lean engineering team.

We use dbt for extract, load, transform projects. There are many pros to dbt, but my favorite is its plug-and-play data validations. These proactively alert us to issues in the data, saving hours of bug-hunting and reanalyzing data.

A model is only as useful as the decisions it enables. ReTool lets us build user interfaces for our less technical colleagues to interact with our models and workflows. This puts decision makers in the driver’s seat and lowers demand for data folks’ time.

DataRobot enables us to quickly train, test and deploy predictive models. It also helps with data exploration. There is also automatic dataset shift and model accuracy tracking. This empowers a modest-size team to deliver outsize results for the business to rack up high-impact wins with our stakeholders.

 

What are some best practices for creating reproducible and stable workflows?

A few come to mind … 

Reuse infrastructure: Our servers and orchestration pipelines are configured to handle multiple workflows. This minimizes the number of moving parts and simplifies infrastructure monitoring and maintenance.

Reuse ELT logic: We also use dbt macros religiously and rely on staging tables to half-normalize datasets for use across multiple warehouse/mart/modeling needs. 

Keep things simple: We’ll shun industry standards if we can find a creative workaround. For instance, we orchestrate our data pipelines using Jenkins, which we cannibalized to serve as a virtual cron machine, kicking off dbt DAGs and model run commands as needed.

If it ain’t broke, don’t fix it: Open-source automation server Jenkins on EC2 was meant as a holdover until we got around to installing Airflow; 28 months in, it has caused zero issues and been more flexible than most Airflow implementations we have encountered, so we’re sticking with it.

Measure twice, cut once: We scope all new workflows end-to-end before writing any code. We imagine more creative ways a data-set or workflow might be strategically valuable beyond a model-view-presenter. Having an ultimate end state in mind helps us make decisions that scale with time and business needs.

It behooves us to speak at a higher level of abstraction: Not ‘What do you need built?’ but ‘What questions need answers?’”

 

What advice do you have for other data scientists looking to improve their workflows?

Ask “why?” not “what?” Less tech-savvy colleagues sometimes struggle to articulate the actual problems they’re trying to solve. It behooves us to speak at a higher level of abstraction: Not “What do you need built?” but “What questions need answers?” This shift in perspective will prevent unnecessary work — particularly if there are lower-lift ways to answer the questions.

Additionally, data scientists should plan beyond MVP. If you know how your workflows will evolve, you can build with this evolution in mind. You can also set tripwires for when to kick off the next stage of evolution. This eliminates surprises and enables easier scaling down the road.

Data scientists should also revisit their answers. At a minimum, track data-set shift and model accuracy. I recommend documenting all analyses and their results, along with any codes or methods you used. This allows us to quickly revisit analyses in the future to confirm that what we know has not changed.

Finally, become an expert at deciding what to build versus buy. Consider your team’s capacity and what an hour of your team’s time is worth. You may find that buying a tool that initially gave you sticker shock is actually less expensive than building and maintaining the tool’s functionality in-house.

 

 

Fourkites team members having a meeting in the office
FourKites

 

Nate Givens
Data Czar • FourKites

 

FourKites provides logistics solutions for end-to-end freight tracking. 

 

What are your favorite tools for building a data science workflow? 

At FourKites, we have a variety of data science use cases, which requires a broad range of industry-standard tools for data science. Of course, this includes scikit-learn, PyTorch and TensorFlow for building and training our models. Spark is also really helpful for data preprocessing and feature engineering, and Kubernetes works very well for model-serving. Since our core machine learning models are already performing well, a lot of our growth recently has been around orchestration and automation. On that front, we’ve gotten a lot of utility out of Apache Airflow to help tie the entire ML lifecycle together.

 

What are some best practices for creating reproducible and stable workflows?

By now, the promise and potential of data science techniques and machine learning are pretty well understood. Still, there’s a gap between building a model that tests well and integrating that model into products and services to deliver concrete value to end users. This is why MLOps — a new approach that applies established DevOps concepts to deploying and maintaining ML models — has been growing in prominence so much recently, with monitoring and automation (e.g., to kick off retraining and redeployment) being really important elements of MLOps that help create reproducible and stable workflows.

There’s a gap between building a model that tests well and integrating that model into products and services.”

 

What advice do you have for other data scientists looking to improve their workflows?

Any technical advice we give today could be out of date tomorrow; but if you have a positive mindset, then you will be in a great place to improve your workflows today and tomorrow, too.

 

 

Callista Christ
Senior Associate, Data Analytics Engineering • OCC

 

The Options Clearing Corporation operates as an equity derivatives clearing organization and the foundation for secure markets.

 

What are your favorite tools for building a data science workflow? 

A version-control system is essential, especially if you are collaborating with other data scientists. As our team is currently more focused on data engineering, we have used tools that are open-source projects aimed at validating data and building reusable data models. Because we work in a high-security environment, we also rely heavily on security scanning tools to ensure we are not introducing any known vulnerabilities into our systems. All these tools provide nice support avenues for getting questions answered quickly and do not introduce any significant vulnerabilities into our workflow.

 

What are some best practices for creating reproducible and stable workflows?

Building out a continuous-integration/continuous-deployment pipeline for your work is a good way to create reproducible and stable workflows that all members on your team can benefit from. For example, as soon as we make changes to our codebase, the changes are automatically piped through a series of unit tests and security scans. Once all of the tests pass, the code is uploaded and can then be deployed using a CD tool. Every change we make will automatically flow through this exact process, which gives our team peace of mind that the changes we make will not have any negative or unexpected effects.

Data scientists have a responsibility to truly know the data before using it in downstream tasks like machine learning.”

 

What advice do you have for other data scientists looking to improve their workflows?

I strongly suggest incorporating a data-validation tool into your workflow. Writing data validation tests will encourage you to examine your data carefully and will give you confidence moving forward that any data manipulation you do will give you the expected results. Data scientists have a responsibility to truly know the data before using it in downstream tasks like machine learning. Data validation software may help with this. 

Additionally, building out a CI/CD pipeline for your workflow can be beneficial for the exact reason aforementioned: creating reproducible and stable pipelines.

 

 

Responses have been edited for length and clarity. Images via listed companies and Shutterstock.

Hiring Now
Citadel
Fintech • Information Technology • Software • Financial Services • Big Data Analytics