Data Scientists Share What Technologies They’re Using to Build Their Data Pipelines — and Why

Written by Madeline Hester
Published on Jan. 16, 2020
Data Scientists Share What Technologies They’re Using to Build Their Data Pipelines — and Why
Brand Studio Logo

Before SpotHero was founded in 2011, finding a good parking spot meant crossing fingers and circling the parking garage. Today, SpotHero operates in parking garages nationwide (more than 1,000 just in Chicago), airports and stadiums. 

And those thousands of parking spots mean one thing for Director of Data Science Long Hei: terabytes of data.

With a $50 million Series D funding secured in August 2019, SpotHero began expanding its digital platform and deepening its technology stack to optimize parking throughout North America. The company also invested in hiring new talent and adding features to its single protocol software. But Hei still had to interpret the data and scale for SpotHero’s future. 

“To facilitate this,” Hei explained, “we have moved our raw data out of Redshift into S3, which allows us to scale the amount of data almost infinitely.”

We asked Hei and Mastery Logistics’ Lead Machine Learning Engineer Jessie Daubner about which tools and technologies they use to build data pipelines and what steps they’re taking to ensure those data pipelines continue to scale business. Because whether these companies are making parking more seamless or reimagining freight technology, as Mastery Logistics is, one thing is certain: data is king. 

 

mastery
mastery logistics systems

Mastery Logistics Systems helps freight companies reduce waste by arming them with software that makes it more efficient to move goods from one place to another. Lead Machine Learning Engineer Jessie Daubner explained how her team harnesses Snowflake to deliver insights to their customers faster.

 

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies specifically?

Since we’re an early-stage startup with a small team, we had a greenfield opportunity to evaluate the latest tools to build a modern data stack. As a result, we’ve built our analytics layer and initial data pipelines in Snowflake using an ELT pattern. 

This has enabled us to deliver insights to our customers faster by using Snowflake to directly consume data from Kafka topics and empower our data science team to focus on delivering insights rather than the DBA work typical of building new data infrastructure. 

We also use Fivetran, a managed data ingestion service, to sync data from our SaaS application and other third-party data sources like Salesforce, so that new transaction data is available for analysis across the organization with as little of a delay as 10 minutes. Lastly, our team uses dbt (data, build, tool) to transform data for analysis and visualization, which has enabled our team of engineers and analysts to bring software engineering best practices into our analytics workflow.

Since we’re a small team, we had the opportunity to evaluate the latest tools to build a modern data stack.’’ 

 

As your company — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

We’re confident in Snowflake’s ability to scale as the number and size of customers using our transportation management system (TMS) grows. However, with a near real-time service level agreement of 10 minutes or less required for some data sources and machine-learning-driven services we expect to outgrow Fivetran. 

Thankfully, our architecture team has implemented messaging and stream processing using Kafka, AVRO and Confluent Schema Registry, so we already have a single asynchronous messaging protocol in place to meet our SLAs as our data volume increases.

As a Python-oriented team, we’ve also committed to using Faust, an open-source Python library with similar functionality to Kafka streams. As our default stream processing framework, that provides AVRO codec and schema registry support.

 

spothero
spothero

Facilitating parking reservations nationwide through an app requires processing large amounts of data. Director of Data Science Long Hei explains why he uses Apace Airflow to build the data pipeline at SpotHero

 

What technologies or tools are you currently using to build your data pipeline, and why did you choose them?

We mainly use Apache Airflow to build our data pipeline. It’s an open-source solution and has a great and active community. It comes with a number of supported operators that we utilize heavily, such as Redshift operator and Postgres operator. 

At SpotHero, we have extended our pipeline to our PipeGen functionality so we can take YAML files and generate DAGs, which serve as internal self-serve ETLs.

As the business scales, the volume of our data also scales tremendously.’’

 

What steps are you taking to ensure your data pipeline continues to scale with the business?

As a culture, we always encourage and help other business teams build their own ETL processes using Airflow and PipeGen. As the business scales, the volume of our data also scales tremendously. There are more people in the business who need access to the data. As a result, we have outgrown Redshift as a catch-all for our data. 

To facilitate scale, we have moved our raw data out of Redshift and into S3, which allows us to scale the amount of data almost infinitely. We can query and explore that data through Presto. 

Once a data set is locked down, the ETL-ready insights can be graduated into Redshift via PipeGen. This has significantly helped the rapid scaling of the data and the data pipeline.

 

Responses have been edited for length and clarity. Images via listed companies.

Hiring Now
Adyen
Fintech • Payments • Financial Services