Want to Build a Better Data Pipeline?

In the world of code, sometimes it helps to think of algorithms as tangible devices — or even physical tools. When it comes to the intricate work of building data pipelines, those algorithmic tools can be as fine as a faucet key or as unwieldy as a hacksaw.

According to Peter Peluso, director of algorithmic development at Magma Capital Funds, when a company’s dataset grows, the algorithms involved can no longer afford to act like hacksaws.

“A brute-force algorithm may work on a small dataset, but may be time-consuming on a large dataset,” Peluso said. “As our volume of data continues to grow, it is more important to write efficient algorithms.”

That need for increased efficiency is matched by the need for scalability, noted Sam Hillis, a data scientist at Strata Decision Technology. “Our organization supports over 50 percent of U.S. healthcare and it is crucial that our data pipeline is scalable and can meet the demands of the important work that our customers do,” Hillis said.

To learn more about which tools, technologies and strategies best help data pipelines become more scalable and efficient, Built In Chicago sat down for a conversation with Hillis and Peluso.

Data Pipeline to Computer Graphic — shutterstock

Sam Hillis

Data Scientist • Strata Decision Technology

Strata Decision Technology

Strata Decision Technology is a platform designed to help “heal” healthcare by providing innovative software and service solutions.

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies specifically?

When selecting technology at Strata, our first priority is choosing a secure platform that will protect our customers’ data. For everyday analysis, our team works primarily with SQL and Python. We use these tools because most of our client data is still stored in relational databases, so SQL is the obvious choice for ad-hoc querying and analysis and Python is a convenient tool for automating and enhancing workflows and building machine learning models.

All of our machine learning models are built and deployed in AWS SageMaker and accessed through our main application via a Lambda/API Gateway integration. We chose this approach because it allows for easy integration with our core software which is written in C#. With SageMaker, we can deploy our models through Docker images, allowing us to leverage whatever tools and libraries we want. An additional advantage of using SageMaker is the ease with which we can automate the training and deployment pipeline with Lambda and Step Functions, as well as other important features such as autoscaling.

All of our machine learning models are built and deployed in AWS SageMaker and accessed through our main application via a Lambda/API Gateway integration.”

As your company size and data volume grow, what steps are you taking to ensure your data pipeline continues to scale with the business?

In terms of scalability, our engineering team is nearing the completion of a new data architecture built on Snowflake. This allows for more convenient analysis across many clients simultaneously, which is critical for much of the work that our team does.

Along with the added convenience, Snowflake provides huge performance benefits over our existing SQL architecture through massive parallel processing and the ability to easily scale up the cluster’s underlying cloud resources.

Snowflake will also allow for additional automation of our SageMaker Step Function pipelines. In addition, the engineering team has built a data-wrangling tool that facilitates extraction and processing of data within Snowflake, allowing our team to write queries that execute over each of the replicated databases. It also includes provisions for custom logic at the client level for those with non-standard implementations as well as the ability to promote workflows from a sandbox to production environment.

Peter Peluso

Director of Algorithmic Development • Magma Capital Funds

Magma Capital Funds

Magma Capital Funds is a quantitative hedge fund that combines human research and AI-driven trading decisions to deliver above-average returns to investors.

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies specifically?

In some areas, we choose technology that fits our needs, and in other areas we choose what our vendors use. Data that comes from our vendors includes real-time, historical and alternative data. We are often restricted in how we access the data by the vendor’s choice of technology. Some of our vendors provide data through Amazon S3, some flat-files (CSV and HDF) over FTP and others through an API. We store this data in a combination of files and SQL databases. For certain datasets, we stick with the files because they are very easy to use. Our researchers are comfortable with SQL, making it a simple choice for data storage.

Data that comes from our vendors includes real-time, historical and alternative data.”

As your company size and data volume grow, what steps are you taking to ensure your data pipeline continues to scale with the business?

There are exciting technologies to use as our data grows. For model training, we are using GPUs and other high-performance cloud solutions. For databases, we are considering a move to a time-series database. InfluxDB and KDB+ are possible choices. Moving towards one of these will offer speed improvements now, but will shine as our amount of data grows.

Strata Decision Technology

Magma Capital Funds

Recent Articles