Local Data Scientists Follow These Best Practices for Managing and Organizing Data Lakes

While there is no established best practice, Chicago’s data community does have advice regarding how to approach the process along with what technologies are worth investing in. We recently spoke with three of these local data professionals to learn more about their best practices for managing and organizing data lakes.

Written by Michael Hines
Published on Dec. 17, 2020
Local Data Scientists Follow These Best Practices for Managing and Organizing Data Lakes
Brand Studio Logo
The Chicago CTA
PHOTO BY Janece Flippo FOR SHUTTERSTOCK

A data lake can very quickly turn into a data swamp if a company hasn’t decided what unstructured data it’s storing and how it’ll be managed and organized.

Of course, laying this foundation is an immense challenge, one that requires a team to think critically about what data is most valuable to their company and why along with the technology required to ingest and restructure it at scale.

While there is no established best practice, Chicago’s data community does have advice regarding how to approach the process along with what technologies are worth investing in. We recently spoke with three of these local data professionals to learn more about their best practices for managing and organizing data lakes.

 

Robert Adler
Director, Data • Avant

There is no shortage of technologies that help teams bring structure to, and get value from, unstructured data. Robert Adler, director of data at online lender Avant, shared a few of the tools his team uses to derive pointed insights from the mass amount of data generated by the company’s applicants, customers and financial products.

 

Give us a sense of scale for your company’s data lake. Where is the data coming from and what technology do you use to store it?

As a financial services company, Avant manages terabytes of data generated by dozens of different systems. Customer application data is captured through a native application while web traffic is captured in Tealium. Agents process customer applications, generating interactive voice response and application workflow data, while also receiving inquiries through email and chat apps. Once a customer is issued a product, financial data is collected regarding transactions and payments, while collections information is logged for customers and agents through a third-party contact management platform.

Avant product, risk, operations and finance teams want to leverage data from all of their disparate systems in their reporting and analysis. To accommodate these use cases, we ingest data from all of these platforms using Spark and Scala programs and store them as Parquet files in our S3-based data lake.
 

These tools, along with a carefully planned data model, allow us to extract key insights and provide the best possible experience for our customers.


How does your team manage, organize and extract business value from all that unstructured data?

Avant leverages several technologies to aid in the management and organization of our data. All data ingested into the lake is logged in Alation, our data catalog. This tool profiles the incoming data and logs important metadata, which helps users get their bearings and log additional findings once they explore the tables.

Once data lands in the data lake, we perform several layers of transformation to combine these data sources and build a more structured data model that business users can leverage for their reporting and analytics. These tables are also logged in Alation and include more robust definitions, lineage and business insights.

Avant leverages Dremio, which specializes in querying Parquet files from S3 and allows our team to define a virtual data structure on top of a physical data structure. This gives us incredible schema and permissions control and allows us to best support our business use cases. These tools, along with a carefully planned data model, allow us to extract key insights and provide the best possible experience for our customers.

 

Matthew Couture
Business Intelligence Lead • Ascent

Unstructured data powers Ascent’s regulation technology platform, which is designed to ensure financial services companies never fall out of compliance no matter how often the rules change. According to Matthew Couture, business intelligence lead, his team’s goal isn’t just to take a document full of legal text and bring structure to it but also to capture the relationships between the data within it.

 

Give us a sense of scale for your company’s data lake. Where is the data coming from and what technology do you use to store it?

Ascent’s data is primarily stored in three different structures based on the type of data and its use case. Our raw data inputs are the legal text — i.e. “rulesets” — published by financial regulators, and because of the unstructured nature of these text documents, they are stored in a data lake. From there, the ruleset data is transformed and stored in a graph database and our data warehouse. 

The graph database is a key tool for understanding the anatomy of a rule, as well as the complex relationships between rules within a larger body of regulation. The ruleset data is also stored in our data warehouse to provide training data for our machine learning models as well as feed business intelligence needs. We currently use Amazon Redshift for our data warehouse, but due to the complexity of the data we’re working with, it’s becoming apparent a more flexible solution will be needed as we continue to grow.
 

The graph database is a key tool for understanding the anatomy of a rule, as well as the complex relationships between rules within a larger body of regulation.


How does your team manage, organize and extract business value from all that unstructured data?

We use a combination of expert legal review, machine learning algorithms and an in-house application. Through this process we break down the rulesets into component pieces and then reconstruct them, adding valuable information and structure along the way. One way we add value is by reconstructing the ruleset within our graph database. In doing this, we add structure and consistency to the components and create relationships showing how the components of the ruleset relate to each other. 

Additionally, we store the ruleset data in a data warehouse where it can be retrieved to train new machine learning models or retrain existing ones. As our library of ruleset data grows, we can use our business intelligence tools, combined with other source data, like user data, to look for trends and insights that would only become available with the data enrichment we’ve provided.

 

Evan Roth
Director, Data, Automation & AI • Productive Edge

Being able to store a large amount of unstructured data is only really useful if you have a plan for it. Evan Roth, director of data, automation and AI at Productive Edge, shared the technology his team is leveraging to help a client extract value from their HR data.

 

Give us a sense of scale for your company’s data lake. Where is the data coming from and what technology do you use to store it?

We’re building out a data-empowered analytics platform for a client, an HR analytics team within a large Fortune 500 enterprise. The data we receive comes from a variety of HR sources, so security and governance are a critical consideration every step of the way. Some examples of data sources include associate employment history — think promotions and job changes — out of Workday, as well as raw social media-like posts from their internal Yammer system.

As the initial step to collect data in an on-prem solution, we’re using a combination of network attached storage to store raw data and Microsoft SQL Server to store some metadata. In a future phase, this will migrate to Microsoft Azure and be stored within their cloud data lake.
 

The data we receive comes from a variety of HR sources, so security and governance are a critical consideration every step of the way.


How does your team manage, organize and extract business value from all that unstructured data?

There are two main focuses. First, the data is scrubbed and loaded into a structured data warehouse, which is then exposed to Power BI or customized unstructured data feeds for downstream analytics teams, respecting any security and governance requirements. Second, machine learning pipelines leverage subsets of the data to perform predictive analytics on incoming data, such as analyzing Yammer posts for sentiment.

All responses have been edited for length and clarity. Headshots provided by respective companies.

Hiring Now
Grindr
Mobile • Social Media • Social Impact • Software