Pachyderm

Optimizing Datasets Provides Bigger Benefits For Most Teams

Over the last decade, AI/ML researchers have focused on code and algorithms first and foremost.

The data was usually only imported once and generally left fixed or frozen. If there were problems with noisy data or bad labels they'd usually work to overcome it in the code.

Because so much time was spent working on the algorithms, they're largely a solved problem for many use cases like image recognition and text translation.

Swapping them out for a new algorithm often makes little to no difference.

Data-Centric AI flips that on its head and says we should go back and fix the data itself. Clean up the noise. Augment the data set to deal with it. Re-label so it’s more consistent.

There are six essential ingredients needed to put a data centric AI into practice in your organization:

Creative Thinking

Data Centric AI demands creative problem solving and thinking through the solution from start to finish.

Synthetic Data

Synthetic Data is artificially created data used to train AI models.

Data Augmentation

Data Augmentation involves running data through filters or altering it slightly to create more variance in the data and more samples to work with in your models.

Testing

Build human-in-the-loop tests at every stage to validate things like labeling consistency, as well as data integrity and data quality.

Clarifying Instructions

Clarify your instructions to labelers and data engineers as the data goes through the iterative process of labeling, clarifying and labeling again.

Tooling

Tooling includes both data engineering orchestration pipelines and data science experimentation pipelines.

Flexible, version-controlled machine learning with data warehouse

Data Engineering and Science teams are increasingly looking to leverage their Data Warehouse for innovative machine learning (ML) projects such as churn analysis or customer lifetime value projections. However, getting the requisite data out of Snowflake or Redshift, and into data pipelines for experimentation and model training can be challenging.

Data-Centric Processing

Pachyderm’s pipelines leverage automated versioning that drives incremental processing and data deduplication that shorten processing times and reduce storage costs

Complex Data Workflows

With Pachyderm you can build complex workflows that can support the most advanced ML applications, which can be visually managed and monitored with Pachyderm console UI

Scales to the Job

Pachyderm scales to petabytes of data with autoscaling and data-driven parallel processing. Our approach to version control and file processing automates scale while controlling compute costs