Pachyderm

Automatic Detection

Data-driven pipelines automatically trigger based on detecting data changes.

Version Control

Immutable data lineage with data versioning of any data type.

Autoscaling

Autoscaling and parallel processing built on Kubernetes for resource orchestration.

Automatic Deduplication

Uses standard object stores for data storage with automatic deduplication.

Cloud & On-prem

Runs across all major cloud providers and on-premises installations.

Flexibility

Leverage your infrastructure investments and run on your existing cloud or on-premises infrastructure.

Run any data type, size, or scale of data in both batch or real-time pipelines.

Support effective team collaboration through git-like structure of commits.

Code and Data Agnostic

Container-native pipelines empower developer autonomy.

Use any languages or libraries that are best for the job.

Seamlessly ingest from streaming, real-time, or batch data sources.

Infrastructure Agnostic

Runs in all major cloud providers and on-premises data centers.

Integrates with existing tools – CI/CD, logging, auth, and data APIs.

Integrates with standard data processing and machine learning tools.

Composability

Easily share data sets or pipelines across teams or use cases.

Make any process data-driven by subscribing to data repo changes.

Microservices-like approach increases reuse and collaboration.

Console

Console is a complete web UI for visualizing running pipelines and exploring your data.

Map out the overall structure and flow of all pipelines.
View repositories, commit histories, and preview data directly in your browser.
Follow job statuses, pipeline processes, and execution history.

Notebook

JupyterLab mount extension that selectively maps the contents of data repositories right into your Jupyter environment.

Ideal for Data Scientists to explore and analyze data.
Run and test pipeline code against versioned data.
Create reliable, shareable development environments.

Enterprise Administration

Robust tools for deploying and administering Pachyderm at scale across different teams in your organization.

Centralized licensing and administration of all clusters.
Authentication against any OIDC provider.
Role based access control (RBAC) support for governance and data privacy.

Pachyderm Overview