Collector Overview - Entegrata Support

What is the Collector?

The Collector is Entegrata’s core data ingestion system that automatically connects to your data sources, discovers available data, and synchronizes it to your data lakehouse. The Collector manages the entire data pipeline from source systems to your analytics-ready tables.

Key Concepts

Connections

A Connection represents Entegrata’s link to a data source. Each connection includes:

Authentication credentials - Secure access to your data source
Collection schedule - How often data is synchronized
Connection-level settings - Default collection behaviors for all resources

Resources

A Resource is a table, view, API endpoint, or file within a connection that Entegrata collects. Each resource can have:

Individual collection settings - Override connection-level defaults
Load type - Full load or incremental synchronization
Unique keys - Fields that identify unique records
Filters - Rules to limit what data is collected

Resource Hierarchy

Some resources have nested sub-resources (child resources). For example:

A database table might have related child tables
An API endpoint might have nested data structures
A file might contain multiple sheets or sections

Sub-resources inherit collection settings from their parent and cannot be scheduled independently.

Jobs

A Job represents a single collection execution for a connection or resource. Jobs track:

Execution status (Running, Completed, Failed, Scheduled)
Records collected and processing speed
Duration and performance metrics
Error details if the job failed

Discovery

Discovery is the automated process where Entegrata:

Connects to your data source
Scans for available resources (tables, views, endpoints, files)
Analyzes schema and structure
Detects changes to existing resources

Discovery runs automatically every 3 hours and can also be triggered manually.

Collection Lifecycle

Connect

Add a new connection by providing authentication credentials and connection details

Discover

Entegrata automatically discovers all available resources in your data source

Configure

Set up collection schedules, load types, and filters for your resources

Collect

Data is automatically synchronized according to your configured schedules

Monitor

Track job status, performance, and data quality through the admin portal

Collection Schedules

Entegrata supports two scheduling approaches:

Interval-Based Scheduling

Collections run on a regular time interval (e.g., every 6 hours, daily at 2 AM). This is the most common approach for routine data synchronization.

Load Types

Full Load

Copies all data from the source every time. Use when:

Source data is small
Historical changes aren’t tracked
You need a complete snapshot each time

Incremental Load

Copies only new or changed data since the last collection. Requires:

An incremental load field (like modified_date or updated_at)
The source system tracks when records change

Incremental loads are faster and more efficient for large datasets.

Collection Status

Connections and resources can be:

Active - Currently collecting data according to schedule
Inactive - Paused, not collecting data

You can toggle status to temporarily stop collection without deleting configuration.

Getting Started

Managing Connections

Learn how to add, update, and delete data source connections

Managing Resources

Configure resources, set schedules, and manage collection settings

Monitoring Jobs

Track collection progress and troubleshoot issues

Discovery

Understand how Entegrata discovers and tracks your data

Getting Started

Catalog

Connections

Resources

Discovery

Jobs

​What is the Collector?

​Key Concepts

​Connections

​Resources

​Resource Hierarchy

​Jobs

​Discovery

​Collection Lifecycle

​Collection Schedules

​Interval-Based Scheduling

​Load Types

​Full Load

​Incremental Load

​Collection Status

​Getting Started

Managing Connections

Managing Resources

Monitoring Jobs

Discovery

What is the Collector?

Key Concepts

Connections

Resources

Resource Hierarchy

Jobs

Discovery

Collection Lifecycle

Collection Schedules

Interval-Based Scheduling

Load Types

Full Load

Incremental Load

Collection Status

Getting Started