What is the Collector?
The Collector is Entegrata’s core data ingestion system that automatically connects to your data sources, discovers available data, and synchronizes it to your data lakehouse. The Collector manages the entire data pipeline from source systems to your analytics-ready tables.Key Concepts
Connections
A Connection represents Entegrata’s link to a data source. Each connection includes:- Authentication credentials - Secure access to your data source
- Collection schedule - How often data is synchronized
- Connection-level settings - Default collection behaviors for all resources
Resources
A Resource is a table, view, API endpoint, or file within a connection that Entegrata collects. Each resource can have:- Individual collection settings - Override connection-level defaults
- Load type - Full load or incremental synchronization
- Unique keys - Fields that identify unique records
- Filters - Rules to limit what data is collected
Resource Hierarchy
Some resources have nested sub-resources (child resources). For example:- A database table might have related child tables
- An API endpoint might have nested data structures
- A file might contain multiple sheets or sections
Sub-resources inherit collection settings from their parent and cannot be scheduled independently.
Jobs
A Job represents a single collection execution for a connection or resource. Jobs track:- Execution status (Running, Completed, Failed, Scheduled)
- Records collected and processing speed
- Duration and performance metrics
- Error details if the job failed
Discovery
Discovery is the automated process where Entegrata:- Connects to your data source
- Scans for available resources (tables, views, endpoints, files)
- Analyzes schema and structure
- Detects changes to existing resources
Collection Lifecycle
Collection Schedules
Entegrata supports two scheduling approaches:Interval-Based Scheduling
Collections run on a regular time interval (e.g., every 6 hours, daily at 2 AM). This is the most common approach for routine data synchronization.Load Types
Full Load
Copies all data from the source every time. Use when:- Source data is small
- Historical changes aren’t tracked
- You need a complete snapshot each time
Incremental Load
Copies only new or changed data since the last collection. Requires:- An incremental load field (like
modified_dateorupdated_at) - The source system tracks when records change
Collection Status
Connections and resources can be:- Active - Currently collecting data according to schedule
- Inactive - Paused, not collecting data
