Skip to main content

What is the Collector?

The Collector is Entegrata’s core data ingestion system that automatically connects to your data sources, discovers available data, and synchronizes it to your data lakehouse. The Collector manages the entire data pipeline from source systems to your analytics-ready tables.

Key Concepts

Connections

A Connection represents Entegrata’s link to a data source. Each connection includes:
  • Authentication credentials - Secure access to your data source
  • Collection schedule - How often data is synchronized
  • Connection-level settings - Default collection behaviors for all resources

Resources

A Resource is a table, view, API endpoint, or file within a connection that Entegrata collects. Each resource can have:
  • Individual collection settings - Override connection-level defaults
  • Load type - Full load or incremental synchronization
  • Unique keys - Fields that identify unique records
  • Filters - Rules to limit what data is collected

Resource Hierarchy

Some resources have nested sub-resources (child resources). For example:
  • A database table might have related child tables
  • An API endpoint might have nested data structures
  • A file might contain multiple sheets or sections
Sub-resources inherit collection settings from their parent and cannot be scheduled independently.

Jobs

A Job represents a single collection execution for a connection or resource. Jobs track:
  • Execution status (Running, Completed, Failed, Scheduled)
  • Records collected and processing speed
  • Duration and performance metrics
  • Error details if the job failed

Discovery

Discovery is the automated process where Entegrata:
  1. Connects to your data source
  2. Scans for available resources (tables, views, endpoints, files)
  3. Analyzes schema and structure
  4. Detects changes to existing resources
Discovery runs automatically every 3 hours and can also be triggered manually.

Collection Lifecycle

1

Connect

Add a new connection by providing authentication credentials and connection details
2

Discover

Entegrata automatically discovers all available resources in your data source
3

Configure

Set up collection schedules, load types, and filters for your resources
4

Collect

Data is automatically synchronized according to your configured schedules
5

Monitor

Track job status, performance, and data quality through the admin portal

Collection Schedules

Entegrata supports two scheduling approaches:

Interval-Based Scheduling

Collections run on a regular time interval (e.g., every 6 hours, daily at 2 AM). This is the most common approach for routine data synchronization.

Load Types

Full Load

Copies all data from the source every time. Use when:
  • Source data is small
  • Historical changes aren’t tracked
  • You need a complete snapshot each time

Incremental Load

Copies only new or changed data since the last collection. Requires:
  • An incremental load field (like modified_date or updated_at)
  • The source system tracks when records change
Incremental loads are faster and more efficient for large datasets.

Collection Status

Connections and resources can be:
  • Active - Currently collecting data according to schedule
  • Inactive - Paused, not collecting data
You can toggle status to temporarily stop collection without deleting configuration.

Getting Started