Stitch Documentation
has moved!

Please update your bookmarks to https://www.stitchdata.com/docs

If you're not automatically redirected after 5 seconds, click here.

Selecting & Changing Replication Keys for Database Integration Tables

Important!

  • The info in this doc only applies to database integrations.
  • Mongo Replication Keys work a little differently than other integrations. Please refer to this doc if you’re selecting Replication Keys for a Mongo integration.

When you set a table to use Incremental Replication, you’ll also need to define a Replication Key for that table.

Replication Keys are columns that Stitch uses to identify new and updated data for replication.

As improperly setting Replication Keys can cause data discrepancies, it’s important to understand how they work, what makes a good key, and the gotchas associated with them. In this doc, we’ll cover:

Replication Key Requirements

To use Incremental Replication, a table must contain one of the following column types to be used as the Replication Key:

  • a large integer, which includes BIGINT, INT and MEDIUMINT
  • datetime
  • timestamp

When Stitch replicates your data, it will store the last recorded maximum value in the Replication Key column and compare it against the data source - not what's in your data warehouse - to identify new/updated data. Any row with a Replication Key value greater than or equal to the stored value is where Stitch will begin the next replication attempt.

Replication Keys vs. Primary Keys

When it comes to replicating your data, there are a lot of ‘keys’ involved. It can be difficult to keep them all straight, but aside from Replication Keys, there’s one more you should keep in mind: Primary Keys.

In Stitch, Replication Keys and Primary Keys serve two different purposes:

  • Replication Keys are used during the extraction part of the replication process - or when Stitch is querying your data source - to identify new and updated data for replication.
  • Primary Keys are used during the last step of the replication process, which is when Stitch loads replicated data into your data warehouse. Primary Keys identify unique rows within a table and ensure that only the most recently updated version of that record appears in your data warehouse.

While a column can sometimes be used as both a Replication Key and a Primary Key, these are not necessarily always the same column.

Replication Key Recommendations

While a column only need be an integer, datetime, or timestamp to be a Replication Key, we have some other recommendations:

  • For tables where existing records are updated: We’re big fans of using updated_at or modified_at. This is the best way to ensure that both new records and updates to existing records are captured.

  • For append-only tables: When a table is append-only, the only time the table is updated is when new records are added. Existing records are never updated.

    We recommend using a unique, auto-incrementing integer as the Replication Key for these types of tables. However, a created_at date or timestamp column may also be suitable.

Replication Key Gotchas

Before selecting a Replication Key for a table, there are a few things you should keep in mind:

  • Rows with NULL values in the Replication Key column will only be replicated during the first sync of an integration.
    This means subsequent syncs will not capture rows where the Replication Key is NULL. Stitch uses the Replication Key column to detect new and updated data - without it, data can't be correctly detected and replicated.

  • Auto-incrementing integers are only suitable Replication Keys for append-only tables.
    If you want to use an auto-incrementing integer column as the Replication Key for your table, ensure that the table is append-only. If an auto-incrementing integer is used and existing records are updated, Stitch won't detect the new data because the Replication Key won't change. This can lead to data discrepancies.

  • Mongo Replication Keys work a little differently.
    Replication Keys for Mongo work a little differently than they do for other integrations. Read this doc for more info.

  • Stitch does not capture hard deletes for incrementally-replicated tables.

Changing an Existing Replication Key

Changing an existing Replication Key for a table is simple - just open up the Table Settings page for the table and select the new Replication Key column from the drop-down menu.

When you change a table's Replication Key, Stitch will queue a full re-sync of the table's data. We do this to ensure that there aren't any gaps because of the Replication Key switch.

Replication Keys, Data Discrepancies, & Row Count Impact

Because Replication Keys are so important and can directly contribute to data discrepancies and row count issues if improperly set, we felt the next two points merited their own section:

  • Incorrectly selecting a Replication Key can cause data discrepancies.
    For example: you set the Replication Key for an append-only table to an id column, which is an auto-incrementing integer. Existing rows in this table are updated, but the id column never changes after the record is created. Stitch will not detect the updated values because the id column hasn’t changed.

  • Incorrectly selecting a Replication Key can impact your row count.
    For example: you set the Replication Key column for a table to a BIGINT column that’s used in a boolean fashion, meaning it contains 0s and 1s. Every time the values in this column change, the row will be re-replicated to your data warehouse and count against your monthly limit.

If you encounter a data discrepancy, we recommend you start by verifying that the Replication Method and Key for the table are properly set. For further assistance, check out the Data Discrepancy Troubleshooting Guide.

Additionally, you can also reset Replication Keys for database integrations if needed.

 

Related

Was this article helpful?
0 out of 0 found this helpful

Comments

Questions or suggestions? If something in our documentation is unclear, let us know in the comments!