Stitch Documentation
has moved!

Please update your bookmarks to https://www.stitchdata.com/docs

If you're not automatically redirected after 5 seconds, click here.

BigQuery: Data Replication & Table Structural Changes

Important!
Please note that our BigQuery integration is currently in open beta. The information in this article is subject to change.

As replicated data arrives at your BigQuery project, it will be stored in automatically-created tables. When new data is detected and ready to be loaded, Stitch does a few things depending on the replication method being used.

Replication Staging

Each table inside the integration schemas in your data warehouse contain a few columns prepended with sdc:

  • _sdc_batched_at
  • _sdc_received_at
  • _sdc_sequence
  • _sdc_table_version

Stitch uses these columns to replicate your data. 

Note that if any columns in your data source are prepended with rjm or sdc, Stitch will NOT replicate them to your data warehouse. Stitch reserves all columns with this naming convention for internal use.

Incremental Replication (Append Only)

For SaaS and database tables that are set to incrementally replicate, Stitch replicates data into BigQuery in an append only fashion. Unlike other data warehouses that Stitch supports - such as Amazon Redshift - data that updates existing rows in the table (based on Primary Key information) do NOT overwrite the existing data in the table. Instead, that data is appended to the end of the table.

This means that there can be many different rows in a BigQuery table with the same Primary Key, each representing what the data was at that moment in time. Stitch provides the _sdc_sequence column to help distinguish the age of the records. Let's take a look at an example.

Our example table contains the following rows to start with: 

Name | Age | Type  | Magic? | _sdc_sequence
-----+-----+-------+---------+--------------
Finn | 14  | Human | False   | 1473773877524
Jake | 6   | Dog   | Tru     | 1473773877524

These rows are replicated to BigQuery during the initial sync of the table. Some time later, new rows are added to the table:

Name      | Age | Type     | Magic? | _sdc_sequence
----------+-----+----------+--------+---------------+
Bubblegum | 16  | Princess | True   | 1473776585195
Beamo     | 1   | Game Bot | False  | 1473776585195

During the next replication job, the new rows are appended to the table, bringing the total row count to 4:

Name      | Age | Type     | Magic? | _sdc_sequence
----------+-----+----------+--------+---------------+
Finn      | 14  | Princess | False  | 1473773877524
Jake      | 6   | Dog      | True   | 1473773877524
Bubblegum | 16  | Princess | True   | 1473776585195  // new rows are appended
Beamo     | 1   | Game Bot | False  | 1473776585195

But what happens when an existing row is updated? If Finn's age were updated to 15, the existing row would not be updated.

Instead, a new row would be appended to the end of the table (with a higher sequence number), bringing the total row count to 5.  : 

Name      | Age | Type     | Magic? | _sdc_sequence
----------+-----+----------+--------+---------------+
Finn      | 14  | Human    | False  | 1473773877524
Jake      | 6   | Dog      | True   | 1473773877524
Bubblegum | 16  | Princess | True   | 1473776585195         
Beamo     | 1   | Game Bot | False  | 1473776585195
Finn      | 15  | Human    | False  | 1473827984228  // updated data as new row

Full Table Replication

The lone exception to the append only replication that is performed by Stitch is for database integrations that have tables set to Full Table Replication.  Some SaaS integrations also exhibit this behavior. This method of replication provides a single, point in time copy of the source table.

During the initial replication of a table using Full Table replication, you might see what looks like an “incomplete” table. This is because Stitch is still processing and replicating your data. Once the replication is complete, the table will be marked with a version 1 stamp.

During the next replication attempt, a new version of the table will be created. Each batch is marked with a version stamp as Stitch picks it up. At this point:

  1. All data from the existing table, or prior version, is deleted.
  2. All data from the new version is copied into the new table in one transaction.

Handling Structural Changes

New Column

When data arrives with additional columns that aren’t reflected in the existing BigQuery table, Stitch automatically appends those column to the end of the table structure.

Removed Columns

If a column is removed from the source, data for that column will no longer arrive in BigQuery. Stitch will not remove the column from the table and it will contain NULL values from that point forward. 

What happens when data can’t fit into a table?

You’ll experience some missing data on your end. To troubleshoot, we recommend taking a look at some of the more common causes for dropped (or missing) data:

  • Two or more columns in the table differ only in case OR are the same after replacing spaces and special characters with underscores
  • Data arrives with a datatype that doesn't match the existing datatype of the column
  • Nested data is rejected by BigQuery because of incompatible schema changes within the record
  • Stitch's permissions to your data warehouse are revoked

If you’re stumped or need some help pinpointing the issue, please reach out to support.

Was this article helpful?
0 out of 0 found this helpful

Comments

Questions or suggestions? If something in our documentation is unclear, let us know in the comments!