Stitch Documentation
has moved!

Please update your bookmarks to https://www.stitchdata.com/docs

If you're not automatically redirected after 5 seconds, click here.

Postgres: Handling of Nested Data Structures & Row Count Impact

Important!

  • This article is NOT applicable to Postgres ARRAY and JSON data types. For more info, click here.
  • Our Postgres destination is currently in open beta. The information in this article is subject to change.

To understand how Stitch interprets the data it receives, you need to understand a little bit about JSON.

MongoDB and many SaaS integrations use nested structures, which means each attribute (or column) in a table could have its own set of attributes. Stitch is designed to deconstruct these nested structures into separate tables to easily query the data.

In this article, we'll cover:

JSON Data Structures

When Stitch pulls data from an integration, it's pulling a series of JSON records. JSON records can contain structures called arrays and objects.

Arrays

An array is an ordered collection of values. Values are separated by commas and can be a string (contained in double quotes), numbers, boolean, an object, or another array. Arrays begin with a left square bracket ( [ ) and end with a right square bracket ( ] ).

Here's an example:

{  
   "order_id":"1234",
   "customer_id":"100",
   "line_items":[     // array begins 
      {  
         "product_id":"5008798",
         "price":"5.99"
      }
   ]     // array ends
}

When Stitch receives a nested array - or an array that's inside a JSON record - like the one above, it will "denest" it from the parent structure and create a subtable. We'll go into more detail in a moment.

Objects

An object is an unordered set of name and value pairs; each set is called a property. Objects begin with a left curly bracket ( { ) and end with a right curly bracket ( } ).

{  
   "product_id":"5008798",
   "name":"Awesome Dino Shirt",
"price":"5.99", "attributes":{ // object begins "color":"blue", "size":"large", "type":"clothing", "ounces":"5" } // object ends }

When Stitch receives an object, the properties in the object are "flattened" into the table and columns are created. Columns created from object properties follow this naming convention: [object_name]__[property_name]

If a table were created for the object in the example above, the schema would look like this:

product_id name price attributes__color attributes__size attributes__type attributes__ounces
 5008798 Awesome Dino Shirt  5.99 blue  large clothing 5

Deconstruction of Nested Structures

To give you a better understanding of how Stitch denests records, we'll walk you through an example using a Shopify order record. In this example, the order record is composed of three parts:

  • Core order data
  • Line items
  • Tax lines

Here's what the JSON for our Shopify order looks like:

{  
   "order_id":"1234",
   "created_at":"2015-01-01 00:00:00",
   "customer_id":"100",
   "line_items":[     // line item record begins 
      {  
         "product_id":"5008798",
         "price":"5.99",
         "quantity":"1",
         "tax_lines":[     // tax line record begins
            {  
               "price":"5.99",
               "rate":"0.06",
               "title":"State Tax"
            }
         ]     // tax line record ends
      }
   ]     // line item record ends
}

We can see that this record contains three levels of data due to the nested arrays. Stitch will denest the arrays from the top level record - in this case, the core order info - and create subtables. From this one order record, three tables will be created:

  • orders - This table contains the core order data: order ID, created timestamp, and customer ID.
  • orders__line_items - This table contains the line item info: product ID, price, and quantity.
  • orders__line_items__tax_lines - This table contains the tax line info: price, rate, and title.

Connecting Subtables to Top Level Records

When subtables are created, Stitch will append a few columns to be used as composite keys that enable you to connect subrecords back to their parent. Let's take a look at the schemas for each of the Shopify tables to get a better idea of how this works.

Top Level: Core Order Data

This table contains the order record's Primary Key, order_id.

order_id [pk] created_at customer
1234 2015-01-01 00:00:00 100

Second Level: Line Items

In addition to the attributes in the nested record - in this case, product ID, price, and quantity for line items - Stitch will add these columns to second level tables:

  • _sdc_source_key_[primary_key] - This contains the top level record's Primary Key. In this example, the column would be _sdc_source_key_order_id.
  • _sdc_level_0_id - This forms part of a composite primary key for this row and can be used to associate further down the line nested rows to this parent. This will auto-increment for each unique record in the table, beginning with 0.

    For our Shopify example, the first line item record would be 0, the second 1, the third 2, and so on.

We recommend always joining the top level table to the nested table - this will allow you to avoid queries that may have outdated data.

Here's what the orders__line_items table would look like if we added another line item to the order record:

_sdc_source_key_order_id _sdc_level_0_id product_id price quantity
1234 0 5008798 5.99 1
1234 1 3445689 10.99 1


If we wanted to return all line items for order number 1234, we’d run the following query:

     SELECT * 
     FROM orders__line_items li 
     WHERE _sdc_source_key_order_id = 1234

Third Level: Tax Lines

In addition to the attributes in the nested record - in this case, price, rate, and title for tax lines - Stitch will add these columns to third level tables:

  • _sdc_source_key_[primary_key] - This contains the top level record's Primary Key. In this example, the column would be _sdc_source_key_order_id.
  • _sdc_level_0_id - This is the foreign key for the second level (orders__line_items) table. Combined with the source key (_sdc_source_key_order_id), it can be used to find the parent.
  • _sdc_level_1_id - This forms part of a composite primary key for this row and can be used to associate further down the line nested rows to this parent

    For our Shopify example, the first tax line record would be 0, the second 1, the third 2, and so on.

Here's what the orders__line_items__tax_lines table would look like if we added another tax line record to the order:

_sdc_source_
key_order_id
_sdc_level_0_id _sdc_level_1_id price rate title
1234 0 0 5.99 .06 State Tax
1234 1 0 10.99 .06 State Tax


If we wanted to return all line items and tax lines for order number 1234, we’d run the following query:

     SELECT * 
     FROM orders__line_items li 
     INNER JOIN orders__line_items__tax_lines tl 
     ON tl._sdc_level_0_id = li._sdc_level_0_id
     AND tl._sdc_source_key_order_id = li._sdc_source_key_order_id 
     WHERE _sdc_source_key_order_id = 1234

Impact on Total Row Count

Because Stitch is built to "break apart" nested arrays into separate tables, you can expect to see more rows moving through Stitch than what might be in your database or SaaS integration.

Consider our Shopify example. Order 1234 isn't just a just a single row in our data warehouse. Because Stitch had to denest subrecords and create tables to accommodate these records, we can expect to see more than one row moving through Stitch. Let's take a look:

From the top level record, we have the row in the orders table: 

order_id [pk] created_at customer
1234 2015-01-01 00:00:00 100

From the second level record, we have the rows in the orders__line_items table:

_sdc_source_key_order_id _sdc_level_0_id product_id price quantity
1234 0 5008798 5.99 1
1234 1 3445689 10.99 1

From the third level record, we have the rows in the orders__line_items__tax_lines table:

_sdc_source_
key_order_id
_sdc_level_0_id _sdc_level_1_id price rate title
1234 0 0 5.99 .06 State Tax
1234 1 0 10.99 .06 State Tax

In total, 5 rows will be replicated for one order record.

When setting up your integrations and defining their Replication Frequencies, you should take some time to learn about how the data for that integration is structured and how it's replicated.

Check out the Expected SaaS Data section for detailed info on Replication Methods and the tables you can expect to see. Every Expected Data article links to that service's API documentation, which is the best way to learn about the data for that integration is structured.

RELATED

Was this article helpful?
0 out of 0 found this helpful

Comments

Questions or suggestions? If something in our documentation is unclear, let us know in the comments!