Stitch Documentation
has moved!

Please update your bookmarks to https://www.stitchdata.com/docs

If you're not automatically redirected after 5 seconds, click here.

Redshift: Applying Encodings, Sort, & Distribution Keys

Important!

  • The info in this doc only applies to Redshift data warehouses.
  • The process we outline in this tutorial - which includes dropping tables - can lead to data corruption and other issues if done incorrectly. Please proceed with caution or reach out to us if you have questions.

Want to improve your query performance? In this article, we’ll walk you through how to use encoding, Sort, and Distribution Keys to streamline query processing:

  1. Overview of encodings, Sort, and Distribution Keys
  2. Things to think about
  3. Applying encodings and Keys

Overview

Encodings

Encodings, or compression types, are used to reduce the amount of required storage space and the size of data that’s read from storage. This in turn can lead to a reduction in processing time for queries.

Sort Keys

Sort Keys determine the order in which rows in a table are stored. When properly applied, Sort Keys allow large chunks of data to be skipped during query processing. Less data to scan means a shorter processing time, thus improving the query’s performance.

Distribution Keys

Distribution, or DIST Keys determine where data is stored in Redshift. When data is replicated into your data warehouse, it’s stored across the compute nodes that make up the cluster. If data is heavily skewed - meaning a large amount is placed on a single node - query performance will suffer. Even distribution prevents these bottlenecks by ensuring that nodes equally share the processing load.

Things to Think About

Consider the following before diving in:

  1. Optimizing for every single query isn’t possible. We suggest selecting the most important queries and selecting Sort/Dist Keys that will improve the performance of those queries.
  2. Columns with few unique values aren’t good Sort keys. Because Sort Keys store records together based on similar values, selecting a column with few unique values as the Sort key will heavily skew the data. This will lead to an increase in query processing time.
  3. Tables using Full Table Replication aren’t good candidates for this process. Due to the nature of Full Table Replication, encodings, Sort, and Dist Keys in these tables may be overwritten during the replication attempts that follow application.

Applying Encodings and Keys

Let’s take a look at an example to walk through the application process. Note that the process outlined here can be used across the board to apply encodings and Keys.

We’ll use a table called orders, which is contained in the rep_sales schema. Log into your Redshift database using your SQL tool to get started.

Retrieving the Table Schema

We’ll use this command to retrieve the table schema for orders:

\d+ rep_sales.orders

Which will produce something like this:

Column              | Data Type                  
--------------------+----------------------------+
id                  | BIGINT               
rep_name            | VARCHAR(128)              
order_amount        | BIGINT                     
order_confirmed     | BOOLEAN                    
created_at          | TIMESTAMP                  
_sdc_sequence       | NUMERIC                    
_sdc_received_at    | TIMESTAMP WITHOUT TIMEZONE 
_sdc_batched_at     | TIMESTAMP WITHOUT TIMEZONE 
_sdc_table_version  | BIGINT                     
_sdc_replication_id | VARCHAR(128)               

Creating a Table Copy & Redefining the Schema

In this step, you’ll create a copy of the table, redefine its structure to include the DIST and SORT Keys, insert/rename the table, and then drop the "old" table.

But first, retrieve the table's Primary Key using the following query:

SELECT description FROM pg_description WHERE objoid = 'old_orders'::regclass;             
   // This will retrieve the Primary Key comment: {"primary_keys":["XXXXX"]}

This will be used in the next step to indicate which column(s) are the table's Primary Keys.

Remember: Redshift doesn’t enforce the use of Primary Keys, but Stitch requires them to replicate data. In the following example, you'll see COMMENT being used to note the table's Primary Key.

Make sure you include the Primary Key comment in the next step, as missing Primary Keys will cause issues with data replication.

Here's the transaction we're using in this example:

SET search_path to rep_sales;
BEGIN;
ALTER TABLE orders RENAME TO old_orders;
CREATE TABLE new_orders (
    id bigint,
    rep_name character varying(128) encode bytedict,     // Sets the encoding
    order_amount bigint,
    order_confirmed boolean,
    created_at timestamp without time zone,
    _sdc_sequence numeric(18,0),
    _sdc_received_at timestamp without time zone,
    _sdc_batched_at timestamp without time zone,
    _sdc_table_version bigint,
    _sdc_replication_id character varying(128)
)  distkey (id)     // Sets the DIST Key
   sortkey (id);     // Sets the SORT Key
INSERT INTO new_orders (SELECT * FROM old_orders);

COMMENT ON table new_orders IS '{"primary_keys":["XXXXX"]}';     // Sets Primary Key comment

ALTER TABLE new_orders RENAME TO orders;
ALTER TABLE orders OWNER TO stitch_user;      // Grants table ownership to Stitch
DROP TABLE old_orders;      // Drops the "old" table
END;

Verifying the Table Owner

When you perform this process yourself, make sure that the Stitch Redshift user retains ownership of the table.

If Stitch isn't the owner of the table, issues with data replication will arise.

Verifying the Application

To verify that your changes were applied correctly, retrieve the table’s schema again using this command:

\d+ rep_sales.orders

In this example, if the Keys and encodings were applied correctly, the response would look something like this:

Column              | Data Type                  | Encoding | Distkey | Sortkey
--------------------+----------------------------+----------+---------+---------+
id                  | BIGINT                     | none     | true    | true
rep_name            | VARCHAR(128)               | bytedict | false   | false
order_amount        | BIGINT                     | none     | false   | false
order_confirmed     | BOOLEAN                    | none     | false   | false
created_at          | TIMESTAMP                  | none     | false   | false
_sdc_sequence       | NUMERIC                    | none     | false   | false
_sdc_received_at    | TIMESTAMP WITHOUT TIMEZONE | none     | false   | false
_sdc_batched_at     | TIMESTAMP WITHOUT TIMEZONE | none     | false   | false
_sdc_table_version  | BIGINT                     | none     | false   | false
_sdc_replication_id | VARCHAR(128)               | none     | false   | false
Was this article helpful?
0 out of 0 found this helpful

Comments

Questions or suggestions? If something in our documentation is unclear, let us know in the comments!