in redshift AWS Analytics sql ~ read.

Generating series in Redshift

Redshift in 99% of cases means data warehouse, where generating series (especially date series) is quite a common case. You will be disappointed to find out no information about series in official documentation (the only thing that you will be able to find - "generate_series function is not supported by Redshift"). It is a strange situation that generate_series() not supported in database engine with great demand on such function, but generate_series() implemented in regular Postgres with fewer cases to use.

Actually generate_series() function can be used in Redshift with some limitation, but it is still available. Before I start the explanation of using generate_series() I'm going to make a quick note regarding Redshift specific function support. Redshift is a multinode database: master node is an instance that performs orchestration, while the worker node is performing actions with data. It is not obvious, but the master node supports several additional functions comparing to the worker node. However, you do not find generate_series() function in this list. Nevertheless, this knowledge will help us in the future.

generate_series() function can be used as undocumented function. It can be dangerous, but you should know all the available tools. So let's check it:

    SELECT CURRENT_DATE :: TIMESTAMP - (i * INTERVAL '1 day') AS generated_datetime
    FROM generate_series(1, 50) i

You will get a perfect date series. So let go further and use it with some data:

    with gen as (
    SELECT CURRENT_DATE :: TIMESTAMP - (i * INTERVAL '1 day') AS generated_datetime
    FROM generate_series(1, 50) i
)
  select * from gen
  INNER JOIN user u ON gen.generated_datetime = su.datetime_created

And you will get a not clear error message: Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables. Do you remember my small note regarding fact that leader node has its own set of function. So the error message is telling you that data nodes do not support some function that you use in sql. In such small sql script, it is clear that generate_series() is such kind of function. You can run function generate_series() in any sql that will be calculated using only master node, it means that you can not join or reuse results of generate_series in the same script where you are referring to data tables (such requests will be redirected to worker nodes without support of generate_series()).

How to deal with such limitation? Well, I see two options: use generate_series() to generate data into some new table and then use such table in scripts involving data tables or use some trick to get number series from an already existing table. Take a look:

    SELECT
      DATEADD('day', -n, DATE_TRUNC('month', CURRENT_DATE)) AS generated_date
    FROM (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100) n
    ORDER BY generated_date DESC

You will get perfect date series, the only trick that is used here - using system table stl_scan as a source of number series (yes, it can not be generated directly). Some system tables like stl_scan are populated with technical data right after database creation, so you can be sure (or almost sure) that at any moment you'll at least several rows of data there. Such workaround will perfectly work in case of join with some business data. Let's check it:

    SELECT 
        user_id,
        dt.generated_date
    FROM users us
    LEFT JOIN (
      SELECT
        DATEADD('day', -n, DATE_TRUNC('month', CURRENT_DATE)) AS generated_date
      FROM (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100) n
      ORDER BY generated_date DESC) dt
    ON us.subscription_day >= dt.generated_date and us.cancel_day <= dt.generated_date