Developer's Notes

Enhancing DBT Workflows with Dynamic Tag-Based Macros

Ivan Zerin — Tue, 20 Feb 2024 15:47:20 GMT

In the world of modern data analytics and transformation, the Data Build Tool (DBT) has emerged as a cornerstone for managing SQL-based workflows within data warehouses. A powerful feature of DBT lies in its ability to leverage Jinja, a templating engine for Python, enabling developers to create dynamic and reusable code snippets. I will cover one such examples. A macro that I have crafted macro designed to collect DBT models based on tags. In this article, we'll delve into the significance of DBT tags, explore the necessity of executing code within DBT macros, and outline various use cases where the get_models_by_tag macro proves invaluable.

Understanding DBT Tags

First thing first, what are DBT tags? DBT tags serve as metadata labels assigned to DBT resources, including models, sources, or seeds. They provide a means for organizing and categorizing these resources based on criteria such as functionality, departmental ownership, or business unit. Tags offer clarity and structure within DBT projects, aiding in documentation, testing, deployment, and monitoring.

Tag could be placed, really simply:

{{ config(
    tags=["finance"]
) }}

select * from ....

As a result, you could perform any DBT operation over the group of models defined by the tag. Let's say you need to rebuild all models dedicated to finance data:

dbt run --select tag:finance

Or run you hourly pipeline:

dbt run --select tag:hourly

That is super flexible and easy.

DBT Tags and Jinja macro

But what if I need to perform some action over tagged models from DBT Jinja macro? Let's say: the pipeline has crafted some new models and I want to have macro that will do some complex cleanup of a temporary table that is not possible to be done by post-hook of DBT? Sorry, you could not collect a list of models by tag from macro code... until you create your own macro.

{% macro get_models_by_tag(tag) %}
    {% set models = [] -%}
    {% if execute %}
        {% for node in graph.nodes.values()
            | selectattr("resource_type", "equalto", "model")
            | selectattr("config.enabled", "equalto", True) %}
            {% if tag in node.tags %}
                    {% do models.append(("dm_" ~ type_tag, node.name)) %}
            {% endif %}
        {% endfor %}
    {% endif %}
    {{  return(models) }}
{% endmacro %}

You could collect a full list of models from graph object, which will appear after doing dbt compile/dbt run and will be stored in manifest file. Next, we need to filter out disabled nodes and collect only models. The last step - to check if target tag exists in tags list of each node. And that is it.

Subscription Analytics: the little things

Ivan Zerin — Thu, 05 Jan 2023 08:11:15 GMT

Originally that article was posted at my blog on Medium.

I have spent 5+ years in the subscription business. Mainly, I was building in-house analytics from scratch. Now I found more and more projects that try to fulfill empty areas of services for the subscription business. Some people ask me to provide some insights and the main question is usually: why do subscription businesses have to build in-house data solutions? Let me try to answer the question here.

How complex analytics stack in subscription business could be was covered by CTO of Scentbird Andrei Rebrov. However, it is a high-level overview of what tools subscription services are using. Let’s go a little bit deeper into technical details.

Let’s start with some really simple statement: subscription business by its nature is more complex than regular e-commerce. Why?

Subscription business does not create revenue in one purchase. In most cases, you stay far behind customer cost with one sale. The idea of a subscription business is to attract and make one-time customers into regular loyal customers. To do that subscription businesses have to make every purchase quite cheap (compared to regular e-commerce) and offer some value over time for the customer: discounts, exclusive products, fast shipping, good support, etc. You could not just take money for one purchase and forget about the client as you already do your profit. It is a relationship with a history and a subscription business that put a lot of effort into it.
A subscription business should offer a really different product to a regular shop. E-commerce could offer just everything you could find in a grocery store, but with delivery to the door and it will work. For a subscription service to be successful sometimes means that the product does not exist yet. You should invent it (or smartly combine existing products), promote it, sell it, and support it. If we go deeper with the term “invent” it could mean setting up supply chains for raw ingredients for producers who craft that product for you and organizing delivery, sorting, and storage of your unique product.
I hope, that two points show how complex the backoffice of subscription service could be. It is not just reselling.

Ok, now we could go a little bit deeper into issues of analytics for subscription services.

Users

How do you know how many users your subscription service has? It is the basic metric, but it is not simple at all. Let’s compare a regular e-commerce service to a subscription one in terms of the calcalution of users.

E-commerce service metrics about users (feel free to point me that I missed something):

How many make a purchase?
How many users sign up or they do purchase without signing up?
How many users asked for a refund?
Subscription service metrics about users:

How many users sign up?
How many make the first purchase?
How many make a second, third…, n purchase?
How many users use a trial subscription (most services have that one)?
How many users asked for a refund?
How many users canceled subscriptions intentionally?
How many users canceled subscriptions unintentionally (e.g. payment system report that a credit card was not chargeable)?
How many users could not pay for the next subscription cycle because of technical issues with payments?
…
More and more other questions could appear here. Before getting the total number of subscribers the service has that month we need to answer every question on that list and process all the corner cases (e.g. if with a trial subscription already a subscriber or he should be after the first real charge? if users asked for a refund for first purchase is not a valid subscriber?). The mechanics have a lot of moving parts and you should take all of them into account.

Payments and taxes

Payment is one of the most complex aspects of a subscription service. The root of that is coming from the nature of subscription service: you do not get a product right after payment, it is about doing advanced payments for getting some product next week/next month. From a technical perspective, it is the same: you pay money and get some product, but from a financial perspective it is not.

If a user pays you on the 29th of the current month and your delivery would be next month, how you should consider that payment? Is it revenue for the current month or for the next month? Is it revenue by the way or just advanced payment? If the user would change his mind and want his money back before shipment but during the next month how should you amend the finance stats for the previous month? You could say all problems have regular e-commerce service also. Correct, but it is a one-time headache. In subscription service, a user could pay for 6 or 12 months in advance and financial reports should be shaped to make proper processing of such advanced payments over a long period of time for big amount of users.

Moreover, during 12 month period you could get some extra corner cases: changing the price of a subscription plan, a partial refund for some shipments, and some bonus shipments were done for free during that period of time (so that shipment should not affect finance reports). It is a very very complex area. And taxes… Just skip them, as there are many more questions that arise on every question mentioned above.

Supply chains and logistics

You have a unique product, do you remember? it should be also taken into account in analytics and finance reports. You just need to pick up the knowledge about the area regular e-commerce does not face with. It is possible that you could just outsource all activity to some 3rd party company and it will work until some level. But next, you could face several issues:

Your product is so unusual, that most logistic companies have no tools or abilities to work with it at an efficient level. For example, Scentbird is dealing with perfume. Perfume is flammable: transportation, sorting, and storing should be done in a very special way. Most of the 3rd party companies does not deal with flammable products in that enormous amount as subscription service does.
The amount of monthly shipments is so huge, that the regular way to process those shipments is not efficient. The process that managed to pack and ship 10k packages monthly of any kind of product could not be scaled up to 100k packages just by increasing headcount 10x times.
If you produce your own product from scratch then the supply chains you have built could be good for producing a small amount of product, but it does not scale well once again. You could outsource it to a bigger company that could handle it (spend more and more money) or reinvent all the processes and dive into that area by yourself.

In the end

There are tons of minor issues in subscription services and all of them make a big difference. Any big subscription business is unique and has a lot of customization. It means that they could not share the same boxed analytics solution.

Do get me wrong: it is still a lot of common areas that could be served by some 3rd party service like marketing analytics or web-site analytics. However, there hardly ever could be a tool so flexible that it could be used by any subscription service in terms of finance.

P.S. You could always reach me for consulting in areas: subscription services, analytics, and big data architecture.

What we are missing in data CI/CD pipelines?

Ivan Zerin — Thu, 05 Jan 2023 08:04:57 GMT

Originally that article was posted at my blog on Medium.

I told the story of how we set up our CI/CD with a data pipeline using classical CI/CD tools — Gitlab previously. As a final idea, I just put a note that Gitlab is not a great tool for doing that job. You have to create several workarounds. That statement is quite general and does not fit into the concept of “Offering Solutions Rather Than Criticism”. I’m not able to provide a solution at the moment. However, the first step of creating the solution is to understand what the cons of the current solution are and how it could be improved. So I take a stop to think about: what data engineers are missing in CI/CD tools.

Let’s start with a setup that we will be talking about:

batch data processing: new data arrives in some period of time and there is a way to deal with new data only;
quite heavy and complex data transformation: it takes a lot of time to process;
there are several contributors/developers for data models, so there is a need for a test environment (or several environments);

Incremental build.

At the beginning, it sounds a little bit fantastic maybe, but hold on for a while. Let’s say to rerun the whole data model could take you 30–40 minutes. It would not a big fun to develop a new data model if you need to wait that amount of time to get the result. Tools like DBT has a way to select only a particular model to run:

dbt run -t prod --model some_test_model

Or to rebuild some model and all dependant models also:

dbt run -t prod --model some_test_model+

That simple feature helps you to speed up development. I’m pretty sure everyone uses that way to check the result of your development. Why it could not be used in CI/CD process? Actually, it could. You need to create some script for the current CI/CD system, that will detect the list of changed files in a commit, next get only files related to data models and run every model one by one including dependent models.

In case of big changes, there is a high chance to run the same model several times during on CI/CD cycle. Let’s imagine that there is a commit with 2 files:

data_model1
          |_ data_model2

data_model2 is dependant on data_model1 so it could be covered with one run only:

dbt run -t prod --model data_model1+

However, in the first approach we will two 2 runs, because of lack of knowledge relationship between that models:

dbt run -t prod --model data_model1+
dbt run -t prod --model data_model2+

We could try to rework the initial approach to take that into account. Let’s say we get a table that has mapping between hash (calculated based on content of file) of the current version of every data model file and table located in our data warehouse:

:table1
:table2
...

So every time you get a new version arrived into git, you could compare its hash with that table and figure out that the current table was built with the previous version of the data model. It should be recreated with all its children. Hashcodes could be also checked after every rebuild of the model. So in the case above the run:

dbt run -t prod --model data_model1+

You will get such hashcode:

:table1
:table2
...

So before running the rebuilding process for data_model2 it is possible to see that table2 has a new hashcode and that run is useless.

Anyway, you could come up with some other strategy, but it is still possible to implement that approach with some bash script. However, that requires some effort, quite a deep knowledge of CI/CD tools API, and may require your to find out several workarounds, but it is real…

So why could not it be that system of our dream will have that feature built-in? Just provide adapters for the most popular data transformation tools and implement the logic of incremental build under the hood.

Sync of data across different environments

Working with data involves the same stage as classic software development. You are developing some approach, testing it, and delivering it to production. It is similar from a high-level perspective, but there are a lot of aspects differences at the ground level. One of the most significant differences: you could not develop new data models and features locally in most cases. It is caused by 2 main reasons:

amount of data you need to develop your approach could be quite big and could not fit into local data storage;
most popular DW engines are available only as cloud-based solutions;
That makes it almost impossible to have your own development environment. The obvious way to resolve it is to have a separate cluster/database/schema for every environment. That solution introduces a new issue: how to sync the data into stage/test environments? How to it smooth and easy? It really depends on DW engine you are using.

E.g. Redshift requires you to make a full cluster clone and takes quite a long period of time to do that data cloning. It requires paying for another cluster similar in size to the production one to be able to fit all the data. There is also no internal tool to do it in one click, it is done by dump and copy commands running within prod Redshift cluster. It is also obvious that cloning of data in Redshift better to provide in low-activity hours. Users would not be affected by dump command on production.

On the other hand, Snowflake has a great internal feature to get an extra environment called ‘zero-copy clone’. In short, it just makes a new data “branch” that tracks changes to the new environment separately rather than copying the whole data. That cloning operation happens really fast and does not require your to pay some extra money. That operation could be done with one command.

As you can see there are totally different ways to clone the data in different, so they need to have a different way to interact. Why could not our CI/CD tools has adapters for different DW engines offering different strategies to create environments?

In case, we go deeper, we could get a smarter approach. What if you need to refresh the data during the working hours with one table you working with? It looks easy: just copy one table. But in the case of data models, it could lead to inconsistency with connected tables. The way to work with it is to get all related tables in that refresh activity.

For example, DBT tool has the ability to build DAG to get such information. So let’s image our CI/CD tool has “short refresh flow” to get only target data and we are using DBT. That flow could be implemented by collecting a list of connected tables to target data by running the DBT command (getting the data from DBT DAG) and next using DW adapter to perform “refresh data” activity on a limited amount of data models only.

Ability to check results of data tests during CI/CD

In case you are using data tests, you know how important they are! But it is not only some kind of trigger for developer to take action. It is not the same as it acts in the classical flow of software development (get an error->fix it->run it again). In data engineering fallen data test means sometimes not only fixing the root cause, but the side effects also (in classical software development actually it is also the case, but it is quite often left undetected). In the hardest cases getting an error in input data could lead to rerun the whole pipeline starting from reimporting fresh source data. Not all issues with data tests could be so critical, so you should react in a different fashion.

It will be great to have the ability to use fallen tests as triggers to some actions that could be taken. Let’s say that you have a classical data pipeline that is delivering new daily data in the morning. You have imported new source data (e.g. list of sales for yesterday) and generated new business data using source data (daily income amount) for the current day. Afterward, you are running the data tests to check if there are any issues, like null or unexpected values in business data. That test detects that there is some issue and you need to investigate it. It took some time so it is better to wipe out generated data before you could provide the right one. It is always better to make your users wait a little bit without the data, rather than use incorrect ones and make incorrect decisions.

The best way to eliminate that situation is to have a “pre-prod” environment to run the generation process and test afterward, check if there are no issues with critical tables, and trigger the generation process on a prod. However, it is also not 100% protection against such cases, because it is really hard to create an exact copy of the production system layout. In that case, you could use a second line defense: CI/CD pipeline checks test results for production data generation, and in case there are some issues with particular tables — automatically starts the rollback process.

How it could be done? Once again by some adapter module in CI/CD pipeline that supports DBT Tool testing processes or GreatExpectation framework.

In the end

As you could see there are several reasons why it is hard to fit classical CI/CD processes in the data pipeline. I believe the data engineering community is waiting for custom shaped CI/CD tool. Did I miss something you want to get from such tool?

Scentbird Analytics 2.0. Migrate from Redshift to Snowflake. Redesign ETL process.

Ivan Zerin — Fri, 05 Nov 2021 18:01:55 GMT

As some of you may know I'm building analytics in Scentbird and my intial steps in building ETLs and choosing a data warehouse engine was not perfect. Initially, I'm a Java developer and I start that journey having zero experience in building data stack, I hope it gives me an excuse to commit some mistakes (by the way it is great that building analytics systems requires both skills and I have challenges like building micro-services at the time of creating another data pipeline).

Ok, let me show you what we had in our data stack and what issues did we met? Our data architecture at the beginning of 2019:

Some side notes regarding that schema:

We have quite a classic ETL process based on AWS Glue jobs: extract data from Postgres Read-only replica, load it into a memory of Spark cluster, transform it and load to target warehouse.
Once we discovered DBT tool, it appears the most of the transformations could be done at DBT, but we still unwrap JSON with GLUE jobs (due to Redshift limitations).
We had a legacy self-build analytics portal. As Looker starts to cover all business needs, we were using it less and less.

It was built pretty straightforward to cover the basic need in analytics data. But since that moment we have dramatically increased in users (2020 pushed everyone online) and the amount of data we should collect. Moreover, we collect some issues of the current stack.

Redshift

Let start with the barebone of our stack - Redshift. What disadvantages did we collect?

Computation power and storage space melted into one entity. It means you get a Redshift node with fixed storage amount and you have no opportunity to scale computation power without getting another full node with extra useless disk space and vice versa. We were using DC2.large nodes with fixed 160Gb SDD storage space (the current price is ~$180 node per month). You have to deal with it until you become big enough to switch to nodes RA3.xlplus that will cost you ~$800 node per month (and you will be required to buy at least 2 nodes to unlock the ability to manage storage amount separately from computation power). As the result, if you are spending less than $1600 per month you have to buy an additional DC2 node to extend your storage space or find out some way to unload your data to cold storage like Amazon Spectrum. In most cases we did not utilize the computation power of our cluster more than ~30-40%, however, our storage capacity was sitting above 85% all the time. In fact, we were buying disk storage space with huge overprice.
Scaling and cloning of fixed cluster is quite a slow operation. Redshift is doing a simple operation of copy in case of scaling or cloning your cluster. No tricks, except one: no downtime, just read-only mode. In case you'd like to add another DC2 node you will stay in read-only mode until a new cluster will be created and all data is copied there. In our case, it took almost 2 hours to scale the cluster. In case you are living with batch data processing, it is quite ok. What is more tricky - clone production data to staging cluster. Waiting for 2 hours until it will be complete... No, we have created some workaround. The full copy is done during non-working hours. In case you need to refresh the table during your development - run the script to dump and copy only the target table. It took less time, but it was still not suitable in the case of bi tables.
Hard to develop data models in parallel. Taking into account points #1 and #2 I'd say it was not so easy to develop new data models in parallel. Ideally, every developer should have his isolated data copy. In the case of redshift, it will increase your cost x times. To create one data copy per one developer you need space, but do you remember what it means? Yes, buy additional nodes to support space. Another option deal with limited dumps: it is really hard (we have almost 300 data models and the number of tables is quite the same) to create one according to your task and pull all the required tables. You have to create a special dump for every task and it is also not fun. The simplest way we found - use a shared dev environment. However, sometimes get extra issues during parallel development and debugging issues because of a clash with teammates.
Poor ability to process JSONs. Redshift is really, really bad in terms of processing JSON documents. First of all, there are only a few JSON functions in Redshift, syntax is ugly. Secondly, JSONs process not so fast as you can expect. And the last one, Redshift not able to access all fields of JSON in case of tricky JSON objects with a bunch of internal arrays and objects. It could be a workaround by data preprocessing (as we do JSON unwrapping by AWS Glue), but in the current trend of moving from ETL to ELT it will be no popular solution.

AWS Glue

AWS Glue is allowed you easily create ETL jobs inside the Amazon ecosystem via UI. Back in 2017, it looks a good solution to start with. But let me focus on AWS Glue and the disadvantages we discovered later.

Glue is the primary UI tool. Generated code is redundant repeats in every job you create. For sure you can generate most code via UI, but as result, you get something that looks like examples. Pretty complex for simple Extract and Load, but in case you have a couple of jobs it is ok. In our case we end up with 60+ jobs and creating code for every job via UI was not our way. To eliminate a lot of repeating code and put all jobs code in Git we create our wrapper around Glue SDK to shorten job code to minimal. As result, we get a job similar to:

from sbcommon import SBJob
sb_job = SBJob('SourceDatabaseName')
sb_job.extract_and_save("source_table", "target_table")
sb_job.finish()

Glue is the primary UI tool once again. It is not easy to integrate jobs into your CI/CD process. To be able to automatically create new jobs from your code, amend an existing one, delete useless ones, update shared libraries you have to write your own code via AWS SDK. Once again we resolve it via our Python wrapper.
Quite hard to develop/debug new jobs. There is no way to set up Glue locally or mock it somehow. You have to create a development instance of Glue in AWS Cloud and pay for it. You also have to mock all calls to Glue during your CI/CD tests just to test the code of your job so you get almost useless tests. Another point - no way to separate production and development environments. You can do it only by creating duplicated jobs attached to the test sources and target storage.
Long warm-up time of Glue job. Glue job after start wait for AWS Cloud to allocate resources and startup Spark cluster. I have calculated the average warmup time for our production jobs during the whole history and it is ~7 minutes. In early 2021 AWS realized AWS Glue jobs v. 2.0 that promise warm-up time under 1 minute.
Overwhelming calculation power of Spark cluster for simple Extract and Load operations. Every Glue job has Spark running under the hood and you can perform complex calculations with your data in memory in case you need it. All we were doing inside the Glue cluster, as mentioned before, was unwrap JSON documents because Redshift was not able to do it. To tell the truth, JSON appears in 8-10 of our Glue jobs. Rest 50+ was just simple Extract and Load, so we use a sledgehammer to hit small pins.
It is tricky to do joins for late-arriving data/events. Processing arrived data is not only about joining (if any) with another source/table data arrived at the same time window. Sometimes you have to join a new delta with data that arrived yesterday or a week ago. You need to read and load it in memory of the Glue job before doing it. It could be a real time/memory/money consuming operation. Better just load raw data into target storage and perform all joins via DBT or any other tool. This fact makes Glue transformation useless in a lot of cases.
Glue is not suitable for job orchestration. Glue provides you with triggers and you can create a workflow. It has very limited functionality. Not easy to integrate Glue workflow into your meta pipeline, where you have something to run after/before Glue workflow. Another issue you have to be very careful with the order of started jobs: job could not depend on successful completion of another job, it can be done only via triggers. You have to create a trigger that will start job B once job A completed. It means you have extra entity and extra chance to make something wrong. Resuming failed jobs (and the rest job going next) should be done manually and sometimes it is annoying.
8. Glue jobs often could be unstable due to Amazon internal issues. Couple times a month we get random job failed due to some strange issues (like no way to read data from S3 bucket). Most of them were recognized by AWS Support as internal issues and it was recommended just to rerun the job. Sometimes it means restart of the whole workflow and shifting the expected time of delivering data to business users.

External sources for ELT

Do you remember my initial architecture schema at the beginning of the article? Actually, it was changing rapidly starting in 2019. We started several subbrands, rework our application to a microservices-oriented way, adopted Snowplow, and have to include more and more external data obtained via API coming from Shopify, Facebook, Google Ads, and several more. We come to such architecture schema:

External APIs become a new challenge. We scanned for a possible solution and end up with Singer framework. We adopted it in our stack and works well for external API. It was a code-based tool, cheap in running, easily scaled, fast, and stable. It become obvious the most of AWS Glue functionality could be easily replaced in our stack, except for our needs in JSON data processors.

Snowflake

We have heard a lot about Snowflake and looks very sweet. Modern data warehouse with a lot of cool features compared to Redshift. We get an idea to perform some PoC with Snowflake: how good it is for our case? will it cost us more?

From the Snowflake team we get a demo:

SQL syntax is 90% the same compared to Redshift;
Computation power and storage space exist independently from a very beginning;
Computation instances could be suspended automatically if there are no incoming requests;
Autoscaling with zero downtime (or read-only mode);
Zero copy clone of data: as data in snowflake are stored as a series of incremental changes, you can create your own data "branch" in a few minutes and it will require disk space only to store "pointers" to incremental you create your "branch" from;
Time-travel feature allows you to get a snapshot of a table at any moment back up to 90 days. It means you can stop doing everyday backup until your retention policy is below 90 days;
Undo of drop operation (table/schema);
JSON-friendly: great JSON support, auto-formatting of stored JSON data;
Snowflake was compatible with any tool in our stack except AWS Glue;

We want to go further, but we couldn't do it because of AWS Glue: it could not be connected to Snowflake without undocumented tricks. We decided to give a try to Singer act as the main pipeline framework on delivering data to Snowflake during PoC.

Singer

As we already were running singer as our ETL tools for external sources it was quite easy to adapt it to transfer data from our internal databases (a lot of Postgres databases) to the data warehouse. We decided to compare its results with the same sources and targets vs. AWS Glue. We took 3 tables with our average amount of rows per table, 2x average number of rows and a table with really few rows to make us fall overhead of Glue for small tables.

Table	Glue warm up time + ETL time (avg over 70 runs)	Singer time (warm + ETL) (avg over 70 runs)
Table_A (12M rows)	6.1 minutes + 4 minutes	Total time for 3 tables: 8.85 minutes
Table_B (6M rows)	20 seconds + 4.9 mintues
Table_C (58 rows)	30 seconds + 1 mintutes
Total	~7 minutes + ~10 minutes	8.85 minutes

You could notice some AWS Glue jobs (Table_B and Table_C) has short warm-up time. I suspect they somehow assigned to already warmed up by Table_A job instances (All jobs was running in a chain Table_A->Table_B->Table_C). Singer is much more efficient (almost 2x times faster). Even if compare only pure time for moving data, Singer is doing it more efficient (8.85 minutes vs 10 minutes). Additionally, we have found a great wrapper over Singer most popular taps Pipelinewise that makes installation and configuration of Singer framework much easier. So we decided we are ready to go to ELT process fully build over Singer.

Snowflake PoC

Once we get a working pipeline connected to Snowflake we are ready to build PoC (proof of concept). Plan for building it:

Pickup couple of the most complex data models in DBT and get the list of tables involved in building those models;
Dump historical data from Redshift to Snowflake for picked up tables;
Setup singer as ETL tools to move everyday delta from source to Snowflake for picked up tables;
Convert chosen DBT data models from Redshift SQL to Snowflake SQL;
Compare results for the building data model and executing complex SQLs from Looker;

We managed to build PoC and run tests quite fast and get data produced 4 times faster with Snowflake's instance Small (it should be equal to Redshift's DC2.Large in price) and 6 times faster with Medium type of Snowflake instances (just to be sure additional money save us some time in future).

Job orchestration. Argo PoC

So far so good. We were almost ready to switch to Snowflake + Singer instead of Redshift + Glue. But we are missing one part - job triggers and orchestrations. Glue was doing the orchestration of our jobs. Not so good as we'd like it to do, but we had such ability. Singer does not provide any built-in tool for running jobs in a proper way and at a proper time. Once I've been visiting Data Counsil conference in 2019 I faced an interesting DAG tool: Argo Workflow. And looks like it suits us well, because of our switch to Kubernetes. We gave it chance as another tool in our data stack. And it works perfectly:

every job in workflow runs in a separate Kubernetes pod, so it is isolated, resource efficient and easy restartable;
every job is a Docker container, so it means you can have packed any technology inside;
it is really easy to build DAG using your job configurations;
workflow could be started by classic cron triggers or via API call;
Argo has a really tiny and fast UI;

We decided to covert our jobs to Argo jobs in such way:

Every combination of singer tap and target is packed in a separate Docker image.
Every source of data gets its own Argo job based on proper Singer Docker image and is provided with custom resource settings of Pod (you can tune CPU and memory consumption depending of a number of rows you need to transfer). In this case, you also get the ability to resync a particular data source in one click via Argo.
In case we have a lot of tables to transfer in one source, we create several Argo jobs based on the same Docker image, but pass a different list of tables to sync. It gives an ability to extract data from the same source in parallel via several Argo jobs. In case of any issue with Argo job you have to resync only part of tables and not all of them, that could be expensive in case there are hundreds of tables packed in one job.
We have created a special Docker image to run DBT after all data migration is completed.
All Argo jobs are linked in one big workflow that starts once a day by cron trigger. So we have such a logical schema: Cron Trigger->ETL Jobs->DBT Job->Slack Notification Job.

As result our Argo workflow:

You can explore and reuse our template for Argo+Singer ETL here.

Migration itself

Ok, we had to do several proofs of concepts, research different tools, and find out a new solution to organize our ETL pipeline. The next step is migration itself. We came up with such a plan:

Create a role model in snowflake, create users for every tool we use.
Setup Argo and Singer to move data in Snowflake from internal databases in parallel to Redshift.
Setup Singer to move data in parallel into Snowflake from external sources (GA, Facebook, Shopify, etc.).
Dump Redshift historical data and put it into Snowflake, start Singer running to make everyday data update to Snowflake.
Convert all DBT models to Snowflake SQL in a separate branch and set up regular DBT build in Snowflake from the branch.
Compare data in Redshift and Snowflake arriving in parallel (should be no diffs in numbers). We need to run both databases for one month to have the ability to check several end-of-month reports.
Switch off Glue jobs, stop Redshift collecting backups, stop DBT running for Redshift.
After some retention period (2-3 weeks) stop Redshift instances. Save the latest backup.

Our initial ETA was to complete migration in 3 months, however, it took us almost 5 months. For sure there were some hidden issues we have to face with and it took additional time to work around them. About couple of them I created separate posts (here and here).

In the end

We are running Snowflake, Argo, Singer, DBT stack for 6 months already. Our current architecture schema:

Why this type of architecture is better:

Using the only EL (Extract and Load) tool for all types of data sources (Singer).
Ability to scale in any component independently. Decoupled storage and computing power (Snowflake).
The whole EL + T process now takes 1.5 hours instead of 3-3.5 hours based on Aws Glue and Redshift. And it could be cut down to 1 hour for additional money (using medium instances instead of small in Snowflake), but there is no need for us to spend additional money to get an extra half an hour in the night.
Reduce cost 50% for the whole stack (Singer+Argo cost almost nothing compared to Aws Glue). Snowflake reduced our cost dramatically thanks to the ability to suspend not used instances.
Improve stability of the whole EL process (there is no more failed EL jobs because of EL engine itself).
Durability of Argo+Singer solution is great. Zero effort to maintain those solutions.

Timestamp to string conversion (Redshift vs Snowflake) issue

Ivan Zerin — Fri, 02 Jul 2021 13:34:50 GMT

Everyone faces some strange behavior in software products from time to time. Any software has bugs, so it does not surprise us much. However, in some case you facing not a clear but some kind of strange implementation that surprise you. I have such one from our Redshift to Snowflake migration:

select
    cast(TIMESTAMP '2021-03-19 17:07:10.123321' as varchar),
    cast(TIMESTAMP '2021-03-19 17:07:10.123000' as varchar),
    cast(TIMESTAMP '2021-03-19 17:07:10.100000' as varchar),
    cast(TIMESTAMP '2021-03-19 17:07:10.000000' as varchar);

Output (Redshift):

2021-03-19 17:07:10.123321,
2021-03-19 17:07:10.123,
2021-03-19 17:07:10.10,
2021-03-19 17:07:10

Output (Snowflake):

2021-03-19 17:07:10.123321,
2021-03-19 17:07:10.123000,
2021-03-19 17:07:10.100000,
2021-03-19 17:07:10.000000

At first glance, Redshift just cut trailing 0. Do not hurry! Pay attention to the 3rd date 2021-03-19 17:07:10.100000. Redshift will convert it really strange form 2021-03-19 17:07:10.10 with trailing zero. I could not explain why it happens in such a way. But it is what it is - Redshift converts data to string in a quite unexpected way.

You can say: Well, actually it is not a big deal. You could read all of the values in both cases, so there is no harm. Correct, in terms of just string output, it is totally ok. But we were not so lucky: our data model heavily uses hash keys over several fields in a table. So to combine all of them to build hash code (actually we are using DBT function that do it for us) we are converting every field to string... I bet you already have an idea of what's going wrong with our hashes. Due to different algorithms for string conversion, we were getting unequal hashes over the same columns and the same values in them. Old Redshift hashes should be somehow converted to Snowflake base. We have created such a function to run it over Snowflake and convert all broken references build on top of hashes:

alter account set TIMESTAMP_NTZ_OUTPUT_FORMAT = 'YYYY-MM-DD HH24:MI:SS.FF6';

alter table public.some_table add column unique_id_redshift varchar;

update public.some_table c set uniq_id_redshift = md5(cast(
                          coalesce(cast(c.some_regular_column1 as varchar), '') || '-'
                      || coalesce(

                            case
                                when LEN(RTRIM(cast(c.date_started as varchar), '0')) - POSITION('.', RTRIM(cast(c.date_started as varchar), '0')) > 1 then RTRIM(cast(c.date_started as varchar), '0')
                                when LEN(RTRIM(cast(c.date_started as varchar), '0')) - POSITION('.', RTRIM(cast(c.date_started as varchar), '0')) = 0 then coalesce(substr(RTRIM(cast(c.date_started as varchar), '0'), 0, LEN(RTRIM(cast(c.date_started as varchar), '0'))-1), '')
                                else RTRIM(cast(c.date_started as varchar), '0') || '0'
                                end

                              , '') || '-'
              || coalesce(cast(c.some_regular_column2 as varchar), '') || '-'
      || coalesce(cast(c.some_regular_column2 as varchar), '') as varchar));

update table_with_invalid_key_reference f set f.some_table_unique_id = c.snowflake_calculated_hash
from some_table c
where f.some_table_unique_id = c.unique_id_redshift;

...

alter table public.some_table drop column unique_id_redshift;

Connect Snowplow to Snowflake

Ivan Zerin — Thu, 24 Dec 2020 13:08:59 GMT

I have already mentioned that we are using Snowplow as our backbone system to collect our events. It was not mentioned that the last time setup Snowplow in pair Redshift was a long and awful ride. Mostly because of quite a bad Snowplow documentation and some mess with existing Snowplow components. However, I was thinking about making a proper tutorial regarding Snowplow set up once we get finished, the issue with Snowplow docs was to complex to get all details in one list several weeks after. So I let it go.

Last week I need to check if it is possible to setup Snowplow Loader in connection with Snowflake. I was expecting some issues with documentation and prepared to write down all my pain during the journey. So here it, my tutorial "How to set up Snowplow components with Snowflake". I hope it will save some time for anyone who will go after step by step. But before we go, couple of notes:

Start with official documentation: Just for the beginning read all 5 articles. It does not step by step tutorial. Sometimes you need knowledge from the last articles to perform some tuning in action mentioned in the first one.
Just take that everything should not be started locally (or at least, it will need some special tricks). I have to spend a lot of time just trying to get all jars mentioned in the official documentation. It could not be done, because some of jars are store in private buckets, for others it is really hard to find a link to download (links are spread all around old and new documentation).

Let's start with setup article: Just go over the document until Setting up Snowflake and finish snowflake storage integration. I would add that you should also perform some missed actions. After creating of integration need to grant owner to sysadmin. And be sure your Snowflake stage does not contain - (it makes loader to produce an error).

It is strange but Setting up Snowflake should be skipped. It could be misread and looks like you need to perform some CLI command to setup Snowflake, but it is not so. Just skip it and go ahead.

Your next stop is DynamoDB. I'm not sure what Snowplow developer means by manifest word. It confused me a lot, giving me an idea that I missing some part because there was no definition of such term... Later, I find out that manifest is the name of the field where you should put DynamoDB table name in config. Anyway, all you need is just create a table in DynamoDB with one column called RunId and write down the name of the table. You need it later.

The next stop is Snowflake. It looks like you need to create atomic.events table in Snowflake on your own (or not?). Not clear what you should do here... Nothing. It is mentioned in setup instruction just FYI, the table will be created automatically.

Next, Dataflow Runner has two big JSON configs. It is not a problem to fill all gaps in configs. I'd recommend you to raise amiVersion from 5.9.0 to 5.18.0. It unlocks you EC2 instances of the 5th generation (with 5.9.0 you can use only old 4th gen instances). It is better to amend the default configuration of spark in cluster.json. It could significantly increase the performance of a cluster, but it should be tuned for a particular task. My config, that works for me well, just for reference:

"configurations":[
  {
    "classification":"core-site",
    "properties":{
      "Io.file.buffer.size":"65536"
    }
  },
  {
    "classification":"mapred-site",
    "properties":{
      "Mapreduce.user.classpath.first":"true"
    }
  },
  {
    "classification":"yarn-site",
    "properties":{
      "yarn.resourcemanager.am.max-attempts":"1"
    }
  },
  {
    "classification":"spark",
    "properties":{
      "maximizeResourceAllocation":"false"
    }
  },
  {
    "classification":"spark-defaults",
    "properties":{
      "spark.dynamicAllocation.enabled": "false",
      "spark.executor.instances": "6",
      "spark.yarn.executor.memoryOverhead": "1024",
      "spark.executor.memory": "9G",
      "spark.executor.cores": "5",
      "spark.yarn.driver.memoryOverhead": "2048",
      "spark.driver.memory": "13G",
      "spark.default.parallelism": "60",
      "spark.network.timeout": "800"
    }
  }
]

Next, it looks like you need to run something within your CLI (dataflow-runner), but you do not. It should be started in a cloud with access to Amazon EMR. So just leave it as is. In the playbook, you can see the mention of loader-config file. It is created on the next page.

Do you remember DynamoDB manifest? You should put it here. All the rest items are obvious.

Next page is Cross-batch deduplication. It was not important for me, so I have skipped it.

Backpopulate the manifest is very important in case you are perform migration from on storage to Snowflake. You can just do not import all raw Snowplow events with EMR cluster, but copy it from your storage to Snowflake table directly and mark that data as already imported for Snowflake loader. Be careful there are several errors on the page.

First, of all you won't be able to complete that CLI commands:

$ pip install boto3 snowplow_analytics_sdk
$ wget https://raw.githubusercontent.com/snowplow-incubator/snowplow-snowflake-loader/release/0.4.0/backfill.py   # Won't actually be downloaded as repository is private

The link is incorrect (According to the comment, the author was sure that it is broken, but forget to fix it). Instead, use:

pip3 install boto3 snowplow_analytics_sdk

wget https://raw.githubusercontent.com/snowplow-incubator/snowplow-snowflake-loader/release/0.7.1/backfill.py

Next, CLI command also has an error:

./backfill.py \
    --startdate 2017-08-22-01-01-01 \
    --region $AWS_REGION \
    --manifest-table-name $DYNAMODB_MANIFEST \
    --enriched-archive $TRANSFORMER_INPUT \
    --aws-access-key-id=$AWS_ACCESS_KEY_ID \
    --aws-secret-access-key=$AWS_SECRET_KEY

You won't be able to run Python file as a regular bash file. So correct command looks like this:

python3 backfill.py --startdate 2020-12-23-01-01-01 --region us-east-1 --manifest-table-name your-dynamodb-table --enriched-archive s3://path/to/archive --aws-access-key-id=${AWS_ACCESS_KEY_ID} --aws-secret-access-key=${AWS_SECRET_ACCESS_KEY}

At last, you are ready to make a push your configs to some EC2 instance, that will start EMR cluster. What you should have in your push:
cluster-config.json
playbook.json
iglu-resolver.json (just take a look my other article)
run_emr_cluster.sh

In the last .sh file you should put all CLI commands you have seen over the official tutorial. Take a look at mine, for example:

...
pip3 install boto3 snowplow_analytics_sdk

wget https://raw.githubusercontent.com/snowplow-incubator/snowplow-snowflake-loader/release/0.7.1/backfill.py

python3 backfill.py --startdate 2020-12-23-01-01-01 --region us-east-1 --manifest-table-name your-dynamodb-table --enriched-archive s3://path/to/archive --aws-access-key-id=${AWS_ACCESS_KEY_ID} --aws-secret-access-key=${AWS_SECRET_ACCESS_KEY}

aws s3 cp s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.7.1.jar .

java -jar snowplow-snowflake-loader-0.7.1.jar setup --base64 --config $(cat ./loader-config.json | base64 -w 0) --resolver $(cat ./iglu-resolver.json | base64 -w 0)

/dataflow/dataflow-runner run-transient --emr-config ./cluster-config.json --emr-playbook ./playbook.json

Extracting data from Bigquery into Redsfhit

Ivan Zerin — Mon, 18 May 2020 07:59:31 GMT

Another morning, another challenge! :) This time I have to integrate data collected into Google Ads into our data warehouse based on Redshift. I started with Google and find several tricky ways to get data into Redshift.

Some paid ETL services.
Self-hosted script to capture data from Google Ads directly via API.
Transfer Google Ads data into Google BigQuery and next turn my task into "Transfer data from Bigquery into Redshift".

As you can see based on a title, I started to dig into the last option as it looks like the thing I can easily implement. We are using singer.io framework in our tech stack and I was thinking that Singer has BigQuery adapter for sure! So I will just put another adapter into my singer repo and the data start to flow.

Signer actually has official Bigquery adapter... but it is a target adapter (in terms of singer framework it means the adapter able to write into Bigquery, not extract). However, I have noticed that there is some very simple BigQuery extract adapter done by a single person. I give it a try and it works in a very simple and reliable way (read adapter in Singer are called 'tap'). It has several issues:

It does not fully support singer contract, so not all signer target adapters are compatible with it.
There is no ready package in Pip: you have built and install it by yourself.
The description is outdated and copy-pasted from another tap.

I was able to resolve all those issues and make a pull request with updates in the original repo and it was merged by the author. I also make a package, so you are able to use it via 'pip install'.

Under the hood, it just extracts data as CSV files via BigQuery API, put it on S3 bucket, and run COPY command to import data into Redshift.

To create pipeline between BigQuery and Redshift locally you need next commands:

>python -m venv ./app/env_bigquery
>source ./app/env_bigquery/bin/activate; pip install tap-bigquery;
>python -m venv ./app/env_redshift
>source ./app/env_redshift/bin/activate; pip install pipelinewise-target-redshift;

>export GOOGLE_APPLICATION_CREDENTIALS="./client_secret.json"

>tap_bigquery -c config.json -d > catalog.json

>tap_bigquery -c config.json --catalog catalog.json --start_datetime '2020-05-01T00:00:00Z' --end_datetime '2020-05-01T01:00:00Z' | target-redshift --config redshift_config.json

It is quite easy: install tap and target in separate python environment, export secret key for the service account in Google (check it in readme to tap repo), amend config for tap. Next, I run the tap in discovery mode (-d option), it will produce schema of BigQuery table as output. Check it! I have to amend the type of several fields. As the final step just start tap-bigquery with all options, including start and end date-time (it will limit extracted data by datetime_key you can specify for every table in config) and redirect its output in redshift target.

For incremental extraction you have to use keys start_datetime and end_datetime (proper support for singer state message not implemented). I have created such sh script to work around it:

#!/bin/bash

source ./state.conf

if [ -z "$endDate" ]
then
      echo "\$endDate is empty, use default value"
      endDate=$(date --utc +%F --date="15 days ago")
      echo $endDate
fi

startDate=$endDate
endDate=$(date --utc +%F --date="1 days ago")

export GOOGLE_APPLICATION_CREDENTIALS="./client_secret.json"

./env_bigquery/bin/tap_bigquery -c bigquery_config.json --catalog catalog.json \
--start_datetime $startDate --end_datetime $endDate | \
./env_redshift/bin/target-redshift --config redshift_config.json

declare -p endDate > ./state.conf

Just save and load value for parameter start_datetime from a file via standart bash commands source and declare. Quite simple.

How do we build analytics system in Scentbird

Ivan Zerin — Mon, 29 Jul 2019 22:15:21 GMT

Couple of months ago Scentbird has celebrated 2 years of building internal analytics system. I considered it is a time to start sharing our experience with the community and it was a topic of my HighLoad 2019 (the biggest conference in Russia regarding high load systems) speech. Here it is (in Russian):

However, I suppose it is worth to re-tell the story in English. So I will try to do it below. In the beginning, I'd like to say that it is not about best practice (we are building our first analytics system), so it is just our experience with all set of mistakes and issues. I hope it will help others to avoid the same traps.

What do we begin with? It was quite a simple setup:

We get a self-made analytics portal at the beginning. It was connected directly to the same database (website database) our main web service was. Web analytics was done by external systems like Mixpanel and Google Analytics. It was quite simple to build, but sometimes heavy analytics SQLs affected the performance of the main website. It was easy to resolve it: introduce read-only replica for the database (Postgres supports such replica out of the box):

The new setup helps us to avoid performance issues caused by analytics SQLs at the side on the website. However, analytics SQLs was performed with a bad performance even with the dedicated read-only replica. It became clear: we need another way to store our analytics data. That is how do we find out Redshift database in Amazone ecosystem. In short, it has such advantages for us:

Amazon ecosystem (we are using AWS services a lot);
column-based DB;
PostgreSQL-based: quite easy to reuse initial SQL;
simple performance tests promise ~10x faster query execution;

Looks great, but data should be somehow transferred from website database to newly created analytics-specific Redshift database. That is how we discover what does ETL (extract, transfrom, load) means. We have checked several ETL solutions: AWS DataPipeline, Stitch, AWS Glue. AWS Glue was a winner at the end, so we introduced it in our stack with Redshift:

We increased the complexity of our stack, but it gave us a great speed boost. We were able to move further with more complex analytics and do not afraid about the performance of analytics reports. Users like the performance and started asking for more new reports. As you remember, we were using our analytics portal with developing all UI stuff. Additional reports lead us to a new amount of UI components that should be developed and maintained. Dev team spent more and more time for UI, instead of working with data. It was a moment of taking a look at BI-tools. Once again we compare several options: Metabase, Tableau, AWS QuickSight, Looker. We decided to jump into Looker, still using it and happy:

It gave the dev team an ability to concentrate on data tasks instead of building UI. It is possible to resolve about 80% of users' requests without any development (it can be some assist from dev side, but no development of UI anymore). We are using our old self-made analytics portal still but perform some migration of its reports to Looker.

Having good performance and be able to concentrate on data development only, We had opened a new level of complexity in our analytics. We were trying to build long transformation pipelines with AWS Glue (it has Apache Spark under the hood) but failed to organize a proper flow of transformation and validation phases. That is how we came to DBT tool.

It helps us to move data transformation from AWS Glue (that is good at extract and load of data) to separate tool allowing us to do transformation (letter T in ETL) like a pro:

DBT tool allows us:

store data transformation models as code, supporting version control;
create transformation models with SQL;
reuse our SQL code with integrated JINJA;
simple creation of incremental data models;
build-in data test framework.

What can you pick up to your experience from our journey?
I try to make a summary:

Do not share the same DB instance for analytics and your main service that is producing data. At least start with separate read-only replica for analytics.
You definitely should go for analytics-oriented DB. You can not use it from the very beginning, but keep in mind the idea that you will be at some point move to another kind of database. Once you get report complex and slow enough to make users feel uncomfortable - start your migration.
ETL looks like an easy part of all your journey at the very beginning. Do not let you be fooled - it is important and complex part. Choose it carefully.
You should test and control your data. Start it as soon as possible.
Moving to 3rd party BI tool will be an expensive step. However, it will give you a lot of time to work with data, not UI. You will get a feeling that you need it. Start with some free solution (open-source or your custom solution).

Jenkins: How to start Jenkins build using commit id of last successful build on test enviroment?

Ivan Zerin — Sat, 01 Jun 2019 14:53:18 GMT

Last week I faced with a quite not trivial case in CI practice. There is production build running once a day by schedule and it is using master branch as a source. The main challenge here that the development team is not using branches in everyday practice (why? it is quite another topic, but at the moment we stay with one branch).

Such setup was working fine until dev team size hits 4 engineers. It becomes often case, that Jenkins build start while in the master branch we have some unfinished task that is in progress. It becomes obvious that we need use for the production build only the commit that produces the last successfull build on stage environment. Sounds good and should quite easy to implement, but...

Jenkins is great in case you have a pipeline of builds (one build trigger another one right after successful completion). However, it is not so easy to push a scheduled job to check the state of another job and obtain commit id in pre-build actions. At least, I do not find anything out of the box.

Let's try to create it on our own. First of all, we need to obtain the last successfull build with all additional info. Checking of Jenkins api give us a link to do it: http:///job//lastSuccessfulBuild/api/json.

It will produce a huge JSON output, but the most intersing part is here: "lastBuiltRevision":{"SHA1":""}. We could reduce JSON size, in case pass exact names of fields and depth of search over JSON: http:///job//lastSuccessfulBuild/api/json?tree=actions[lastBuiltRevision[SHA1]]&depth=3. OK, commit hash can be onbtained.

Now, we need to trigger this API call during pre-build actions, parse commit hash and use it for build start. It can be done with Jenkins EnvInject Plugin. Setup it and in build parameter you should enable the option Prepare an environment for the run.

Do not forget to enable option Override Build Parameters. Ok, now we have an opportunity to override environment variables before build starts and can use such purpose: hardcode values, script (not sure what kind of script can be used, because I find out next option), groovy script. Groovy is a perfect language for me for such kind of automatization, so I start using this option. What else do we need? Ah... We can override env variable, but what variable to override? Let's create such one.

In build parameters you should enable the option This project is parameterized. Add Git parameter section. Now name parameter (I have used COMMIT_ID) and set the default value of the parameter (in my case, it was a reference to master's head).

Let's inject a new environment variable. It can be done in the section Source Code Management, put reference to env variable in a field Branch Specifier (blank for 'any').

And the last step: perform API call, parse the hash and put it in env variable COMMIT_ID. Let's return to Jenkins EnvInject Plugin and find section Groovy script. Put in the field the script:

def post = new URL("http:///job//lastSuccessfulBuild/api/json?tree=actions[lastBuiltRevision[SHA1]]&depth=3").openConnection()

String user = ""
String pass = ""
String authStr = user +":"+  pass
String encoding = authStr.getBytes("utf-8").encodeBase64().toString()


post.setRequestMethod("POST")
post.setDoOutput(true)
post.setRequestProperty("Authorization", "Basic " + encoding)
def postRC = post.getResponseCode()
if(postRC.equals(200)) {
    def result = post.getInputStream().getText()
    def shaBegin = result.indexOf('SHA1":"') + 7
    def shaEnd = result.indexOf('"', shaBegin)
    println(result.substring(shaBegin, shaEnd))
    return ["COMMIT_ID": result.substring(shaBegin, shaEnd)]
}

The script is pretty straightforward. You need to pass you username and API token (click on your username in Jenkins at the top -> Configure -> Show Legacy API Token). Next, we just parse hash from result string and return map with parameter name COMMIT_ID and hash value.

Generating series in Redshift

Ivan Zerin — Fri, 01 Feb 2019 13:35:09 GMT

Redshift in 99% of cases means data warehouse, where generating series (especially date series) is quite a common case. You will be disappointed to find out no information about series in official documentation (the only thing that you will be able to find - "generate_series function is not supported by Redshift"). It is a strange situation that generate_series() not supported in database engine with great demand on such function, but generate_series() implemented in regular Postgres with fewer cases to use.

Actually generate_series() function can be used in Redshift with some limitation, but it is still available. Before I start the explanation of using generate_series() I'm going to make a quick note regarding Redshift specific function support. Redshift is a multinode database: master node is an instance that performs orchestration, while the worker node is performing actions with data. It is not obvious, but the master node supports several additional functions comparing to the worker node. However, you do not find generate_series() function in this list. Nevertheless, this knowledge will help us in the future.

generate_series() function can be used as undocumented function. It can be dangerous, but you should know all the available tools. So let's check it:

    SELECT CURRENT_DATE :: TIMESTAMP - (i * INTERVAL '1 day') AS generated_datetime
    FROM generate_series(1, 50) i

You will get a perfect date series. So let go further and use it with some data:

    with gen as (
    SELECT CURRENT_DATE :: TIMESTAMP - (i * INTERVAL '1 day') AS generated_datetime
    FROM generate_series(1, 50) i
)
  select * from gen
  INNER JOIN user u ON gen.generated_datetime = su.datetime_created

And you will get a not clear error message: Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables. Do you remember my small note regarding fact that leader node has its own set of function. So the error message is telling you that data nodes do not support some function that you use in sql. In such small sql script, it is clear that generate_series() is such kind of function. You can run function generate_series() in any sql that will be calculated using only master node, it means that you can not join or reuse results of generate_series in the same script where you are referring to data tables (such requests will be redirected to worker nodes without support of generate_series()).

How to deal with such limitation? Well, I see two options: use generate_series() to generate data into some new table and then use such table in scripts involving data tables or use some trick to get number series from an already existing table. Take a look:

    SELECT
      DATEADD('day', -n, DATE_TRUNC('month', CURRENT_DATE)) AS generated_date
    FROM (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100) n
    ORDER BY generated_date DESC

You will get perfect date series, the only trick that is used here - using system table stl_scan as a source of number series (yes, it can not be generated directly). Some system tables like stl_scan are populated with technical data right after database creation, so you can be sure (or almost sure) that at any moment you'll at least several rows of data there. Such workaround will perfectly work in case of join with some business data. Let's check it:

    SELECT 
        user_id,
        dt.generated_date
    FROM users us
    LEFT JOIN (
      SELECT
        DATEADD('day', -n, DATE_TRUNC('month', CURRENT_DATE)) AS generated_date
      FROM (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100) n
      ORDER BY generated_date DESC) dt
    ON us.subscription_day >= dt.generated_date and us.cancel_day <= dt.generated_date

Running static Iglu repository on AWS S3

Ivan Zerin — Tue, 06 Nov 2018 04:05:06 GMT

During setuping Snowplow analytics systems I have to setup a private Iglu repository. The main idea behind this is described on (https://github.com/snowplow/iglu/wiki/Setting-up-an-Iglu-repository). That manual missed several steps that are really important for building Iglu repository on AWS infrastructure. I had spent a lot of time trying to figureout those step. So here they are:

You should upload data to S3 in the layout that is described here https://github.com/snowplow/iglu/wiki/Static-repo
Enable S3 bucket as static hosting solution. It can be from Properties menu of S3 bucket.
Amend policy of s3 bucket to allow public access. It is located in Permission section within Bucket Policy submenu.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicReadGetObject",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::bucket-name/*"
        }
    ]
}

Create CORS policy. It is also located in Permission section in submenu CORS configuration section.




    *
    GET
    *

Update Iglu resolver config for enricher (https://github.com/snowplow/iglu/wiki/Iglu-client-configuration).

You can check that everything is working correct with simple wget ping:
wget http://you-prefix.s3-amazon-region-prefix.amazon.com/schemas/com.yourcompany/schema_name/jsonschema/1-0-0
(in case you are create your own schema with reference as iglu:com.yourcompany/schema_name/{schema_version} and schema version is 1-0-0). You should be able to download your schema now.

build-test-data plugin: how to generate IDs for domain entities in a batch

Ivan Zerin — Mon, 09 Apr 2018 19:28:52 GMT

I have faced an inssue on grails project: domain entities has assigned id, so in tests I have to manually set up id for every created entity:

new Entity(id: id_value, ...).save()

It is very verbose and creates additional task to create and maintain id variable. In addition, there is a build-test-data plugin in our tests, so I was wondering if it is possible to get assist from this plugin to generate IDs? And I found a solution: you have to add in TestDataConfig.groovy such expression:

testDataConfig {
    sampleData {
        ...
        Holders.grailsApplication.domainClasses.each { domainClass ->
            if (domainClass.fullName.startsWith('domain.package.to.get.ids')) {
                this."${domainClass.fullName}" {
                    long i = 1
                    this.id = { -> i++ }
                }
            }
        }
    }
}

Expression will generate next code for all domains in package domain.package.to.get.ids:

'domain.package.to.get.ids.SomeEntity' {
        long i = 1
        this.id = { -> i++ }
    }

Typecasting in SQL could force database does not take into account index or sort column

Ivan Zerin — Sat, 24 Mar 2018 10:23:33 GMT

I faced this issue while working with Redshift. But it can be found in any SQL database engine.

Redshift is known as special column database, that allows run query in parallel way that to distribution and sort key. Of course, the most widely used sort key is date and should really careful with it in queries. Sort key is associated only with specific date/time format, so converting this field to any format will cause huge drop of performance. Redshift have to convert all stored values to new format during query execution and new converted values have no sort index, so no advantages of sort key.

As an example, let's imagine that there is a column with timestamp type without time zone. In SQL query we need to convert string to timestampt, so let's use function TO_TIMESTAMP (http://docs.aws.amazon.com/redshift/latest/dg/r_TO_TIMESTAMP.html):

select * from
some_table
where timestamp_field > TO_TIMESTAMP('2018-03-24', 'yyyy-MM-dd')

Actually, in this case Redshift should compare field timestamp_field with type TIMESTAMP against value 2018-03-24 in format of timestampTZ, because TO_TIMESTAMP function returns timestampTZ value. It will require implicit casting of timestamp_field to type timestampTZ and it will be absolutely new field without any sort key.

To avoid performance drop can be used ::TIMESTAMP or directly value 2018-03-24 with no casting, it will be casted implicitly to TIMESTAMP.

I have faced such issue on table with ~60M of rows and to_timestamp function cased drop of performance to 90 seconds per query against 6 seconds in case of using ::TIMESTAMP.

Crafting architecture solution at tech interview

Ivan Zerin — Wed, 01 Mar 2017 15:27:00 GMT

From time to time at technical interviews, your can face with architecture tasks. The main aim of such tasks to find out your ability to create scalable and solid solutions with attention to bottlenecks. Moreover, such task gives an opportunity to understand of your outlook with modern tools and frameworks. In addition, this type of problems in a real world is coming with a lot of unclear/unknown facts and numbers, the same you should expect at interview task. I have faced such kind of task and going to show an example of possible solution.

What is the task?

It is needed to create service for "recording" user actions on customers' sites and then replay user action on backoffice side of service for customers to allow then understand how to improve UX of the site. Service should be ready to track 100K sites (it is a level of top service with the same functionality).

A simple version of the recording feature looks like an easy task at first glance. It is not so.

High-level feature requirements

Customer uses service to record user actions at target website,
Customer should be able to replay user actions at service side at any time,
Service should collect as many actions from website as possible (mouse events, user inputs),
No information about how many users and records service should be able to service. To be defined. (One of the main challenge of such kind of tasks)

Collecting events from customer’s site

For capturing and sending user actions JS-script can be used. The customer should include it on the tracked site.

Collecting events from customer’s site

JS-script should track and transmit next events:

Mouse and keyboard events over the whole document (onmousemove).
Sniff input fields (JQuery bind function can do this).
Mutation observers for every DOM element at a page (tracking of DOM changes).
Track end of a user session (a number and onbeforeunload events).

All events can be easily captured via JS and send to service. Events should be tracked with some latency (~50-100 ms will be ok). It reduces the number of events and decreases the browser load.

Replaying event by user requests

After picking up all the session's events load site in a headless browser on backend side.
Put transparent canvas in a foreground over the whole web page.
3.1. With an image of a mouse cursor drawn on canvas emulate mouse moves by moving image according to saved events.
3.2. Clicks on hyperlinks can be done as loading of another page to the headless browser.
3.3. Input values can be emulated via JS with filling input fields.
Screenshots should be taken every ~50-100ms (approximately 10-20 screenshots per second) via API of the headless browser during the emulation process. Collected images will be used to produce a video for showing user action at the web page at customer's backoffice.

PhantomJS allows using the headless browser to load target web page, manipulate with its DOM and create screenshots of a loaded webpage. Next, a video can be created using a server-side utility (e.g. ffmpeg with command ffmpeg -f image2 -i image%d.jpg video.mpg).

Architecture solution

Let's try to estimate possible load for such kind of service. There are no requirements in original task, so let's assume load similar to one of the most popular services with the same functionality (~100K sites).

Assumptions about load

Let's use concept 80/20 and consider that ~20k of sites are in Alexa top 1M.
Alexa top 1M sites visited per day with ~10-100K times per day.
Assume that on average, these 20K sites have 50K users per day.
80K of sites have significantly fewer visitors, let’s take 50 users/day on average.
As similar service, let's provide some free plan allowing a customer to capture 100 sessions per account.
Let's collect events once per 100ms.
Service should be scale ready to extend a number of processed sites and increased number of stored sessions for non-free plans.

Peak load

Every site from top visited 20K produces ~0.6 sessions/s (50k visitors per day/24 * 3600)
Top 20k sites in total produce: 0.6 sessions/s * 20k site = ~12k sessions/s
Each visitor could produce at max 10 events/s: 12k sessions/s * 10 events/s = 120k incoming events/s
Rest 80k site will produce: 80k sites * 50 sessions/day / 24 hours * 3600 s = 47 sessions/s => 47 sessions/s * 10 events/s = 470 events/s
In total peak request number: 120.5k requests/s (it is still ~120k request/s)

What’s about data?

Let's estimate Event object size:

session hash (e.g. UUID) - 16 bytes
page address - ~500 bytes
object event (mouse event, click, input, terminating of a session) - ~100 bytes

other possible information, object structure overhead

So, let's roughly estimate the size of one event ~1Kb.

Incoming peak traffic will be 1Kb * 120K events/s = 120MB/s.
In case unlimited number of collected sessions it will give us ~10Tb of incoming data/day

In the case of limit 100 sessions per site:

Assume that on average users clean old records once a day, then:

Day sessions: 100K site * 100 sessions/day = 10M sessions/day.
Average time spends on site by user ~1 min.
Day events: total events for one session: ~60s * 10 events/s = 600 events per session.
Day load: 10M sessions per/day * 600 events = 6 BN events/day (~65-70k events/s).
Data size per day: 6 BN * 1Kb = 6Tb of incoming data/day = 70 MB/s.
The peak of incoming traffic still ~120 MB/s.

Incoming events should be considered as a temporary data needed for creating videos. It means there is no need to persist events. Videos should be persisted on disk storage system. The path to the video and some basic information are the only data persisted in Database.

An in-memory database (e.g. Redis) is the most suitable place for temporary storing events. 120Mb/s (120K events/s) is a normal load for an in-memory database hosted with one medium-level server. Let's consider that max lifetime of stored events will be ~1 hour, then in-memory database then should have capacity 120Mb/s * 3600s ~ 500GB.

Let's estimate amount of data written at database layer:

Session object size:

a path to video - 200 bytes
session id - 16 bytes

some additional info (e.g. date stamps, )

In total let's consider it as ~ 0.5 Kb

Incoming day data on database side: 10M sessions/day * 0.5 Kb = ~5Gb/day.
Next day stored amount would not grow due to limit in 100 sessions. Deleted sessions will free disk space.
Amount of data written per second: ~100 sessions/second => ~50 Kb/s will be written on DB layer

These are very comfortable numbers for DB layer. A regular server can manage with incoming data.

Disk space consumed by video:

Let's consider average size of 1 minute session's video ~5Mb (could be less, need to investigate video compression algorithms).
Daily data: 10M sessions * 5Mb = 50 Tb/day

It is a big amount of data: however, it would not increase, due to 100 session/site limit.

Summary of math

Servers: able to process 120K requests/s
In-Memory database: capacity 500GB, write speed 120MB/s (120K puts in cache/s) + approximately same read speed
Database: start with amount 5GB/day, not increasing heavily
Disk: start amount ~50TB, not increasing heavily, write throughout ~600MB/s

The real-life load will be several times lower, but the setup should be ready to increase capacity up to these numbers.

A server at app-layer could be run with disruptor framework. From my experience, it gives an opportunity to serve about 15k request/s with complex business logic. Here, we have quite simple logic: performance could be much higher (I would expect ~50k events/s, with some optimisation could increase performance significantly).

The in-memory database could manage with example load. Redis hosted at a medium level server can process up to 500k puts&gets/s.

Database: no worries. We are safe.

Disk storage: Write throughput (~600 Mb/s) could be reached by routeing writing processes to different disks.

Algorithm of processing

Incoming events are picked up by app servers, which puts events in the in-memory DB using session id as a key.
Once received end-session event app server performs get operation by session id to in-memory DB to get all corresponding events, pack events as job object and put it in a job queue.
Worker servers pick up job objects and perform loading website in a headless browser, emulate user actions, taking screenshots and rendering a video to disk storage.

Architecture of service

According to limitations described above service architecture should have 5 main components:

App servers for processing incoming events (scalability important)
In-memory DB (scalability important)
Worker servers to create videos (scalability important)
Disk storage for created videos (scalability & resilience important)
Database to store meta info about videos (resilience important)

How to connect all these components?

All incoming requests should be routed by load balancer between app servers.
According to math, it will be enough to have 1-2 active servers to process all incoming events (need to validate with load tests). Can be scaled easily.
To make in-memory database scalable solutions like Redis Cluster or Amazon ElasticCache can be used. They will take care of scalability and resilience out of the box.
A Message queue can store ready job objects.
Stateless workers can be easily scaled to process incoming jobs for creating videos according to incoming events.
Database layer for storing metadata about videos does not seem to be a part of the system where expected high-load. It looks like usual database server will be enough to store all expected data with huge reserves.
Load to disk storage system should be split because potential writing speed can be several times more than writing speed of modern SSD disks. Amazon EBS can be used, where max throughput can be scaled up automatically to 1.2 GB/s. The same solution can also be organised without Amazon: the main idea is to split writing threads across several physical disks. It can be done at the moment of putting a job in a message queue (just putting disk name as a field of job object) or using a load balancer between worker servers and disk storage system.

Resilience is critical for database and disk storage system (it is could be easily archived by using replication servers), however, all other components are not so critical. In worst case saved user sessions currently in processing will be lost. Considering there is a limit for 100 sessions per site it is not so critical for the system - there is no aim to save all user sessions, service should just give a set of sessions for analysis. Lost sessions will be replaced with new incoming sessions once service components are started.

At the conclusion I'd like to mention, it is clear that calculated numbers give an opportunity to find out bottleneck paths and design architecture solution taking these paths into account. However, to estimated definite number of servers for application cluster, a number of cache instances, capacity and latency of system some experimental research should be done. Please, remember that possible peak values were calculated with load pessimistic scenario (there are no unused accounts, dead/unvisited sites, customers capture sessions all the time, etc).

During preparation, I have followed knowledge that obtained from:

highscalability.com

Continuous integration for Spring Boot app using Codeship + Openshift

Ivan Zerin — Sun, 04 Sep 2016 10:30:21 GMT

I have already talked about CI using pair Codeship + Openshift and I like such combination. Recently, I have started a new project based on spring boot and faced an issue, that Openshift does not provide an out-of-box pattern to run Spring Boot application.

Spring boot application could be built in two ways: with embedded tomcat and without it. After reading some tutorials, how to build an application without tomcat, I decided not to go this way - it requires some code changes + some kind of settings tuning on Openshift side. The solution with embedded tomcat looks more consistent and reliable in this case for me.

To setup spring boot app you need DIY gear with maven and Java 8. I was lucky and found a repo on github with sample spring boot application on DYI openshift gear. So I use this code as the start of my application. The main magic are located in file .openshift/action_hooks/deploy. It downloads Maven and setups all needed environment variables.

As next step, you should build and test your Spring Boot project on Codeship. All technical details about setup Codeship can be found here. I will notice that in this tutorial the build was created for Grails project, so test and deploy steps for Spring Boot project will be slightly different.

First of all, we need to use Java 8 on codeship server for tests instead of default Java 7. To enable Java 8 in setup section of Project settings -> Test include this:

jdk_switcher home oraclejdk8
jdk_switcher use oraclejdk8

Test script has nothing special except Spring profile for CI execution:

mvn clean initialize
mvn clean test -Dspring.profiles.active=ci
mvn package -Dmaven.test.skip=true

Deployment script is the most interesting part here. Usually, when you use standard Tomcat Openshift gear, all you need is place WAR file in a correct directory and it will be automatically deployed. I want to use embedded tomcat, so I get packed JAR which should be started manually. Such case was not what I desire (take into account that before starting I should stop old running version of the application). Performing all these operations from deployment script located on Codeship was also not possible due to insufficient access rights to start and stop processes. All I can do remotely - file operations and controlling Openshift gear lifecycle. It gives me a way to resolve deployment problem:

Stop running gear.
Delete old JAR.
Copy freshly built JAR from Codeship to Openshift.
Start gear.
Openshift action hook that triggers by a start of gear will start the application from new JAR.

The first 4 steps of plan are located in Codeship deploy script:

ssh @project_name-acc_name.rhcloud.com 'gear stop'
scp -rp /home/rof/src///target/*.jar @project_name-acc_name.rhcloud.com:~/app-root/runtime/repo/target/
ssh @project_name-acc_name.rhcloud.com 'gear start'
wget --retry-connrefused --no-check-certificate -T 60 https://project_name-acc_name.rhcloud.com/somepath/to/ping

Actually, there is no need to remove an old JAR, because it always has the same name, so the new file will overwrite the old one. I add the ping action as the last command to be sure the application is up and running once I see that build is passed.

Now let's take a look at the 5th step of the plan - starting the application by action hook. All Openshift hook are placed at .openshift/action_hooks in git repository of your Openshift project. As I started with the sample of Spring boot layout for Openshift I already have a script for action hook for gear starting: .openshift/action_hooks/start. The main part of this script is this line:

nohup java -Xms384m -Xmx412m -jar target/*.jar --server.address=${OPENSHIFT_DIY_IP} --spring.profiles.active=openshift &

So all I need here is to be sure that I have correct path and Spring profile.