What we are missing in data CI/CD pipelines?

Originally that article was posted at my blog on Medium.

I told the story of how we set up our CI/CD with a data pipeline using classical CI/CD tools — Gitlab previously. As a final idea, I just put a note that Gitlab is not a great tool for doing that job. You have to create several workarounds. That statement is quite general and does not fit into the concept of “Offering Solutions Rather Than Criticism”. I’m not able to provide a solution at the moment. However, the first step of creating the solution is to understand what the cons of the current solution are and how it could be improved. So I take a stop to think about: what data engineers are missing in CI/CD tools.

Let’s start with a setup that we will be talking about:

batch data processing: new data arrives in some period of time and there is a way to deal with new data only;
quite heavy and complex data transformation: it takes a lot of time to process;
there are several contributors/developers for data models, so there is a need for a test environment (or several environments);

Incremental build.

At the beginning, it sounds a little bit fantastic maybe, but hold on for a while. Let’s say to rerun the whole data model could take you 30–40 minutes. It would not a big fun to develop a new data model if you need to wait that amount of time to get the result. Tools like DBT has a way to select only a particular model to run:

dbt run -t prod --model some_test_model

Or to rebuild some model and all dependant models also:

dbt run -t prod --model some_test_model+

That simple feature helps you to speed up development. I’m pretty sure everyone uses that way to check the result of your development. Why it could not be used in CI/CD process? Actually, it could. You need to create some script for the current CI/CD system, that will detect the list of changed files in a commit, next get only files related to data models and run every model one by one including dependent models.

In case of big changes, there is a high chance to run the same model several times during on CI/CD cycle. Let’s imagine that there is a commit with 2 files:

data_model1
          |_ data_model2

data_model2 is dependant on data_model1 so it could be covered with one run only:

dbt run -t prod --model data_model1+

However, in the first approach we will two 2 runs, because of lack of knowledge relationship between that models:

dbt run -t prod --model data_model1+
dbt run -t prod --model data_model2+

We could try to rework the initial approach to take that into account. Let’s say we get a table that has mapping between hash (calculated based on content of file) of the current version of every data model file and table located in our data warehouse:

<hash1>:table1
<hash2>:table2
...

So every time you get a new version arrived into git, you could compare its hash with that table and figure out that the current table was built with the previous version of the data model. It should be recreated with all its children. Hashcodes could be also checked after every rebuild of the model. So in the case above the run:

dbt run -t prod --model data_model1+

You will get such hashcode:

<new_hash1>:table1
<new_hash2>:table2
...

So before running the rebuilding process for data_model2 it is possible to see that table2 has a new hashcode and that run is useless.

Anyway, you could come up with some other strategy, but it is still possible to implement that approach with some bash script. However, that requires some effort, quite a deep knowledge of CI/CD tools API, and may require your to find out several workarounds, but it is real…

So why could not it be that system of our dream will have that feature built-in? Just provide adapters for the most popular data transformation tools and implement the logic of incremental build under the hood.

Sync of data across different environments

Working with data involves the same stage as classic software development. You are developing some approach, testing it, and delivering it to production. It is similar from a high-level perspective, but there are a lot of aspects differences at the ground level. One of the most significant differences: you could not develop new data models and features locally in most cases. It is caused by 2 main reasons:

amount of data you need to develop your approach could be quite big and could not fit into local data storage;
most popular DW engines are available only as cloud-based solutions;
That makes it almost impossible to have your own development environment. The obvious way to resolve it is to have a separate cluster/database/schema for every environment. That solution introduces a new issue: how to sync the data into stage/test environments? How to it smooth and easy? It really depends on DW engine you are using.

E.g. Redshift requires you to make a full cluster clone and takes quite a long period of time to do that data cloning. It requires paying for another cluster similar in size to the production one to be able to fit all the data. There is also no internal tool to do it in one click, it is done by dump and copy commands running within prod Redshift cluster. It is also obvious that cloning of data in Redshift better to provide in low-activity hours. Users would not be affected by dump command on production.

On the other hand, Snowflake has a great internal feature to get an extra environment called ‘zero-copy clone’. In short, it just makes a new data “branch” that tracks changes to the new environment separately rather than copying the whole data. That cloning operation happens really fast and does not require your to pay some extra money. That operation could be done with one command.

As you can see there are totally different ways to clone the data in different, so they need to have a different way to interact. Why could not our CI/CD tools has adapters for different DW engines offering different strategies to create environments?

In case, we go deeper, we could get a smarter approach. What if you need to refresh the data during the working hours with one table you working with? It looks easy: just copy one table. But in the case of data models, it could lead to inconsistency with connected tables. The way to work with it is to get all related tables in that refresh activity.

For example, DBT tool has the ability to build DAG to get such information. So let’s image our CI/CD tool has “short refresh flow” to get only target data and we are using DBT. That flow could be implemented by collecting a list of connected tables to target data by running the DBT command (getting the data from DBT DAG) and next using DW adapter to perform “refresh data” activity on a limited amount of data models only.

Ability to check results of data tests during CI/CD

In case you are using data tests, you know how important they are! But it is not only some kind of trigger for developer to take action. It is not the same as it acts in the classical flow of software development (get an error->fix it->run it again). In data engineering fallen data test means sometimes not only fixing the root cause, but the side effects also (in classical software development actually it is also the case, but it is quite often left undetected). In the hardest cases getting an error in input data could lead to rerun the whole pipeline starting from reimporting fresh source data. Not all issues with data tests could be so critical, so you should react in a different fashion.

It will be great to have the ability to use fallen tests as triggers to some actions that could be taken. Let’s say that you have a classical data pipeline that is delivering new daily data in the morning. You have imported new source data (e.g. list of sales for yesterday) and generated new business data using source data (daily income amount) for the current day. Afterward, you are running the data tests to check if there are any issues, like null or unexpected values in business data. That test detects that there is some issue and you need to investigate it. It took some time so it is better to wipe out generated data before you could provide the right one. It is always better to make your users wait a little bit without the data, rather than use incorrect ones and make incorrect decisions.

The best way to eliminate that situation is to have a “pre-prod” environment to run the generation process and test afterward, check if there are no issues with critical tables, and trigger the generation process on a prod. However, it is also not 100% protection against such cases, because it is really hard to create an exact copy of the production system layout. In that case, you could use a second line defense: CI/CD pipeline checks test results for production data generation, and in case there are some issues with particular tables — automatically starts the rollback process.

How it could be done? Once again by some adapter module in CI/CD pipeline that supports DBT Tool testing processes or GreatExpectation framework.

In the end

As you could see there are several reasons why it is hard to fit classical CI/CD processes in the data pipeline. I believe the data engineering community is waiting for custom shaped CI/CD tool. Did I miss something you want to get from such tool?