Enhanced dbt Tests

Enhanced dbt Tests

Grai exposes many of our pipeline tests, including those defined in dbt, to be run anywhere else in our data stack. That means, whenever we change upstream data assets like our production database we can validate that those changes will not break our data pipelines.

This guide will walk you through configuring your data lineage graph from scratch using Postgres as our production database and Google BigQuery as our data warehouse with data transformations managed in dbt.

Finally, we will configure automate testing against the production database using GitHub Actions.

ℹ️

This guide is written based on using Grai cloud (opens in a new tab), however, it will run equally well on your own self-hosted implementation.

Connect Data Warehouse

Starting on the home screen selection Connections from the top of the screen and then Add Connection.

Add Connection button

Now Choose the Google BigQuery connection button.

BigQuery Connection button

Fill out the the connection details for your BigQuery data warehouse. This guide uses test database credentials, so that you can easily follow along. Substitute your own credentials as needed.

ℹ️

Notice the Namespace field, this is an important part of the Grai lineage structure. This is how you manage two connections pulling from the same source. For more details checkout the Namespaces section.

FieldValue
Namespacedefault
NameGoogle BigQuery
projectgrai-demo
datasetgrai_bigquery_demo
credentialsCopy and paste from below
Log ParsingFalse
Log Parsing Window

Sample BigQuery Credentials

ℹ️

If you want to use your own BigQuery instance you can follow the instructions here (opens in a new tab) to create a service account and then copy the credentials into the form.

{
"type": "service_account",
"project_id": "grai-demo",
"private_key_id": "808cdbe329203ca6823ca4a5298a8f1c17430a5b",
"private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQCbGvNvBJMkv7tW\njLeT1U/Ds7/zO5rd+Q8Fm3QHjvKp5VX1HTqo4mGqUnpavSlnf9acUHRLKXIMeuzm\nexVWpZHXkGvMf3dR632ZkiYHopPMCuid5iQvDD5jp3wFgp2sq438wXnR0t+hnOoF\nyk4yEfCKkdbWm4wvqOpVivwjv5Up+wV9GHIYi3gIWJsm0Ftp4ImDOx/FPNGjLWs9\nxpXWCe9YVUO4ZFl0quThDMyuUsWi9eSF7y2GeVNaJA9Lqzj+Ybvbor8qe4PKcFS3\n+pu9CFqIcsBU9n0xOIrkvIQt06JIL7LsBH0LrBj/LJaxIaoJYhKnJOhVRVb7/WmD\nWGYUMa1/AgMBAAECggEAAgUcVX5MQkbC9FIZ0/GLu+M75TmJ/0kOkoafQLfyXhcQ\nAwSx7mAgYHz5ffQHtYLbMybzFMWB6Dqzmh7RPNMkLpgBZ2Tjk2wLMzRNjXznCtm7\nhM7p+rjgmSClw1abohOL4lZMtVhXND4caB3l3c4RAWT71MiqzzYWcy2/seF9WlIW\n4mu1ruqUJTCpXb46mlPaR1N2J89gz32R06aiuMfrR3Yt0v5Rj+vnNZA+srLTJX19\n6J4R2UedJ7IZIL9QBbZHxumY0jgw/GbSFV40kBk+wVuJ0E1pbX8I0gLxmMR/EwFk\n5YyReNzEsocPDalF6rcmcSvqNZ9zsiybLkpdQ9HkDQKBgQDRQOZnfJjQd3o4rXqa\nDe4FhOD3J3QxqP/hPtMU5IOIttZQHX32koTZZyBNgMtS+heHCIGnDpm1ram8u7vw\nXy+1OcAx9eZs5pFuM/zTD5fD68/8bZe6X1+NQ31ENXa5jkGTaXKx+MUtp5xwi4jx\ngRTcSKDt+PEmu6fr/OJ2O9bMRQKBgQC9wVdaWxT6uBGoY4+KTZBAwMrKzZ0/SRky\npZR13uLqMOFlh/X3bq56nVSl4yMDljsT0lFXRctjsTNwc+h0QYXc4fXT5YtHmksI\npv90TYQg2qK2wQ9bhHldTCJsTFGPCxLlMD8AyL6PK9hCuatxmYc0LFF5OXy8Itlb\nqbC1c4co8wKBgCflNsh+QehlDyFlOd3LUBkvR3D3zbh2HysDvlzaYJWdPmkR5mUv\ndDK67ba5GorcccXmAkomh3nS/WylYmSm0UK9Gv6rgl4663lWYhqfe3D4MbRP9MCs\n1FvrhSOPCe7Ax5HiZeK2qmlU7oeqotZgpOiG1F/quZeH6bEditO9/ur9AoGAOcrA\nyAwlf5bACgEInp6w6IfPO6UT10p0GjDD3oJbqefpPfsCtrFHAqEYPs3GxDjlFUxg\n6augHmTBveYPThkGpBdNv5ORr+UWJTR3aPyS2U69b9usybq3G+ssML+tt1swDg17\nosmBACniW7AgvyB7RTCaP8l6a/JRMNGluB3PdHECgYEAnB7nYBirbRkpDrRWfbiV\n6X9iV0284NaLqLIU9Y1USCGAnv38bxlz+kNtHUX2QKpnoXwbXVr/yHH9b92teCLx\nnL30lTZPVy9dweuybgdIwGNlyq28pirNbLXYH/zRo9YMJ0KzArpgBi8U5w1mNud3\n0MgqDH8AZvYngs7700WDW+o=\n-----END PRIVATE KEY-----\n",
"client_email": "grai-352@grai-demo.iam.gserviceaccount.com",
"client_id": "113796530500416825812",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/grai-352%40grai-demo.iam.gserviceaccount.com"
}

Press Continue - if you see an error double check your provided connection credentials.

For now we won't set a schedule for this connection. However you could set your connection to run every day ensuring your lineage is up to date.

Press Finish to complete setup. Once your new connection is created you should see the connection page. Press Refresh in the upper right hand corner to run your connection and populate your data lineage graph.


BigQuery Connection page

The run will take a few seconds to run and then you should see the run complete successfully. Now that the connection has completed running you can go to the Graph page to see your BigQuery tables. It will likely look something like this


BigQuery Graph page

Currently, there are no connections between the tables, this is because BigQuery doesn't have any foreign keys setup.

Import dbt Manifest

Again go to the connections page and click "Add Connection." This time, choose the dbt connection.


dbt Connection button

Leave the Namespace value as default for now. For the purposes of this demo, we've created a manifest.json (opens in a new tab) file you can use which was created by running dbt compile on the source repository (opens in a new tab).

ℹ️

Using the same Namespace value for both BigQuery and dbt ensures the tables and columns are linked together.

Press Finish to upload the file and run the import. This should take a few seconds before completing successfully.

Finally, check out the graph view again. You should now see that your BigQuery tables have been joined together based on the dbt metadata. For example where one dbt model depends on another Grai has created a link between the two tables.


dbt Graph page

Connect the Production Database

We will be using Postgres for our production database in this instance so again go to the connections page, click "Add Connection" This time, choose the PostgreSQL connection.


Postgres Connection button

Enter the following values

FieldValue
Namespaceprod
NamePostgreSQL
hostsample-database.cudyk77thtpt.us-west-2.rds.amazonaws.com
port5432
Database Namejaffle_shop
userdemo
passwordzfYD%qW2VOfUmK1Y

You've probably getting the hang of this now, but just in case, press continue. Assuming your credentials are valid, you can press Finish to complete the setup and then run the connection.

If we check out the Graph page, we can see that the Postgres tables have been added to the graph.


Postgres Graph page

There should be a number of connections between the Postgres tables, these have been produced from the foreign keys.

Connect our production database to the data warehouse

There won't be any links between the Postgres tables and the BigQuery ones. Typically, an ETL tool like Fivetran would be used to transfer data between the production database and the data warehouse but for the purposes of this demo we will setup these edges using yaml files.

Just like before, Connections -> Add Connection -> YAML File

YAML File Connection button

Again, leave Namespace as default. Upload the edges.yaml (opens in a new tab) file from the source repository (opens in a new tab). Press Finish and check out the Graph page to see the results.


YAML File Graph page

This combination of integrations, and yaml files, allows you to create a complete data lineage graph for your data warehouse no matter what tools you use.

Setup GitHub action

ℹ️

You will need a GitHub repository for this stage. Forking the example repository might be a good starting point jaffle_shop_bigquery_demo (opens in a new tab).

Now, we are going to use GitHub Actions to automate running tests against our dev database using Grai.

Create a .github/workflows directory in your repository on GitHub if this directory does not already exist.

In the .github/workflows directory, create a file named grai-actions-demo.yml. You can find more information about workflow yaml files in GitHub's documentation, see Creating new files (opens in a new tab).

Now copy the following YAML contents into the grai-actions-demo.yml file:

on: [pull_request]
 
jobs:
  test:
    name: Evaluate Postgres Changes
    runs-on: ubuntu-latest
    steps:
      - name: data lineage check
        uses: grai-io/grai-actions/postgres@master
        with:
          namespace: "prod"
          workspace: [WORKSPACE_UUID]
          client-host: "api.grai.io"
          api-key: [API_KEY]
          grai-frontend-host: "https://app.grai.io"
          db-host: sample-database.cudyk77thtpt.us-west-2.rds.amazonaws.com
          db-user: postgres
          db-password: jnicGnmRdXclhURSRr86
          db-database-name: jaffle_shop_dev
ℹ️

Normally you would use the credentials for the development database that changes with each pull request or commit in order to evaluate your tests. In this case, we've created a sample database for you to use.

To finish this off, we will need to find your Workspace UUID and create an API Key for your Grai account.

API Key

To generate an Api Key in the Web App, navigate to Settings from the Profile Menu in the top right and select API Keys. Press Add API Key and choose a suitable name, for example demo, then press Save.

You should record the returned key, as it will not be displayed again. This can be copied into the [API_KEY] slot in the grai-actions-demo.yml file. In production you should use a GitHub Action secret to store this value, so that it isn't checked into version control, see Encrypted secrets (opens in a new tab).

Workspace UUID

Staying in can be seen at the top of the API keys page you should see a field that says WorkspaceId followed by a UUID value at the top of the page. Copy that UUID into the [WORKSPACE_UUID] slot in the grai-actions-demo.yml file.

Running the Action

To see the action run you will need to create a pull request in GitHub. Once the action has run, if you have used the demo database credentials, you should see a number of comments on your PR showing some test failures. You can click on the link to view more details on the Web App.

Example GitHub Comment

Wrapping Up

In this tutorial, we have shown how to use Grai to create a data lineage graph, and then use that graph to run tests against a development database. We have also shown how to automate this process using GitHub Actions.

If you have any questions or feedback, please reach out to us at hello@grai.io and in the meantime, happy data engineering!