Enhanced dbt Tests
Grai exposes many of our pipeline tests, including those defined in dbt, to be run anywhere else in our data stack. That means, whenever we change upstream data assets like our production database we can validate that those changes will not break our data pipelines.
This guide will walk you through configuring your data lineage graph from scratch using Postgres as our production database and Google BigQuery as our data warehouse with data transformations managed in dbt.
Finally, we will configure automate testing against the production database using GitHub Actions.
This guide is written based on using Grai cloud (opens in a new tab), however, it will run equally well on your own self-hosted implementation.
Starting on the home screen selection
Connections from the top of the screen and then
Now Choose the Google BigQuery connection button.
Fill out the the connection details for your BigQuery data warehouse. This guide uses test database credentials, so that you can easily follow along. Substitute your own credentials as needed.
Notice the Namespace field, this is an important part of the Grai lineage structure. This is how you manage two connections pulling from the same source. For more details checkout the Namespaces section.
|credentials||Copy and paste from below|
|Log Parsing Window|
If you want to use your own BigQuery instance you can follow the instructions here (opens in a new tab) to create a service account and then copy the credentials into the form.
Press Continue - if you see an error double check your provided connection credentials.
For now we won't set a schedule for this connection. However you could set your connection to run every day ensuring your lineage is up to date.
Press Finish to complete setup. Once your new connection is created you should see the connection page. Press Refresh in the upper right hand corner to run your connection and populate your data lineage graph.
The run will take a few seconds to run and then you should see the run complete successfully.
Now that the connection has completed running you can go to the
Graph page to see your BigQuery tables.
It will likely look something like this
Currently, there are no connections between the tables, this is because BigQuery doesn't have any foreign keys setup.
Again go to the connections page and click "Add Connection." This time, choose the dbt connection.
Namespace value as default for now.
For the purposes of this demo, we've created a manifest.json (opens in a new tab) file you can use which
was created by running
dbt compile on the source repository (opens in a new tab).
Using the same
Namespace value for both BigQuery and dbt ensures the tables
and columns are linked together.
Press Finish to upload the file and run the import. This should take a few seconds before completing successfully.
Finally, check out the graph view again. You should now see that your BigQuery tables have been joined together based on the dbt metadata. For example where one dbt model depends on another Grai has created a link between the two tables.
We will be using Postgres for our production database in this instance so again go to the connections page, click "Add Connection" This time, choose the PostgreSQL connection.
Enter the following values
You've probably getting the hang of this now, but just in case, press continue. Assuming your credentials are valid, you can press Finish to complete the setup and then run the connection.
If we check out the Graph page, we can see that the Postgres tables have been added to the graph.
There should be a number of connections between the Postgres tables, these have been produced from the foreign keys.
There won't be any links between the Postgres tables and the BigQuery ones. Typically, an ETL tool like Fivetran would be used to transfer data between the production database and the data warehouse but for the purposes of this demo we will setup these edges using yaml files.
Just like before,
Connections -> Add Connection -> YAML File
This combination of integrations, and yaml files, allows you to create a complete data lineage graph for your data warehouse no matter what tools you use.
You will need a GitHub repository for this stage. Forking the example repository might be a good starting point jaffle_shop_bigquery_demo (opens in a new tab).
Now, we are going to use GitHub Actions to automate running tests against our dev database using Grai.
.github/workflows directory in your repository on GitHub if this directory does not already exist.
.github/workflows directory, create a file named
You can find more information about workflow yaml files in GitHub's documentation, see Creating new files (opens in a new tab).
Now copy the following YAML contents into the
on: [pull_request] jobs: test: name: Evaluate Postgres Changes runs-on: ubuntu-latest steps: - name: data lineage check uses: grai-io/grai-actions/postgres@master with: namespace: "prod" workspace: [WORKSPACE_UUID] client-host: "api.grai.io" api-key: [API_KEY] grai-frontend-host: "https://app.grai.io" db-host: sample-database.cudyk77thtpt.us-west-2.rds.amazonaws.com db-user: postgres db-password: jnicGnmRdXclhURSRr86 db-database-name: jaffle_shop_dev
Normally you would use the credentials for the development database that changes with each pull request or commit in order to evaluate your tests. In this case, we've created a sample database for you to use.
To finish this off, we will need to find your
Workspace UUID and create an
API Key for your Grai account.
To generate an Api Key in the Web App, navigate to
Settings from the Profile Menu in the top right and select API Keys.
Press Add API Key and choose a suitable name, for example
demo, then press Save.
You should record the returned key, as it will not be displayed again.
This can be copied into the
[API_KEY] slot in the
In production you should use a GitHub Action secret to store this value, so that it isn't checked into version control, see Encrypted secrets (opens in a new tab).
Staying in can be seen at the top of the API keys page you should see a field that says
WorkspaceId followed by a UUID value at the top of the page.
Copy that UUID into the
[WORKSPACE_UUID] slot in the
To see the action run you will need to create a pull request in GitHub. Once the action has run, if you have used the demo database credentials, you should see a number of comments on your PR showing some test failures. You can click on the link to view more details on the Web App.
In this tutorial, we have shown how to use Grai to create a data lineage graph, and then use that graph to run tests against a development database. We have also shown how to automate this process using GitHub Actions.
If you have any questions or feedback, please reach out to us at firstname.lastname@example.org and in the meantime, happy data engineering!