Skip to main content

Recap

In my previous post, I wrote about using GitHub Actions in conjunction with your data setup, specifically with your CI workflows. GitHub Actions can be helpful for teams looking to lessen their dependency on dbt Cloud (and to save money). Although there is a little bit of setup, the only prerequisites are knowing your way around a YAML file and having a basic knowledge of Python.

Today, I’m going to follow up from that post and talk about enhancing the previous setup with 1) some refactoring of the previous CI workflow and 2) starting a workflow that gets triggered after a pull request is merged.

Refactoring our CI workflow

The key changes I made to the previous code include:

  • Installed the package act through the Codespaces devcontainer.json configuration file. This allowed me to run and test my GitHub Actions file locally without having to push a bunch of commits to my repo.
  • Installed venv and added a requirements.txt file. Previously, I had a step to just install dbt. I opted to install dbt along with other packages using requirements.txt since this is a common practice in engineering organizations.
  • Removed the profiles.yml from the repo and configured the CI dbt profile directly within the GitHub action workflow file. In the previous iteration of this, we checked in the profiles.yml file to the repository, but your organization may prefer not to do that. Also, this method allowed me to customize the name of the schema (aka “dataset” in BigQuery), which I describe in more detail next.
  • Named the BigQuery schema/dataset using a prefix and the pull request number. You’ll notice this line in the dbt profile configuration: PR_SCHEMA="${{ env.SCHEMA_PREFIX }}_${{ github.event.number }}" . In this case, I’m using the prefix dbt_oriana_ga_pr before each PR number. You may want to use a prefix like dbt_<your name/username>_<your project name>_pr. When GitHub Actions kicks off this workflow, it’ll append the number associated with the pull request to the schema prefix. This is helpful in differentiating multiple pull requests in the same codebase!
list of datasets in bigquery
Here’s how my BigQuery data warehouse looks after opening a few pull requests.

Here is the newly updated dbt-ci.yml file. To use this, you’ll also have to add a requirements.txt file and create a new GitHub secret containing your BigQuery project ID. To see everything that’s changed, here is the pull request.

name: Run dbt test suite on PRs on: pull_request: types: - opened - reopened - synchronize - ready_for_review push: branches: - '!main' jobs: build: runs-on: ubuntu-latest env: DBT_PROFILES_DIR: ./coffee_shop # Specify the location of the profiles.yml file DBT_PROJECT_DIR: ./coffee_shop # Specify the location of the project directory SCHEMA_PREFIX: dbt_oriana_ga_pr # Specify the prefix for the PR schema PROJECT_ID: ${{ secrets.BIGQUERY_PROJECT_ID }} # Update with the name of the secret containing the project ID SERVICE_ACCOUNT_KEY: ${{ secrets.BIGQUERY_SERVICE_ACCOUNT_KEY }} # Update with the name of the secret containing the service account key steps: - name: Checkout PR code uses: actions/checkout@v4 with: ref: ${{ github.event.pull_request.head.sha }} # Checkout the specific commit of the pull request - name: Create GCP keyfile JSON run: | # Creates the keyfile and moves it to the profiles directory echo "$SERVICE_ACCOUNT_KEY" > gcp-credentials.json mv gcp-credentials.json $DBT_PROFILES_DIR - name: Set up Python uses: actions/setup-python@v2 with: python-version: 3.11 # Choose the Python version you need - name: Install requirements run: | pip install -r requirements.txt - name: Configure dbt Profile run: | PR_SCHEMA="${{ env.SCHEMA_PREFIX }}_${{ github.event.number }}" echo " coffee_shop: target: ci outputs: ci: type: bigquery method: service-account keyfile: $DBT_PROFILES_DIR/gcp-credentials.json project: $PROJECT_ID dataset: $PR_SCHEMA threads: 4 timeout_seconds: 300 " > $DBT_PROFILES_DIR/profiles.yml - name: Run dbt build run: | dbt build --profiles-dir "$DBT_PROFILES_DIR" --target ci --project-dir "$DBT_PROJECT_DIR"
Code language: PHP (php)

Adding a new workflow when merging PRs

When a pull request is merged, you’ll probably want to trigger a set of actions, including:

  • running your new changes in production
  • updating your dbt docs (which you can host on GitHub Pages!)
  • cleaning up your PR schemas

You can see the dbt-merge.yml file in the repo here. Below is the workflow in full.

name: Run production models, update dbt docs, and clean up CI schemas when a PR is merged on: pull_request: types: - closed push: branches: - main env: DBT_PROFILES_DIR: ./coffee_shop # Specify the location of the profiles.yml file DBT_PROJECT_DIR: ./coffee_shop # Specify the location of the project directory SCHEMA_PREFIX: dbt_oriana_ga_pr # Specify the prefix for the PR schema PROJECT_ID: ${{ secrets.BIGQUERY_PROJECT_ID }} # Update with the name of the secret containing the project ID SERVICE_ACCOUNT_KEY: ${{ secrets.BIGQUERY_SERVICE_ACCOUNT_KEY }} # Update with the name of the secret containing the service account key jobs: run-production-models: if: github.event.pull_request.merged == true runs-on: ubuntu-latest steps: - name: Checkout PR code uses: actions/checkout@v4 with: ref: ${{ github.event.pull_request.head.sha }} # Checkout the specific commit of the pull request - name: Create GCP keyfile JSON run: | echo "$SERVICE_ACCOUNT_KEY" > gcp-credentials.json mv gcp-credentials.json $DBT_PROFILES_DIR - name: Set up Python uses: actions/setup-python@v2 with: python-version: 3.11 # Choose the Python version you need - name: Install requirements run: | pip install -r requirements.txt - name: Configure dbt Profile run: | echo " coffee_shop: target: prod outputs: prod: type: bigquery method: service-account keyfile: $DBT_PROFILES_DIR/gcp-credentials.json project: $PROJECT_ID dataset: dbt_oriana_ga_prod threads: 4 timeout_seconds: 300 " > $DBT_PROFILES_DIR/profiles.yml - name: Run dbt run: | dbt build --profiles-dir "$DBT_PROFILES_DIR" --target prod --project-dir "$DBT_PROJECT_DIR" - name: Generate dbt docs run: dbt docs generate - name: Deploy dbt docs to GitHub Pages uses: peaceiris/actions-gh-pages@v3 with: github_token: ${{ secrets.GITHUB_TOKEN }} publish_dir: ${{ env.DBT_PROJECT_DIR }}/target - name: Install Google Cloud SDK run: | echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] <https://packages.cloud.google.com/apt> cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl <https://packages.cloud.google.com/apt/doc/apt-key.gpg> | sudo gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg && sudo apt-get update -y && sudo apt-get install google-cloud-sdk -y - name: Authenticate to GCP uses: google-github-actions/auth@v0.4.0 with: credentials_json: '${{ secrets.BIGQUERY_SERVICE_ACCOUNT_KEY }}' - name: Remove PR Schema run: | PR_SCHEMA="${{ env.SCHEMA_PREFIX }}_${{ github.event.number }}" echo $PR_SCHEMA gcloud config set project $PROJECT_ID gcloud alpha bq datasets delete $PR_SCHEMA --quiet --remove-tables
Code language: PHP (php)

To run new changes and/or any models in production, most of the code is pretty similar to the dbt-ci.yml workflow. The only thing that changes is the dataset field in the “Configure dbt Profile” step, which you can update to the name of your production dataset. In my case, I’m using dbt_oriana_ga_prod. Depending on the set up of your staging and production models, you may also need to configure a different BigQuery project (or $PROJECT_ID) for this workflow.

Publishing dbt docs

GitHub Pages offers a free way of hosting static pages, which we can use to host our dbt documentation. Additionally, you can set up a GitHub Action to automatically push new documentation when you merge your code. We do this with the steps “Generate dbt docs” and “Deploy dbt docs go GitHub Pages.”

After this code is merged, there is an additional step to actually publish the dbt documentation. Navigate to Settings > Pages for your repository. Under the section “Build and deployment”, select “Deploy from a branch” for “Source”. Under “Branch,” look for a branch called “gh-pages”.

Additional configurations to set for publishing GitHub Pages
Additional configurations to set for publishing GitHub Pages.

The URL of this will look something like https://<username or GitHub organization>.github.io/<repo name>. In my case, it’s https://oreoana.github.io/aec-github-actions/.

Note: Private GitHub Pages are currently only available for Enterprise GitHub accounts.

Tidying up your data warehouse

As you can imagine, if your team is creating pull requests on a regular basis and you’ve implemented a similar CI workflow, your data warehouse might start to feel a little cluttered with all these PR schemas. The last three steps in the dbt-merge.yml workflow allow us to install the Google Cloud SDK, authenticate using your service account key, and delete the schema associated with your pull request. Since we put in place a naming convention for our PR schemas, we are able to identify the correct schemas to remove!

Note: In the command to delete BigQuery datasets gcloud alpha bq datasets delete $PR_SCHEMA --quiet --remove-tables, we use the quiet flag to skip the confirmation prompt and the remove-tables flag to also remove the underlying tables in the dataset. Additionally, at this point of writing this post, this command is in alpha mode and may be subject to change without notice.

Confirmation of the deletion of the PR dataset in a successful run of the GitHub Action
Confirmation of the deletion of the PR schema/dataset in a successful run of the GitHub Action.

Conclusion

Some of our readers have expressed interest in seeing how we could expand the GitHub Actions workflow to use the manifest.json file to be able to keep track of state and run only modified models. We hope to write about this soon!

Let us know if you and your team start to implement GitHub Actions into your development workflows. As always, we love to hear from you, so drop us a line!