This is another post in our series written by alums of Analytics Engineers Club. Read on to see what Oluwatomisin Soetan has to say about the course and its applications at work. If you want to be featured in the next one, let us know!
TL;DR
My data team had a data modeling process that involved querying the database for each data request, often re-writing the same queries, until our data engineer saved us hours by storing repeated queries as tables in the database. Turns out, he was using dbt behind the scenes. Before leaving, he introduced me to dbt–the magic he had been performing. During a random Google search, we found the AEC course. I enrolled in December, and by January, I was confidently working with dbt, CLI, Git, Python, and even BigQuery (to some extent). Now, I am reorganizing my company’s data models, and employing software engineering best practices to ensure optimal data modeling. Want the full story? Read on, dear reader.
Initial State: Repetitive SQL Queries
In January 2023, when I joined my data team, our data access methodology involved receiving a data request from the end user, querying the database to retrieve the requested data, downloading the data to an Excel sheet, and then sharing the sheet with the business user. If any requirements were misunderstood, the data analyst would adjust the SQL query and repeat the process until the end user was satisfied with the result. This was the system I met and adopted as a new data analyst at the company. There were three major problems with this very frustrating process
1. Long and complicated queries, dependency on on The Developers’ Knowledge of Table Relationships
Data generated by our application are broken into object, event, and validation data tables. This meant that if I wanted the list of farmers financed for a particular loan program in a particular region, I had to join the loans events table to the loan program object table, the farmer object table, and the region validation table. With a detailed data architecture, this process would be seamless for the data analyst or analytics engineer. However, without this architecture, the data analyst would have to rely on the developers for guidance. Below is an example of a simple query I wrote starting out, which really stressed me out as a SQL newbie at the time. The query was to retrieve a list of loans in 2021, for the LoanProgram2021, and the countries they originated from.
SELECT
loan.loan_id,
loan.created_at,
loan.loan_amount,
loan.repayment_amount,
loan.repaid_amount,
bundle.name loan_bundle,
program.name loan_program,
program.code loan_program_code,
wh.name warehouse,
co.name country,
(CASE
WHEN loan.repayment_amount > loan.repaid_amount THEN (loan.repayment_amount - loan.repaid_amount)
WHEN (loan.repayment_amount < loan.repaid_amount) THEN (loan.repaid_amount - loan.repayment_amount)
ELSE 0
END) outstanding_amount,
(CASE
WHEN (loan.repayment_amount > loan.repaid_amount) THEN 'Is owing'
WHEN (loan.repayment_amount < loan.repaid_amount) THEN 'Overage' ELSE 'Not owing'
END) loan_status,
loan.is_approved,
loan.is_approval_completed,
loan.is_rejected,
loan.is_reverted
FROM main.loan loan
LEFT JOIN main.project project ON loan.project_id = project.id
LEFT JOIN main.loan_bundle bundle ON loan.loan_bundle_id = bundle.id
LEFT JOIN main.warehouse wh ON loan.warehouse_id = wh.id
LEFT JOIN main.local_govt lg ON wh.lga_id = lg.id
LEFT JOIN main.state st ON lg.state_id = st.id
LEFT JOIN main.country co ON st.country_id = co.id
WHERE DATE_PART('year', loan.created_at) = '2021' AND project.name = 'LoanProject2021'
Code language: PHP (php)
Every time loan information was required, the seven tables had to be joined to retrieve the required information. This meant endless trips to the backend developers to understand the relationships between tables; and the next time the request came, I might have forgotten the lesson on this relationship, and would have to ask again and again. That was problem number 1.
2. Prone to errors
Most of our requests were time-sensitive, so missing basic considerations became almost a norm. After the data or resulting information had been processed, we would often receive the infamous response, “This figure cannot be correct”. In the example above, I had missed a critical part of all loan data analysis: filtering out unapproved loans. Adjusting for this, the query then becomes:
SELECT
loan.loan_id,
loan.created_at,
loan.loan_amount,
loan.repayment_amount,
loan.repaid_amount,
bundle.name loan_bundle,
program.name loan_program,
program.code loan_program_code,
wh.name warehouse,
co.name country,
CASE
WHEN loan.repayment_amount > loan.repaid_amount THEN (loan.repayment_amount - loan.repaid_amount)
WHEN loan.repayment_amount < loan.repaid_amount THEN (loan.repaid_amount - loan.repayment_amount)
ELSE 0
END as outstanding_amount,
CASE
WHEN loan.repayment_amount > loan.repaid_amount THEN 'Is owing'
WHEN loan.repayment_amount < loan.repaid_amount THEN 'Overage' ELSE 'Not owing'
END as loan_status,
loan.is_approved,
loan.is_approval_completed,
loan.is_rejected,
loan.is_reverted
FROM main.loan loan
LEFT JOIN main.project project ON loan.project_id = project.id
LEFT JOIN main.loan_bundle bundle ON loan.loan_bundle_id = bundle.id
LEFT JOIN main.warehouse wh ON loan.warehouse_id = wh.id
LEFT JOIN main.local_govt lg ON wh.lga_id = lg.id
LEFT JOIN main.state st ON lg.state_id = st.id
LEFT JOIN main.country co ON st.country_id = co.id
WHERE DATE_PART('year', loan.created_at) = '2021' AND project.name = 'LoanProject2021'
Code language: PHP (php)
This happened frequently, usually before the data reached the end user. However, it meant the query had to be run again and the data re-exported, costing me more precious time.
3. Repetitiveness
The worst part was how many of these requests, which occurred almost daily, had foundational similarities, but I found myself recreating the queries over and over again. To address this, I decided to save my queries in a folder titled “SQL Queries” within an “Ad-Hoc Requests” folder which also contained the resulting CSV files from the queries, with the same filename. This way, when new requests came in, I would review my queries and CSVs to identify the best fit for my use case, then adjust it accordingly.
This approach saved me some time and helped ensure I did not forget the necessary filters most times. However, my queries were still much too long and complicated, often overwhelming, and I found myself reviewing them repeatedly. We couldn’t continue like this.
First Dive Into dbt
We had a data engineer, let’s call him Q, who played a pivotal role in organizing our work as a team. When he noticed how complicated and repetitive our queries were, he offered to convert them into living tables in the database. As a result, my complex query from the example above became:
SELECT loan_id, created_at, loan_amount, repayment_amount, repaid_amount, outstanding_amount, loan_bundle, loan_project, loan_project_code, warehouse, country, loan_status
FROM farmer.fact_loan
WHERE extract (year from created_at) = '2021' AND loan_project = 'LoanProject2021'
Code language: JavaScript (javascript)
I had no idea how he achieved this; all I knew was that my very complicated query became much simpler with the introduction of fact and dimension tables. All I had to do was curate the query and share it with Q, who converted them into living tables that I could query directly. Life was good, and everything was well until Q had to leave—this led to two problems:
1. Who would manage the current data models? How were they even created? Was there some script running behind the scenes?
2. What if a new business case arose and we needed new repetitive data queries? How would I convert these queries into simple tables that I could easily query? Would I be stuck with a long array of CTEs?
Q had a solution, and it was the same tool he had been using all along: dbt! At that time, I had no idea what dbt was. The tool was unfamiliar to the rest of us, and the fear of misconfiguring the system—leading to manual data retrieval and long explanations to business users about PowerBI report updates—meant we would avoid using it without sufficient training which had to happen ASAP.
Ideally, we would leave everything untouched, but realistically, it was only a matter of time before something broke along the pipeline. As Q’s understudy, I needed to understand the process fast. With limited time for training, we looked for an online course. Q first recommended a YouTube channel, but the content was too technical and full of data engineering jargon. Our next option was the Analytics Engineering Club’s Introductory course found through a quick Google search and a LinkedIn comments to my inquiry post.
Saved by Analytics Engineering Club
The dbt projects we had were quite siloed, although version-controlled with Git. The most advanced structure we had for the project was:
farmers
├── README.md
├── analyses
├── seeds
├── dbt_project.yml
├── macros
├── models
│ ├── docs
│ ├── marts
│ │ ├── dim_farmers.sql
│ │ ├── fact_finance.sql
│ │ └── fact_transactions.sql
├── schema.yml
├── packages.yml
├── snapshots
└── tests
Code language: CSS (css)
The course was well-structured, providing step-by-step knowledge in the order of realistic requirements. Learning the often-intimidating CLI early on was crucial for creating and managing dbt projects, editing profiles.yml on the server, and setting permissions for running processes.
Following the course guidance, I reorganized our models directory into three parts: staging, intermediate, and mart. This structure proved highly beneficial for the commodities indices modeling project I was working on. The new models directory looks like this:
- Staging: Cleaned source data.
- Intermediate: Handled joins, feature engineering, and error corrections.
- Mart: Combined intermediate models and performed final calculations.
We separated the source.yml file from the schema.yml file for proper documentation and testing, with our schema.yml file inheriting descriptions from the source.yml file. We also adopted dbt packages for testing and quicker modeling. Our new project structure became:
farmers
├── README.md
├── analyses
├── seeds
├── dbt_project.yml
├── macros
├── models
│ ├── docs
│ ├── intermediate
│ │ ├── int_farmers.sql
│ │ ├── int_location.sql
│ │ └── int_loan.sql
│ ├── marts
│ │ ├── schema.yml
│ │ ├── dim_farmers.sql
│ │ ├── fact_finance.sql
│ │ └── fact_transactions.sql
│ ├── staging
│ │ ├── stg_farmer.sql
│ │ ├── stg_ward.sql
│ │ ├── stg_crop_type.sql
│ │ ├── stg_transactions.sql
│ │ ├── stg_finance.sql
│ │ ├── stg_warehouse.sql
│ │ ├── stg_lga.sql
│ │ ├── stg_state.sql
│ │ ├── stg_country.sql
│ │ └── customers.sql
│ └── sources.yml
├── packages.yml
├── snapshots
└── tests
Code language: CSS (css)
Another advantage of a solid grasp of dbt is using ephemeral materialization instead of views, as I preferred CTEs over having intermediate or staging models sit in the database as views. We are implementing custom SQL-based tests as part of our dbt project restructuring, which includes proper documentation and basic test definitions for data quality checks. Currently, two of us contribute to the dbt projects, using Git for collaboration and version control. We follow consistent naming conventions for tables and columns and use dbt’s relationship graph to understand table relationships. Integrating dbt with OpenMetadata also helps our growing team quickly visualize modeled tables relationships, data quality results and data documentations.
Future Steps
Largely, I am pleased with our dbt project and workflow. We start by identifying a business case, then collaborate with the user and backend developer to manipulate the source data as needed. The models are built, and the resulting data is used to create PowerBI reports or dashboards. We then fine-tune the model as new considerations arise.
Looking ahead, there are three more areas I would like to incorporate into our process:
Adopting a Deployment Pipeline for CI/CD
Embracing the analytics engineering mindset means adopting software best practices is essential. Implementing a Git-based deployment pipeline will automate updates to the server, eliminating manual pull processes and direct editing of live projects.
Building Incremental Models
Although full loads are currently feasible for our event tables, incremental updates would be more efficient. This is covered in the AEC course, but due to potential increased run times, it has been deprioritized for now. Did you know that Airbyte loads are based on dbt?
Snapshotting Data
Snapshotting will likely be implemented before incremental refreshes as a valuable quick win. It is particularly useful for fraud detection by tracking changes in client details. We plan to apply snapshotting to our object and possibly validation tables.
Conclusion
To summarize, my team transitioned from manually writing repetitive queries to using dbt, which has streamlined our process and improved efficiency. With dbt, we have moved from basic models to a well-structured framework that follows analytics engineering best practices, thanks to the AEC course. Our upgraded data modeling process now provides more accurate business insights, saves time, and minimizes errors. Next, we plan to adopt CI/CD processes, snapshotting, and incremental updates.
That’s a wrap on our dbt evolution. How has dbt evolved in your organization? Have you adopted it yet?
For more about the AEC program, feel free to reach out. I’m happy to share my insights. See you in the next one.