Data education is broken

A few weeks ago I was catching up with my friend Sam — she’s currently studying biomedicine, and has a class on data analysis this semester. She knows that I work in “data”, and now that she has some experience in the area, she asked me what exactly it is that I do at work.

Despite this seemingly shared context, I struggled to explain the work that I do as an analytics engineer. As I fumbled my way through talking about SQL, analytical databases, and extract and load tools, I realized that we had almost zero shared context about what it means to do data analytics, because we have almost zero shared context of what data is.

From my friend’s perspective: data is a CSV that was created at some point in time. It’s five columns of data on iris sepal lengths, a list of Titanic characters, or a set of results from a clinical study. In almost every case, the data is static, on the smaller-scale, and is represented by one table (or CSV). The nature of these datasets influences the tools that are used in class for analysis — R, pandas, and SPSS are the common ones. You find the ✨ insights ✨, and move on to the next lesson.

It’s not just Sam’s college class that teaches this version of data analysis — what she’s learning is the same version of data analysis I learnt in university in Sydney, and it’s the same version that data boot camps and online courses are still teaching.

From my perspective (and the perspective of those that work on modern data teams), data looks totally different.

It’s constantly updating — new rows are being added and the underlying structure of the data can change at any time. It’s stored in a database. It is our best translation of business processes to objects computers can understand. It’s not just one table (or CSV), but many — sometimes you can use reliable IDs to understand how these datasets relate to each other, other times, you need to make a best guess based on email, or IP address.

This fundamental difference in what my friend Sam thinks data is, versus what I think data is, is the point of divergence between our experiences. From this divergence, everything else about what it means to work with data differs:

Storage: A CSV doesn’t make sense when working with huge datasets, instead you need to store data in an analytical database (Snowflake, BigQuery, Redshift)
Data loading: Real datasets are constantly updating, so you need to use tools that make sure your data is always up to date (Stitch, Fivetran) – something that just isn’t considered when working with a static dataset
Data transformations: Once the data is in an analytical database, it makes far more sense to query it with SQL using tools like dbt (rather than use pandas or R) — SQL is usually more performant in this case too! Since business decisions are being made on top of this data, you’re probably going to want some level of quality assurance on top of these transformations (think: testing, CI pipelines etc)
Analysis: With an updating dataset, it’s not enough to build one chart, or get one “insight” and call it a day. Analysts need to know how to build a good dashboard and how to do the one-off deep dives.
Other skills and tools: It’s no surprise that most data analysis courses don’t teach anything about version control, the command line, orchestration, code editors, etc.

And that’s only scratching the surface — while loading, transforming and analyzing data is one part of it, harder-to-define skills like getting buy-in from your C-Suite, requirements gathering, and debugging are all skills that you don’t get to learn until you’re on a team.

The work that modern data teams are doing looks nothing like what’s being taught.

	(College student) Expectations	vs. Reality
What does the data look like?	Static One dataset/table/CSV to analyze Represents observations or an experiment	Constantly changing — new records + changing schema Dozens, if not hundreds of datasets to work with Represents business operations
How is it stored?	Stored in a CSV	Stored as objects (tables/views) in a database
What languages and tools used to analyze it?	Analyzed with R, Pandas, SPSS	Languages: SQL Tools: git, the command line, dbt, Airflow, VSCode, etc etc.
What do analysts do?	Deliver “insights”	So much! Version control, testing, modeling, requirements gathering, getting buy-in

Why hasn’t data education caught up?

I don’t think it’s just a case of academia being a few years behind industry (though that’s part of it), I think there’s a harder-to-solve problem here: access to real data.

You just don’t get to see data like this until you work for a real company. A company that has orders and customers and products; that has tracking in place that records when people visit their website, and from which computer; a company whose marketing team spends real dollars on advertising. Maybe you can read a book on the topic, but that’s not the same thing as doing the work. Maybe you can try to generate a dataset to work with, but that’s running before you can walk.

Contrast this with learning to build websites — recently, I decided I want to learn exactly this, so I went and built a website. I didn’t need to wait to work for a company first.

This is the Catch 22 of working in data: want to learn to be a great data professional? First, you need to already have a job that gives you access to data.

How does this impact the industry?

An industry of side-steppers

Most data professionals I know sidestepped into their roles from adjacent roles. It’s really common to hear of someone who was in one role, learned SQL, and then transitioned onto the data team at the same company when a position became available — personally, I studied civil engineering, and ended up in a jill-of-all-trades “operations” role at a startup, learned SQL on the job, and then transitioned onto the data team when the data analyst left. I know people that found their way into data from ops, customer support, and accounting!

This trend is a really interesting one: it means that many data professionals in my network are people that are driven by curiosity (which is one of my favourite traits when working with people) — it takes a certain type of person to teach themselves the skills required to wrangle their data.

But there’s a lot of downsides.

Lack of supply

Since courses don’t teach a relevant version of analytics, there is a gap in supply. Often employers are left filling the gap, taking it upon themselves to train more junior employees into the role, or leaving it to the employee to figure it out.

Undertraining

There’s a lot of analysts writing bad SQL out there. And I don’t blame them — if you are in a role where you’re teaching yourself SQL, how are you to know that CTEs are much more readable than subqueries, or that window functions exist (and are wonderful)? (Fun fact: an earlier version of this post was called “why do so many analysts write bad SQL?”)

Barriers to entry

Finally, if the way to get a data job is to have a foot in the door already, and then transition internally to a data role, the reality is that this process is going to perpetuate existing systems of inequality. How do we make it so that more people can enter analytics, from a wider range of backgrounds?

How do we fix data education?

We start by fixing that divergence: by making the data we use for teaching analytics look as real as possible — orders, products, customers, subscriptions, page views; constantly updating, stored in an analytical warehouse. Then we create a new curriculum around it.

If you hadn’t guessed it yet, we (Michael Kaminsky and I) are building this training course.

Our course will turn analysts into analytics engineers. We can’t fix the whole data education landscape at once, so we’re starting with the persona we most identify with. If you’re an analyst that knows how to write the SQL to answer a question like “what’s my monthly revenue”, but feels intimidated by the command line or git, then this is the course for you.

In addition to using data that feels real, this course will also:

Be designed and taught by people who have actually been a part of, and led, data teams (us!)
Use the same tools that the best data teams are using (BigQuery, Fivetran, dbt, git, VS Code)
Create cohorts of students so they can learn from each other as they progress through the course
Have experts join us for round-table discussions on some of the fuzzier challenges of being on a data team, for example, the venerable question of “who should a data team report to?”

This is just the first part of a bigger project — we’re starting with a training course that is focussed on one area of data, but we won’t finish until we’ve changed an entire industry.

Want to get involved?

Whether you are:

an analyst who knows how to write SQL, but wants to figure out all the other stuff (version control, code editors, testing, orchestration, and more!), and is interested in joining us;
a data lead who has someone on your team that you want to upskill quickly, and wants to send a team member our way;
or a data professional interested in keeping an eye on this project

The best way to stay in touch with is to join our Analytics Engineers’ Club mailing list below. We can’t wait to share more with you soon!