This article was inspired by recently completing my largest data engineering project. I’ve assisted a half dozen sales teams with prospecting over the past year. I handled each project as an ad hoc, urgent request. I knew the requests would continue as they got hooked on the data.
I decided to gather all the previous lessons learned and build a cohesive, enterprise-wide solution. I locked myself away for a few weeks with Google BigQuery (GBQ) and a handful of data warehouses. As the dataset developed, I recognized a few inefficiencies in my workflow and the risk of Confirmation Bias (CB).
This article will share how I solved these inefficiencies to streamline error checking (a.k.a. QA’ing) and avoid the trap of CB. CB is a tendency to process information by looking for, or interpreting, information consistent with one’s existing opinions¹. Typically our existing opinion is that our work is “of course” correct. We’ll then subconsciously skew our QA’ing to confirm that. This bias is exacerbated when QA’ing is cumbersome, so CB is easier to overcome when QA’ing is streamlined.
I’ll focus on Alteryx and GBQ since those are the tools I used. Any similar products would work just as well for this use case.
The most crucial part of any data professional’s job is to check for errors thoroughly. Though tedious, producing a dataset with the fewest errors possible is paramount but easier said than done. A common approach is to evaluate a sample of the data from all possible perspectives. In-database error checking typically requires a lot of additional, complex coding. However, there is a better way!
Alteryx allows QA’ing to be repeatable, flexible, and lightweight. Let Alteryx do the hard work so that you can focus on the concept of the data rather than the tedium.
How to QA Inefficiently
I chose to code in GBQ rather than Alteryx because the interface is better and the volume of data was large. I began QA’ing the data as it started coming together. After a few iterations, I recognized inefficiencies in my process.
GBQ allows only 50 rows of data to be viewed at a time. Rather than flipping through page after page, it’s best to export the data. However, GBQ only allows ~10k rows to be downloaded to a local CSV, far smaller than my data sample size. Working around that limit results in excessive CSV downloads or creating an excessive amount of additional tables throughout the iterative process.
QA’ing entirely within GBQ requires far too much scripting beyond very high-level summaries.
Alteryx: Your R2D2 of QA’ing
Before long, a better approach dawned on me. The full dataset was too large for Altyerx, but the data sample wasn’t.
Within a few mins, I set up a handful of basic summaries, similar to Excel Pivot Tables. Without coding, the summaries allowed me to evaluate the data from numerous perspectives simultaneously. Instantly I discovered erroneous records containing blanks, spaces, or wrong characters. Within a click, I realized that some data types were wrong. I easily ingested older data into Alteryx so that it could discover variances between the two datasets. Plus, Alteryx’s Browse tool automatically finds data quality issues, provides summary stats, and visualizes the data. As errors came to light, I could fix the main code in GBQ before continuing the data engineering project. At the next waypoint, I could effortlessly rerun the QA process in Alteryx.
So why was Alteryx the perfect solution?
- Repeatable — no additional effort is needed after the initial build
- Flexible — quickly adapt to data or requirement changes
- Lightweight — no requirement to download or save anything
- Transparent — the complete data can be seen in one place
- Documentation — numerous solutions for taking notes, whether freeform, images, or annotation
- Cache — the data can be cached in memory to save time and cost
- R2D2 — native functionality that automatically discovers data quality issues, identifies duplicate records, tracks record count throughout the workflow, and effortlessly sorts, pivots, and filters data
CB is a slippery slope for data professionals, whether engineers, analysts, or data scientists. Our job demands being objective and self-critical. However, CB leads to unintentionally being less critical of our work than we should be or that we would be of another’s. Consciously or subconsciously, we go easy on ourselves and assume that our work product must be correct.
CB is exacerbated when checking for errors is laborious. The more cumbersome the QA process is, the less likely one will follow through wholeheartedly.
Without Alteryx, I may have checked a few columns, cross-referenced a few companies, and only one company’s records. With Alteryx, I could check every column, cross-reference every company, and dive into more companies in a fraction of the time and effort.
Extrapolating significance from so few observations to prove one’s preconceived notions is perilous. CB creates an inaccurate perception that the benefit of QA’ing is minimal since the work is likely correct. The cost of QA’ing increases as the level of effort increases. Before long, the higher cost doesn’t cover the misperceived low benefit, and one takes the easy way out by doing a half-hearted job.
Alteryx or similar tools are instrumental in more thoroughly and efficiently QA’ing data. It’s human nature to believe that our work is correct and skew our QA’ing to confirm that bias. The easier QA’ing is, the more likely it is done and done well, avoiding CB.
Next time you need to QA your work, overcome Confirmation Bias by using tools rather than willpower.
- Confirmation Bias | https://www.britannica.com/science/confirmation-bias