Column Names as Contracts: From dplyr to DBT
Emily Riederer’s presentation on Column Names as Contracts - Inspired by dplyr, Available in dbt delves into a topic that significantly impacts data usability: how column names can embody contracts that govern data interpretation and manipulation.
Riederer emphasizes the common pitfalls of poor column naming conventions: ambiguous abbreviations, inconsistent casing, and the lack of intrinsic meaning that can impede analysis. For anyone who has worked with messy data, this is a familiar frustration.
The Importance of Column Names
First, she notes how the dplyr package in R reshaped her approach to column naming. In dplyr, functions like starts_with()
, ends_with()
, and contains()
allow users to dynamically select and manipulate columns based on their names. This is not merely convenience; it reinforces the idea that every variable name can and should provide meaningful context.
For instance, if Riederer prefixes indicator variables with ‘ind_’, she can easily extract all relevant data points for analyses. If she wants the mean of those indicators, she doesn’t just retrieve them; she can also calculate proportions seamlessly. The process trails logically from well-defined column names, showcasing how semantic clarity can improve workflow efficiency.
Communicating Meaning through Naming
As her responsibilities expanded into data production, Riederer recognized column names as the “user interface” of datasets. This idea shifts focus from the values themselves to how those values are presented and understood by users. Developers often overlook the fact that end users are interacting primarily with column names, not with the underlying data. Therefore, naming conventions must encapsulate data semantics and intent.
In practice, she proposes a “two-step approach”:
- Define Simple Stubs: Create foundational labels that encapsulate the semantics and boundaries of the data.
- Permute Stubs for Variability: Construct column names by mixing and matching these stubs to produce descriptive, user-friendly identifiers.
For instance, consider a structure where ‘ind’ signifies an indicator variable. Attach a condition such as “not null” to it to create a variable like ‘ind_login_not_null’. This approach ensures that all parties interacting with the data agree on what “login” means, minimizing ambiguity and aligning expectations.
Using DBT for Enhanced Column Naming
Transitioning to DBT, Riederer shares how she adapted these principles from dplyr into the DBT ecosystem. DBT allows for SQL-based transformations while incorporating a level of structure similar to that seen in R, including control flow, modularization, and improved readability.
By translating filtering and selection functions into her DBT plier package, Riederer provides users with the syntax needed to apply those thoughtful naming conventions within SQL. For example:
SELECT vars
FROM dataset
WHERE name LIKE 'ind_%';
This enables analysts to handle data dynamically, applying transformations across all relevant columns based on meaningful patterns.
A Case for Data Validation
One of the critical points Riederer stresses is the importance of validation in data engineering. With poorly defined column names, analysts might conduct analyses without understanding the implications of their choices. Assertions such as checking if a date variable is indeed in date format are pivotal. Misclassifications can radically alter outcomes, introducing significant analysis error without clear indicators of problems in the dataset.
Riederer suggests using crafted tests in DBT to validate column formats systematically. This means rather than manually writing checks for each new column, a rule-based approach automatically applies to all similar variables.
Risks and Challenges
Automation can streamline workflows but also brings risks—like casting columns without confirming their types could mask data pipeline errors. Riederer’s structured, systematic approach mitigates these risks by establishing predefined contracts that specify expected data types and patterns. This strategy reinforces reliability and reduces the likelihood of errors in downstream analyses.
For instance, she illustrates that transforming variables without prior validation can lead to garbage-in, garbage-out scenarios. If a variable that should contain integers inadvertently includes strings, the automation will propagate this error rather than address it.
The Takeaway
Riederer’s exploration of column names as contracts emphasizes a paradigm shift in data workflow design. By instilling clarity and intent into naming conventions, analysts can significantly reduce friction in understanding, using, and interpreting data. Clarity in naming fosters transparency, facilitates communication among diverse users, and promotes data quality.
By translating dplyr’s capabilities to DBT, Riederer not only enhances individual project workflows but also contributes to the broader data community’s standards, marking a critical evolution in how we think about and engage with data.