By Pete Stiglich
One question I ask of data architects and data engineers that I’m interviewing is “What is the most important thing to know about your data?” It’s a bit of a trick question, it is valid. I frame the question a bit to give some context. For example, I might say “Suppose you have a spreadsheet called “CustomerSales.xlsx” with one tab and 300 columns (with column headings) and you’ve been tasked to understand how to integrate that with other data in your data warehouse. What is the most important thing you need to know about that data?”
The answer I’m looking for is “The most important thing I need to know about that is what is/are the key(s) of that data”.
If you don’t know the keys (primary AND natural/alternate / business), then you really don’t know much about the data – you’re left with making potentially catastrophic assumptions about the meaning and grain/level of detail. For example, if you assume the grain of CustomerSales.xlsx you would probably assume the grain is sales by customer – but that’s just an assumption. Not knowing what the key(s) are means that you can make some poor design decisions (which will need to get fixed eventually – of course, it’s much more expensive to fix once you’re application and data are in production….) and let poor data into your system (e.g., duplicates). If you don’t know what the key(s) are – how can you know whether you have a unique record or not?
Even if you know, for example, that the key of CustomerSales.xlsx is SaleId (which for the sake of argument we’ll assume is a surrogate key) you still don’t know what the business key is. A surrogate key is a unique sequential number with no inherent business meaning in itself. All it tells you is that you have a unique number representing a record – but that same record could still be duplicated (just with a different SaleId). You still need to know the natural / business key in order to understand the meaning and grain of the data. For example, is the natural key CustomerId, SaleDate, LocationId, SaleNumber or is it CustomerId, SaleDate, LocationId, SaleNumber, and ProductId – i.e., is it a header or a line item, or are there other perhaps non-obvious columns that should be part of the natural / business key e.g., VersionStartDt (in which case it represents a version of a sale record). You should, whenever possible, enforce unique indexes on the natural /business key. Of course, for some large cloud databases or NoSQL platforms unique indexes on an alternate key might not be possible or the data volumes might require unique indexes be dropped. If that is the case, you should develop programs to at least occasionally test your data to ensure you don’t have dups.
Nearly every data set should have at least one unique key – but I’ve seen cases where having what looks like a duplicate record is actually acceptable e.g., the same customer executes the same transaction multiple times for the exact same dollar amount. This is poor design but might be outside of your control – there should at least be a unique surrogate key to differentiate the rows.

Pete Stiglich: Trusted Expert in Data Architecture & Modeling
Pete has over 30 years of data architecture, data management, and analytics experience, most of that time as a consultant in industries such as government, finance, healthcare, insurance, and more. He is an industry thought leader in data architecture and data modeling and has developed and taught many courses on these topics. Pete enjoys helping clients solve complex data problems leveraging proven approaches such as “Modeling the business before modeling the solution” which provides a benefit to clients that many IT professionals miss.
Join Our Data Community
At Data Principles, we believe in making data powerful and accessible. Get monthly insights, practical advice, and company updates delivered straight to your inbox. Subscribe and be part of the journey!
