In the data world, we love to argue about definitions. What is unstructured data? Is JSON structured or semi-structured? Are PDFs unstructured, or do they contain “implicit structure”? Do LLM embeddings of text count as structured data? And WTF is a semantic layer? Entire threads, articles, and even conference talks spin out of these debates.
This obsession forms what I call the Pedantic Layer, the fog of war, where we endlessly quibble over minute variations of wording and meaning, in ways that float above the actual work of using data. It’s where we burn cycles nitpicking whether “unstructured” is even a valid category, while the real world is busy proving this bickering doesn’t matter.
The other day on LinkedIn, my friend Malcom Hawker wrote a post calling out the bajillionth argument over the nitpicking of the definition of unstructured data. I replied these sorts of arguments are an example of why the data industry has held itself back for decades. The data industry often feels like a retirement home, where residents keep arguing about the same old things, year after year.1
The thing is, businesses and people have been making money from “unstructured” data for ages. Customer support teams mine call transcripts to improve service. Industrial companies use robotics and computer vision. Giant search engines provide image and video search to billions of users. LLMs themselves are the most obvious proof point, trained on an Internet-worth of text, photos, and more.
The Pedantic Layer slows us down by focusing attention on taxonomy and semantics over outcomes. It makes practitioners feel like they need to pass some purity test of definitions before they can get value. It feeds a culture where thought leaders posture about whether CSVs are structured “enough,” while the practitioners who ignore the debate are quietly building models that drive millions in revenue.
This isn’t to say definitions don’t matter. At the right level, they certainly do. Clear definitions are useful when designing schemas, interoperability standards, or regulatory frameworks. But outside those contexts, pedantry becomes senseless gatekeeping instead of delivering results.
We should measure success by the value created from data, not whether we’ve labeled it correctly. Instead of fixating on whether unstructured data is “really” unstructured, ask:
Can I extract signal from it?
Can I connect it to other forms of data?
Does it improve decision-making, automation, or customer experience?
If it passes these questions, the data has likely value for some purpose. Call the data whatever you want.
The irony is that the most innovative work in data today comes from collapsing categories, not defending them. This is especially poignant as AI forces the convergence of data practices and various forms of data, building new systems and practices that treat text, images, logs, and tables as peers. We’re embedding unstructured data into vector spaces where it becomes queryable alongside structured records. This is the convergence I describe with Mixed Model Arts. We’re entering a world where all data is just data. The sooner we move past the Pedantic Layer, the sooner we can focus on what matters.
So next time you see someone arguing on LinkedIn about what unstructured data is, remember that while they’re debating, someone else is shipping.
Sidenote - I used to work in a retirement home. Definitely the most rewarding job I’ve ever had.
Word.
Taking a bit from sociolinguistics. It irks me quite a bit how technologists will sometimes use jargon as a form of linguistic discrimination. It's bothersome with the debates on definitions and it alienates others (especially business side stakeholders) from the discourse.
I find that those who are proficient in the tech sociolect are included in the conversation, while those who are not are excluded.
It would all behoove us to get on with it and actually do something with technology to help people vs. talk about it. (The irony is not lost on me regarding my own comments.)
I see it a bit differently. To make my point, I'll describe several groups we all likely belong to at different times.
First are the innovators, who encounter a problem and design a practical solution. Then come the academics, who study that solution, write about it, and establish formal definitions.
Next, entrepreneurs monetize the solution, often expanding the definitions with marketing language. They partner with consultants, who then teach these frameworks to leaders. Leaders, driven by a fear of falling behind, urge their teams—the practitioners—to apply these packaged solutions to new problems.
This is precisely where clear definitions become critical. Practitioners need a shared, precise vocabulary to verify that a solution truly fits the specific use case.
Personally, I love to innovate. Give me a problem and I want to shut the door and solve it. But what is perceived as the biggest challenges will always invite collaboration and this is where common definitions are imperative.