Joe's Nerdy Rants #9
Data Modeling - What If We Just Burn It All Down?, plus weekend reads and other stuff
Data Modeling - What If We Just Burn It All Down?
Imagine two extremes. On one end, data modeling is done perfectly and harmoniously across the data lifecycle. On the other end, data modeling is ignored and thrown into the dustbin of history. Along this spectrum, where do you think we are as a data industry?
As I’ve been thinking about the state of data modeling for the last several years and where we’re going, I definitely think we’re on the latter end of the spectrum. Universally, when I talk with anyone who handles data (developers, “data people,” etc.), data modeling is forgotten, ignored, and sometimes scoffed at as being “too difficult and slow.” The default is to cobble together whatever data looks good for the task at hand.
I wonder if this is due to a lack of awareness of data modeling, incentives to “just ship,” and leave rigorous/formal/resilient practices for another time, or something else? Regardless, the consequences are all over the place.
In this morning’s YouTube show with the Seattle Data Guy, I called today’s data modeling “query-driven modeling.” I suppose we can also call it “just-in-time modeling.” The notion is to react to the question at hand, then move on to the next query…and the next query…and the next query…Sort of like how a puppy gets excited about the world, including pooping all over your nice rug.
If this is where we are as a data industry, it begs the question - does data modeling matter? Apparently, companies that eschew data modeling perform just fine. They make a ton of money, certainly enough to throw costly compute to crunch whatever query needs to be run. And with AI replacing knowledge work anyway, what’s the point of data modeling? Or working, for that matter? AI’s going to replace knowledge work with better knowledge and less work shortly.
My open-ended question to you this week - 🔥What if we just burn it all down?🔥 What if we just forget the old practices and techniques of data modeling ever existed? Would we be fine? If not, why?
I look forward to your answers in the comments :)
Listen to the audio clip above on this topic, which is also my 5 Minute Friday on Spotify.
Cool Weekend Reads
Hope you all had a great week.
Here are some cool things I read this week…
Tech, AI & Data
Software engineers hate code (Dan Cowell)
Code is fun until it’s not…
“Don't write new code when you can use, improve or fix what already exists. If you must write new code, write only what you need to get the job done.”
Software Ate All the Easy Shit (Andrew Rea)
“Personally, I’m not willing to bet my career on beating Open AI, Google, Anthropic, etc. in LLMs. Nor am I willing to bet on being the 1 startup in 100 that finds a way to build an enduring company in “gen AI copywriting” or “gen AI outbound marketing for SDRs” faster than the incumbents or the dozens of venture-backed startups pursuing this.
Most of you probably shouldn’t either.
I am willing to bet that I can find a boring, unsexy problem, in a great market, build a better product, and distribute the shit out of it. Playing my role as a good soldier in the fight to spread the gospel of our Lord and Savior Shareholder Value.”
PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news
Welcome to the future…
Gossip Protocol (System Design)
Distributed protocols are something I strangely enjoy nerding out on. Enjoy!
Business & Startups
ChatGPT was an 'oh crap' moment for hundreds of CEOs (Insider)
Shareholder-driven development is a thing… See my old post, the Golden Rule of Value.
This CEO rightfully gets a lot of flack for being a douchebag. However, I expect these sorts of layoffs will become the norm, for better or worse.
Sarah Silverman is suing OpenAI and Meta for copyright infringement (The Verge)
The crazy, though unsurprising part, “ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
I had a pretty rough childhood and was in countless fights growing up. As a result of my experiences and seeing the world for the gritty place it is, my kids had to learn jiu-jitsu at very early ages. They think they’re better for it, and I agree.
Now that billionaires are challenging each other to fights and dick-measuring contests (is this the modern-day duel?), fighting is cool. That said Marc’s spot on. The world is dangerous…learn to deal with it accordingly.
New Content, Events, and Upcoming Stuff
This week
Monday Morning Data Chat - #134 - Should Your Business Chase Generative AI? w/ Andreas Welsch (Spotify, YouTube)
In case you missed it…
Monday Morning Data Chat - #133 - Intro to Data Contracts w/ Andrew Jones (Spotify, YouTube)
Monday Morning Data Chat - #132 - Data Collaboration From the Outside-In w/ Andrew Padilla (Spotify, YouTube)
The Joe Reis Show
Maya Mikhailov - Cutting Through the BS of AI, and Making it Useful in Business and Life (Spotify)
Peter Hanssens - Building Awesome Data Communities in Australia and Beyond! (Spotify)
Ranjith Raghunath - Modeling and Improving The Customer Experience (Spotify)
In case you missed it…
Tristan Handy - Balancing Competing Tensions and Handling Complexity (Spotify)
Solomon Kahn - The Keys to a Good Data Team, Why Embedded Analytics Suck, and More (Spotify)
Colleen Fotsch - From Athlete to Analyst (Spotify)
Kris Jenkins - Programming as Art, Finding Your Niche, and MANY Wonderful Tangents (Spotify)
5 Minute Friday - We Get the Community We Deserve (Spotify)
Upcoming
Monday Morning Data Chat - Dataframe Deep Dive w/ Devin Petersohn (Live on LinkedIn and YouTube)
The Joe Reis Show - Lots coming up!
Here are some cool upcoming in-person events I’ll be at in June and beyond for 2023
Taking July off…🏔️, except for the virtual Portable Conference and a few other things. My calendar is otherwise completely blocked off from July to early August, so let’s chat over email or similar.
Joe Reis + dbt roadshow - Atlanta (8/10), Seattle (9/7). More details are coming soon.
DataEngByes. I’ll be on the continental tour in Perth, Brisbane, Melbourne, and Sydney for a couple of weeks. August 2023 (more info and registration)
Big Data London - I’m keynoting. Big up the London Massive. September 2023.
Europe - September 2023 TBA
Dubai - October 2023
India - October 2023
Canada - November 2023
Vegas - ReInvent 2023
More to come…
Thanks! If you mind helping out…
Thanks for supporting my content. If you aren’t a subscriber, please consider subscribing to this Substack.
You can also find me here:
Monday Morning Data Chat (YouTube / Spotify and wherever you get your podcasts). Matt Housely and I interview the top people in the field. Live and unscripted. Zero shilling tolerated.
The Joe Reis Show (Spotify and wherever you get your podcasts). My other show. I interview guests, and it’s totally unscripted with no shilling.
Fundamentals of Data Engineering (Amazon, O’Reilly, and wherever you get your books)
Be sure to leave a nice review if you like the content.
Thanks! - Joe Reis
One of the major issues with data modeling is it's split into two camps, with a never-ending push/pull of who does what, or where the logic goes. The first is the "transformation" camp, where data modeling is the act of producing a set of tables according to some methodology like Kimball or Data Vault. The second is the "semantic layer" camp, where data modeling is the act of linking tables together and defining metrics on top of them.
Neither one of them is great at the end-to-end pipeline -- Team Transformation can never anticipate all the different dimensional cuts required by business users, and ends up in a never-ending spin cycle of fulfilling data requests. Team Semantic Layer usually runs into performance issues when querying fact-level data, and thus inevitably pushes some logic into the transformation layer, at which point, metric definitions are now split across tools.
It's an artificial divide. Both teams are trying to accomplish the same thing, but the current state of the art tooling falls short. The industry needs something that unifies both of these camps. I wrote about this a bit more on my blog, in a post I called "The Data Modeling Divide": https://carlineng.com/?postid=data-modeling-divide#blog
They perform “just fine” in spite of - not because of “just in time modeling”. No company is ever going to publicly expose their struggles. I worked for a company - a very big one- and we provided reporting “one query at just the right time” and it was a nightmare. We spent so much time asking why this result didn’t match that result and looking like fools to each other and in front of customers. The business units are now cannibalizing each other because market conditions have shifted. You can get away with a lot - or should I say you can get by with a lot when the external factors are a wind at your back. Those chickens always come home yo roost. Fundamentals are fundamental for a reason