Hey y’all. I’m back. Had to skip last weekend’s newsletter because I was in the middle of traveling back from Europe for the bajillionth time. This week is a bit lighter travel, with me going to Denver and Silicon Valley for some events. I’m exhausted and need a break from international travel; it’s so bad that I often don’t even know where I am or what day it is. So yeah, a break is necessary.
For the first half of 2024, my international travel schedule will be much lighter. I’ll hide in my Bat Cave, focusing on my new book, courses, and other awesome projects. I’ll still be traveling domestically during this time (and a few international trips), so be sure to find me if I’m in a city near you. Expect some big announcements in the first half of 2024.
I appreciate all of you for supporting my work. You’re truly awesome.
Thanks,
Joe Reis
AI is Data Management’s Hail Mary Pass
This morning, my good friend Samir Sharma wrote a LinkedIn post about data management, saying:
“In the last week I've spoken to a number of data management professionals about data and LLMs.
Most say we can't create LLMs as the data isn't perfect! I have a question about that, but, it might be a little too reflective for those folks than is necessary right now!
So, what I want to do is talk about the elephant in the room: the myth of 100% perfect data for building out LLMs.
The reality is that perfect data doesn't exist! There I said it! It never has!
In the realm of data management, the pursuit of flawless data is akin to chasing unicorns.
The truth is, perfect data is a myth. 😬”
I agree. The track record of data management’s pursuit of perfection is beyond poor. Data management seems stuck in the same rut for the last few decades. Something needs to change. Instead of repeatedly rehashing the same practices (and getting the same results), data management needs a second look as a field, and maybe the goals need to be reassessed. Perfect datasets very rarely exist. I’m not even sure what a perfect dataset looks like, but I do know what useful datasets look like. For machines, useful datasets power applications and machine learning models in production; for humans, useful datasets facilitate good decision-making and outcomes.
Consider the growth, complexity, and variety of the data types we now work with. The world moved beyond tabular data a long time ago. There’s endless text, audio, images, video, log data, etc. While we’re at it, throw in AI-generated content for fun with the prediction that over 90% of the data on the internet will be AI-generated by 2025. Data is at a scale that’s beyond the ability of humans to make sense of it. Data management’s ideal of “grown-ups in the room” won’t work anymore (it barely worked in the past). Unscalable problems for humans are perfect use cases for machine learning and AI.
Let’s take it a step further. I believe we’re going to see AI-driven data management. Because this doesn’t really exist yet, at least as a widespread practice or technology, here’s my idea of what this looks like. AI agents work alongside humans (or instead of humans) to evaluate, create, and curate useful datasets. This week, I chatted with a company that used LLMs to curate vast amounts of text data with LLMs, dropping what would have been tens of thousands of people-hours to hundreds of hours. And it feels like things are just getting started. Chain together various purpose-built data management AI agents, and you’ve got an army of data management “professionals” working tirelessly on managing all of your data assets. This day is coming very soon, and I think AI is the Hail Mary pass that data management needs to succeed. Lord knows data management needs a reboot.
Listen to the audio clip above on this topic, which is also my 5-Minute Friday on Spotify.
Cool Weekend Reads
Here are some cool things I read this week. Enjoy!
Tech, AI & Data
Sam Altman Is Out at OpenAI; Mira Murati Will Be Interim CEO (WSJ)
Wow, what a doozy of a bombshell for a late Friday afternoon. This is probably the biggest CEO firing in a long, long time. Very curious to see how this unfolds. Stay tuned…
Microsoft is leading the charge with making everything AI-first for developers.
Google, Intel, Nvidia Battle in Generative AI Training (IEEE)
Interesting MLPerf results. Hardware is getting faster, better, and cheaper for generative AI training.
The Architecture of Serverless Data Systems (Jack Vanlightly)
This article discusses the architecture of serverless data systems like BigQuery, Spanner, and AWS DynamoDB. It's a good read if you’re an architecture geek like me.
GraphCast: AI model for faster and more accurate global weather forecasting (Google)
“While GraphCast’s training was computationally intensive, the resulting forecasting model is highly efficient. Making 10-day forecasts with GraphCast takes less than a minute on a single Google TPU v4 machine. For comparison, a 10-day forecast using a conventional approach, such as HRES, can take hours of computation in a supercomputer with hundreds of machines.”
Impressive…
I Accidentally Saved Half A Million Dollars (Lucidity)
This came out a few weeks ago, and I somehow failed to post it. Whoever this person is, the blog posts are priceless. This person speaks about the problems in the data industry in a way that other grizzled vets and I find amazing.
Business & Startups
Business questions worth asking (Gabriel Mays)
Asking great questions gets you to better answers. This is one of the best short lists of business questions you should ask yourself and your team.
Oops! We Automated Bullshit (University of Cambridge)
“The problem isn’t AI. What we need to regulate is the bullshit. Perhaps the next British PM should convene a summit on bullshit, to figure out whose jobs are worthwhile, and which ones we could happily lose?”
New Content, Events, and Upcoming Stuff
Monday Morning Data Chat
Coming up…
Dave McComb, Sarah Nagy, and more…
In case you missed it…
EU AI Act w/ Kai Zenner (Spotify, YouTube)
Apache Hudi Deep Dive w/ Nadine Farah (Spotify, YouTube)
Why is Data Security So Hard? w/ Yoav Cohen (Spotify, YouTube)
Data Conference Recap (Coalesce, Gitex Dubai, DEWCon) w/ Kevin Hu (Spotify, YouTube)
The Joe Reis Show
Coming up…
Karin Wolok, Matt Harrison, Ben Rogojan, and more…
This week…
Peggy Tsai - Setting CDOs Up For Success (Spotify)
5 Minute Friday - Things I Didn’t Expect, AI & Data Management, and More (Spotify)
In case you missed it…
5 Minute Friday - The Biden AI Executive Order w/ Katharine Jarmul (Spotify)
Dave McComb - The Unreasonable Effectiveness of Knowledge Graphs (Spotify)
5 Minute Friday - The CEO and C-Data Stuff Divide (Spotify)
Bill Inmon - History Lessons of the Data Industry. This is a real treat and a very rare conversation with the godfather himself (Spotify)
Events
November
Las Vegas - ReInvent, 11/28 to 11/30
December
No speaking events. Taking the month off :)
2024
dbt + Joe Reis Roadshow (Dallas) - TBA
Data Day Texas (Austin) - register here
Data Modeling Zone (Arizona) - register here
Skiers in Data (Switzerland) - March, TBA
Malaga, Spain - May, TBA
Vancouver, BC - June, TBA
South Africa - TBA
Dubai - TBA
Australia - TBA
Asia - TBA
Thanks! If you want to help out…
Thanks for supporting my content. If you aren’t a subscriber, please consider subscribing to this Substack.
You can also find me here:
Monday Morning Data Chat (YouTube / Spotify and wherever you get your podcasts). Matt Housely and I interview the top people in the field. Live and unscripted. Zero shilling tolerated.
The Joe Reis Show (Spotify and wherever you get your podcasts). My other show. I interview guests, and it’s unscripted with no shilling.
Fundamentals of Data Engineering (Amazon, O’Reilly, and wherever you get your books)
Be sure to leave a nice review if you like the content.
Thanks! - Joe Reis