When the automobile went mainstream in the U.S. during the 1920s, the nation re-built roads, wrote new laws, and erected signs and signals to control the flow of traffic. Without these process improvements and infrastructural overhauls, the advent of the automobile would’ve been untenable. In the last decade, we’ve experienced something analogous in the world of data: the advent of the cloud data warehouse.
Modern cloud data warehouses are fast. The speed at which you can interrogate your data is unprecedented. But, just as with the advent of the automobile, we need to rethink how we work with data to ensure we don’t just recklessly drive [answer data requests] faster. We need:
I have no silver bullets to address the above within this article — just some observations that point to an emerging theme: the refactoring of our analytics processes.
When I was at Wayfair (2017–2019), pulling data was painfully slow. It could take minutes to hours to extract even the most menial slices of data from Hive or Vertica. You might suspect that this slowness simply limited the amount of work that could be done, but the repercussions were broader. The slowness impacted the nature of the work we prioritized. I’d often favor work that didn’t necessitate pulling from the warehouse — playing with data in-memory in Jupyter notebooks, clever algorithmic solutions, machine learning.
And I don’t think I was alone in orienting this way. The culture of data work in this era seemed to inherently favor the deep and exhaustive. I’d wager that the use cases we emphasized by doing this sort of slow work even shaped collective corporate attitudes on data work: problem-solving over stakeholder management; sophistication over interpretation; perhaps even data science over analytics. What we saw was all there was, and all we had the patience (or time) to see was work that lived outside of the warehouse. Iterative SQL workflows were prohibitively slow, and therefore, took a backseat.
A trivial yet powerful principle was at play here:
If you cannot get data quickly, you can’t focus on things that rely on getting data quickly.
And this quiet idea has been pervasive in data work over the last decade, having had enduring ramifications on how we’ve been using data. Because data has been historically so cumbersome to pull out of legacy data warehouses, we designed everything — from our decision-making processes to our org structure — on this premise.
A step function change happened in how I personally related to data when I moved to Airbnb in early 2019. While Airbnb had a similar stack to Wayfair, their optimized Presto + Druid (Minerva) layers enabled queries to execute more quickly and reliably than at Wayfair. And because Airbnb had been this way for a while, the culture had time to organically adapt. The practical ramifications: product requests were nearly non-stop, exploratory work backed nearly every major decision, SQL was a foundational language rather than simply an access point, and data science folks (increasingly of the “analytics” flavor) were being embedded in every team.
But the shift presented a ton of new problems. The revolution was really stakeholder-led, meaning we were missing a clear definition of our role in decision-making conversations from data leadership. We instead followed the cues of our product counterparts and juggled often orthogonal team initiatives with our embedded work. In the absence of crisp messaging from leadership, we were constantly navigating the tension between where we felt we were needed, what our managers wanted us to do, and what we felt was truly important. The prevailing sentiment from leadership at Airbnb (and, to be fair, across data science/analytics organizations among its other tech unicorn peers) was “answer as many requests as you can, but also do your own projects.” At the end of the day, we built far too many dashboards, yet still did too much repeat work, and, ultimately, focused on the wrong problems.
The root cause: while we were empowered to answer substantially more questions than before, we did not set up our teams, infrastructure, or objectives in a way that aligned to this new world.
If you’ve followed the literature in the data community over the last couple of years, you may have noticed emerging narratives trying to establish better practices around how we should work in this brave new world. We’ve reached the consensus that stakeholder alignment is critical. Much like product teams live on user feedback, every piece of data work can be viewed as a “Data Product” used by stakeholders. If stakeholder problems aren’t addressed, the product is failing. We’ve also started to structure our teams differently and hire differently, building tightly-coupled working groups that can grease the pipeline from data preparation to decision. Still, we made mistakes, focusing far too heavily on reactive speed, building far too many dashboards in the pursuit of self-service. We haven’t cracked the formula for scalable, streamlined process, and I’m excited to see how we’ll tackle some of the big remaining problems:
Decision science is going to go mainstream.
The 2010s belonged to data science. Then we saw the rise of analytics eng. But aeng -> less time on prep + better data -> more questions -> decision science. Will purple people cleave into magenta and fuschia people?
— robert yi 🐳 (@imrobertyi) February 28, 2022
I’m excited to see how the narratives will evolve. The industry has undergone astronomical shifts in how we procure and prepare data, but it’s time for us to reconsider the workflow around decision-making and analytics — the reasons why we got all this data in the first place.
We’re making a bet that the world needs to move towards better alignment and processes, and we’ve built Hyperquery to reflect our ambitions to this end. Hyperquery is a doc workspace for SQL + analytics work. By doing analytics work in an organized doc workspace, we make it easier to not only write and share query-based analyses but also drive alignment, scale impact, avoid reproducing work. Check out what we’re building at Hyperquery.ai.