The Rise and Fall of Standalone Data Discovery

Robert Yi

September 28, 2021

Last year, we launched a data discovery tool called Dataframe. We wanted to solve the following pain point: it’s hard to find and get context on tables in your warehouse. But by working with customers over the last year we’ve learned that standalone data discovery, while effective at controlling data sprawl in massive organizations, only provides a stopgap in the modern data stack without giving us what we all really want: a better, research-friendly SQL-writing experience. And better SQL-writing requires two simple problems to be addressed in the traditional data discovery workflow:

Richer discovery. You need to find your old work and the work of others as much as you need to learn about data.
Context in context. Having an app separate from your IDE adds heavy context switching costs.

In this article, I’ll discuss each of these, and how Hyperquery — our SQL-enabled doc workspace— addresses both of these concerns.

Towards a better SQL-writing experience, step 1: richer discovery.

You need to find past work, not just data.

We initially idealized the user journey as something like this:

And so data discovery was born. But we found our simplified view of this journey missed two points: (1) discovery is not complete without discovery of past work, and (2) data discovery often starts with query discovery.

First, the data discovery journey is part of a more holistic discovery process that involves queries and write-ups. Table-level context only provides a small fraction of the context needed to write a query. Full context comes from reading relevant write-ups and queries that reference the table in question, not hoping for sufficiently accurate and informative documentation.

Secondly, data discovery doesn’t always start with data. The best way to reliably find greenfield data sources is often not to just blindly search for tables — it’s to search through queries or documents other people have written. Yes, you can apply labels within your catalog to signify that certain tables are good-to-go, but when there are hundreds or thousands of these, you’re going to have a tough time finding wading through them with any confidence. It’d be much simpler to see if anyone’s written a doc on the subject you’re interested in, then look for table usage within.

These complaints, though, are not insurmountable, but in our experience with data discovery, it requires heavy integration to even get partial coverage of all queries being written and their metadata, not to mention the fact that any queries you scrape externally won’t be qualified/contextualized in any way. It would be better if queries were written and contextualized in the same tool to ensure proper context is retained at every step. And that brings me to my second point…

Towards a better SQL-writing experience, step 2: Context in context

Having a discovery app separate from your IDE adds heavy context switching costs.

“There are a lot of tools out there that address data awareness, and I’m excited to keep following them as they mature. … The thing that I’m not seeing in these products today is that the information is not in context. You need to go to a different tool to find information. And I think it’d be a lot better if that information about your data lived in your text editor or your query tool.”
- Drew Banin, CPO dbt labs

Even if you manage to transfer all query-writing context and metadata to your discovery tool, having separate tools for query-writing, doc-writing, and discovery adds heavy context switching costs to your analytics workflow. Here’s what your workflow actually looks like:

The dotted lines indicate the distinct paths we’ve all found ourselves on during the query-writing process. You look for work that you/others have done before, look for queries, then look for information about the tables, start writing a query, then repeat the entire process numerous times for a single query. If you’re doing all these things in different apps, you’re in context-switching hell.

Moreover, in the disjoint IDE + data discovery separate app workflow, traversal between these workflows isn’t just difficult — it’s often impossible. How do you find the queries you wrote that use a certain table? How do you search all docs you’ve written, if they’re all living in scattered Google docs?

Our solution: Hyperquery — the networked doc workspace for analytics.

We realized the solution to the qualms above is to have a single app that enables the following: research data/past work, write queries, and write up your findings. Putting discovery + query-writing in the same workflow was a no-brainer, but our key realization was that the loop is fully closed once we allow users to write up their findings as well so they could be discovered later.

To this end, we centered our app around a WYSIWYG, SQL-enabled doc experience, making it simple to write queries and add context as needed. We linked it all together through our data discovery backend, meaning:

You can do work in Hyperquery and trust that it’ll be discoverable later.

And as a result, the user journey is vastly simplified:

Let’s walk through a couple sample journeys in our app so you can get excited about this.

Seamless doc + SQL-writing.
Stop keeping a ton of tabs open in your IDE, only to have your queries get lost forever. Write your SQL in a delightful markdown format so you can find them later. Or publish them, so your team can read up on your work.

Get table context while writing queries.
Because we have a data discovery tool underpinning our query-writing experience, you can get rich (read: better than IDE) context while you write. Click on the link in the hover-over to get to the catalog side of our app.

Find docs/queries from tables.
As soon as you execute a query in a doc, we link it to tables that it references in the catalog side of our app. Next time you go diving for that analysis you did with a particular table, easily traverse the graph instead of scrounging. It’s lineage, but over all your write-ups and ad-hoc work.

Access everything through a universal search.
We still retain a Google-like search over your tables, but include your queries and documents in the same space, so everything you might need to learn/remember/discover is accessible from a single entrypoint.

And that’s just a small glimpse of what Hyperquery provides: docs are also nest-able and organizable with rich markdown capabilities, and table detail pages have rich catalog capabilities — even a record of historical events associated with the table. We’ve been using Hyperquery ourselves, and we are narcissistically in love. And given heavy enthusiasm from our early customers, I can guarantee you’ll love it too.

‍

Tweet @imrobertyi / @hyperquery to say hi.👋
Follow us on LinkedIn. 🙂
To learn more about hyperquery, visit hyperquery.ai.‍

The Rise and Fall of Standalone Data Discovery

Towards a better SQL-writing experience, step 1: richer discovery.

Towards a better SQL-writing experience, step 2: Context in context

Our solution: Hyperquery — the networked doc workspace for analytics.

Get started today