The $3 Trillion Data Quality Opportunity and Our Investment in Validio
After 2.5 years of building their platform in stealth, J12 partnered up with Swedish data validation company Validio, leading a $1.5m seed round along DHS Venture Partners.
Our decision was largerly based on
Opportunity and Timing
Team and Execution
Fit with J12
And we have decided to share some of our refelctions here
…
Opportunity
“Data is eating the world”
Marc Andreessen famously coined in 2011 that “software is eating the world”. However, data has emerged as this new world’s most valuable asset. Companies big and small are investing at record levels to collect, categorize, stream, integrate, analyze, and act upon data. And this trend is only accelerating. Many remember how in 2010 people talked about the awesome scale of big data. Now that seems quaint. By 2025, our “datasphere” will be two orders of magnitude larger than in 2010. The scale of the global datasphere (global data stored in all formats) is expected to be four times bigger in 2025 versus 2019, growing from 45 zettabytes to 175 zettabytes. Put more simply: from 2019 to 2025 almost 4X more data will be created than in history to date! Hence, it should come as no surprise that 86% of enterprises plan to increase their data operations investments in the coming 12 months and that the data engineer role is the fastest-growing role in tech right now. The management of the exponential growth of data and the inherent complexity cannot solely be solved with linear methods (i.e. humans), we need intelligent software to assist in different stages of the data lifecycle.
“The good old machine learning adage holds true: Crap in, crap out”
Poor data quality costs companies A LOT. According to Harvard $3’000’000’000’000 per year for US companies alone. The introduction of poor data quality can stem from various problems, such as system mergers, external data streams, human error, etc. — fixing those is naturally key. In a world where data rarely will be perfect, one needs to validate it constantly. That’s where Validio fits in. Data-driven companies must be proactive rather than reactive when input data changes and messes with e.g. machine learning models in production. Companies big and small are realizing that the models were never their IP, it is their data. A 2019 survey by O’Reilly showcased how those with mature machine learning practices (as measured by how long they’ve had models in production) cited a “lack of data or data quality issues” as the main challenge preventing them from further machine learning adoption.
“Data scientists & data engineers, the white-collar cleaners & plumbers?”
For many companies attracting and retaining data scientists and data engineers is extremely costly. Still, many data scientists and data engineers end up spending up to 80% of their time wrangling and fixing bad data, a task that is rarely enjoyed and leads to frustration beyond the enormous financial cost for the company. We believe that there is a significant willingness to pay for any service reducing that pain point. We’ve understood that validating data manually takes significant extra effort and time from other business-critical and frankly more enjoyable work. Data engineers are responsible for ensuring that data is trustworthy and delivered at the expected quality. Controversially, when looking at how important the role is, we’ve observed how ill-equipped many are to do that necessary work. Automating data validation and quality monitoring with software will reduce the burden of data science teams, allowing them to focus on more value-adding (and enjoyable) work.
That is what the Validio founders are obsessed to build: the best data validation and data quality monitoring software in the world.
…
Timing
“The rise of the modern data infrastructure stack”
Many of today’s fastest-growing B2B SaaS startups have data at their core. E.g. Snowflake has leveraged the rise of the multi-cloud ecosystem to simplify access to the data warehouse and business analytics. Databricks has made it possible for large enterprises to efficiently batch query large amounts of data. Confluent, which according to Sequoia partner Matt Miller is the fastest-growing enterprise subscription company Sequoia has ever seen, has connected data across the enterprise and created the opportunity to act upon it in real-time. The concept of “modern data infrastructure stack” is many years in the making; it started appearing as far back as 2012 when Amazon Redshift launched. However, especially during 2019 and 2020, the popularity of cloud warehouses has grown explosively accelerated by Snowflake’s block-buster IPO, and so has a whole ecosystem of tools and companies around them, going from leading edge to mainstream. Validio is well-positioned to benefit from this strong secular trend and define the data validation and quality monitoring category, still early in the making.
“Machine learning is finally breaking out of the hype bubble”
For years normal companies (not the Google’s, Amazon’s, Uber’s or Facebook’s of the world) have struggled to apply machine learning or deep learning operationally due to various organizational and data infrastructural reasons. However, during the past few years, we’ve seen how especially machine learning is finally breaking out of the PoC lab into mission-critical applications for many companies. These companies typically embarked years ago on a journey that started with big data infrastructure, which has evolved along the way to include data science and machine learning. Those companies are now in the machine learning deployment phase, reaching a level of maturity where ML gets deployed in production.
Subsequently, we’ve heard from various market leaders that the necessity and willingness to invest in better data quality has matured significantly during the past 18 months, from being a relatively nascent topic at top management.
Hence we believe that Validio is entering the market at the perfect time as data is the single most important component in the modern machine learning lifecycle. Data validation and quality monitoring software helps data producers and consumers understand whether there are issues/anomalies with the data and correct it. This can e.g. be nulls, schema changes, timeliness issues, or distribution changes. All of this has an impact on the performance of machine learning models in production.
“The pandemic destroyed many machine learning models in production, showcasing the need for continuous data validation”
The start of the pandemic showcased how fragile e.g. consumer-facing machine learning models can be when circumstances and human behavior changes fast. The pandemic and the sudden change in input data caused hiccups for machine learning algorithms that run e.g. behind the scenes in inventory management, fraud detection, customer segmentation, product iteration testing, sales & demand forecasting, churn prediction, logistics optimization, etc. ML models trained on normal human behavior data found that normal has changed, and some stopped working as they should. ML models are designed to respond to changes. But most are also fragile; they perform badly when input data differs too much from the data they were trained on. The same thing happens if an ML engineer spends time training and serving an ML model built with bad data, the incorrect ML model will be ineffective in production and can have negative secondary implications for user experience and revenue. As seen with the pandemic; ML models can also fail even if the data that the model has been trained with is fine but the input data suddenly changes radically due to unexpected events such as a global pandemic. The recent happenings have further underlined the importance of data validation and accelerated the need for Validio’s real-time focus, which puts the service apart from existing alternatives.
Further readings: Data quality — a primer by Astasia Myers at Redpoint VC, The 2020 Data & AI Landscape by Matt Turk at Firstmark VC, Data Quality at Airbnb Part 1 & Part 2 by Vaughn Quoss, Jonathan Parks, Paul Ellwood at Airbnb, Data Management Trends From An Investor Perspective — Episode 136 by Data Engineering Podcast — in particular from minute 17:20, The Great Data Debate — Episode 608 by The a16z Podcast
…
Team
“A team with deep industry insight and thought leadership in the community”
Validio was founded by Patrik Tran (CEO), Urban Eriksson, and Emmanuel Chappat.
Patrik Tran holds a B.Sc. in Engineering Physics and M.Sc. in Machine Learning and a P.h.D from Stockholm School of Economics (officially from Dec 18th, congrats btw!) He’s an acknowledged speaker and thought-leader in the data science community, being actively involved as Chairman of Stockholm AI
Urban Eriksson holds M.Sc. in Machine Learning and Ph.D. in Laser Physics from KTH where he is also an AI researcher and has to date been granted 7 international patents on different algorithms. In 2000 he was part of the founding team of optic component company Optillion (raised $68m) and later the chief data scientist at Finisar Corporation (NASDAQ: FNSR)
Emmanuel Chappat is known in the data science community for having single-handedly built the deep learning platform AI fiddle. He has previously held several technical co-founding roles.
“The founders have shown grit and ambition this far”
Building and investing their time and resources at a high alternative cost. From early on in the discussion, the team lined out their ambition to build a global category leader with the necessary sacrifices along the way. We believe that they have the foundation in place to do so.
…
Execution
“Product-market fit — customers are asking for a solution”
While the product is in its early stages and this is reflected in the software’s UX, the solution has impressed early-adopter customers working with the Validio team. Meanwhile, we see an increasing appetite from the market as data quality is becoming one of the most central topics in the MLOps space, which concerns lifecycle management of machine learning systems in production.
“Sales market fit — healthy balance between customer value and decision complexity”
Investing in an enterprise SaaS at the seed stage typically means that the company is far from having reached sales-market fit, i.e. having found the right channels and an efficient method to reach the right customers at the right cost (CAC). In layman’s terms a repeatable sales process. Hence, instead, we have been looking at the potential of a healthy sales operation moving forward. We observe that for many firms the data validation & quality problem is of great value (€1M+). At the same time, Validio’s solution is not particularly intrusive and can be tested without lengthy decision making and procurement processes from the customer side. This means that in practice, a single customer can derive enough value, relative to the complexity and length of their decision-making process, making Validio a successful part of their modern data stack.
Further readings: a16z Podcast: Product-Market SALES Fit (What Comes First?), All Things Sales! 16 Mini-Lessons for Startup Founders
…
Fit with J12
“Early proof of product but pre-product-market-sales fit”
We rarely get involved with companies that have already figured it all out, just as outlined in section 2.2) J12 likes to partner with companies in the earliest stages of building a company and typically before product-market-sales fit is achieved. We’re not product specialists, or specialists in any vertical or industry for that matter so we will never be better at building, testing, and iterating products than the entrepreneurs we partner with. But we have a good understanding of what it takes to build a team early on, iterating sales and marketing strategy, nailing pricing early and during proof-of-concept phases, transiting from founder-led sales to a sales organization, etc. — the typical things you have to figure out early on no matter what you are building.
“Great fit with our angel network DHS Venture Partners”
As part of our strategy we closely collaborate with the SSE alumni network DHS which J12 was part of founding and managing to this date. In the case of Validio, three of the 25 members invested alongside J12. That is ex-BCG leader Per Hallius, SaaS and data investor Fredrik Uhrström, and entrepreneur and angel investor Mattias Miksche. Two of them are also joining the board of directors.