[Photo from CBS]
A huge thanks goes out to the Jedis of Data: Russ Glass of Bizo; Jason Cavnar of Singly; Gil Elbaz, Eva Ho, Tyler Bell and Vikas Gupta at Factual; Miten Sampat of Quova; and Ivan Mitrovic of Where.com/Paypal; as well as my colleagues David Pakman and Brian Ascher for critiquing and suggesting content for this article.
This decade will be the decade of Data. Why? Because software development has become commoditized and super-cheap - the main dimension of competition has shifted to data and the intelligence that comes with it. Secondary reasons for this are that we've never had so much data collected in the past; and the tools to analyze this ginormous amount of data have now become widely available.
I think large companies will be built around data that is dynamic or difficult-to-aggregate as well as have algorithms that have network effects, feedback loops and machine learning built in. I call these closed-loop data businesses, adjacent to Big Data. Adding closed-loop data to your product and business model can help your startup break through the software arms-race and margin pressure of a standard data business.
More on closed-loop data businesses further down in this post. Let's first start with a general overview of data businesses.
DATA BUSINESSES
Typically, data businesses arise when:
- companies realize that data is critical to their business but their own efforts at building a usable database will be too costly or time-consuming (e.g. governments/businesses needing to do geospatial planning but not having real detailed geographic maps) OR
- multiple competitors need to pool data in a neutral company to fix something fundamentally broken in their business models (e.g. credit bureaus to fix/address the problem of high delinquency rates) OR
- new methods of data collection (e.g. networks of sensors like GPS devices or smart meters) and processing arise that are 10x faster/cheaper/easier than existing methods OR
- new pools of disparate data can be exclusively or de-facto exclusively accessed for the first time
Strategically, data businesses have several advantages. If one can build a network-effect-based data business, then it is hard for competitors to catch up because of all the data that has been processed that has in turn improved proprietary algorithms through machine learning. Also, at scale, margins tend to be quite high (85%+) as data can be produced once and consumed multiple times with almost zero marginal cost. This low marginal cost also allows data companies to give away free access to certain slices of data to drive initial customer value, drive paid acquisition and generate leads for paying customers. And finally, once customers have incorporated this data into their own businesses, switching costs are usually high because of semi-proprietary formats (sometimes evolving into de-facto industry standards such as FICO scores)
There are challenges, though. It usually takes a long time (sometimes years) to build out a big enough and accurate enough database - there is a problem of acquiring enough data to build out tuned algorithms and an accurate value-added database. Ingested data changes (sometimes quite dynamically) and the value-added database needs to get updated quite often. In the early days, labor costs are usually high in manually normalizing, processing and cleaning this data. And sometimes data collection is not possible from third-parties and has to be done manually (Navteq famously spends upwards of $400 million per year on its van network to collect and process map data). Because third-party data sources are many times paid sources, companies have to figure out how to scale their revenue without scaling data costs in line with revenue. And finally, data businesses are usually further down the stack and therefore at the mercy of competitive or governmental forces threatening their customers.
Some more recent challenges that have emerged: First, data businesses with direct revenue from their customers are now being threatened by data businesses that monetize through reach-driven advertising, not direct revenue. Second, think carefully before trying to exactly replicate a US or Europe data business in an emerging market - in many cases, the sophistication of the specific vertical eco-system may be in earlier stages and markets may not have evolved enough to understand the value of nor respect data. And finally, many consumer businesses have swapped or augmented advertising with data sales as a business model - having user data doesn't mean someone will pay a lot for this data nor that the data is at scale or is differentiated. (thanks, Brian).
Here are some examples of data businesses:
- financial Services (Equifax, CIBIL India, CCIS China, FICO, Bloomberg, Venrock-backed BillFloat)
- retail (Catalina Marketing, Venrock-backed RSi)
- location (Skyhook, Navteq, Factual, OpenStreetMap, MapMyIndia, Venrock-backed INRIX)
- search/advertising (Google search, Dataxu, Rapleaf, BlueKai, Venrock-based Bizo and Media6Degrees)
- healthcare (Venrock-backed Castlight and Kyru.us)
- travel (ITA Software, Galileo, Sabre)
- social (LinkedIn, Bit.ly, Klout, Singly, Gnip)
Let's turn now to a special subset of data businesses:
CLOSED-LOOP DATA BUSINESSES
These businesses:
- collect data from third-party sources (and potentially their own proprietary application/sources)
- process that data with their own 'special sauce' i.e. proprietary rules and algorithms to produce value-added data
- (In some cases,) monetize by directly selling data to third parties. (In some other cases,) monetize by using the data to improve core product (like Google and LinkedIn do) or provide other value-added services such as an ad network or an application.
- receive data back from consumers of their data to aid in machine learning and data quality verification.
Here's a rough diagram of the data flow in a closed-loop data business:
Feedback loops are the crux of the issue here: depending on the vertical you're in, you can spot these in the form of click-through rates, ground-truth testing, fraud rates, acceptance rates, user complaint rates, redemption rates, multi-source cross-checking etc.
In starting a data business, think about whether you can turn it into a closed-loop data business. Here are some questions to ask yourself in doing so:
Data sourcing and normalization:
- Can you source some of your data from businesses that collect this data as a byproduct of their business?
- Can you source data from free and easily-accessible databases e.g. Tiger maps, info.gov, etc.
- Can you get exclusive access to some of this data?
- Can your find or buy orphan data businesses from other companies in your eco-system? As a corollary, can you get other players in the eco-system to outsource their data function to you.
- Can you barter with data suppliers so you reduce your COGS and increase gross margins?
- Can you involve the community to explicitly or implicitly input more data and verify/correct the data?
Data processing & machine learning:
- Can you introduce feedback loops from consumers of your data? Can it be made automatic, rather than manual? Can your algorithms learn from this feedback loop?
- How will you verify data quality and prove this to customers? Different customer segments may require different levels of quality.
- Do you have data scientists (and perhaps algorithms) to model and produce value-added data? Are you giving your data scientists the mandate to come up with new products? They are the only ones that can discover new products in data that consumer product managers typically don't see and can't plan for. (thanks, Ivan)
- Be ready to innovate on data pipeline and data storage technologies. Off-the-shelf "Big Data" technologies may not cut it. Example: LinkedIn's Voldemort project. (thanks, Ivan)
- What scale do you need to reach to have minimum viable database size?
Pricing:
- Does your pricing model reflect the business model already accepted by buyers in the industry? It should.
- Can you give away any lower value slice of the data to drive adoption and lower customer acquisition costs/barriers? (thanks, Russ)
- Are you pricing below other higher-cost data providers in your sector? You may or may not need to engage in price-based competition.
- Is the value in the data itself or in the application of the data? Should you keep the data and build out the application instead to capture full value? (thanks, Vikas)
Packaging and Standards:
- Your value-added data is not one big monolithic product you sell to customers. Think about how you would slice this up to provide different levels of value to different customers (e.g. historical slices, real-time feeds, segmented data, different data delivery types).
- Can you provide applications or value-added algorithms for specific segments of customers?
- Flexible APIs and tools will drive faster adoption e.g. provide sample code in all popular languages, write plug-ins for popular CMS programs, make the data available in popular formats etc. Don't just provide a read API. Think about mandating a write in exchange for certain types of data that your customers read from you.
- Are you adopting industry standards for how your data is stored and transferred? In other words, are you on the side of open or on the side of proprietary? (thanks, Gil).
- Invest early on in visualization capabilities - your customers will need to see and manipulate the data before they understand the true value it may bring them. Your data scientists and data visualization should be able to produce not just charts but customer insights. (thanks, Miten).
[Data visualization from LinkedIn blog post on InMaps]
Legal:
- How will you protect your data from being copied? Dynamic, fast-changing data is hard to copy; static/slowly-changing data is easier to copy or commoditize. Map companies insert phantom geographic features/streets into their maps to track copying by other map database companies.
- Construct the terms of service for easy adoption and maximum feedback. Also take into account residual rights on data produced by your customer's algorithms that has your data as an input. OpenStreetMap was in limbo for a while with the change to an Open Database License (ODbL) from a CC-BY-SA 2.0 license. (thanks, Miten)
- Be especially careful on storing Personally Identifiable Information (PII) - preferably, don't do this. Also take into account different legal definitions and standards-of-use for PII in different countries. Consult a lawyer on this one.
TO FINISH UP...
Congratulations to those of you that have made it through this dense prose! You'll need some of this same fortitude to build your own data business!
As you think about your new startup idea or how to put extra legs on your existing venture, really think about whether you can introduce closed-loop data into your product and business model. In my opinion, pure data processing businesses have a limited shelf life and ever-decreasing margins whereas closed-loop data businesses have increased odds of breaking through.
Come talk to me if you have a world-changing idea in this area and need a partner-in-crime! And go hire a mad data scientist!
[Photo from Cayusa]
Comments
You can follow this conversation by subscribing to the comment feed for this post.