How Much Data is Enough Data?
We recently discussed using predictive analytics to improve sales leads. You obviously need a dataset to mine from in order to have predictive analytics, but how much data is enough data? When big data exploded in the business world, companies collected everything under the sun and then they were often stuck with unusable data swamps rather than clean data lakes like they had hoped. Now, according to some scientists, we need to think smaller and focus on quality over quantity. We shouldn’t be concerned with analyzing all available data; instead, our efforts should be figuring out the amount of data needed to get to something worth nothing. It all sounds well and good, but how do we do it?
Focus on the Right Data
Concentrate on curating the right kind of data. Clean and unbiased data is out there, you just may need to wade through the murky waters to find it. Following the steps I outlined for cleaning up your data swamp is a great place to begin your search.
Start by clearly defining what kind of data is useful to your company, as well as who will man it. Working within the specific confines of your data goals and assigning ownership early on helps narrow your focus and avoid falling behind on management. It’s equally as important to assign metadata when organizing your big data; that way, it’s searchable and ultimately useful to your company. I think silos in the workplace, particularly the c-suite, have led to companies collecting too much data. Leaders need to talk to each other and work together to determine what data can be used by multiple departments to accomplish a goal. Doing so will help keep your data lake clear.
Another tip for paring down big data is to use the technology already available to you and, where possible, automate. I can’t stress this enough. Automation alleviates much of the burden on your human employees and more efficiently manages your data. Finally, once the data is clean, keep it that way. Prevent creating a “data swamp” by establishing clear guidelines for where and how data is to be collected. The more foresight put into cleaning your data swamp—or avoiding creating one altogether—the easier it will be to hone in on the most important data. Trust me – putting in the work now to keep your data lake clean will benefit you in the long run.
Know When Enough is Enough
Michael Berry, analytics director for the travel website TripAdvisor, knows a thing or two about testing data analytics. “Testing the predictive model’s performance by incrementally adding more data can shed light on when enough is enough.” When computing averages for a specific hotel and specific customer bid, Berry hit a steady plateau at 100,000 and realized it was enough. Any more data would be superfluous, having very little influence over the results. This is just an example of the Law of Diminishing Returns.
Uber is a great example of when companies should cry uncle in terms of data quantity. Though it upgraded transportation from the traditional taxi service, Uber essentially works using the same datasets as its predecessor. Understanding more isn’t necessarily better, Uber cut back on the quantity of data it had been collecting and stopped running a “biological anomaly detection algorithm on visual data.” Today, it asks only for essential data: Who needs a ride and where in the city are they? It’s not about the amount of date they’re collecting, it’s about the type.
For the likes of TripAdvisor and Uber to make determinations about which datasets would most benefit their companies, they had to figure out which ones hindered their competitive advantage. By concentrating on waste, they could unearth the kinds and amounts of data that meant less spending and more productivity.
The first step in this excavation process is identifying company waste. Be it in production, retail, or other services, determining the sources of waste paves the way to useful data. Once the sources are targeted, you need to begin automating certain decisions. Those that are simple, repetitive, and operational in nature are optimal for automation. Finally, ask yourself what piece of information is necessary to effectively and consistently reduce waste. Master this trifecta and you’ll be left with more manageable, bitesize pieces of data.
Know the Limitations of AI
A common misconception of AI is that the more data it has, the more it can do with it. But it’s important to note that AI doesn’t analyze ALL the data, just the right data. Frankly, people must work correctly before AI will. HBR said it best, “companies that can zero-in on the impact they want to see and focus on curating the right datasets mapping to those goals have the best opportunity for generating really impactful results from AI.”
That’s not to say big data is a farce, but AI produces more worthwhile results with clearly established objectives and simple datasets. AI is not about winning the data war; it’s about winning the individual battles that bring you closer to victory. We’ve seen AI programs figure out how much data is needed to analyze in order to see something worth noting. Again, it’s clear that small, high-precision pieces of data garner the best results.
As my partner Shelly Kramer stated in her article Why Deep Learning and AI Will Change Everything, “It’s no longer a question of who is using big data—it’s who’s using it well? Who is getting the most insights?” Despite data volumes increasing, most miners still use the same size datasets as they did a decade ago. Predictive models don’t require hundreds of fields of data; they require robust datasets and people who know how to use technology to make the most of the information. Often, companies find they do not need an abundance of or even new information. It’s likely they’ve already collected the data they need in terms of product, customers, and competitors. When you have clearly defined data objectives, a competent team of data scientists, and an understanding of how technologies work together, you have a company able to produce results that will maintain their competitive position.
Latest posts by Daniel Newman (see all)
- VMware Posts Earnings Just Ahead of VMWorld: Wins Keep Coming - August 22, 2019
- Splunk’s Busy Day: Earnings Beat and Cloud Monitoring Leader SignalFX Acquisition - August 22, 2019
- Intel Releases New AI Processors Deepening Training And Inference Capabilities - August 21, 2019