The News: The retirement of DAWNbench was recently announced by the team at Stanford University signaling change ahead for benchmarking the performance of integrated AI hardware/software stacks. Stanford announced that DAWNBench will stop accepting rolling submissions on March 27th in order to help consolidate industry benchmarking efforts. The team responsible for development of the end-to-end learning performance benchmarking methodology introduced with DAWNBench has been working with MLPerf to expand the functionality to a more comprehensive offering of tasks and scenarios. Having now put MLPerf Training benchmark suite and the MLPerf Inference benchmark suite through testing, it was decided to “pass the torch” from DAWNBench to MLPerf moving forward. Read the full post from the Stanford DAWN team here.
The Retirement of DAWNBench — What’s Next for Benchmarking the Next Gen Infrastructure for Industrialized Data Science
Analyst Take: The retirement of DAWNBench by its team of creators at Stanford was of particular interest to me as it relates to what’s next for benchmarking the next generation of infrastructure for industrialized data science. When launched in 2017, DAWNBench was tremendously valuable, now, it’s exciting to look ahead to the next iteration of this industry standard. But first, a look at the back story.
The DAWNBench History
DAWNBench was launched in 2017 as an integral component of the larger five-year DAWN research project as part of the industrial affiliates program at Stanford University and is financially supported in part by founding members including Intel, Microsoft, NEC, Teradata, VMWare, and Google. DAWN’s charter was intended to address the issues of the age of machine learning and artificial intelligence.
DAWNBench was the first open benchmark to compare end-to-end training and inference across multiple deep learning frameworks and tasks. Its benchmarks support comparison of the performance of AI stacks across diverse model architectures, software frameworks, hardware platforms, and optimization procedures. DAWNBench provided benchmark specifications for image classification and question answering workloads, and it benchmarked on such stack metrics as accuracy, computation time and cost.
The AI solution market has matured to the point where users demand reliable benchmarks of the comparative performance of alternative hardware/software stacks.
DAWNBench Was a Key Catalyst for Open AI Benchmarking
Though its importance will soon be a historical footnote, DAWNBench was a key catalyst in the larger industry push for open AI performance benchmarking frameworks. This pioneering project predated and substantially inspired MLPerf and other AI industry benchmarking initiatives that have taken off in the past few years.
Today, and with increasing regularity, AI vendors claim to offer the fastest, most scalable, and lowest cost in handling natural language processing, machine learning, and other data-intensive algorithmic workloads. To bolster these claims, more vendors are turning to MLPerf benchmarks. And we are already seeing major AI vendors such as NVIDIA and Google boast of their superior performance on the MLPerf training and inferencing benchmarks.
With the Retirement of DAWNBench, MLPerf Is Now the Go-To Open Framework for AI Benchmarking
The retirement of DAWNBench was inevitable and should be considered a success. Its discontinuance after three years is a clear sign that its sponsors agree that its core mission has been accomplished.
I expect MLPerf to carry on DAWNBench’s practice of expanding the range of AI performance metrics that can be benchmarked. DAWNBench provided a framework for assessing competitive AI system performance on such metrics as accuracy, computation time, and cost, whereas previous AI accelerator benchmarks had focused purely on accuracy. As edge deployments become more common for AI, I expect that MLPerf will add memory footprint and power efficiency as metrics for benchmarking a wide range of device-level AI workloads.
Is There Any Follow-On AI Benchmarking Potential Within Remaining DAWN Workstreams?
What’s not clear following the announcement of the retirement of DAWNBench is how the larger DAWN project should proceed now that its most important contribution to the development of standardized AI infrastructures is behind it.
Open AI benchmarks are clearly useful for assessing whether any deployed AI hardware/software stack meets any or all of the applicable productization metrics. Having benchmarks enables AI professionals to have confidence that that a deployed hardware/software stack can meet the stringent productionization requirements of industrialized data science pipelines.
But the other subprojects under DAWN feel like a potpourri of interesting initiatives without any clear organizing vision:
- AI insight acceleration: The project has developed an analytic monitoring engine (MacroBase) that directs user attention to the most useful, interesting trends in high-volume data sets and streams.
- AI development productivity enhancement: The project has developed a high-level domain-specific language (Spatial) that uses a parameterized high-level abstraction to support programming of reconfigurable AI-accelerator hardware.
- AI solution acceleration: The project has developed a fast, parallel runtime engine (Weld) that uses a lightweight intermediate representation to optimize code for a wide range of data science workloads, as well as a scalable AI-accelerated engine (NoScope) for high-volume, real-time inferencing and querying over network video streams.
- AI training automation: The project has developed a rapid tool (Snorkel) for programmatically modeling, generating, and managing of labels for synthetic training data in domains for which large manually labeled training sets are unavailable or not easy to obtain. Leveraging weakly supervised training, it uses a generative adversarial network to allow subject matter experts to write functions that assign labels to data automatically.
Conceivably, the DAWN project could try to bring some benchmarking focus to each of these projects, including answering the following:
- How rapidly does MacroBase direct user attention to useful, interesting trends in data sets and streams compared to other analytic monitoring engines?
- How fast does Spatial enable AI-accelerator field programming gate arrays to be reconfigured compared to other domain-specific languages?
- How well does Weld optimized data-science code compared to other parallel runtime engines and intermediate representations?
- How scalably and with how low a latency does NoScope enable querying of network video streams compared to other approaches?
- How scalably and with what latency does Snorkel enable modeling, generating, and managing of labeling for training data compared to purely manual methods?
Snorkel Has the Most Traction of Remaining DAWN Projects
That said, trying to retrofit each of the DAWN projects with a benchmarking focus may not be what AI professionals need most.
Perhaps it would be best for Stanford to bring one of these projects, like Snorkel, for instance, to the forefront of its attentions. High-volume programmatic data labeling definitely scratches a key itch—that of industrialization of training data generation and preparation—felt by working data scientists.
So it’s no surprise that Snorkel has the most active community of any of the remaining DAWN projects. Most notably, Google, working with researchers from Stanford University and Brown University, has extended the open source Snorkel tool to suit it for enterprise-grade data integration workloads. Under the “DryBell” project, Snorkel has been modified to support higher-volume data processing and label creation. The researchers changed the optimization function used in Snorkel’s generative adversarial network to halve the speed at which Snorkel processes data and applies labels. They also integrated Snorkel with the MapReduce distributed computation method so that it can be run loosely coupled across multiple computers.
The Takeaway – What’s Next Benchmarking Next-Generation Infrastructure for Industrialized Data Science
DAWNBench’s contribution to the AI industry ecosystem is clear. DAWNBench catalyzed creation of an open framework and forum—MLPerf—within which hardware/software stacks can be benchmarked for a wide range of AI workloads.
What’s next for benchmarking next-generation infrastructure for industrialized data science? At the midway point in the DAWN project’s five-year journey, it would behoove Stanford to assess which, if any, of the remaining workstreams provides value in the development of an open ecosystem for industrialized AI pipelines.
Conceivably, the DAWN project could try to bring some benchmarking focus where appropriate to any or all of the remaining projects: MacroBase, Spatial, Weld, NoScope, and Snorkel. But I believe it would more fruitful to emphasize Snorkel, given the significant industry traction it has already received as a next-generation open-source platform for programmatic labeling of training data.
In fact, there is no clear alternative to Snorkel in today’s open AI ecosystem. Training workflows are becoming more automated, manual labeling is running into scalability limits, and synthetic data generation is proving itself sufficient for many AI DevOps challenges. That’s why I’m bullish on Snorkel and the benefit a focus there could provide to the industrialized data science community as a whole.
Futurum Research provides industry research and analysis. These columns are for educational purposes only and should not be considered in any way investment advice.