Futurum Tech Webcast — A Light on Cloud Infrastructure: How Marvell is Scaling AI with New Optical DSP
In this episode of the Futurum Tech Webcast – Interview Series, we address the topic of cloud optics innovation and why it is critical to fulfilling the emerging demands in next generation AI/ML systems. Marvell’s new Nova offering, a 1.6T PAM4 DSP optimized for high-performance fabrics in AI/ML environments, focuses on delivering breakthroughs in optical connectivity by enabling the highest speed of data movement in cloud AI/ML and data center networks. Powered by a groundbreaking 200 Gbps/lambda optical DSP, Nova doubles the optical bandwidth of current solutions to enable 1.6 Tbps pluggable modules for scaling AI clusters. Nova extends the multi-source pluggable optics ecosystem and provides the advanced technology needed to alleviate data center network bottlenecks as the industry transitions to 51.2 Tbps networking architectures.
My guest today is Nigel Alvares, VP of Global Marketing and Business Planning at Marvell Technologies, a top-tier semiconductor company. Nigel is a returning guest on this show, and astutely shares his insights and perspective on the direction of the cloud optics market segment and its vital role across AI/ML and data center environments.
To start our discussion, we focused on the Nova PAM4 DSP product and how the PAM4 DSPs sit inside of an optical module that plugs into the back of a switch. Nigel demonstrated how one works and deftly explained they essentially transfer data from the switch to optical modules so that data can be moved anywhere in the world.
Today, Marvell’s Nova product is ready to meet the bandwidth boom clouds are facing as data traffic is forecasted to grow at over 40% per year. Notably, it’s growing even faster inside the data center, where PAM4s are mostly used. Some estimate that East-West traffic between racks in data centers is now 70% of traffic. As a result, clouds are investing heavily in switched and optical interconnects powered by DSPs such as Nova.
Our conversation drilled down on the following key topics:
- What kind of impact will the Marvell Nova solution have on AI, especially in handling and scaling the required cloud workloads.
- Why Nova is the first chip with 200G per lambda or wavelength and how Marvell accomplished this breakthrough.
- The benefits Nova can deliver such as the doubling of data capacity, getting the same amount of bandwidth with half the equipment, and saving valuable real estate in data centers.
- We examined the latest developments in co-packaged silicon and direct drive designs including their pros and cons. Using linear direct (LD) or co-packaged optics (CPO) approaches comes down to customer preferences, although it can be two more years before they become mainstream.
- An assessment of active electrical cable (AEC) technology and how they can help enable server to server bandwidth can start to double every few years akin to rack interconnects as well as optical arriving inside servers due to PAM4 commodifying optical bandwidth.
- Why modules have a long life ahead of them, especially throughout AI clusters, and how Marvell’s 3nm innovations play a key role.
You can watch the video of our conversation here (and subscribe to our YouTube channel while you’re there):
Or listen here:
Or grab the audio on your streaming platform of choice here:
Disclaimer: The Futurum Tech Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we do not ask that you treat us as such.
Other insights from Futurum Research:
Marvell Breaks on Through to the 3nm Process Realm
Marvell Fiscal Q4 2023 & FY 2023: Delivers Record Revenue in Fiscal 2023; Cloud-led Data Center Revenue Flourishes
Marvell Unleashes Custom ASIC Portfolio Innovations to Fulfill New Global Data Infrastructure Demands
Ron Westfall: Hello everyone, and welcome to another episode of the Futurum Tech webcast. I’m your host, Ron Westfall, Research Director and Senior Analyst here at Futurum Research. And in this episode we’re going to talk about cloud optics, we’re going to talk about AI, and we’re going to talk about silicon driven AI innovation and what Marvell is bringing to the market and so much more. And joining us today is my friend and distinguished colleague Nigel Alvarez. Good day Nigel.
Nigel Alvares: Good day Ron. Glad to be on.
Ron Westfall: Right on. It is just that great. And thank you for joining us. And to start, tell us about yourself and how you came to join the Marvell team.
Nigel Alvares: Absolutely. So my name again is Nigel Alvarez, and I lead all our global marketing and business planning. And I joined Marvell about six years ago just as Matt Murphy joined and started the transformation for Marvell. I joined to lead our flash business unit, so I was leading all the SSD controllers and data center initiatives in storage. And about three years ago I moved to essential function to lead our marketing activities. And super excited of all the new technologies we’re bringing to the table. And we’ll be talking about one of the key innovations we just rolled out.
Ron Westfall: It’s been an amazing journey and so much more. And let’s jump in with that, and talk about Marvell’s Nova offering. I know our audience is keen to know more about Nova.
Nigel Alvares: Yeah, absolutely. So at OFC, two months ago, early March in San Diego, we announced the industry’s first PAM4 DSP for 1.6 terabits per second, enabling 200 gigabits per second wavelengths. That was an industry first. And this continues Marvell’s leadership in enabling these PAM4 DSPs, which are the critical engines for cloud data center optical interconnects inside a data center.
Ron Westfall: And that’s an excellent kickoff, Nigel. And I think that triggers a couple major questions. What impacts will the NOVA technology have on cloud providers and their customers? And what are the challenges they’re facing out there?
Nigel Alvares: So just at the highest level, bandwidth growth inside cloud data centers is growing at a huge CAGR greater than 40% per year and continues to grow at that healthy pace as more and more inside the data center, we like to call it east to west, so server to server sharing data. So I like Ron’s tweet, I say like, that data has to go back and forth to make sure things that Ron has been talking about in relevant topics gets put on my feed. So ton, a ton of inside data center type of traffic and bandwidth that is driving the optical interconnects inside the data center. And then on top of the traditional cloud applications, you have this new emerging AI era that has started off. And we’re all familiar with ChatGPT that’s really taken the world by storm, 1 million users within five days, and continues to grow at a fast pace, is the other big engine that’s actually driving optical connectivity roadmaps at a much faster cadence and is actually the big driver for a NOVA product line.
Ron Westfall: That’s excellent insight and I think that is definitely something that’s an ultra hot topic, AI. Every conversation I’ve had over the last two months has involved that automatically. And so naturally I would like to know more about what Marvell is doing to stand out in terms of enabling AI and meeting all the emerging AI demands out there in the market.
Nigel Alvares: Absolutely. So as I mentioned a little earlier, we innovate on these optical interconnects. We’ve been the first to PAM4 technology, so we were the pioneers of this through the Inphi acquisition. So our strategy is to be the first to the next speed, and we’re at a cadence of actually moving to the next speed, or doubling the bandwidth every two years. And that enables you to also lower, we have a 30% lower power per bit every generation. And we also have a 30% lower cost for the modules. So we’re enabling these cloud data center companies to lower their footprints from a power perspective, and from a cost perspective when you’re looking at how much data you’re moving. So as their data movement needs keep increasing, we’re lowering the cost to move to those faster speeds. And that’s the number one value proposition we’re bringing to the table, to scale their bandwidth requirements and meet them.
Ron Westfall: I think those are clear differentiators. I think they really jump out. And another thing that leapt out at me with the Nova announcement is that it’s the first chip with 200 gig per lambda or wavelength. Can you explain that? Provide more insight on that capability?
Nigel Alvares: Absolutely. So today’s mainstream, and when I say mainstream, I’m talking still at the bleeding edge AI applications out there. It’s a hundred gigabits per second per wavelength. And a wavelength is like the light going out of the module. So the speed would be a hundred gigabits per second. Nova is enabling 200 gigabits per second. So there’s multiple ways you can look at this. You can get the same amount of data and half the time, is one way to look at it. Or you can double the amount of data you sent in a given period of time. So those are the two ways to look at it at the highest level. And I think as AI is driving more and more bandwidth and data movement, they want the faster speeds, so they can at least keep their completion times from a training perspective as minimal as possible.
Because time is money for these cloud providers. They’re looking to rent out their GPUs, rent out their CPUs, so they don’t want the interconnect to be the bottleneck. And I can tell you right now, if you look at a traditional server, which is two CPU sockets inside. A nick is maybe a hundred gigabits per second, maybe 200 gigabits per second if you’re on one of the leading providers. A GPU bandwidth requirement is 3.6 to four terabits per second. Just think about that for a minute. That’s like 30 to 40 times more bandwidth requirement for just one GPU. And guess what? A training set needs more than one GPU. Leading edge clusters by the number one GPU provider out there is eight GPUs in a cluster. So you times that by eight, now you’re looking at 30 terabytes per second. So 100 gig or 200 gig to 30 terabits, you’re looking at a 200 and 300 expansion of bandwidth.
So guess what? You need faster optical interconnects to connect all these things. And it starts with these modules that we enable. Our DSP goes into here and then spits out optical links running at 200 gigabits per second. So really, really innovative stuff and really driven by the appetite, or the need to scale your cloud data centers. And back to your first question, I know I rattled it here a little bit because I’m super excited. The 200 gigabits per second actually also reduces the number of lasers you need. So just think about that. Lasers are the number one failing component in a module. So if you’re able to cut down the number of lasers by half because now you’re doubling the rate, back to your original question of 200 gig versus 100 gig, your reliability of these modules goes up.
And you’re able to also reduce the number of fibers. So instead of using maybe eight fibers to get 800 gigabits per second, now you can use four fibers because you have 200 gigabits per second. So you can also reduce the number of fibers. So there’s so many wins when you move to this 200 gigabits per second. So super excited by this innovation that our engineering teams put together, and we’re bringing it up with all our leading AI and cloud customers right now.
Ron Westfall: Yeah, from my perspective, I think those are really standing out. I think it’s really truly a difference with distinction. And I like that point Nigel in terms of my conversations with the hyperscalers of cloud providers. Real estate is a big deal, it’s a big concern. And the data centers that have been built over the last 20 years are built to have staying power, but now it’s becoming more challenging to expand. And I believe this is exactly the type of technology and innovation that’s needed to also address that critical concern. What is Marvell doing to address the real estate challenges of the cloud providers out there?
Nigel Alvares: An awesome point, Ron, I didn’t touch upon that when you talked about the 200 gig versus 100 gig. One of the big benefits is instead of now needing say 64 modules for a hundred gigabit per second type of architecture, you can now use 32 modules. What that enables you to do is stay at the 1U form factor and that means you can get 2x the capacity or keep at the same density from a perspective of scaling. And it goes back to your point is you want to minimize the amount of space you’re taking for the connectivity. So this module will help you build a 1U box for the leading edge of 51T switches that are coming out versus A 2U if you use the 100 gigabit per second type of module. So this does save your real estate from that form factor perspective.
Ron Westfall: Yeah, I think Nova is definitely introducing some key breakthroughs and I think this is aligning with developments that many people are discussing in regard to co-packaged optics silicon, and direct drive designs. And from your perspective, Nigel, what are the pros and cons here? What are you seeing in regards to this major area?
Nigel Alvares: Yeah, I think there’s a lot of buzz around this linear detect drive type of technology, co-pack optics. The technology for those areas, and they introduced it at a hundred gigs. So all the cool stuff that Nova was bringing a 200 gig, all that LDD stuff and co-packaged optics optics is not there today, but let’s just say they were there. Let’s assume they were able to get there. Now LDD requires a lot of fine-tuning, very manual type of work to make it all come together. So when you look at these cloud data center companies, the number one thing they’re looking for is reliability. They want uptime, they want to make sure their systems are down. This LDD has got a lot of failure potential, that’s one.
Two, they want an ecosystem of suppliers, they want multiple folks to be able to address their needs. They don’t want to have one supplier and they have all their engineers fine-tuning this when something goes wrong. They want to avoid that, back to this reliability. And the biggest thing going forward is they want to move to the faster speeds. So the DSP architecture enables you to get to those faster rates at that 18 months to 24 month cadence.
So those are some of the fundamental challenges that LDD and co-packaged optics optics have. Now with that said, there might be some niche use cases, but for the mainstream high volume use cases, the DSP based architecture is what our customers are all gravitating towards, and it’s really a race to the next bandwidth which the DSP is solving. And that’s the big value we’re bringing as well as the big ecosystem of suppliers and multiple sources.
Ron Westfall: From my perspective, Marvell’s covering all the bases you’re meeting what exactly is required today, keeping a close eye on LD and CPO and their developments. It could be two years before they become more mainstream, and regardless, Marvell’s ready. And that I think is something that’s very important too, the customers out there and the prospects out there. And on that positive note, let’s shift a bit, let’s talk about something that is different but I know is very much top of mind in terms of my conversations out there and what the decision makers are keen to know more about. And that is AEC. Nigel, can you tell us more about AEC?
Nigel Alvares: Yeah, so AEC is another piece of interconnects inside the data center and it’s really mainly server to top of rack. Today. It uses a DAC cable, a direct attached copper cables, so that is usually sub three meters. So three up to three meters where your server connects to the top of rack switch. Same phenomena is happening, the speeds are going up in that area. And then electrical cables or copper cables are running out of steam and they need a DSP to increase or amplify the signal. Similar to what you see in the DSP land in optical connectivity is happening in the electrical copper domain.
And we’re seeing a significant move there as these nicks start to move to 200 gig per lane, or a hundred gig per lane, they need this active cable. So an active cable, again, you have this DSP that goes inside the module that connects between the two points, so DSPs on both sides of the cable. So it’s different from just connecting any cable. You have to actually have both ends using the same type of modules that plug in for the cable. And I think we’re in the early innings of this shift over, but I think as the nicks and GPUs and so on move to a hundred gig per lane coming out of their nick devices or GPU devices, you’ll start to see acceleration for AEC type of solutions.
Ron Westfall: Yeah, we’ve been talking about AEC for a while, and it’s great to see this development. I think it’s something that will be a difference maker and I think yeah, it’s addressing the fact that that chips are becoming more intricate and that they are more specialized, I anticipate more chips will be built on a customized basis. And in the meantime the bandwidth has become more commoditized. So this is I think really good timing in terms of being able to address both of those major trends. And with that, let’s look at a way to address what is really, I think important, and that is in terms of the modules that Marvell’s bringing to market, what kind of shelf life can we anticipate? How long do you anticipate them making a difference across all these cloud environments driving AI innovation, enabling all these top priority applications?
Nigel Alvares: Yeah, great question. And I think, taking a step back, I think you’ve heard us say this in the past, every cloud is relatively unique. They have all different workloads, they specialize. I could be a search company, I could be a social media company, I could be a platform company. And they all have unique architectures. I think you touched upon that on teeing this up is they’re looking to optimize their infrastructure. So the cadence for all these different rates varies drastically by cloud provider. And I think if we take a step back, AI is driving the faster speeds at a much higher cadence than the network side, traditional network side.
And what we’re seeing is the life of these DSPs, and again, we started these DSPs over six years ago, we’re still shipping some of those DSPs in some of these cloud providers. Because they have different clusters within their data centers, and they’re looking to optimize each of those clusters or each of those racks for those workloads they’re supporting. So I think it’s hard to put at shelf life, but I can just state that I would expect these new products we’re announcing will last for the next five to 10 years in different pieces of the network. But there is this cadence to keep moving, and the treadmill’s on, that we have to get to 3.2T. So that’s what our team’s actively working on.
And I think one thing to just close out on the AEC piece that I didn’t touch upon that too is based on PAM4 technology. So all that great innovation we’ve been doing in the optical space can be leveraged here and because these cloud providers have unique requirements, we do cloud optimizations inside these DSPs to address both the optical domain and we’re taking that into the electrical domain. So that’s actually another big value prop or differentiator for Marvell is all the learnings we’ve been doing in the optical domain is going to be implied into the AEC side of things, and a lot of cloud optimized silicon around that as well.
Ron Westfall: Yeah, I believe that’s good news for the ecosystem. And I think in terms of wrapping up, I think an important takeaway for the audience out there, Nigel, is why Marvell when it comes to AI? What is the difference that Marvell is making in terms of making sure the decision makers out there have the confidence to meet the unique and distinct demands of AI?
Nigel Alvares: And I think thanks for that question, Ron, because I think right now everyone talks about GPUs, TPUs. Now the reality is those GPUs, TPUs and AI accelerators, they need interconnects. It goes back to these training sets that it involves more than just one device. These are really clustered together and the real bottlenecks starting to emerge is the data movement between these devices. And that’s where Marvell is innovating with these devices we’re talking about today. And not only that, they’re looking to optimize those AI accelerator devices. And that’s another piece of our strategy. We’ve been innovating at the platform level.
Just last week we announced this three nanometer platform, all our building blocks around three nanometer. That three nanometer technology is not going to only be used in ASICs, it’s going to be used in our DSPs, it’s going to be used in our switches. So we’re bringing this whole platform level together, and it’s really going to enable these cloud providers to partner more closely with Marvell, not only on the interconnects, which is a big problem right now, but also on taking their AI engines and optimizing for their emerging workloads. Because they have their own internal AI workloads, and then they have their AI as a service type of workloads and they have different requirements. And Marvell is partnering with these companies starting from the interconnect side and moving into the cloud optimized silicon pieces as well as in the switching fabrics.
So a lot of capabilities when you start to innovate and invest at the platform level. And we have a great partner with TSMC three nanometer that we’re on this treadmill with them. We introduced those building blocks a couple of weeks ago, and we continue to advance that. And that’s going to be the big piece that you’ll hear from us in the near future of all these new products around three nanometer, some will be cloud optimized that we can’t disclose the specifics, and some will be standard products like the Nova, PAM4 DSP type of products. Super excited and thrilled to have the opportunity to work with these cloud companies to advance their infrastructure and really advance society as they’re the leading edge of all our infrastructure today.
Ron Westfall: Yeah, that’s excellent. And I’m definitely keeping a close eye on the three nanometer breakthrough. I believe that will definitely help Marvell make a strong case for meeting all the demands out there, cloud, AI, you name it. And on that high note, thank you so much Nigel for joining us. It’s always great to have you on and sharing what Marvell’s doing to advance cloud optics, and the impact it’s having across the entire data center and cloud ecosystems. And not to mention the AI trends out there that we’re seeing. Thank you everyone, and have a great day.
Ron is an experienced research expert and analyst, with over 20 years of experience in the digital and IT transformation markets. He is a recognized authority at tracking the evolution of and identifying the key disruptive trends within the service enablement ecosystem, including software and services, infrastructure, 5G/IoT, AI/analytics, security, cloud computing, revenue management, and regulatory issues. Read Full Bio.