Optical Transport Network Archives » Acacia

Expanding Capacity and Reach with a New Generation of Coherent Pluggables

Eugene Park — Fri, 20 Sep 2024 18:16:31 +0000

Coherent Multi-Source Agreement (MSA) pluggable modules have played a key role in expanding deployment scenarios for network operators, with the introduction of 400G modules driving recent network transformation opportunities. We have seen how the introduction of a wide range of 400G MSA pluggable products have driven the recent adoptions of IP-over-DWDM architectures enabling direct router-to-router metro connections over optical fiber as well as higher port-density transponder designs.

The Optical Internetworking Forum (OIF) kicked off the 400G MSA pluggable generation with development of the 400ZR implementation agreement enabling point-to-point amplified links up to 120km operating at 60+Gbaud data rates. Around the same timeframe, the OpenROADM MSA defined 400G interfaces for ROADM networks and extended reaches; the OpenZR+ MSA leveraged these higher performance interfaces to enable interoperable enhanced performance links for 400G pluggable modules (Figure 1).

The introduction of high-transmit optical power (>0dBm) ZR+ modules such as Acacia’s Bright 400ZR+ module further expanded the 400G MSA pluggable space to include brownfield ROADM network architectures (with existing transponder channels ~0dBm). Driven by increasing bandwidth demands from applications such as AI, network operators are now looking towards a new generation of MSA pluggable products that further expand applicable networking scenarios that operators can leverage to scale and meet these demands.

How Industry Standards Benefit MSA Pluggable Module Adoption
The latest array of MSA pluggable products introduces a new set of capabilities that network operators can utilize to increase capacity and extend reach. These products provide the ability to deploy 800G with ZR, ZR+, and high-transmit optical power capabilities, as well as extending the capabilities of existing 400G router interfaces to support ultra-long-haul (ULH) reach capabilities. This new generation of modules continues to leverage industry standardization while also borrowing capabilities from performance-optimized coherent solutions. These capabilities include high-baud rate transmission allowing for a doubling of baud rates from the previous Class 2 (~60+Gbaud range) generation to Class 3 (~120+Gbaud range) baud rates, the use of probabilistic constellation shaping (PCS) for enhanced transmission performance, and L-band support for spectrum range expansion.

Figure 1. Interoperability approaches at 400G vs. 800G.

Industry standardization of coherent solutions plays a key role in enabling economies of scale. Users of 400G coherent MSA pluggable modules such as 400ZR/ZR+ have benefited from the efforts of OIF, OpenZR+ MSA, and OpenROADM MSA to provide industry agreements on module specifications resulting in a diverse supply base. We have seen similar efforts to garner industry standardization as users transition to 800G MSA pluggables. There are three main elements that differentiate 800G relative to 400G and are adapted from previously developed performance-optimized solutions.

Interop PCS for Enhanced Performance
A key difference between 400G and 800G interoperability approaches for an enhanced performance “ZR+” is that instead of using enhanced performance forward error correction, oFEC, to provide improved 400G performance, 800G uses industry standard interoperable probabilistic constellation shaping (PCS) for enhancing performance. PCS is a transmission shaping technique that provides additional link performance beyond traditional transmission modes such as 16QAM. Industry standardization of an interoperable PCS transmission shaping function, once relegated to proprietary performance-optimized transponder platforms including those for submarine applications, is a tremendous leap forward in the progress of MSA pluggable module capabilities. Multi-vendor 800G module supply chain diversity from a DSP ASIC perspective is possible when the 800G ZR+ performance enhancement mode utilizes the industry standard interoperable PCS mode.
High Baud Rate Design
PCS is not the only technology that has been adapted from performance-optimized solutions for MSA pluggables. 800G as well as a 400G ULH pluggable solutions require a high-baud rate design operating in the Class 3 ~120+ Gbaud data rate range. Acacia’s performance-optimized CIM 8 module capable of 140Gbaud speeds has already proven that its deployed technology far exceeds the requirement for the new generation of MSA pluggables. Operation at these high baud rates benefits heavily from the advanced integration and RF signal optimization techniques that Acacia introduced in our 400G MSA pluggable product family.

Figure 2. Tightly integrated components enable 120+Gbaud data-rate capabilities.

3. C & L Band Support
A third element of the latest 800G MSA pluggable generation that is borrowed from performance-optimized designs is the capability to transmit in the L-band wavelength range, in addition to the traditional C-band DWDM range. By adding L-band supporting infrastructure to a network, the network capacity is approximately doubled. Network operators now have an option beyond utilizing a transponder platform if they wish to use L-band expansion to increase network capacity.

Figure 3. New generation of coherent MSA pluggable modules to take advantage of L-Band transmission window, adding to existing C-Band support.

Pluggable Interoperable Interfaces are Driving Adoption of 800G Modules
Acacia’s latest family of coherent solutions are powered by its 9th generation DSP ASIC called Delphi. These modules include support for OIF 800ZR, interoperable 800G ZR+ using the OpenROADM interop PCS mode, and 400G ULH for ultra-long-haul reaches. These modules utilize Acacia’s 3D Siliconization providing a highly integrated design enabling high-baud rate modulation. With support for QSFP-DD and OSFP form factors, as well as >+1dBm transmit optical power and L-band support, Acacia’s Delphi generation of products leverage the deployment successes of our performance-optimized CIM 8 module to provide MSA pluggable products that offer increased capacity and longer reaches.

Figure 4. Acacia’s latest generation of MSA pluggable 800G and 400G ULH modules.

Similar to the successful path we saw 400G pluggables experience, these modules are delivering the performance and interoperability that is critical for driving economies of scale and widespread adoption. With data center bandwidth continuing to grow rapidly, fueled by emerging new applications such as AI, these high-performance pluggable modules are on track to become an important tool for network operators to cost-efficiently scale their networks to meet this surging demand.

See Us at ECOC 2024!
Acacia is excited to be participating in the OIF interoperability demo at ECOC 2024 showcasing both its 400G and 800G pluggables; demos will take place in the OIF booth #B83. Acacia will also be demonstrating the Interoperable 800G ZR+ module in our meeting room at ECOC. Click here to set up a meeting.

We hope to see you in Frankfurt!

Future Proofing Transport Networks for AI

Eugene Park — Tue, 10 Sep 2024 16:33:11 +0000

With the rise in generative artificial intelligence (AI) applications and the massive buildout of AI infrastructure, the optics industry is at the forefront of this evolution since improved optical interconnections can mitigate bandwidth constraints within an AI cluster. This was one of the hottest topics at OFC 2024, with LightCounting forecasting that total sales of optical transceivers for AI cluster applications may reach approximately $52 billion over the next 5 years.

While the near-term focus has been on how AI will affect the technology around short reach interconnects, there will certainly be an impact to interconnections beyond the AI clusters and beyond the AI data center, in hyperscaler networks.

The question is: beyond the short-distance high-bandwidth interconnections, how would AI traffic impact the optical transport environment beyond the intra-building network and into the metro, long haul, and longer reach applications where optical coherent transmission is heavily utilized?

Figure 1. Sales Forecast for Ethernet Optical Transceivers for AI Clusters (July 2024 LightCounting Newsletter, “A Soft Landing for AI Optics?”).

Effect of Past Applications on the Transport Network
Bandwidth-intensive computing is attributed to both AI training as well as AI inference, where inference refers to the post-training process in which the model is “ready for the world,” creating an inference-based output from input data using what it learned during the training process. In addition to AI training’s requirements for a large amount of computing power and a large number of high-bandwidth, short connections, there are also foreseen bandwidth requirements beyond the AI data center. To understand how network traffic patterns may evolve beyond the AI data center, let’s review some examples of how the wider transport network was affected by past growth of various applications. Although these applications may not strongly match the effect of AI application traffic, it can provide some insight into the effects that the growth of AI applications may have on optical transport, and thus on the growth of coherent technology.

If we look at search applications, the AI training process is generally analogous to a search engine’s crawling bots combing the internet to gather data to be indexed (AI training being much more computationally intensive). The AI inference process is analogous to the search engine being queried by the end user with results made available for user retrieval with minimal latency. While the required transport bandwidth for search bots and user queries are minimal compared to higher bandwidth applications, the cumulative effect of the search-related traffic is part of the contribution to overall transport traffic, including bandwidth from regional/local caching to minimize latency, as well as usage from subsequent traffic created by acting upon search results.

Understanding how network traffic was affected by the growth of video content delivery is another example that can inform potential AI network transport traffic patterns. A main concern resulting from video content distribution was the burden imposed on the network in delivering the content (especially high-resolution video) to the end-user. To address this concern, content caching, where higher demand content was cached closer to the end-user, was implemented to reduce overall network traffic from the distribution source to the end-user, as well as reduce latency. While it is too early to predict how much network traffic would increase due to expansive queries to and responses from AI inference applications, the challenge is to ensure the latency for this access is minimal. One could see an analogy of content caching to edge computing where the AI inference model is closer to the user with increased transport bandwidth required for these edge computing sites. However, the challenge would be to understand how this would affect the efficiency of the inference function.

Turning to cloud computing for insights on traffic patterns, the rise of (multi-) cloud and computing resulted in intra and inter-datacenter traffic (a.k.a. east-west traffic) increasing as workloads traversed across the datacenter environment. There’s a similar potential rise in this type of traffic with AI as data for training could be dispersed among multiple sites of clusters as well as inference models being distributed to physically diverse sites to reduce latency to end users.

For any of these previous examples, as the demand of these applications increases, the transport bandwidth requirements would also increase from not only the target data (e.g., search results, video), but also from overhead or intra datacenter traffic to support these applications (e.g., content caching, cloud computing, backend overhead). Traffic behavior for aggregating AI training content as well as the distribution of AI inference models and its results may be similar to the traffic patterns of these previous applications, applying pressure to network operators to increase capacity for its data center interconnect, metro, and regional networks. Long haul and subsea networks may also experience a need to expand to meet the demands of AI-related traffic.

Figure 2. A scenario in which the network fabric physically expands due to facility power constraints, requiring high-capacity optical interconnections.

The Balance of Power and Latency
While the application examples above are related to how the AI application itself may affect bandwidth growth, what is becoming apparent is the power requirements to run AI clusters and data centers are significant. In the past, as the demand for cloud services grew, the need for large-scale data centers to have access to localized inexpensive power sources helped to drive the location selection for large data centers. However, power facility/availability constraints helped drive the adoption of physically distributed architectures, which then relied on high-capacity transport interconnects between data centers to maintain the desired network architecture (Figure 2). We anticipate a similar situation with AI buildouts requiring distributed facilities to address power constraints with potential trade-offs of reduced efficiencies for both AI training and inference. The distributed network would then rely on high-capacity interconnect transport using coherent transmission to extend the AI network fabric. Unlike cloud applications, physical expansion of the network fabric for AI applications has a different set of challenges due to compute and latency requirements for both training and inference.

Figure 3. Extremely low latency is required within the AI cluster to expeditiously process incoming datasets during the training mode. Since datasets are collected before being fed into the training cluster, the process of collecting these datasets may not be as latency sensitive.

As we plan for AI buildouts, one common question is how the physical extension of an AI networking fabric may affect AI functions. While geographic distribution of AI training is not ideal, facility power constraints are certain to lead to a growing adoption of distributed AI training techniques that attempts to mitigate introduced latency effects. As part of the training process, sourcing datasets feeding into the training cluster may not be latency sensitive and would not be as impacted by physical network extension (Figure 3). After training, when the inference model is complete, the goal is to minimize the latency between the user query to the inference model and the transmitted results to the user (Figure 4). The latency is a combination of the complexity of the query as well as the number of “hops” between the inference model and the user. Latency reduction when accessing the inference model, as well as methods to effectively distribute both the training and the inference function beyond a centralized architecture to address single-site power constraints, are ongoing discussions within the industry.

Whether driven by power constraints, dataset sourcing, or inference response efficiency, the sheer growth of AI applications will drive network traffic growth beyond AI cluster sites towards the wider network requiring high-capacity interconnects.

Figure 4. Minimizing latency for AI inference is a key objective.

Trading off power requirements versus access to inexpensive and abundant power versus latency is familiar territory when it comes to bandwidth intensive applications. The outcome that optimizes these trade-offs is application dependent and can even be deployment-by-deployment dependent. We continue to watch the evolving AI space to see how these network architecture trade-offs will play out, with the impact of how the transport network is designed. High-capacity coherent transport can certainly influence these trade-offs. And as we have already seen, by using coherent high-capacity transport cloud architectures, networks were able to physically expand to alleviate power source constraints by provided fat-pipe links between sites. We anticipate a similar scenario with expanding AI network architectures.

The Ripple Effect
While the near-term focus on high-capacity interconnects for AI applications has been on short reach connections within AI clusters, we are already seeing bandwidth requirements begin to increase, requiring additional coherent connectivity between datacenters supporting AI. And while there is general agreement that the resulting bandwidth demand from AI applications translates to increased traffic across the network, we are at the early stages in understanding how specific segments of the network are affected. Coherent optical interconnects for high-capacity transport beyond the data center already provide performance-optimized transponder solutions at 1.2T per wavelength as well as 400G router-to-router wavelengths moving to 800G using MSA pluggable modules. This technology will continue to play a role in the transport solution supporting AI applications whether the expanding traffic is in the metro portion, data center interconnects, long haul, or beyond.