AI and Other Evolving Applications Demand
New Approaches to Network Design and Testing

ORDER REPRINTS DOWNLOAD COMMENT DISCUSS SHARE

Cloud-native infrastructure, open radio networks, edge computing, 6G, and the Internet of Things (IoT)—for large network operators, the list of technology trends to contend with is already long, and still growing. But for telecommunications providers struggling to keep pace with so much change, another massive trend is building on the horizon with the potential to dwarf all others: artificial intelligence and machine learning (AI/ML).

It's not that AI/ML is new; you could call it an overnight success 40 years in the making. But with the rise of today’s Large Language Models (LLMs), deep neural networks, and generative AI tools, the AI revolution has officially begun. Analysts with S&P Global forecast generative AI offerings will reach $3.7 billion in 2023, growing to $36 billion by 2028—a staggering 58% compound annual growth rate (CAGR). According to Dell’Oro Group, fully one-fifth of all Ethernet data center switch ports will connect to AI servers by 2027.

This new generation of AI applications and related workloads places tremendous stress on the world’s largest networks. Generative AI in particular brings new traffic patterns and networking requirements unlike anything operators have dealt with before. As a result, these trends are spurring an associated revolution in the way networks are designed and tested that’s still in its early stages. Today, hyperscalers bear the brunt of these growing pains, and they’re reimagining their networks in response. But telecommunications providers should pay close attention. In the not-so-distant future, they’ll need to address many of the same issues in their own networks—to bring new AI services to customers, tap into the power of AI-driven automation and optimization in their own networks, and enable other dynamic applications of tomorrow.

AI Brings Unique Network Challenges

If you pay any attention to the tech sector, you’ve seen the headlines about the explosive growth of OpenAI’s ChatGPT. Within weeks of its November 2022 launch, the generative AI chat client shattered Internet growth records, ramping up to 100 million monthly active users. By June, the site had notched 1.6 billion visits. And that’s just one example. Already, dozens of other AI projects have been unveiled by every major tech company, with LLMs representing just one type of AI application.

Operators of the world’s largest hyperscale data centers (Amazon, Microsoft, Meta, and others) were already scrambling to add compute and network capacity to keep pace with growing cloud utilization. New generative AI workloads, however, represent an entirely different kind of challenge—with business and technical demands that can’t be met just by adding more servers and bumping up interface speeds.

To start with, the most effective compute clusters for AI workloads use Graphics Processing Units (GPUs), which are much better suited to running many tasks in parallel than conventional server Central Processing Units (CPUs). There are currently far fewer GPUs on the planet than CPUs, however, and GPUs were already more expensive and harder to acquire before new AI tools started capturing headlines. Nowadays, GPUs are extremely hard to acquire. This scarcity means that organizations paying the high costs associated with acquiring GPUs need to extract the best performance out of them.

Even when data center operators have the processing power to scale up AI support, however, these workloads represent a very different kind of computing job with different networking requirements. AI/ML clusters function more like a single high-performance machine than a collection of servers, with GPUs needing to both crunch data individually and share huge amounts of information with many other processors in the cluster—all within very tight windows. This places extreme demands on networks, with four pain points standing out in particular:

Throughput: Training workloads for LLMs and other generative AI models are huge, typically requiring clusters with thousands of GPUs. These clusters continually share parameters with GPUs, CPUs, and other specialized processors like Tensor Processing Units (TPUs) and Deep Learning Processing Units (DPUs), generating massive bursts of data across the network.
Latency: The primary metric used to measure AI/ML processing, job completion rate, is very sensitive to latency. If the latency is too high or constantly changing, the overall performance of the system quickly deteriorates.
Timing synchronization: For many AI/ML machine types, these massive bursts of data need to happen as close to line rate as possible, and need to be synchronized across thousands of processors in the cluster.
Network congestion: Relatedly, conventional network architectures that are not efficient enough to keep pace with AI workloads inevitably introduce network congestion, leading to variable throughput and jitter. While this is not typically a major issue in modern hyperscale data centers, it may well affect future AI/ML applications. If not addressed, this variability creates inconsistent cluster performance and degrades completion rate.