The Linux Foundation has announced the formation of the Ultra Ethernet Consortium (UEC) to optimize Ethernet standards for high-performance networking. The consortium comprises industry leaders such as AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta, and Microsoft, who aim to refine Ethernet standards to better support the growing demands of artificial intelligence (AI), machine learning, and high-performance computing.
Collaborative Approach to High-Performance Networking
The UEC aims to build a complete Ethernet-based communication stack architecture that can handle a wide variety of workloads while being scalable and cost-effective. The consortium is founded by companies with a long-standing history and experience in high-performance solutions, each contributing significantly to the broader ecosystem of high-performance in an egalitarian manner.
“This isn't about overhauling Ethernet,” said Dr. J Metz, Chair of the Ultra Ethernet Consortium. “It's about tuning Ethernet to improve efficiency for workloads with specific performance requirements. We're looking at every layer – from the physical all the way through the software layers – to find the best way to improve efficiency and performance at scale.”
Technical Goals and Working Groups
The consortium will focus on minimizing communication stack changes while maintaining and promoting Ethernet interoperability. The technical goals include the development of specifications, APIs, and source code to define protocols, signaling characteristics, interfaces, and data structures for Ethernet communications. The consortium will also work on link-level and end-to-end network transport protocols, congestion, telemetry, signaling mechanisms, and software, storage, management, and security constructs. The founding companies have initiated four working groups: Physical Layer, Link Layer, Transport Layer, and Software Layer.
The white paper for the Ultra Ethernet Consortium (UEC discusses the forthcoming Ultra Ethernet Consortium Specification and its relevance to the networking demands of modern AI and HPC (High-Performance Computing) jobs. Here are the key points:
-
Networking Demands of Modern AI Jobs: The paper emphasizes the growing importance of networking for efficient and cost-effective training of AI models. The increasing size of models, in terms of parameters, entries of embedding tables, and words of context buffers, necessitates large clusters for training and drives larger messages on the network. The network interconnecting these resources must be as efficient and cost-effective as possible.
-
The Ethernet Advantage: The paper highlights the advantages of Ethernet-based IP networks, which include a broad, multi-vendor ecosystem of interoperable Ethernet switches, NICs, cables, transceivers, optics, management tools, and software from many participating parties. It also mentions the proven addressing and routing scale of IP networks, enabling rack-scale, building-scale, and data center-scale networks.
-
Key Needs of AI and HPC Networks of the Future: The paper identifies several areas where improvements can be made to better deliver unprecedented performance for the increased scale and higher bandwidth of future networks. These include multi-pathing and packet spraying, flexible delivery order, modern congestion control mechanisms, end-to-end telemetry, and larger scale, stability, and reliability.
-
Ultra Ethernet Transport (UET): The UEC proposes a new protocol, the Ultra Ethernet Transport, designed to deliver the performance that AI and HPC applications require while preserving the advantages of the Ethernet/IP ecosystem. This protocol includes features such as multipath, packet-spraying delivery, efficient rate control algorithms, APIs for out-of-order packet delivery, and scalable security.
-
Security for AI and HPC: The UEC transport incorporates network security by design and can encrypt and authenticate all network traffic sent between computation endpoints in an AI training or inference job.
-
Further Efforts in UEC-HPC and Beyond: The UEC is also developing technology to support the network needs of High-Performance Computing (HPC) of the future. It expects the UEC transport protocol to serve the networking demands of both AI and HPC jobs.
-
About Ultra Ethernet Consortium: The UEC brings together companies for industry-wide cooperation on interoperability and to build a complete Ethernet-based communication stack architecture that best matches the rapidly evolving AI/HPC workloads at scale. The founding members include AMD, Arista, Broadcom, Cisco, Eviden (an Atos Business), HPE, Intel, Meta, Microsoft.
Industry analysts and founding members have expressed their support and excitement for the UEC initiative. Dr. Earl Joseph, CEO of Hyperion Research, Addison Snell, CEO of Intersect360 Research, and Karl Freund, Founder and Principal Analyst at Cambrian-AI Research, all emphasized the importance of the UEC in meeting the growing network demands of AI and high-performance computing at scale.
Founding members also expressed their commitment to the UEC. Robert Hormuth from AMD, Hugh Holbrook from Arista, Ram Velaga from Broadcom, Rakesh Chopra from Cisco, Eric Eppe from Eviden at Atos Group, Justin Hotard from HPE, Jeff McVeigh from Intel, Alexis Björlin from Meta, and Steve Scott from Microsoft all shared their thoughts on the importance of the UEC and the role their respective companies will play in the consortium.
About the Linux Foundation
The Linux Foundation (LF) is a non-profit technology consortium that provides a neutral, trusted hub for developers and organizations to code, manage, and scale open technology projects and ecosystems. Founded in 2000 as a merger between Open Source Development Labs and the Free Standards Group, the LF aims to standardize Linux, support its growth, and promote its commercial adoption . The LF also hosts and supports hundreds of other open source projects across various domains, such as cloud computing, security, blockchain, artificial intelligence, and more . The LF offers a range of services and programs for its members and the open source community, such as training and certification, events and networking, project insights and management tools, and ecosystem curation and community building.