With the rapid take-up of AI at scale across enterprises, one consequence is that it is consuming a more significant percentage of workloads at data centers.
Not only is AI set to accelerate demand for data centers, creating fresh impetus for investment, but it will also have implications for data center sustainability strategies and the nature of the infrastructure to be deployed.
Tirias Research, for example, forecasts that at its current clip, generative AI data center server infrastructure plus operating costs will exceed USD76 million by 2028, more than twice the current estimated annual operating costs at Amazon's AWS, representing one-third of the global cloud services market.
The forecast is for a 400% increase in hardware compute performance, dwarfed by the Tirias estimate that processing workloads will increase fifty times.
Higher densities
According to a new white paper from Schneider Electric, the explosion in large training clusters and small edge inference servers will also mean a shift to higher rack power densities.
“AI start-ups, enterprises, colocation providers and internet giants must now consider the impact of these densities on the design and management of the data center physical infrastructure,” the white paper says.
Schneider's Energy Management Research Center has made its own forecasts on AI's impact on energy demand. It estimates that AI represents 4.3GW of power demand today and will grow at a CAGR of 26% to 36% by 2028.
This will result in a total demand of 13.5GW to 20 GW, two or three times the overall data center power demand growth. By 2028, AI workloads will represent as much as 20% of total data center energy.
“By 2028, AI workloads will represent as much as 20% of total data center energy.”
Although expected to consume more power than training clusters, Schneider points out that inference workloads operate at a wide range of rack densities.
“AI training workloads, on the other hand, consistently operate at very high densities, ranging from 20-100 kW per rack or more,” it said.
"Networking demands and cost drive these training racks to be clustered together. These clusters of extreme power density fundamentally challenge the power, cooling, racks and software management design in data centers."
Power train challenges
Schneider outlines the likely impacts across four key areas: power, cooling, racks and software management.
Regarding power, AI workloads present challenges to the power train across switchgear and distribution.
Some currently used voltages will prove impractical to deploy, while smaller power distribution block sizes risk wasting IT space. Higher rack temperatures will also increase the chance of failures and hazards.
Cooling will be critical and is one of the areas where significant changes will be required as data centers transition to liquid cooling, which has been used for over half a century for specialized high-performance computing.
"Although air cooling will still exist in the near future, we predict a transition from air cooling to liquid cooling as a preferred or necessary solution for data centers with AI clusters," says Schneider.
“Compared with air cooling, liquid cooling provides many benefits such as improved processor reliability and performance, space savings and higher rack densities, more thermal inertia with water in piping and reduced water usage.”
With AI clusters, servers will need to be deeper, and the power demands greater with cooling more complex.
As a result, racks will need to have greater density and weight capacity.
Digital twinning
Finally, software tools such as DCIM, BMS and electrical design tools will become critical in managing AI clusters.
Appropriately configured and implemented software, says Schneider, can deliver a digital twin of the data center to identify power constraints and the performance of cooling resources, which can inform better layout decisions.
Less margin for error increases operational risk in an increasingly dynamic environment. For this reason, Schneider recommends creating a digital twin of the entire IT space, including the equipment and the VMs in the racks.
"By digitally adding or moving IT loads, you can validate that there are sufficient power, cooling, and floor weight capacities to support them," says Schneider.
“This informs decisions to avoid stranding resources and to minimize human error that might lead to downtime.”
Lachlan Colquhoun is the Australia and New Zealand correspondent for CDOTrends and the NextGenConnectivity editor. He remains fascinated with how businesses reinvent themselves through digital technology to solve existing issues and change their business models. You can reach him at [email protected].
Image credit: iStockphoto/thexfilephoto