Friday, December 20, 2024

 

Many of the BigTech companies maintain datacenters for their private use. Changes to the data center infrastructure to support AI workloads are not covered in as much detail as the public cloud does. Advances in machine learning, increased computational power, and the increased availability of and reliance on data is not limited to the public cloud. This article dives into the changes in so-called “on-premises” infrastructure.

AI has been integrated into a diverse range of industries with key applications standing out in each such as personalized learning experiences in Education, autonomous vehicle navigation in Automotive, predictive maintenance in manufacturing, design optimization in architecture, fraud detection in finance, demand forecasting in retail, improved diagnostics, and monitoring in healthcare, and natural language processing in Technology.

Most AI deployments can be split into training and inference. Inference is dependent on training and while it can cater to mobile and edge computing with new data, training poses heavy demands on both computational power and data ingestion. Lately, data ingestion itself has become so computationally expensive that hardware is needed to evolve. Servers are increasingly being upgraded with GPUs, TPUs, or NPUs along with traditional CPUs. Graphis processing units aka GPUs were initially developed for high-quality video graphics but are now used for high volumes of parallel computing tasks. A Neural Processing Unit has specialized hardware accelerators. A Tensor Processing Unit has a Specialized Application specific Integrated Circuit to increase efficiency of AI workloads.    The more intensive the algorithm or the higher the throughput of data ingestion, the more the computational demand. This results in about 400 watts of power consumption per server and about 10KW for high-end performance servers. Data center cooling with traditional fans no longer suffices and liquid cooling is called for.

Enterprises that want to introduce AI to their IT on-premises infrastructure quickly discover the same ballooning costs that they were afraid of with the public cloud, and they don’t just come from real-estate and energy consumption but also more inventory both hardware and software. While traditional on-premises data were housed in dedicated storage systems, the rise of AI workloads is posing more DevOps requirements and challenges that can only be addressed by exploding number of data and code pipelines. Collocation of data is a significant challenge both in terms of networking as well as the volume of data in transit. The concept of Lakehouse architecture is finding popularity on-premises as well. Large-scale collocation sites are becoming more cost-effective than on-site datacenter improvements. The gray area between workload and infrastructure for AI versus consolidation in datacenters is still an ongoing exploration. Collocation datacenters are gaining popularity because they pack the highest physical security, adequate renewable power, scalability when density increases, low-latency connectivity options, dynamic cooling technology, and backup and disaster recovery options. These are even dubbed as built-to-suit datacenters. Hyperconvergence in these datacenters is yet to be realized but there are significant improvements in terms of redesigning rack placement and in-rack cooling efficiency. These efficiencies drive up the mean-time-between-failures.

There’s talk about even more efficiency required for quantum computing and these will definitely drive up the demands on computational power, data collocation, network capable of supporting billions of transactions per hour and cooling efficiency. Sustainability is also an important perspective driven by and sponsored by both the market and the leadership. Hot air from the datacenter for instance finds new applications for comfort as does solar cells and battery storage.

The only constant seems to be the demand on the people for understanding the infrastructure behind high-performance computation.

Previous article: IaCResolutionsPart219.docx

 

No comments:

Post a Comment