Many of the BigTech companies maintain datacenters for their
private use. Changes to the data center infrastructure to support AI workloads are
not covered in as much detail as the public cloud does. Advances in machine
learning, increased computational power, and the increased availability of and
reliance on data is not limited to the public cloud. This article dives into
the changes in so-called “on-premises” infrastructure.
AI has been integrated into a diverse range of industries
with key applications standing out in each such as personalized learning
experiences in Education, autonomous vehicle navigation in Automotive,
predictive maintenance in manufacturing, design optimization in architecture,
fraud detection in finance, demand forecasting in retail, improved diagnostics,
and monitoring in healthcare, and natural language processing in Technology.
Most AI deployments can be split into training and inference.
Inference is dependent on training and while it can cater to mobile and edge
computing with new data, training poses heavy demands on both computational
power and data ingestion. Lately, data ingestion itself has become so
computationally expensive that hardware is needed to evolve. Servers are
increasingly being upgraded with GPUs, TPUs, or NPUs along with traditional
CPUs. Graphis processing units aka GPUs were initially developed for high-quality
video graphics but are now used for high volumes of parallel computing tasks. A
Neural Processing Unit has specialized hardware accelerators. A Tensor
Processing Unit has a Specialized Application specific Integrated Circuit to
increase efficiency of AI workloads. The
more intensive the algorithm or the higher the throughput of data ingestion,
the more the computational demand. This results in about 400 watts of power
consumption per server and about 10KW for high-end performance servers. Data
center cooling with traditional fans no longer suffices and liquid cooling is
called for.
Enterprises that want to introduce AI to their IT
on-premises infrastructure quickly discover the same ballooning costs that they
were afraid of with the public cloud, and they don’t just come from real-estate
and energy consumption but also more inventory both hardware and software.
While traditional on-premises data were housed in dedicated storage systems,
the rise of AI workloads is posing more DevOps requirements and challenges that
can only be addressed by exploding number of data and code pipelines. Collocation
of data is a significant challenge both in terms of networking as well as the
volume of data in transit. The concept of Lakehouse architecture is finding
popularity on-premises as well. Large-scale collocation sites are becoming more
cost-effective than on-site datacenter improvements. The gray area between
workload and infrastructure for AI versus consolidation in datacenters is still
an ongoing exploration. Collocation datacenters are gaining popularity because they
pack the highest physical security, adequate renewable power, scalability when
density increases, low-latency connectivity options, dynamic cooling
technology, and backup and disaster recovery options. These are even dubbed as
built-to-suit datacenters. Hyperconvergence in these datacenters is yet to be
realized but there are significant improvements in terms of redesigning rack
placement and in-rack cooling efficiency. These efficiencies drive up the
mean-time-between-failures.
There’s talk about even more efficiency required for quantum
computing and these will definitely drive up the demands on computational
power, data collocation, network capable of supporting billions of transactions
per hour and cooling efficiency. Sustainability is also an important
perspective driven by and sponsored by both the market and the leadership. Hot
air from the datacenter for instance finds new applications for comfort as does
solar cells and battery storage.
The only constant seems to be the demand on the people for understanding
the infrastructure behind high-performance computation.
Previous article: IaCResolutionsPart219.docx
No comments:
Post a Comment