Forget the cloud. Do you fancy building a computing rig to dig into machine learning work? Well, independent ML researcher Emil Wallner decided to do just that recently.
In a lengthy blog entry, Wallner detailed his thinking as he explored the various options he explored as he designed and deployed one such system.
Your very own ML system
The system he eventually built was based on the AMD EPYC 2 platform with 256GB in system RAM, with four Nvidia RTX A6000 professional-level graphics cards with 192GB of total GPU memory. For the uninitiated, each A6000 scores higher on GPU benchmarks than the GeForce RTX 3090 that is highly sought-after by gamers.
Total cost? A cool USD29,000 after tapping into Nvidia’s startup and education discount program.
Of course, there are multiple considerations to juggle, ranging from which GPU delivers the best cost-benefit for ML, potential limitations of some GPU options, motherboards that can support multiple GPUs, and infrastructure considerations such as power and cooling. For instance, Wallner notes water-cooling requires maintenance and runs the risk of leaking during transportation.
For now, Nvidia is the only real option until AMD’s GPU ML libraries are “more stable”, he says. But consumer-centric RTX 3090 are in short supply and are simply too hot to use reliably when four are fitted next to each other without using special PCIe risers – where they are exposed to dust. This leaves the prosumer A6000 or enterprise-level A100.
Why not buy it online?
Wallner also addressed the question of the multi-GPU consumer systems being sold online. However, they are not suitable for ML, he explained, since they are designed for cryptocurrency mining, and are hence designed only with this objective in mind.
The main difference comes in the form of low-bandwidth USB adapters to connect to the various GPUs, which transfer only data without electricity. Wallner notes that these adapters are often of poor quality and could potentially destroy the hardware or catch fire due to short circuits. Moreover, they are simply not advisable for ML rigs with PCIe risers that deliver a robust amount of power.
“Crypto rigs also use mining power supplies from Alibaba with poor standards or retrofit enterprise power supplies. Since people tend to place them in garages or containers, they accept the added safety risk,” he summed up.
An eye on experimentation
But why build when one can access similar – and even more powerful systems, on a public cloud platform such as AWS for a reasonable hourly fee.
The main reason to own hardware boils down to one’s workflow, wrote Wallner. He elaborated in a Reddit comment: “Workflow. Especially for R&D where deep learning is at the core of what you do. Owning hardware encourages robust experimentation, while AWS becomes a distracting cost-saving game.”
“AWS users squeeze everything out of pre-emptive instances with clever scripts. They can spend days where they struggle to get an instance, they have to turn the instances on and off all the time, download data for local storage, they lose work, and forget resources that accumulate cost. It’s stressful.”
An ML engineer agreed in a tweet: “People usually calculate in GPU hours but really if you have a blank jupyter notebook open and are thinking for extended periods. It's really stressful and you feel you're wasting $ by thinking.”
In case you were thinking of putting such a system at home or in the office, think again.
Wallner recommends placing such a system within a proper data center due to noise and heat and even outlined typical colocation rates. The reason? “I find 4 GPUs too loud and generate too much heat to have in an office or at home without proper cooling... Think (about it as), a small leaf blower with hot air, equal to a 1,600W radiator.”
Image credit: iStockphoto/solarseven