Building a machine learning server in 2024

James Spinella
12 min readJun 1, 2024

--

Nvidia GPU stock graphic (source: Nvidia)

It’s hard to believe we’re already in June- a lot’s happened this year in the world of AI. Nvidia stock has surpassed $1000/share and may soon take Microsoft’s place as the most-valuable company in the world (currently it’s #3, behind Microsoft and Apple, respectively). ChatGPT released GPT 4o, a speedier alternative to GPT 4. Elsewhere, Tesla continues to catch flak for its autonomous driving functionality being decidedly less-than.

Nevertheless, as is tradition, everyone and their mother seems to be trying to take advantage of the “new hotness” to do things AI frankly can’t or shouldn’t be doing. Software Engineers everywhere are quickly learning that building something even remotely as powerful as ChatGPT takes substantial resources, even with the so-called “offerings” from the likes of AWS, Azure and GCP. Heck, even ChatGPT is only right about half the time.

At the risk of calling the kettle black, I am working on a startup attempting to leverage this recent AI renaissance to create something that (I think) is genuinely useful. As part of this pursuit, I’ve found cloud services (including ChatGPT’s API) to be incredibly expensive. I’ve always been partial to on-prem because I do love playing with hardware, but still, the numbers speak for themselves: querying ChatGPT’s API with our startup’s dataset would cost tens of thousands of dollars- and there’s no guarantee that would be the last of it. Odds are, we will want to re-process our data with newer, more-accurate models as time goes on.

Picking the hardware

As it turns out, the single most-valuable figure when it comes to a GPU is its VRAM. Nvidia knows this, and indeed finding a GPU with at least 20GB of VRAM for under $1,000 is difficult to pull off. VRAM is important for both model training and model inference, but we typically see training consume more VRAM than inference, as the entirety of the training data must fit into the GPU VRAM for training, and the resulting model is a fraction of the size of its training data. Generally-speaking, we should aim for the most VRAM per-dollar. This is something that is always changing. Perhaps 2 years ago, the most cost-effective GPU in this regard was the RTX 3080, while today it might be the RTX 4090 (I honestly have no idea, and it’s going to depend on the country you are in, for starters).

Part of the reason it’s so difficult to tell you which GPU is going to get you the most VRAM for your dollar is because the final cost of your server is going to depend on how much power your GPU or GPUs consume, how many GPUs you put in your server, and what kind of balance you want to strike between VRAM and GPU power- for example, both the RTX 3090 and the RTX A5000 have 24GB of VRAM- but the RTX A5000 is about twice the price and only ~80% as powerful as the RTX 3090. In this case, the choice may be clear- you probably want the RTX 3090. Still, the A5000 is expensive for a few reasons. Being a Quadro card, the A5000 supports Nvidia’s GPU virtualization functionality, meaning you could assign a single A5000 to several virtual machines on a server. This is a nice-to-have, but it’s probably not something you need. I would love to have a VM for model training and a VM for model inference, and a VM for my database, all leveraging the same GPU while having their own OSes, but it’s not a big deal at all to have everything on a single, bare-metal OS- especially for homelab/startup scenarios. Another nice-to-have with the A5000 is that it’s only two PCIe slots wide, while the RTX 3090 is three PCIe slots wide. If you happened to want or need 72GB of VRAM, and wanted all the GPUs to fit in a single ATX computer, you would be unable to use three 3090 GPUs as they would not fit. You would have to use three A5000s (or another double-width GPU), which would only occupy 6 PCIe slots and would therefore fit inside an ATX form factor. Finally, the A5000 has a TDP (total demand power- that is, the maximum sustained power draw) of 230W vs the 3090’s 350W TDP. Power supplies tend to get exponentially more-expensive past 1000W, and 1000W is already pushing it for a server with two RTX 3090 GPUs- in fact, it’s generally recommended to undervolt/power-limit the 3090s to around 250W each (which should have only a slight impact on performance for model training and inference).

Brass tacks- if you’re reading this article, you’re probably new to the ML scene and just looking to get your feet wet without breaking the bank. If this sounds like you, my advice is to start with the lower end of “ML-appropriate” GPUs (any Nvidia GPU with 12GB or more of VRAM) and build a server that will be able to accommodate upgrades- an additional GPU or two, lots of system RAM, and a power supply with room to grow. Here’s how I followed these guidelines:

My sweet spot

Here in the year 2024, my hours and hours of research has led me to purchase two RTX 3090 GPUs, a 4-slot NVLink bridge, and an ASUS WRX80 motherboard with a Ryzen Threadripper PRO 3945WX CPU. Here’s why:

RTX 3090

There are many Nvidia GPU SKUs out there. Too many, some would say. It’s hard to keep up with them all, and Nvidia releases new GPUs each year. What I like about the RTX 3090 in particular is that it supports NVLink (similar to SLI but with the advantage of a dedicated link between GPUs), can be purchased for around $700–800 used on eBay, has a hefty stock cooler, and comes with a whopping 24GB of VRAM. I’m not aware of any other GPU that checks all of these boxes. Furthermore, the 3090 has more CUDA cores and overall GPU performance than any of the Quadros in this price range.

Why NVLink? Some of you reading this may be aware that the RTX 3090 was one of the last consumer Nvidia GPUs to support NVLink. The 4000 series does not have NVLink, even at the high-end (RTX 4090). Quadro (e.g. A5000, A6000) GPUs have NVLink, but again the value isn’t there for what you get as these are enterprise-grade GPUs with enterprise features we’d be paying for, but that we don’t need. NVLink is valuable because it is the best-performing means of combining two or more GPUs for tasks like training and inference. The NVLink bridge is many times faster than using a server’s PCIe lanes to transfer data between the GPUs. Through the 3090, we are able to have NVLink while avoiding Quadro series prices. \

My recommendation is to start out with just one 3090, knowing you can always add a second 3090 should you need or want it, and be comfortable knowing you don’t need to upgrade your motherboard, CPU, or power supply to do so.

A 4-Slot NVLink Bridge

As demand for Nvidia GPUs rose in the last couple of years, so did the demand for NVLink bridges, which come in 3-slot and 4-slot configurations. The naming isn’t great- 3-slot actually means that there are FOUR slots between the NVLink port of GPU A and the NVLink port of GPU B (inclusive).

NVLink bridge options (source: Nvidia)

The 3-slot NVLink bridge costs significantly more than the 4-slot NVLink bridge. As of this writing, a 3-slot NVLink bridge will cost around $220 (used, on eBay), while a 4-slot bridge will cost around $100 (used, on eBay). $100 isn’t so bad in the grand scheme of things, but consider also that in the case of two RTX 3090 GPUs, the 4-slot bridge will leave the two 3090s with a one-slot gap between them- this is great for cooling, as two 3090s immediately next to each other will typically result in higher temps of both GPUs, particularly the one on top. With a 4-slot NVLink bridge, we’re saving money and improving thermals.

If you’re starting off with just one 3090, there’s no need to buy an NVLink bridge now. Come back to this section if and when you’re buying a second 3090!

ASUS Pro WS WRX80E-SAGE SE WIFI Motherboard

What a mouthful.

The ASUS WRX80E E-ATX motherboard is one of a relatively-few motherboards out there that will allow us to accommodate our two 3090 GPUs with the 4-slot NVLink bridge AND support a CPU with enough PCIe lanes to give each GPU the full 16 PCIe lanes. Most motherboards out there have only two PCIe x16 slots, and almost all of them are spaced such that we would need a 3-slot NVLink bridge. This immediately incurs a $100 penalty towards our total build cost. It’s worth noting that if we were to stick with a consumer-grade CPU such as those from the Ryzen 7 or Intel Core i7 lines, we’re unlikely to find a motherboard with the PCIe slot configuration required for a 4-slot NVLink bridge without paying around $100 more for the motherboard, and so at that point it’s essentially a wash getting the 3-slot NVLink bridge. BUT, we’re using a workstation CPU and motherboard to get sufficient PCIe lanes to fully-feed both 3090 GPUs. This gives us around a 10% performance uplift versus a consumer CPU which will almost always only have 16 PCIe lanes for ALL PCIe slots- meaning in a two-GPU scenario, each 3090 would only have 8 PCIe lanes (half!) of bandwidth to communicate with the CPU. 10% may not seem like a lot, but for the money we’re paying, why take that hit? With sufficient planning (e.g. this article), we can have our cake and eat it.

This motherboard gives us a few other features that are great for an on-prem server setup- IPMI (for remote, headless management- only seen on enterprise-grade servers and workstations), a U.2 connector (for enterprise-grade SSDs) and an external CMOS clear button. If you plan on tweaking the CPU or RAM clocks at all, having an external CMOS clear button is fantastic. I am sick and tired of having to open my computer case, finding the CMOS clear jumper, and shorting it with a screwdriver each time I take my overclock settings a little too far. If overclocking isn’t your thing, I would still recommend overclocking your RAM, especially in an AMD-based system such as this one where RAM speed correlates with CPU performance.

As with all Threadripper and EPYC motherboards, the ASUS WRX80E is expensive: used, on eBay, it will run around $500–600. Still, this is about as cheap as it gets for a Threadripper or EPYC motherboard.

For about the same price, ASRock has a WRX80 E-ATX motherboard too. The main difference is that the ASRock motherboard has two Thunderbolt 3 ports rather than two USB-C ports. This is personal preference, but I’m much less-enthused by Thunderbolt today than I was three or four years ago. The technology just hasn’t panned out. If you’re not sure whether you need it, you don’t need it.

Ryzen Threadripper PRO 3945WX CPU

I thought by 2024, we would see consumer-grade CPUs with more than 16 PCIe lanes for the PCIe slots, but alas… Intel has a few CPUs with 20 PCIe lanes, but only 16 are for the PCIe slots while 4 are reserved for an NVMe SSD. This leaves us with two choices- AMD’s Threadripper or EPYC CPUs, or Intel’s Xeon CPUs. These being enterprise CPUs, the price tags often start north of $1,000. The good news is, because the enterprise realm typically upgrades every two to four years, we can get a four-year old CPU for a palatable $500–600. Enter the Threadripper PRO 3945WX. The least-powerful and cheapest of the 3000 series Threadripper PRO CPUs, the 3945WX balances performance and feature set with cost. 12 cores, 24 threads. 128 PCIe lanes is more than enough. At around $500 used on eBay, it’s an incredible value for a GPU-oriented server. NOTE: many eBay listings are for CPUs that are LOCKED to a specific vendor such as Lenovo. Make sure you aren’t buying a vendor-locked CPU. If the CPU is $100–200, it’s likely vendor-locked to Lenovo and will not work outside of a Lenovo motherboard.

A 1000W Power Supply

1000W isn’t all that much these days, with both CPUs and GPUs inching up in their TDP figures since ~2019. Let’s tally up our power requirements so far:

RTX 3090–350W TDP (with occasional spikes to 650W)

Threadripper PRO 3945WX — 280W

The total figure on paper for a single 3090 setup is around 650W. That’s not including a little bit for hard drives and SSDs, the motherboard itself, and fans. Conservatively, we’re looking at a 700W power draw under full load- with transient spikes to roughly 1000W if we do not apply power-limiting to the 3090. So again, a 1000W power supply is the MINIMUM here. We are riding the fence between maximum wattage and not paying $200+ for a power supply. If you don’t mind going used, 1200W or even 1500W for under $200 is easily achievable. If you anticipate adding a second 3090 in the future, the more the merrier- try for 1500W. Still, I am riding with a mere 1000W PSU. Consider that it’s unlikely the CPU and GPU(s) will be 100% floored at the same time. Sure, the CPU can consume a maximum of 280W, but during model training and inference, it will probably be drawing around 100W while the GPUs will be drawing 350W apiece (or less if we power-limit aka undervolt them). The CPU can safely be undervolted as well for additional power savings — we certainly don’t need 12 CPU cores for ML activities.

I’m awaiting the final part for this build, but I expect to be just fine with 1000W. I am estimating the max power draw after undervolting to be around 800–850W (200W from the CPU and 300W from each of the 3090 GPUs).

Fractal North XL

Which case you choose is inconsequential so long as it will accommodate the motherboard, which is E-ATX- meaning the motherboard is wider than an ATX board. A typical ATX mid-tower case may work, but if so, it will be a tight fit. For example, the normal Fractal North will accomodate some E-ATX motherboards, however the cable routing holes will be blocked.

I like the Fractal North XL because it’s one of the few stylish E-ATX cases out there that does NOT have tempered glass (though it does have a tempered glass variant if that’s your thing). This computer is going in a closet. I don’t need it to look pretty, I need it to have excellent airflow. Mesh in lieu of glass is a smart choice.

Note: The Fractal North XL has a non-removable PSU shroud that obscures virtually all of the ports and connectors facing down on the bottom edge of the ASUS Pro WRX80E motherboard. These are all ports that we do not need (and that you likely will never need) such as USB 2.0 headers, the front-panel audio header, headers for front-panel LEDs found only on select enterprise enclosures, and two 6-pin PCIe power connectors used to provide additional power to the PCIe slots on the motherboard — the use-case for this far exceeds our two (or three, or four) GPUs.

What about RAM?

Oh right. It’s really whatever*. This is the least-important part of our ML server setup. It’s also the easiest to upgrade over time- our motherboard has a whopping 8 RAM slots. This means that even if you initially purchase two 8GB DDR4 RAM sticks (aka DIMMs), you could still upgrade to 64GB while keeping all the RAM sticks idential (8 x 8 = 64). Still, I would recommend each stick have at least 16GB of RAM so you could upgrade to 128GB while still using the initial sticks you purchase. It stings a little to start off with two sticks of 8GB and end up throwing them out because you ended up with 8 sticks of 16GB. Additionally, as this CPU supports ECC RAM, I would spend the slight (looks to be around ~10%) market upcharge to get ECC RAM (Error-Correcting Code RAM — I won’t dive into the benefits here, just know they are worth paying a little extra for).

Popular, open-source models like Llama are being run (“inferenced”) all over the world off of the CPU and system RAM, and while the RAM requirement varies, I’ve been seeing ~64GB as the “minimum”. If you intend on running these models on the CPU using system RAM, you should do some research on the current and expected future RAM requirements. Again, as a general rule, I would not go below 64GB (e.g. 2x 32GB sticks) if you want to be able to run something like Llama on the CPU. Ideally models are loaded into GPUs for inference purposes, but 64GB+ of VRAM is really more expensive than what this article’s target audience can comfortably afford.

*Ryzen Threadripper PRO CPUs provide a dedicated memory channel to each RAM slot on the motherboard. This is unheard of in the consumer part space and means maximum memory bandwidth can only be achieved by populating all of the RAM slots. As we are building an ML server, the performance gains here will be unnoticed, but it is worth mentioning. In other words, if you aren’t on a tight budget, and want the absolute maximum performance possible for CPU-oriented tasks you’ll run on this server, get eight RAM sticks.

Cost Breakdown

All told, you’re looking at around $2,900 for this setup (with two RTX 3090s), which is not bad at all for a capable on-prem ML server.

Case: $100–200

Motherboard: $500–600

CPU: $500–600

GPU: $700–800 ea.

RAM: $75–150+

PSU: $150–200+

NVMe SSD: $75–150

NVLink Bridge: $100–200

Total: ~$1,800 — $3,700 (mean of $2,750)

--

--

James Spinella

Growing up, I loved building computers, and now I write code for a living. I am also interested in the “human element” of software development.