How does interconnect topology such as NUMA affect CPU performance in multi-socket systems?

***savas*** · 12-08-2023, 04:39 PM

The performance of multi-socket systems can be a bit of a tangled web, and understanding how interconnect topology like NUMA shapes that performance is crucial. I know most folks think about cores and clock speeds like they’re the whole picture, but the way CPUs access memory plays a massive role in how everything performs in those setups.

When you have multiple sockets, each with their own CPUs, memory access becomes more complex than it is in single-socket systems. I was looking into AMD's EPYC processors recently and their NUMA layout, and it really hit me how much it matters. Each socket has access to its own local memory, making memory access, in theory, faster when a core accesses its own memory instead of going over to another socket's memory. You’ve got to consider things like bandwidth, latency, and where your data sits when you're running multiple workloads.

Let’s take a closer look. Imagine you’re running a high-performance database or a machine learning model that crunches numbers for hours. When you have a single CPU in a multi-socket configuration like in a dual-socket server, each CPU has a link to its own memory but can also access the memory of the other CPU. However, that access comes with penalties in terms of latency. That’s where the NUMA architecture shines — it’s designed to optimize the way memory is accessed across those multiple processors.

When I started getting into server architecture, I learned how essential it is to keep your tasks balanced across CPUs. If you end up with a situation where one CPU is overloaded while the other one is sitting there twiddling its thumbs, then you’re going to see performance drop. Let’s say, for example, you’re running SQL Server with a large dataset on an Intel Xeon Scalable System. If the threads running those transactions are all hitting memory attached to one socket, you’ll face delays because that CPU has to handle multiple requests while the other CPU might be perfectly capable of handling some of that load.

A practical example is using something like a Dell PowerEdge R740xd, which supports dual-socket configurations. If your databases are configured to only pull data from one socket’s memory because they’re pinned to specific cores by your operating system, you’re potentially underutilizing the hardware. This situation can become even trickier as your workloads vary in size and type.

Now, think about modern applications and how they operate. Cloud-native apps, for instance, are designed to have workloads dynamically spread out, and they thrive in environments that deliver low latency. If you’re deploying containers on a multi-socket server with NUMA, the orchestration layer must be designed smartly. You want to put workloads on sockets in a way that reduces cross-socket memory access. If you’ve ever worked with Kubernetes, you know how important it is to specify resource requests and limits. You can also use affinity settings to bind pods to certain nodes, thereby helping manage how they interact with the memory across the sockets.

From my experience, tuning your operating system for NUMA is critical. Most modern OSs like Linux or Windows offer NUMA-aware scheduling, but it doesn’t always work perfectly out of the box. In fact, you might find that out-of-the-box settings don’t consider certain workload patterns. If you’re running a demanding application, you will want to tweak your settings. For example, with Linux, there are various kernel parameters that allow you to adjust how memory is allocated according to NUMA, and I’ve seen cases where those adjustments have led to significant performance boosts.

It’s not just about the OS, though. Application developers also need to be conscious of NUMA effects. Some databases, like PostgreSQL or Oracle 19c, come with their own internal NUMA awareness and can spread workloads effectively across multiple sockets. If you write your application in a way that it’s NUMA-aware, using tools to localize memory allocations, you can enable the application to run faster by maximizing local memory access.

Now imagine you’re running a video encoding application on a dual-socket AMD EPYC server. You may start off with a couple of threads fairly evenly distributed, and then one CPU is accessing its local memory while trying to process HD video. If one workload is saturating the memory bandwidth in one socket, you could be facing some real delays. What’s fascinating here is that some workloads don’t scale linearly. You won’t always achieve double the performance from adding a second CPU because of those latency issues and bandwidth contention.

In my day-to-day work, I often come across clients who expect a simple linear boost in performance with multi-socket setups, but that’s usually not how it works. Communication delays and memory access inefficiencies can create bottlenecks that nullify the advantages of adding more CPUs. You don’t just throw hardware at problems; you need to think about how each component interacts in the bigger picture.

Another aspect to keep in your back pocket is that different CPU architectures deal with NUMA in ways that can influence performance benchmarks. Consider Intel’s Ice Lake architecture against AMD’s Zen 3 processors. They both have their strategies for managing memory across sockets, and the performance characteristics can vary significantly depending on workloads. If you’re in a mixed environment or considering a migration, doing due diligence by running benchmarks suited to your specific applications is key.

With all that in mind, you can also take advantage of profiling tools to understand where the bottlenecks are coming from. Tools like Perf on Linux or Visual Studio’s performance profiler on Windows go a long way in helping you analyze memory access patterns. It allows you to visualize which socket’s resources are being heavily accessed, giving you that insight needed to redistribute the workload more evenly.

At the end of the day, I think it’s crucial to keep a balanced perspective. Multi-socket systems can offer great power and performance, especially when you tune for NUMA. They’re incredible when it comes to scaling applications or processing large datasets. The interconnect topology has a significant influence on performance, but it requires a blend of the right hardware, a well-tuned operating system, and memory-aware applications to truly shine.

If you approach the design and implementation with an understanding of NUMA and what it means for memory access and workload distribution, you’re setting yourself up for success. It’s one of those aspects that might not get the flashy attention but is essential for making sure those systems run as effectively as possible. I’ve seen first-hand how tackling this fundamental architecture concept changes the game for teams focused on performance.