How do CPUs manage access to non-uniform memory architecture (NUMA) systems?

***savas*** · 05-20-2022, 05:20 PM

When we're talking about CPUs and their interaction with non-uniform memory architecture (NUMA) systems, I'm always excited to share my thoughts because it's one of those topics that can get a bit technical, but once you wrap your head around it, it opens up a whole new understanding of how modern computing works.

Let me set the stage. In a typical computing environment, each CPU has its own memory, and if you have multiple CPUs in a system, they normally share that memory. But with NUMA, it’s a bit different. Each CPU can access its own local memory much faster than it can access memory that's not local. This can impact performance greatly, especially in systems with multiple CPUs or cores, like those powerful servers from Dell’s PowerEdge or HPE’s ProLiant lines, which people often use in data centers.

The way CPUs access memory in a NUMA architecture is crucial because, in essence, it’s all about efficiency. When I’m working on an application, be it a large database or something even more complex like a data analytics workflow, I want to ensure that the data that the CPU needs is as close to it as possible. If the CPU has to fetch data from a remote memory node, it takes longer and can cause what's known as memory latency. That’s just a fancy way of saying that the CPU has to wait longer to get the information it needs.

Now, you might wonder how CPUs manage this whole process. Well, it begins with something called locality. Locality is a principle that suggests that if a CPU accesses certain data in memory, it will likely access that same data or data that is close to it again soon. This means that smart operating systems and hypervisors will try their best to allocate memory resources near the CPU that will be using them.

Take a look at Linux for instance. When you're running a system on Linux with a NUMA configuration, it actually has built-in support to give you control over how the system manages memory allocation. I often recommend using tools like `numactl`, which lets you set policies around memory allocation. For example, if you’re running a heavy-duty application like PostgreSQL, you can use `numactl` to ensure it uses memory from the node closest to the CPU that is processing queries. This can dramatically speed up response times for data retrieval because the application will be working with data in its local memory.

You also have to consider the CPU scheduling. Each core has a processor affinity and will often favor threads that are running on the same core, which alludes back to that locality I mentioned. When I configure a system, I make sure that the operating system does the heavy lifting in terms of scheduling the threads of applications to run on the correct CPU core that’s closest to the memory they need, thereby reducing that memory latency.

I remember setting up a virtual machine on an AMD EPYC architecture for a big data project. The performance bumped up significantly once I ensured that the virtual CPUs and application memory were aligned with the NUMA nodes on that CPU. The EPYC’s architecture is designed with lots of cores and memory channels, and playing around with the memory distribution and thread affinities made a noticeable difference.

Speaking of AMD and Intel, their respective CPUs use their own ways to manage NUMA, which influence how we run and optimize applications. For example, Intel’s Xeon processors have a technology called Cache Allocation Technology that helps manage how different caches are used based on the workload. This can improve the data locality by ensuring that the cache for tasks running on a specific CPU is optimized without spilling over into caches from other CPUs, especially in NUMA configurations.

You might be saying, “Okay, but what if it all goes wrong?” That’s a great question, and it can happen. Sometimes, the performance might not meet your expectations even when everything seems set up correctly. Things like thread contention can happen, or your workload may not be optimized for a NUMA layout. In those cases, I often look at profiling tools to identify where the bottlenecks are. Tools like perf on Linux can give you insights into CPU cycles, cache hits, and memory accesses, which can help identify if the application is spending too much time waiting for data.

Another point to consider is memory allocation patterns. If you’re using languages like Java with its garbage collector, the way objects are allocated in memory can lead to performance hiccups if not managed properly. If your application is creating a lot of objects in different memory nodes, it could result in excessive remote memory access, leading to performance degradation. I typically use memory management options to control object allocation, ensuring that objects are allocated close to where they will be used.

For those of us who monitor system performance on a regular basis, monitoring tools become indispensable. You might have heard of the likes of Prometheus or Grafana, which can visualize your CPU and memory usage in real-time. With those tools, I often find trends in memory usage or CPU load and adjust accordingly. If you see that one node is getting hammered while others are underutilized, that’s your cue to move some workloads around or fine-tune your memory allocation policies.

And here's something I've noticed over time: application design can really impact how well a NUMA architecture performs. If you're coding an application that assumes memory access will always be uniform, you might run into a lot of trouble when it's deployed on a NUMA system. That’s why it’s essential to write scalable and efficient code. Sometimes I even find it useful to collaborate closely with developers to ensure they’re aware of how memory is set up in production environments, allowing them to architect their code with the right level of optimization for NUMA.

Recently, I worked on a project that required heavy machine learning based workloads. Using NVIDIA GPUs with the CUDA toolkit alongside AMD EPYC processors got tricky as we started hitting memory bottlenecks. Then it hit me to integrate better memory management strategies around the GPU and CPU communications, ensuring that data stays within the NUMA nodes. That way, we got a significant boost in throughput since we optimized how the data flowed between our CPUs and the memory.

In the end, the performance you achieve with NUMA systems boils down to understanding how your CPU architecture works and utilizing effective strategies for memory access, whether it’s through scheduling, allocation policies, or profiling. It’s all about making sure that memory isn't a bottleneck in getting your applications up and running smoothly. Once you grasp these principles, you can really make the most of your computing architecture, be it for high-performance computing, gaming, or enterprise-level applications.

I’m always here to share thoughts on it anytime you want to chat or work on a project that involves NUMA architectures. It’s definitely something that keeps me on my toes, and I love exploring it. Whether you are designing a multi-core server setup or optimizing a complex application, knowing how CPUs interact with memory can truly unlock performance improvements you may not have thought possible.