Bad Concurrency: Building a CPU Topology on MacOS X

Within a project I've been working on I've had the need to simulate the capabilities of Linux's /proc/cpuinfo on Mac OS. Specifically I needed to build a topology of the CPUs on a given system. I.e. I need to map the operating system's processors to hardware threads, then build a picture of which cores and sockets those threads reside. For example my Mac looks something like:

CPU0 (Thread 0) ---+
                   |---> Core 0 ---+
CPU1 (Thread 1) ---+               |
                                   | ----> Socket 0
CPU2 (Thread 2) ---+               |
                   |---> Core 1 ---+
CPU3 (Thread 3) ---+

While this sounds very simple, it's actually fraught with a number of little niggles. Not only did it require getting down and dirty with a bit of C++ and X86 Assembly, it also required writing a MacOS kernel extension.

The first step was to understand what information was available from the CPU. Intel exposes an instruction called CPUID. The is the primary mechanism for getting information about the CPU. There is a raft of information available from listing of the CPU features available (e.g. hyperthreading) to sizes of the various levels of cache and the associated cache lines. To access the CPUID instruction we need a little bit of inline assembler.

The code shows how to get the vendor string from the CPU. On my Mac I get the following:

// Output:
Vendor String: GenuineIntel

For those unfamiliar with Intel inline assembly, the Intel CPU defines a number of registers. The ones used for the CPUID instruction are EAX, EBX, ECX, and EDX (referenced as RAX, RBX, etc if using 64 bit instructions via the REX extension). These used for both input and output. An inline asm segment consists of 3 parts. The first part is the instruction to be executed. In this case the "cpuid" instruction. The second line defines the output parameters. The snippet "=a" (data[0]) means store the result in the EAX register in the variable data[0]. The "=a" refers to the 2nd letter of the register designation. The 3rd and final section are the input parameters. The CPUID instruction takes 2 parameters, one in EAX and one in ECX.

The particular CPUID reference that provides information needed to building the topology is 0xB (11) - the extended topology enumeration leaf. The data returned from this instruction is:

     0                   1                   2                   3   
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EAX | Shift |                     Reserved                          |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EBX |   No. Process at this level   |            Reserved           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
ECX |   Level No.   |  Level type   |            Reserved           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EDX |                           x2APIC ID                           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The extended topology enumeration leaf is one of the CPUID indexes that also makes use of ECX as an input parameter. This indicates the level of the CPU that you wish to work at. I started at level 0 and worked my way up. The 'Level type' describes whether the current level is a CPU thread (1) or a physical core (2) or invalid (0). All values greater than 2 are reserved, presumably for later use. The 2 other useful values are the x2APIC ID and Shift. The x2APIC ID is a unique identifier for the smallest logic unit in the CPU. The Shift value is used to group x2APIC ID values into units at the next logical level. This is done by shifting the x2APIC ID right by the value specified by Shift. For example on my using the following code on my workstation (2 sockets, 8 cores, 16 threads):

Outputs the following:

Shift: 1, Count: 2, Level Id: 0, Level Type: 1, x2APIC: 33, Group: 16
Shift: 5, Count: 8, Level Id: 1, Level Type: 2, x2APIC: 33, Group: 1

This indicates that the hardware thread indicated by APIC 33 has the core id of 16 and socket id of 1. The socket and core ids aren't really references, but values that indicate threads have have the same id share that unit. This gets me most of the way there, I can group all of my threads into a hierarchy of cores and sockets. However there is one small niggle remaining. I need this code to run all of the CPUs on my system. On Linux this is reasonably simple, you count the number of CPUs then iterate through all of them and use the pthread_setaffinity_np(...) to specify specify which CPU to run the CPUID instructions on.

However, on Mac OS X actual thread affinity is not supported; just a mechanism logically grouping or separating threads, powerful in its own right, but not what I'm after here. This is where the Kernel module comes in. The XNU Kernel defines a funky little method called mp_rendevous(...). This method takes in a number function pointers. The key one (action_func), is run once on each of the CPUs. So to get the topology information for all of the CPUs we can use it like so:

Because the method mp_rendevous() is defined in the kernel the code above needs to be packaged as a kernel extension. Even then, getting access to the method is interesting. It's not defined in a header that can be easily included, however it is available at link time when compiling a kernel module. Therefore in order compile correctly, it's necessary to provide your own forward declaration of the method. The same is true of the cpu_number(). Calling into the kernel method from user space requires use of the IOKit framework, but I'll leave the details of that as an exercise for the reader.

3 comments:

cynikal said...: 13 You mentioned calling into kernel code from user space requires IOKit.framework.. is that to suggest that one could do this w/o having to create a kernel module? Or to access the functionality of your kernel module is tricky?

Could you expand on this please? I want to do something similar except instead of cpuid, i want to read some c state residency msr and other performance counters. The lack of user-space affinity support makes this tricky getting the counter off the right logical proc/cores :-\; 16 October 2012 at 15:39
Michael Barker said...: To read MSR values it has to be done in the kernel as rdmsr and wrmsr are ring 0 instructions. If you are running linux then you can use the /proc/cpu/XX/msr file system. On Mac OS X you need your own kernel module. I did a bit of hacking using MSRTool to get some information. Be very careful though, you give it an invalid MSR number and it will panic your kernel.; 17 October 2012 at 08:49
Unknown said...: Hi Michael,

Thanks for this valuable post.

Could you please share the complete source and show us how to really map threads on CPU cores?

Thx in advance and happy new year.
Mica; 2 January 2013 at 08:54

Sunday, 8 January 2012

Building a CPU Topology on MacOS X

3 comments: