Search

Top 60 Oracle Blogs

Recent comments

Little things I didn’t know: difference between _enable_NUMA_support and numactl

In preparation for a research project and potential UKOUG conference papers I am researching the effect of NUMA on x86 systems.

NUMA is one of the key features to understand in modern computer organisation, and I recommend reading “Computer Architecture, Fifth Edition: A Quantitative Approach” from Hennessy and Patterson (make sure you grab the 5th edition). Read the chapter about cache optimisation and also the appendix about the memory hierarchy!

Now why should you know NUMA? First of all there is an increasing number of multi-socket systems. AMD has pioneered the move to a lot of cores, but Intel is not far behind. Although AMD is currently leading in the number of cores (“modules”) on a die, Intel doesn’t need to: the Sandy-Bridge EP processors are way more powerful on a one-to-one comparison than anything AMD has at the moment.

In the example, I am using a blade system with Opteron 61xx processors. The processor has 12 cores according to the AMD hardware reference. The output of /proc/cpuinfo lists 48 “processors”, so it should be fair to say that there are 48/12 = 4 sockets in the system. An AWR report on the machine lists it as 4 sockets, 24 cores and 48 processors. I didn’t think the processor was using SMT, when I find out why AWR reports 24c48t  I’ll update the post.

Anyway, I ensured that the kernel command line (/proc/cmdline) didn’t include numa=off, which the oracle-validated RPM sets. Then after a reboot here’s the result:

$ ]$ numactl --hardware
available: 8 nodes (0-7)
node 0 size: 4016 MB
node 0 free: 378 MB
node 1 size: 4040 MB
node 1 free: 213 MB
node 2 size: 4040 MB
node 2 free: 833 MB
node 3 size: 4040 MB
node 3 free: 819 MB
node 4 size: 4040 MB
node 4 free: 847 MB
node 5 size: 4040 MB
node 5 free: 834 MB
node 6 size: 4040 MB
node 6 free: 851 MB
node 7 size: 4040 MB
node 7 free: 749 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  20  20  20  20  20  20
  1:  20  10  20  20  20  20  20  20
  2:  20  20  10  20  20  20  20  20
  3:  20  20  20  10  20  20  20  20
  4:  20  20  20  20  10  20  20  20
  5:  20  20  20  20  20  10  20  20
  6:  20  20  20  20  20  20  10  20
  7:  20  20  20  20  20  20  20  10

Right, I have 8 NUMA nodes from 0-7, total RAM on the machine is 32GB. There are huge pages allocated for another database to allow for a 24GB RAM SGA. A lot of information about NUMA can be found in the SYSFS which is now mounted by default on RHEL and Oracle Linux. Check the path to /sys/devices/system/node:

$ ls
node0  node1  node2  node3  node4  node5  node6  node7

$ ls node0
cpu0  cpu12  cpu16  cpu20  cpu4  cpu8  cpumap  distance  meminfo  numastat

For each NUMA node as shown in the output of numactl –hardware there is a subdirectory noden. There you can see also the processors that form the node as well. Oracle Linux 6.x offers a file called cpulist, previous releases with the RHEL-compatible kernel should have subdirectories cpux. Interestingly you find memory information local to the NUMA node in the file meminfo, as well as the distance matrix you can query in numactl –hardware. So far I have only seen distances of 10 or 20-if anyone knows where these numbers come from or has soon other figures please let me know!

Another useful tool to know is numastat which presents memory information (and cross-node memory requests!) which can be useful.

$ numastat
                           node0           node1           node2           node3
numa_hit                 3048548        25344114        14523218        13498057
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit              8196          390371          415719          458362
local_node               2415628        24965781        14059618        12907752
other_node                632920          378333          463600          590305

                           node4           node5           node6           node7
numa_hit                 9295098         4072364         3730878         3659625
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit            512399          451099          417627          390960
local_node               8637176         3483582         3152133         3159090
other_node                657922          588782          578745          500535

Oracle and NUMA

Oracle has an if then else approach to NUMA as a post from Kevin Closson has explained already. I’m on 11.2.0.3 and need to use “_enable_numa_support” to enable NUMA support in the database. Before that however I though I’d give the numctl command a chance and bind it to node 7 (both for processor and memory)

This is easily done:

[oracle@server1 ~]> numactl --membind=7 --cpunodebind=7 sqlplus / as sysdba <

Have a look at the numactl man page if you want to learn more about the options.

Now how can you check if it respected your settings? Simple enough, the tool is called “taskset”. Unlike the name may suggest not only can you set a task, but you can also get the affinities etc. A simple one-liner does that for my database SLOB:

$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done
pid 1434's current affinity list: 3,7,11,15,19,23
pid 1436's current affinity list: 3,7,11,15,19,23
pid 1438's current affinity list: 3,7,11,15,19,23
pid 1442's current affinity list: 3,7,11,15,19,23
pid 1444's current affinity list: 3,7,11,15,19,23
pid 1446's current affinity list: 3,7,11,15,19,23
pid 1448's current affinity list: 3,7,11,15,19,23
pid 1450's current affinity list: 3,7,11,15,19,23
pid 1452's current affinity list: 3,7,11,15,19,23
pid 1454's current affinity list: 3,7,11,15,19,23
pid 1456's current affinity list: 3,7,11,15,19,23
pid 1458's current affinity list: 3,7,11,15,19,23
pid 1460's current affinity list: 3,7,11,15,19,23
pid 1462's current affinity list: 3,7,11,15,19,23
pid 1464's current affinity list: 3,7,11,15,19,23
pid 1466's current affinity list: 3,7,11,15,19,23
pid 1470's current affinity list: 3,7,11,15,19,23
pid 1472's current affinity list: 3,7,11,15,19,23
pid 1489's current affinity list: 3,7,11,15,19,23
pid 1694's current affinity list: 3,7,11,15,19,23
pid 1696's current affinity list: 3,7,11,15,19,23
pid 5041's current affinity list: 3,7,11,15,19,23
pid 13374's current affinity list: 3,7,11,15,19,23

Is that really node7? Checking the cpus in node7:

$ ls node7
cpu11  cpu15  cpu19  cpu23  cpu3  cpu7

That’s us! Ok that worked.

_enable_NUMA_support

The next test I did was to see how Oracle handles NUMA in the database. There was a bit of a enable/don’t enable/enable/don’t enable from 10.2 to 11.2. If the MOS notes are correct then NUMA support is turned off by default now. The underscore parameter _enable_NUMA_support turns it on again. At least on my 11.2.0.3.2 system on Linux there was no relinking of the oracle binary necessary.

But to my surprise I saw this after starting the database with NUMA support enabled:


$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done
pid 17513's current affinity list: 26,30,34,38,42,46
pid 17515's current affinity list: 26,30,34,38,42,46
pid 17517's current affinity list: 26,30,34,38,42,46
pid 17521's current affinity list: 26,30,34,38,42,46
pid 17523's current affinity list: 26,30,34,38,42,46
pid 17525's current affinity list: 26,30,34,38,42,46
pid 17527's current affinity list: 26,30,34,38,42,46
pid 17529's current affinity list: 26,30,34,38,42,46
pid 17531's current affinity list: 0,4,8,12,16,20
pid 17533's current affinity list: 24,28,32,36,40,44
pid 17535's current affinity list: 1,5,9,13,17,21
pid 17537's current affinity list: 25,29,33,37,41,45
pid 17539's current affinity list: 2,6,10,14,18,22
pid 17541's current affinity list: 26,30,34,38,42,46
pid 17543's current affinity list: 27,31,35,39,43,47
pid 17545's current affinity list: 3,7,11,15,19,23
pid 17547's current affinity list: 24,28,32,36,40,44
pid 17549's current affinity list: 26,30,34,38,42,46
pid 17551's current affinity list: 26,30,34,38,42,46
pid 17553's current affinity list: 26,30,34,38,42,46
pid 17555's current affinity list: 26,30,34,38,42,46
pid 17557's current affinity list: 26,30,34,38,42,46
pid 17559's current affinity list: 26,30,34,38,42,46
pid 17563's current affinity list: 26,30,34,38,42,46
pid 17565's current affinity list: 26,30,34,38,42,46
pid 17568's current affinity list: 0,4,8,12,16,20
pid 17577's current affinity list: 0,4,8,12,16,20
pid 17584's current affinity list: 0,4,8,12,16,20
pid 17597's current affinity list: 0,4,8,12,16,20
pid 17599's current affinity list: 24,28,32,36,40,44

Interesting-so the database, with an otherwise identical pfile (and a SLOB PIO SGA of 270 M) is now distributed across lots of NUMA nodes…watch out for that interleaved memory transfer!

It doesn’t help trying to use numactl to force the creation of process on a node-Oracle now uses NUMA API calls internally it seems and overrides your command:

$ numactl --membind=7 --cpunodebind=7 sqlplus / as sysdba < startup
> EOF
...
$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done
pid 20155's current affinity list: 3,7,11,15,19,23
pid 20157's current affinity list: 3,7,11,15,19,23
pid 20160's current affinity list: 3,7,11,15,19,23
pid 20164's current affinity list: 3,7,11,15,19,23
pid 20166's current affinity list: 3,7,11,15,19,23
pid 20168's current affinity list: 3,7,11,15,19,23
pid 20170's current affinity list: 3,7,11,15,19,23
pid 20172's current affinity list: 3,7,11,15,19,23
pid 20174's current affinity list: 0,4,8,12,16,20
pid 20176's current affinity list: 24,28,32,36,40,44
pid 20178's current affinity list: 1,5,9,13,17,21
pid 20180's current affinity list: 25,29,33,37,41,45
pid 20182's current affinity list: 2,6,10,14,18,22
pid 20184's current affinity list: 26,30,34,38,42,46
pid 20186's current affinity list: 27,31,35,39,43,47
pid 20188's current affinity list: 3,7,11,15,19,23
pid 20190's current affinity list: 24,28,32,36,40,44
pid 20192's current affinity list: 3,7,11,15,19,23
pid 20194's current affinity list: 3,7,11,15,19,23
pid 20196's current affinity list: 3,7,11,15,19,23
pid 20198's current affinity list: 3,7,11,15,19,23
pid 20200's current affinity list: 3,7,11,15,19,23
pid 20202's current affinity list: 3,7,11,15,19,23
pid 20206's current affinity list: 3,7,11,15,19,23
pid 20208's current affinity list: 3,7,11,15,19,23
pid 20211's current affinity list: 0,4,8,12,16,20
pid 20240's current affinity list: 0,4,8,12,16,20
pid 20363's current affinity list: 0,4,8,12,16,20
sched_getaffinity: No such process
failed to get pid 20403's affinity

Little things I didn’t know! So next time I benchmark I will have that in mind!