In a previous article called ‘memory allocation on startup’ I touched on the subject of NUMA; Non Uniform Memory Access. This article is about how to configure NUMA, how to look into NUMA usage and a real life case of NUMA optimisation using in-memory parallel execution.
At this point in time (start of the summer of 2016) we see that the CPU speed competition has stagnated and settled at somewhere below maximally 4 gigahertz, and instead the number of core’s and the size of memory is growing. The common used server type in the market I am in is a two socket server. It is not unthinkable that in the near future servers with more than two sockets will increase in popularity, which facilitates the increase in (parallel) computing capacity and maximal amount of memory.
On the other hand of course there are already 2+ socket count servers, of which some have been multi socket for quite a while: Oracle/Sun M series, IBM P series, and a large diversity of intel based 2+ socket servers, like the Oracle/Sun “-4” and “-8” series. The investigations in this blogpost are done done on a SGI UV 300RL, which is a system with (up to) 32 sockets, 576 cores, and 1152 CPU threads.
There already have been a lot of blogposts about Linux and NUMA:
Kevin Closson
Martin Bach: linux-large-pages-and-non-uniform-memory-distribution, _enable_numa_support and numactl
Yves Colin
Bertrand Drouvot
I am sure there will be more, these are the ones that I read in the past.
This article will take you through the configuration of NUMA for the Oracle database and how to measure specifics from the Linux side. It will touch some of the things mentioned in the other blogposts, in order to provide a consistent story.
The database version is: 12.1.0.2.160419
The operating system is Oracle Linux 7.1
Kernel: 3.8.13-98.4.1.el7uek.x86_64
1. NUMA or not.
In order to sensibly use NUMA optimisation with the Oracle database, you first need to have hardware that is actually NUMA capable. This means multiple sockets in use that have local memory (please mind this includes common two socket server hardware). The next thing needed to use NUMA is NUMA support enabled by the Linux kernel. With the current version of the oracle-preinstall rpm, NUMA is disabled on the kernel load line in grub (numa=off). When NUMA is disabled, it will show one NUMA node (=socket) with numactl.
First make sure you are on NUMA enabled hardware by issuing ‘numactl –hardware’ and see multiple nodes specified.
2. Default Oracle.
When you don’t modify (undocumented) NUMA parameters, Oracle will NOT run in NUMA aware mode. In other words, you explicitly need to enable NUMA in the database. This is also visible if you look in the numa_maps, which shows process memory and NUMA information:
$ grep huge /proc/$(pgrep pmon_slob)/numa_maps 60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1 80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3339 mapmax=904 N0=112 N1=98 N2=98 N3=98 N4=98 N5=98 N6=98 N7=97 N8=96 N9=96 N10=96 N11=101 N12=108 N13=106 N14=105 N15=105 N16=107 N17=107 N18=106 N19=109 N20=106 N21=106 N22=107 N23=107 N24=106 N25=107 N26=108 N27=108 N28=109 N29=112 N30=111 N31=113 9c460000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge 9c480000000 prefer:1 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N0=1
Few things to note: I took the pmon process, and my instance name is slob (pmon_slob). The SGA only is allocated from huge pages, so I grepped for ‘huge’. numa_maps only shows the process’ paged memory, my SGA is way bigger. The important thing to see here is the main SGA memory allocation at 80000000 reads ‘interleave:0-31’, which means the memory is spread over all the NUMA nodes. This can also be seen on the same line, because the memory allocation per node is shown with Nnn=nr.pages.
3. Set parameters
In Oracle 12.1.0.2 you need to set two parameters to fully enable NUMA support in the database:
alter system set "_enable_NUMA_support"=true scope=spfile; alter system set "_px_numa_support_enabled"=true scope=spfile;
These are undocumented parameters. However, these parameters are explicitly mentioned in MOS article ‘What Is Correct Setting for _px_numa_support_enabled for 12c Databases That Wish to Use NUMA?’ (Doc ID 1956463.1).
Once you have set these parameters, Oracle will start up in NUMA aware mode. One of the things that is different in NUMA aware mode, is how the buffercache is created. This can be seen by looking at the numa_maps file of pmon again:
$ grep huge /proc/$(pgrep pmon_slob)/numa_maps 60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1 80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3587 mapmax=904 N0=112 N1=112 N2=112 N3=112 N4=112 N5=112 N6=112 N7=112 N8=112 N9=112 N10=112 N11=112 N12=113 N13=112 N14=113 N15=112 N16=112 N17=112 N18=113 N19=112 N20=112 N21=112 N22=112 N23=112 N24=112 N25=112 N26=112 N27=112 N28=112 N29=112 N30=112 N31=112 8860000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=1536 N25=1536 d260000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=512 N26=512 11c40000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=512 N27=512 16620000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=512 N28=512 1b000000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=512 N29=512 1f9e0000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=256 N30=256 243c0000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=512 N31=512 28da0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=256 N0=256 2d780000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=512 N1=512 32160000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=256 N2=256 36b40000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=512 N3=512 3b520000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=256 N4=256 3ff00000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=512 N5=512 448e0000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=256 N6=256 492c0000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=512 N7=512 4dca0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=256 N8=256 52680000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=512 N9=512 57060000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=256 N10=256 5ba40000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=512 N11=512 60420000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=256 N12=256 64e00000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=512 N13=512 697e0000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=256 N14=256 6e1c0000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=512 N15=512 72ba0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=256 N16=256 77580000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=512 N17=512 7bf60000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=256 N18=256 80940000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=512 N19=512 85320000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=256 N20=256 89d00000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=512 N21=512 8e6e0000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=256 N22=256 930c0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=512 N23=512 97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=512 N24=512 9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge 9c480000000 prefer:2 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N5=1
If we look at the output, a few segments are still allocated interleaved (these are the fixed SGA and the shared pool), and we see allocations for every NUMA node. These per NUMA node allocations are the buffer cache.
Okay. Now the database is setup for NUMA, and we checked how this looks like on the operating system level using Linux tools. However, we need to realise that NUMA is no magic that sets your database into the next gear. Most things you do are not NUMA optimised by the database, simply because it wouldn’t make sense.
For example, when I perform a buffered full table scan without parallel query, it will spread out the blocks over the allocated segments. Why? Because my session is running on a random CPU thread which could run on another CPU thread the next time it is scheduled. What I am saying here is: only specific things are NUMA optimised.
A next logical question would be: but then what is actually NUMA optimised? One obvious thing that is NUMA optimised is the database writer (this is mentioned in the alert.log at startup). Another NUMA optimised thing is parallel query. If you look at the parallel query worker processes, the status file gives a hint:
# grep allowed_list /proc/$(pgrep p000_slob)/status Cpus_allowed_list: 450-467,1026-1043 Mems_allowed_list: 0-31
This shows that the first parallel query worker process is limited in what CPU (threads) it is allowed to use. All parallel query worker processes are limited to ranges of CPU threads. Please mind it is NOT limited in what memory it can use, so it can use both local and remote memory. It may look weird that it still can access remote memory, totally ignoring NUMA optimised (=local) memory access.
First the CPU thread ranges. If you look what the range mean (it’s rather obvious), you see this:
# lscpu | grep 450 NUMA node25 CPU(s): 450-467,1026-1043
The CPU thread range is exactly a NUMA node. So whatever happens, parallel query worker process p000 can only run on the CPU threads of NUMA node 25!
Then absence of limits in memory: this means the parallel query process can still access anything it needs. This means that anything that used to work with parallel query can still run as it did without or before the NUMA optimisations. In specific cases, parallel query can take advantage of the NUMA setting.
One of such cases is in-memory parallel execution. With in-memory parallel execution, the parallel query slaves read blocks in the buffer cache instead of performing a direct path read. In-memory parallel execution is enabled when the parameter parallel_degree_policy is set to auto, and heuristically determines feasibility.
I setup a table of approximately 700GB, and performed an in-memory parallel query scan on a freshly started instance, in order to have the parallel query worker processes have as little memory touched as possible. At the end of the scan, this is how numa_maps of parallel query worker process p000 looks like:
# grep huge /proc/$(pgrep p000_slob)/numa_maps 60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=906 N0=1 N1=1 80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=4608 N0=144 N1=144 N2=144 N3=144 N4=144 N5=144 N6=144 N7=144 N8=144 N9=144 N10=144 N11=144 N12=144 N13=144 N14=144 N15=144 N16=144 N17=144 N18=144 N19=144 N20=144 N21=144 N22=144 N23=144 N24=144 N25=144 N26=144 N27=144 N28=144 N29=144 N30=144 N31=144 8880000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=50176 N25=50176 d2a0000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N26=6145 11c80000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=4608 N27=4608 16660000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=70 N28=5121 1b040000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=14336 N29=14336 1fa20000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=72 N30=5121 24400000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=4608 N31=4608 28de0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N0=6145 2d7c0000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=4099 N1=4099 321a0000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=2528 mapmax=73 N2=2528 36b80000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=2222 N3=2222 3b560000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=2504 mapmax=69 N4=2504 3ff40000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=2249 N5=2249 44920000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=4754 mapmax=69 N6=4754 49300000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=2296 N7=2296 4dce0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=2463 mapmax=70 N8=2463 526c0000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=2246 N9=2246 570a0000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=2440 mapmax=72 N10=2440 5ba80000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=2233 N11=2233 60460000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=2469 mapmax=69 N12=2469 64e40000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=2241 N13=2241 69820000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=2434 mapmax=69 N14=2434 6e200000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=2342 N15=2342 72be0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=2461 mapmax=70 N16=2461 775c0000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=2329 N17=2329 7bfa0000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=2499 mapmax=71 N18=2499 80980000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=2334 N19=2334 85360000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=2466 mapmax=73 N20=2466 89d40000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=2328 N21=2328 8e720000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=2180 mapmax=69 N22=2180 930e0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=2208 mapmax=69 N23=2208 97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=2167 mapmax=68 N24=2167 9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=147 N26=2 9c480000000 prefer:0 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=906 N2=1
If you look at memory address 8880000000, you see parallel query worker process p000 used 50176 pages, while on the other (buffer cache/’prefer’) segments it touched somewhere between 2200-6150 pages. So it clearly used it’s local memory more than the remote memories. This is proof in-memory parallel query is NUMA optimised.
In fact, when I time the query using in-memory parallel query with NUMA enabled, it takes a little less than 8 seconds consistently once the data is loaded in the buffercache. When I disable NUMA optimisations (_enable_NUMA_support=false), and execute the same query, also doing in-memory parallel query, the time goes up to a little less than 24 seconds, only scanning the buffercache.
Full name
Frits Hoogland
My company
http://www.enkitec.com
Recent comments
1 year 3 weeks ago
1 year 15 weeks ago
1 year 20 weeks ago
1 year 21 weeks ago
1 year 25 weeks ago
1 year 46 weeks ago
2 years 14 weeks ago
2 years 44 weeks ago
3 years 28 weeks ago
3 years 29 weeks ago