Tuesday, January 12, 2016

Solaris 11.2: Control the size of the ZFS ARC cache dynamically

Solaris 11.2 deprecates the zfs_arc_max kernel parameter in favor of user_reserve_hint_pct and that’s cool.
ZFS has a very smart cache, the so called ARC (Adaptive replacement
cache). In general the ARC consumes as much memory as it is available, it also takes care that it frees up memory if other applications need more.
In theory, this works very good, ZFS just uses available memory to speed up slow disk I/O. But it also has some side effects, if the ARC consumed almost all unused memory. Applications which request more memory need to wait, until the ARC frees up memory. For example, if you restart a big database, the startup is maybe significantly delayed, because the ARC could have used the free memory from the database shutdown in the meantime already. Additionally this database would likely request large memory pages, if ARC uses just some free segments, the memory gets easily
fragmented.
That is why, many users limit the total size with the  zfs_arc_max  ernel parameter. With this parameter you can configure the absolute maximum size of the ARC in bytes. For a time I personally refused to use this parameter, because it feels like “breaking the legs” of ZFS, it’s hard to standardize (absolute value) and it needs a reboot to change. But for memory-intensive applications this hard limit is simply necessary, until now.
Solaris 11.2 finally addresses this pain point, the
 zfs_arc_max parameter is now deprecated. There is the new dynamic user_reserve_hint_pct kernel parameter, which allows the system administrator to tell ZFS which percentage of the physical memory should be reserved for user applications. Without reboot!
So if you know your application will use 90% of your physical memory, you can just set this parameter to 90.
Oracle provides a script called set_user_reserve.sh and additional documentation. Both can be found on My Oracle Support: “Memory Management Between ZFS and Applications in Oracle Solaris 11.2 (Doc ID 1663862.1)”. The script gracefully adjusts this parameter, to give ARC enough time to shrink.
According first tests it works really nice:
# ./set_user_reserve.sh -f 50
Adjusting user_reserve_hint_pct from 0 to 50
08:43:03 AM UTC : waiting for current value : 13 to grow to target : 15
08:43:11 AM UTC : waiting for current value : 15 to grow to target : 20
...
# ./set_user_reserve.sh -f 70
...
# ./set_user_reserve.sh 0

The following chart shows the memory consumption of the ZFS ARC and available memory for user applications during my test. The line graph is the value of user_reserve_hint_pct which is gracefully set by the script. During the test, I set it to 50%, 70% and back to 0%. At the same time I generated some I/O on the ZFS filesystem to cause caching to ARC.

As you can see, the ARC shrinks and grows according to the new
parameter.
For generating the chart data, I wrote the following Dtrace script:
dtrace script zfs_user_reserve_stat.d

#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
        pagesize= `_pagesize;
        mb = 1024*1024;
        printf("  PHYSARC  USR avail(MB) USR(%%) user_reserve_hint_pct (MB) user_reserve_hint_pct (%%)\n");
}
profile:::tick-$1sec
{
        physmem= (`physmem * pagesize) / mb;
        user_reserved_pct= `user_reserve_hint_pct;
        user_reserved= (user_reserved_pct * physmem)/ 100;
        arc_size= `arc_stats.arcstat_size.value.ui64 / mb;
        used_by_kernel= (`kpages_locked * pagesize)/ mb;
        current_mem_userspace_pct= ((physmem - used_by_kernel)* 100) / physmem;
        current_mem_userspacephysmem - used_by_kernel;
        printf("%8d %8d %15d %7d %27d %26d\n", physmemarc_sizecurrent_mem_userspace,
      current_mem_userspace_pct,user_reserved,user_reserved_pct);
}

You can run the script during adjusting user_reserve_hint_pct with your desired interval, for example every ten seconds:

# ./zfs_user_reserve_stat.d 10

PHYS   ARC  USR avail (MB)  USR(%) user_reserve_hint_pct (MB)
user_reserve_hint_pct (%)

2031   1136   288           14     0             0

2031   1051   364           17     304           15

2031   934    489           24     507           25

2031   863    561           27     609           30

2031   798    627           30     710           35
...

I definitely need to play more with this parameter, but so far it looks like a big improvement to zfs_arc_max and a very good replacement.

No comments:

Post a Comment