Tuesday, January 12, 2016

Solaris 11: Monitoring & Increasing Swap Space Using ZFS Volumes

During installation, Oracle Solaris 11 usually makes the swap space around one quarter
of the RAM size. System and, particularly, application requirements can vary for each environment, so it's often appropriate to alter the swap space size by adding or removing space.
The swap space is an area of disk dedicated to paged anonymous memory and processes that are moved because of a low amount of RAM.
Monitoring Swap Space
There are several ways to see the current size of the space swap for your system, for
example:
root@solaris11-1:~# swap -l
swapfile                     dev  swaplo    blocks
free
/dev/zvol/dsk/rpool/swap   285,2       8
2097144  2097144
where:
  • swapfile indicates the swap space comes from a ZFS volume at /dev/zvol/dsk/rpool/swap.
  • dev shows the major number, which in this case confirms that the swap object is based on a ZFS volume:
root@solaris11-1:~# more /etc/name_to_major | grep 285
zfs 285
  •     swaplo indicates the minimum
    possible swap space size, which represents the memory page size (8 sectors x 512 bytes = 4K). To check it, 
    pagesize can be obtained by executing the following:
root@solaris11-1:~# pagesize
4096

A value of 4K is typically found on Intel machines. However, with Oracle Solaris 11 on SPARC machines, the page size can vary from 16K to 2 GB (this upper limit also applies for Intel processors). This upper limit is mainly used as the page size for the System Global Area (SGA)—a dedicated shared-memory area for an instance of Oracle Database 11g. Additionally, it is worth noting that 2 GB pages are supported with Oracle Solaris 10 8/11 or later Oracle Solaris releases and Oracle's SPARC T4 processor, but this page size isn't enabled by default. If it's suitable for some applications, we have to enable it by inserting set max_uheap_lpsize=0x80000000 in he /etc/system file and then rebooting the system.
Furthermore, Oracle Solaris 11 supports multiple page sizes, which can be set manually according the application profile or automatically through a new built-in
memory prediction technology that is able to analyze the demands of applications in order to assign a suitable value.
The supported page sizes can be shown by running the following command (in this
case, on an Intel processor):
root@solaris11-1:~# pagesize -a
4096
2097152

The example above shows us that two page sizes are supported: 4K and 2 MB. The real
reason for using larger memory pages is for improving the Memory Management Unit (MMU) performance by reducing TLB (Translation Lookaside Buffer) misses. The number of TLB misses can be verified by using the 
trapstat command (although trapstat is not usually implemented on Intel platforms).
  • blocks is the total size of the swap space (2097144 x 512 bytes = 1 GB).
  • free represents the free swap space (1 GB).
Another very good way to monitor the swap space is the following command:
root@solaris11-1:~# swap -s
total: 680180k bytes
allocated + 266516k reserved = 946696k used, 2321756k available
 
From this command output, we can see the following:
bytes allocated indicates the amount of swap space that already has been used (that is, touched previously but not necessarily still being used at this time) and continues to be available and reserved for use. A rough comparison would be a high-watermark threshold.
266516k reserved indicates swap space that has not been allocated yet, but has been claimed for possible future use. Remember that swap space is reserved when the virtual memory (heap segment or anonymous memory) for a process is created, and the reserved swap space is then allocated when the process is run. 
Anonymous memory is made of pages that don't have a counterpart in any file system and that are migrated to the swap space due to a shortage of physical memory (RAM)—probably because the sum of the stack, the shared memory, and the process heap (from the malloc function, for example) is larger than the amount of available memory.
946696k used indicates the total amount of swap space that is either allocated or reserved.
2321756k available indicates the swap space available for future allocation.
Additionally, we must remember that some swap space is reserved when the virtual memory for a process is created, but only part of this reserved space is really associated with the address space of the process; otherwise, the swap -s output can be misinterpreted, because it is telling us that 946696k is, at the end, reserved (in order to allocate a space, the space must has been reserved previously) and 680180K of swap space has been touched. 
Another very important point is that the swap -l command reports the physical swap space (on disk) while swap -s reports virtual swap space, which is the sum of the physical swap space and the physical memory.
Therefore, the available swap space from swap -s is the sum of free physical swap space plus free physical memory space. That's the reason that the swap -s command
is not recommended for evaluating the physical swap space; instead, 
swap -l should be used for this goal.

If we want to try another way to get the swap information, we can use the echo
::
swapinfo | mdb -k command, for example:

root@solaris11-1:~# echo
::
swapinfo | mdb -k
 ADDR            VNODE     PAGES FREE NAME
ffffc10007798260 ffffc10007a7db40    262143
262143 /dev/zvol/dsk/rpool/swap
It's simple to confirm that 262143 pages x 8K = 2097144K.
As mentioned earlier, it's good to remember that anonymous memory doesn't have a
counterpart in the file system. Usually, anonymous pages are the private data of a process, which includes the process heap (anonymous data) and the thread structure (the stack area, for example).
Swapping—an operation in which the swapper process (sched) swaps out processes that have been sleeping for more than 20 seconds (first their thread structures and
then the stack and heap data [anonymous page])—shouldn't be confused with paging,
which is moving pages (normally 4 KB or 8 KB each) from memory to disk and usually results in very efficient memory management. However, one kind of paging has a horrible effect on system performance—anonymous paging (mainly anonymous page-in)—because it increases application latency for reading back data from a disk.

Also swapping shouldn't be confused with reaping, which is a technique to free memory from the kernel slab allocator caches and which is done by the function kmem_reap( ).
How can you verify whether a system is using anonymous pages? In the following
output, the columns that are interesting are 
apo(anonymous page-out) and api (anonymous page-in), which both ideally should be equal to zero. The latter is responsible for an increase in application latency.

root@solaris11-1:~# vmstat -p 1
memory   page   executable    anonymous   filesystem
swap free re mf fr de sr epi  epo  epf  api  apo  apf  fpi  fpo  fpf
 2973844  2609240  3 18  0  0  3  0  0  0  0  0  0  0  0  0
 2895156  2544236  26 47 0  0  0  0  0  0  0  0  0  0  0  0
 2895156  2544092  0 0   0  0  0  0  0  0  0  0  0  0  0  0

To find out what process is doing anonymous page-in, use the following command:
root@solaris11-1:~# dtrace -n 'vminfo:::anonpgin { @[pidexecname] = count(); }'
Swapping is the last-used resource when paging is not able to free enough memory to meet the demands of an application, which can be indicated by a high level of page
scanning
 (searching for free memory pages).
Usually, when the amount of free memory goes below the amount specified by the desfree kernel parameter and then below the amount specified by the minfree kernel parameter, page scanning becomes more intensive. If the amount of free memory stays below the desfree value for 30 seconds or more, the system starts swapping.
The worst form of swapping is hard swapping, which is when some inactive kernel modules are unloaded and moved to the swap space. We can monitor whether the system is hard swapping by using the following command:
root@solaris11-1:~# echo "hardswap/D" | mdb -k
hardswap:
hardswap:       0

Hard swapping is rare because following conditions must be met:
The amount of free memory needs to be below desfree for more than 30 seconds, AND
There must constantly be two pending processes on the run queue (the r column in the vmstat output below), AND
freemem must be below minfree OR the number of page-ins plus page-outs must be greater than maxpgio, where maxpgio is the number of page-out requests that can be queued by the paging system.
In other words, maxpgio is used to limit how many memory pages can be sent to swap causing a disk I/O bottleneck. Therefore,maxpgio depends on the number of swap devices using their own disk controller. Its default value is 40 pages.
More often, we might see a light kind of swapping called soft swapping, which happens when the amount of free memory is below thedesfree value.
We can check for soft swapping by executing the following command:
root@solaris11-1:~# echo "softswap/D" | mdb -k
softswap:
softswap:       0

By way of introduction (more details would be beyond the scope of this article), the minfree value equals desfree/2, and the desfreevalue equals lotsfree/2. The following is the formula for calculating lotsfree:
lotsfree = [memory - kernel]/(64 * page size)]

These values can be seen by running the following commands:
root@solaris11-1:~# prtconf grep -i memory
Memory size: 4096 Megabytes

root@solaris11-1:~# echo lotsfree/D | mdb -k
lotsfree:
lotsfree:       16318

root@solaris11-1:~# echo desfree/E | mdb -k
desfree:
desfree:        8159      

root@solaris11-1:~# echo minfree/D | mdb -k
minfree:
minfree:        4079       
   
root@solaris11-1:~# bc
16318 * 4096 * 64
4277665792
root@solaris11-1:~#
 
The best method for getting the values of lotsfreedesfree, and minfree is executing the following command:

root@solaris11-1:~# kstat -n system_pages
moduleunix                            instance: 0    
name:   system_pages                    class:    pages
availrmem                       409132
crtime                          0
desfree                         8159
desscan                         25
econtig                         4229439488
fastscan                        522183
freemem                         243665
kernelbase                      0
lotsfree                        16318
minfree                         4079
nalloc                          110633425
nalloc_calls                    31285
nfree                           107403292
nfree_calls                     23611
nscan                           0
pagesfree                       243665
pageslocked                     635234
pagestotal                      1044366
physmem                         1044366
pp_kernel                       649290
slowscan                        100
snaptime                        26017.87927546

Furthermore, returning to the page scanning subject, there are different values for page scanning that happen at different times. For example, fastscan is the number of
pages scanned per second when free memory is equal to zero, 
desscan is the scan rate goal during page scanning, and nscan is the number of pages scanned during the last page scan action. In this example, there is enough memory and there isn't any page scanning activity (nscan equals 0).
This same information from kstat can be collected by running the following commands:

root@solaris11-1:~# echo fastscan/E | mdb -k
fastscan:
fastscan:       522183         
root@solaris11-1:~# echo slowscan/E | mdb -k
slowscan:
slowscan:       100            
root@solaris11-1:~# echo desscan/E | mdb -k
desscan:
desscan:        25             
root@solaris11-1:~# echo nscan/E | mdb -k
nscan:
nscan:          0              

To monitor the swap space, we can check the past and the present (real time)
swapping statistics by executing this command:
root@solaris11-1:~# vmstat
1
kthr  memory  page    disk          faults      cpu
r b w swap free re mf pi po fr de sr s0 s2 s3 s4 in sy cs us sy id
0 0 0 2972960 2608516 3 18 0 0 0 0 3 0 0 0 0 659 480 723 1 4 95
0 0 0 2895104 2544208 26 49 0 0 0 0 0 0 0 0 0 660 648 694 1 4 95
0 0 0 2895104 2544056 0 2 0 0 0 0 0 0 0 0 0 690 1839 847 4 4 92
 
The important column for us is w, which shows swapped out threads caused by memory pressure that was probably caused by the amount of free memory dropping below minfree or desfree for more than 30 seconds and, thus, causing idle processes to be swapped out to the swap space.
The following command shows the real-time swap status:
root@solaris11-1:~# vmstat -S 1
 kthr  memory   page            disk          faults      cpu
r b w swap free si so pi po fr de sr s0 s2 s3 s4 in sy cs us sy id
0 0 0 2972572 2608200 0 0 0 0 0 0 3 0 0 0 0 659 480 723 1 4 95
0 0 0 2895032 2544000 0 0 0 0 0 0 0 0 0 0 0 706 875 901 2 5 93
0 0 0 2895032 2544000 0 0 0 0 0 0 0 0 0 0 0 615 511 671 1 3 96

Columns so and si represent swapped-out pages and swapped-in pages, respectively, in real time. Again, ideally both should be zero for good performance.

Adding or Removing Swap Space Using a ZFS Volume
Now that we know how to monitor the swap space, it's time to learn to add space and
delete disk space that is allocated to the swap area. The Oracle Solaris 11 host we are using (
solaris11-1) has the following file system-related components:

root@solaris11-1:~# zfs list -r rpool
NAME                        USED  AVAIL  REFER  MOUNTPOINT
rpool                       28.5G  49.7G 4.91M  /rpool
rpool/ROOT                  25.4G  49.7G 31K  legacy
rpool/ROOT/solaris          25.4G  49.7G  24.4G /
rpool/ROOT/solaris-backup-1 138K   49.7G  24.2G  /
rpool/ROOT/solaris-backup-1/var 64K 49.7G 291M  /var
rpool/ROOT/solaris/var      486M   49.7G  234M  /var
rpool/VARSHARE              92K    49.7G  92K  /var/share
rpool/dump                  2.06G  49.8G  2.00G  -
rpool/export                805K   49.7G  32K  /export
rpool/export/home           773K   49.7G 32K  /export/home
rpool/export/home/ale       741K   49.7G 741K /export/home/ale
rpool/swap                  1.03G  49.7G 1.00G  -

The last line indicates the swap space is 1GB and it's a ZFS volume. This information can be verified by executing the following:
root@solaris11-1:~# ls -l /dev/zvol/rdsk/rpool/swap
lrwxrwxrwx   1 root root 0 Dec  2 06:31 /dev/zvol/rdsk/rpool/swap -> ../../../..//devices/pseudo/zfs@0:2,raw

Thus, it's feasible to change its size because the rpool has some free space and the swap volume belongs to the rpool storage pool:
root@solaris11-1:~# zfs get volsize rpool/swap
NAME        PROPERTY  VALUE SOURCE
rpool/swap  volsize   1G local

root@solaris11-1:~# zfs set volsize=2G rpool/swap
root@solaris11-1:~# zfs get volsize rpool/swap
NAME        PROPERTY  VALUE SOURCE
rpool/swap  volsize   2G local

root@solaris11-1:~# swap -l 
swapfile                  dev     swaplo   blocks  free
/dev/zvol/dsk/rpool/swap  285,2         8 097144  2097144
/dev/zvol/dsk/rpool/swap  285,2    097160 097144  2097144

root@solaris11-1:~# swap -s
total: 451556k bytes
allocated + 259888k reserved = 711444k used, 3886000k available

root@solaris11-1:~# zfs list -r rpool/swap
NAME         USED  AVAIL REFER  MOUNTPOINT
rpool/swap  2.06G 48.7G  2.00G  -
root@solaris11-1:~#

However, it is not always possible to change the properties of the swap space, because
it could be busy. So sometimes it's necessary to add a second volume into the rpool storage pool and, afterwards, to insert a line at end of 
/etc/vfstab to mount this volume automatically:

root@solaris11-1:~# zfs create -V 2G rpool/newswap
root@solaris11-1:~# swap -a /dev/zvol/dsk/rpool/newswap 
root@solaris11-1:~# swap -l
swapfile                  dev    swaplo  blocks   free
/dev/zvol/dsk/rpool/swap  285,2  8       2097144  2097144
/dev/zvol/dsk/rpool/swap  285,2 2097160  2097144  2097144
/dev/zvol/dsk/rpool/newswap 285,4  8     4194296  4194296

root@solaris11-1:~# swap -s
total: 453668k bytes
allocated + 260304k reserved = 713972k used, 5962264k available

root@solaris11-1:~# zfs list -r rpool
NAME                    USED  AVAIL  REFER MOUNTPOINT
rpool                   31.6G  46.6G 4.91M  /rpool
rpool/ROOT              25.4G  46.6G 31K  legacy
rpool/ROOT/solaris      25.4G  46.6G 24.4G  /
rpool/ROOT/solaris-backup-1 138K 46.6G  24.2G  /
rpool/ROOT/solaris-backup-1/var  64K 46.6G 291M  /var
rpool/ROOT/solaris/var  486M   46.6G 234M  /var
rpool/VARSHARE           92K   46.6G 92K  /var/share
rpool/dump               2.06G 46.7G 2.00G  -
rpool/export             805K  46.6G 32K  /export
rpool/export/home        773K  46.6G 32K  /export/home
rpool/export/home/ale    741K  46.6G 741K  /export/home/ale
rpool/newswap            2.06G 46.7G 2.00G  -
rpool/swap               2.06G 46.7G 2.00G  -
root@solaris11-1:~# more /etc/vfstab
#device    device   mount    FS      fsck    mount   mount
#to mount  to fsck  point    type    pass    at boot options
#
/devices   -       /devices  devf    -       no        -
/proc      -       /proc     proc    -       no        -
Ctfs       -       /system/contract  ctfs  - no        -
Objfs      -       /system/object    objfs - no        -
Sharefs    -       /etc/dfs/sharetab sharefs - no      -
Fd         -       /dev/fd    fd      -      no        -
Swap       -       /tmp       tmpfs   -      yes       -
/dev/zvol/dsk/rpool/swap    - -  swap -      no        -
/dev/zvol/dsk/rpool/newswap - -  swap -      no        -

Obviously, the process of removing swap space is the reverse. For example, the following command is executed and then the last line in the /etc/vfstab file is deleted:
root@solaris11-1:~# swap -d /dev/zvol/dsk/rpool/newswap

No comments:

Post a Comment