Monday, January 26, 2015

Solaris: Parallel Compression/Decompression


This topic is not Solaris specific, but certainly helps Solaris users who are frustrated with the single threaded implementation of all officially supported compression tools such as compress, gzip, zip.
pigz (pig-zee) is a parallel implementation of gzip that suits well for the latest multi-processor, multi-core machines. By default, pigz breaks up the input into multiple chunks of size 128 KB, and compress each chunk in parallel with the help of light-weight threads. The number of compress threads is set by default to the number of online processors. The chunk size and the number of threads are configurable.
Compressed files can be restored to their original form using -d option of pigz or gzip tools. As per the man page, decompression is not parallelized out of the box, but may show some improvement compared to the existing old tools.
The following example demonstrates the advantage of using pigz over gzip in compressing and decompressing a large file.
eg.,

Original file, and the target hardware.

$ ls -lh PT8.53.04.tar
-rw-r--r-- 1 psft dba 4.8G Feb 28 14:03 PT8.53.04.tar

$ psrinfo -pv
The physical processor has 8 cores and 64 virtual processors (0-63)
The core has 8 virtual processors (0-7)
    ...
The core has 8 virtual processors (56-63)
SPARC-T5 (chipid 0, clock 3600 MHz)

gzip compression.

$ time gzip --fast PT8.53.04.tar

real 3m40.125s
user 3m27.105s
sys 0m13.008s

$ ls -lh PT8.53*
-rw-r--r-- 1 psft dba 3.1G Feb 28 14:03 PT8.53.04.tar.gz

/* the following prstat, vmstat outputs show that gzip is compressing the
    tar file using a single thread - hence low CPU utilization. */

$ prstat -p 42510

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
42510 psft 2616K 2200K cpu16 10 0 0:01:00 1.5% gzip/1

$ prstat -m -p 42510

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
42510 psft 95 4.6 0.0 0.0 0.0 0.0 0.0 0.0 0 35 7K 0 gzip/1

$ vmstat 2

r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
0 0 0 776242104 917016008 0 7 0 0 0 0 0 0 0 52 52 3286 2606 2178 2 0 98
1 0 0 776242104 916987888 0 14 0 0 0 0 0 0 0 0 0 3851 3359 2978 2 1 97
0 0 0 776242104 916962440 0 0 0 0 0 0 0 0 0 0 0 3184 1687 2023 1 0 98
0 0 0 775971768 916930720 0 0 0 0 0 0 0 0 0 39 37 3392 1819 2210 2 0 98
0 0 0 775971768 916898016 0 0 0 0 0 0 0 0 0 0 0 3452 1861 2106 2 0 98

 
pigz compression.

$ time ./pigz PT8.53.04.tar

real 0m25.111s    <== wall clock time is 25s compared to gzip's 3m 27s
user 17m18.398s
sys 0m37.718s

 
/* the following prstat, vmstat outputs show that pigz is compressing the
tar file using many threads - hence busy system with high CPU utilization. */

 
$ prstat -p 49734

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
49734 psft 59M 58M sleep 11 0 0:12:58 38% pigz/66

 
$ vmstat 2

kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
0 0 0 778097840 919076008 6 113 0 0 0 0 0 0 0 40 36 39330 45797 74148 61 4 35
0 0 0 777956280 918841720 0 1 0 0 0 0 0 0 0 0 0 38752 43292 71411 64 4 32
0 0 0 777490336 918334176 0 3 0 0 0 0 0 0 0 17 15 46553 53350 86840 60 4 35
1 0 0 777274072 918141936 0 1 0 0 0 0 0 0 0 39 34 16122 20202 28319 88 4 9
1 0 0 777138800 917917376 0 0 0 0 0 0 0 0 0 3 3 46597 51005 86673 56 5 39

 
$ ls -lh PT8.53.04.tar.gz
-rw-r--r-- 1 psft dba 3.0G Feb 28 14:03 PT8.53.04.tar.gz

 
$ gunzip PT8.53.04.tar.gz     <== shows that the pigz compressed file is
compatible with gzip/gunzip

 
$ ls -lh PT8.53*
-rw-r--r-- 1 psft dba 4.8G Feb 28 14:03 PT8.53.04.tar

 
Decompression.

$ time ./pigz -d PT8.53.04.tar.gz

real 0m18.068s
user 0m22.437s
sys 0m12.857s

 
$ time gzip -d PT8.53.04.tar.gz

real 0m52.806s <== compare gzip's 52s decompression time with pigz's 18s
user 0m42.068s
sys 0m10.736s

 
$ ls -lh PT8.53.04.tar
-rw-r--r-- 1 psft dba 4.8G Feb 28 14:03 PT8.53.04.tar

 
Of course, there are other tools such as Parallel BZIP2 (PBZIP2), which is a parallel implementation of the bzip2 tool are worth a try too. The idea here is to highlight the fact that there are better tools out there to get the job done in a quick manner compared to the existing/old tools that are bundled with the operating system distribution.

No comments:

Post a Comment