Tuesday, May 17, 2016

Solaris 11.2/11.3 – IBM Guardium & Heartaches!!!

In my current environment, I have many Solaris boxes. The boxes which were installed many years ago are running LDOMs. But some of these boxes are on Solaris zones – both Native as well as kernel.
These zones/LDOMs are running a host of databases + applications. To monitor these databases, we have a software called Guardium (Now owned IBM). IBM Infosphere Guardium is an appliance, which requires an STAP agent to be installed on the boxes that it needs to be monitored.
The features that IBM Guardium provides is probably excellent. But, trust me, it’s a very unstable product. To make matters worse, the pdf document that are packaged inside the Guardium installation tool is soooo (not a typo) outdated, that it nowhere reflects the actual installation procedure.
Guardium STAP agent, upon installation, gets embedded into the Solaris kernel and starts automatically during system boot-up. So far, so good. On the downside, Guardium agent is so unpredictable that it crashes, most of the times without any specific reason.
Because the Guardium STAP agent is embedded into the kernel, it panics the entire box and hangs the system. This has been acknowledged by IBM – snapshot below:

image

image

If you don’t believe me, check the case “modified date” at the bottom right-hand side.
The most incredible part is the implementation methodology. LDOMS/kernel zone implementation is different from Native zones.
In LDOMs/Kernel zone, the Guardium STAP agent is installed in the LDOM/Kernel zone.
But in case of native zone, the Guardium agent is installed in the global zone.
So, imagine the situation where the system is running hundreds of native zones with Guardium installed/configured on the global zone and the Guardium crashes taking the entire global zone as well as all the Native zones straight to hell and back!!!
Crash Dump Analysis of the system reveals:

crash file:     /sr/3-12417908495/uc/vmdump.1_57700696/this/vmcore.1
release:        5.11 (64-bit)
version:        11.2
usr/src:       36224:29488d54d94a:0.175.2.15.0.4.0:S11.2SRU15.4+0
usr/closed:    2817:a6f8db7e698d:0.175.2.15.0.4.0:S11.2SRU15.4+0
machine:        sun4v
node name:      prctldom3
hw_provider:    Oracle Corporation
system type:    ORCL,SPARC64-X (SPARC64-X+)
hostid:         9007170b
dump_conflags:  0x10100 (DUMP_KERNEL|DUMP_ZFS) on /dev/zvol/dsk/rpool/dump(63.1G)
cluster_bflgs:  0x3 (CLUSTER_CONFIGURED|CLUSTER_BOOTED)
pxfs_software_mount_level: v1 (consolidated version)
current node:   1
dump_uuid:      d8abf8af-255f-4fae-a757-b366ee3d621a
time of crash:  Fri Mar 25 20:13:05 UTC 2016
age of system:  2 hours 8 minutes 15 seconds
panic CPU:      49 (40 CPUs, 126G memory, 2 nodes)
panic string:   BAD TRAP: type=31 rp=2a100643080 addr=643380 mmu_fsr=0 occurred in module "genunix" due to an illegal access to a user address

sanity checks: settings...
NOTE: /etc/system: module ge not loaded for "set ge:ge_intr_mode=0x833"
vmem...CPU...
WARNING: 3 CPUs have 6 threads in their dispatch queues
sysent...
WARNING: unknown module ktap_71327 seen 10 times in sysent table
clock...misc...
WARNING: 23 severe kstat errors (run "kstat xck")
NOTE: system has 6 non-default processor sets (2 CPUs in cp_default)
NOTE: system has 6 non-global zones
NOTE: system has 6 non-default CPU pools
NOTE: system has 1 non-default project
disks...
WARNING: 1 disk is busy with 1 command pending (run "dev busy").
NOTE: user_reserve_hint_pct is set to 80% (99.2G)
done
CAT(vmcore.1/11V)>


CAT(vmcore.1/11V)> panic
panic on CPU 49
panic string:   BAD TRAP: type=31 rp=2a100643080 addr=643380 mmu_fsr=0 occurred in module "genunix" due to an illegal access to a user address
==== panic user (LWP_SYS) thread: 0xc401c6b856c0  PID: 11123  on CPU: 49 ====
cmd: /usr/lib/dbus-daemon --system
fmri: svc:/system/dbus:default
t_procp: 0xc401bf654010
  p_as: 0xc4012c1c7800  size: 3874816  RSS: 2596864
     a_hat: 0xc401bf3ee940
     cnum: CPU48:1/16239
     cpusran: 48,49
  p_zone: 0xc401c1229880 (prdbaml1)
t_stk: 0x2a100643290  sp: 0x2050aa21  t_stkbase: 0x2a10063a000
t_pri: 59 (TS)  pctcpu: 0.006347
t_transience: 2  t_wkld_flags: 0
t_lwp: 0xc40129267ea0  t_tid: 1
  machpcb: 0x2a100643290
  lwp_ap:   0x643380
  t_mstate: LMS_SYSTEM  ms_prev: LMS_KFAULT
  ms_state_start: 0.000660480 seconds earlier
  ms_start: 4.460619520 seconds earlier
t_cpupart: 0xc4012260ea80(6)  t_bind_pset: 6  last CPU: 49
idle: 4459386880 hrticks (4.459386880s)
start: Fri Mar 25 20:13:01 2016
age: 4 seconds (4 seconds)
t_state:     TS_ONPROC
t_flag:      0x1800 (T_PANIC|T_LWPREUSE)
t_proc_flag: 0x104 (TP_TWAIT|TP_MSACCT)
t_schedflag: 0x13 (TS_LOAD|TS_DONT_SWAP|TS_SIGNALLED)
t_acflag:    3 (TA_NO_PROCESS_LOCK|TA_BATCH_TICKS)
p_flag:      0x42000400 (SZONETOP|SMSACCT|SMSFORK)

pc:      unix:panicsys+0x40:   call     unix:setjmp

void unix:panicsys+0x40((const char *)0x10101b38, (va_list)0x2a100642e48, (struct regs *)0x2050b3d0, (int)1, 0x9900001605, , , , , , , , 0x10101b38, 0x2a100642e48)
unix:vpanic_common+0x78(0x10101b38, 0x2a100642e48, 0x213e7, 0x21467, 0x2a100643120, 0x1600)
void unix:panic+0x1c((const char *)0x10101b38, (void *)0x31, 0x2a100643080, 0x643380, 0, 0x20834cc8, 0x10101ba8, ...)
int unix:die+0x7c((unsigned)0x31, (struct regs *)0x2a100643080, (caddr_t)0x643380, (uint_t)0)
void unix:trap+0xabc((struct regs *)0x2a100643080, (caddr_t)0x643380, (uint32_t), (uint32_t))
unix:ktl0+0x64()
-- trap data  type: 0x31 (data access MMU miss)  rp: 0x2a100643080  --
addr: 0x643380
pc:  0x1012af24 genunix:auditsys+0x24:   ldx    [%i0], %i2
npc: 0x1012af28 genunix:auditsys+0x28:   subcc    %i2, 0x1d, %g0  ( cmp   %i2, 0x1d )
global:                       %g1               0xba
       %g2                  0  %g3         0x1011fb70
       %g4             0x1740  %g5         0x20843b78
       %g6                  0  %g7     0xc401c6b856c0
out:  %o0           0x100000  %o1                  2
       %o2         0xff356840  %o3         0xff1b2a40
       %o4                  1  %o5     0xc401bf654010
       %sp      0x2a100642921  %o7         0x1011ee28
loc:  %l0                  0  %l1                  1
       %l2         0xff3ee354  %l3                  2
       %l4               0xff  %l5         0xff1b2a40
       %l6                  0  %l7                  1
in:   %i0           0x643380  %i1      0x2a100643288
       %i2    0xa2aa025b027ff  %i3         0x20840438
       %i4                  2  %i5               0x30
       %fp      0x2a1006429d1  %i7         0x1011fbc8
<trap>int genunix:auditsys+0x24((struct auditcalls *)0x643380, (rval_t *)0x2a100643288)
int64_t genunix:syscall_ap+0x58()
unix:_syscall_no_proc_exit32+0x78()
-- switch to user thread's user stack --

CAT(vmcore.1/11V)>


CAT(vmcore.1/11V)> findstk
building sp list...found 4 stacks
==== stack @ 0x2a100643080 (sp: 0x2a100642881) ====
NULL(genunix:lwp_getdatamodel?)()
struct sysent *genunix:lwp_getsysent+4((klwp_t *))
genunix:auditsys((struct auditcalls *)0x643380, (rval_t *)0x2a100643288) - frame recycled
int64_t genunix:syscall_ap+0x58()
unix:_syscall_no_proc_exit32+0x78()
-- switch to user panic_thread's user stack --

==== stack @ 0x2a100642d10 (sp: 0x2a100642511) ====
NULL(unix:panic?)()
int unix:die+0x7c((unsigned)0x31, (struct regs *)0x2a100643080, (caddr_t)0x643380, (uint_t)0)
void unix:trap+0xabc((struct regs *)0x2a100643080, (caddr_t)0x643380, (uint32_t), (uint32_t))
unix:ktl0+0x64()
-- trap data  type: 0x31 (data access MMU miss)  rp: 0x2a100643080  --
pc:  0x1012af24 genunix:auditsys+0x24:   ldx    [%i0], %i2
npc: 0x1012af28 genunix:auditsys+0x28:   subcc    %i2, 0x1d, %g0  ( cmp   %i2, 0x1d )
 

### Very definitely a user space address rather than a kernel one. 


global:                       %g1               0xba
       %g2                  0  %g3         0x1011fb70
       %g4             0x1740  %g5         0x20843b78
       %g6                  0  %g7     0xc401c6b856c0
out:  %o0           0x100000  %o1                  2
       %o2         0xff356840  %o3         0xff1b2a40
       %o4                  1  %o5     0xc401bf654010
       %sp      0x2a100642921  %o7         0x1011ee28
loc:  %l0                  0  %l1                  1
       %l2         0xff3ee354  %l3                  2
       %l4               0xff  %l5         0xff1b2a40
       %l6                  0  %l7                  1
in:   %i0           0x643380  %i1      0x2a100643288
       %i2    0xa2aa025b027ff  %i3         0x20840438
       %i4                  2  %i5               0x30
       %fp      0x2a1006429d1  %i7         0x1011fbc8
<trap>int genunix:auditsys+0x24((struct auditcalls *)0x643380, (rval_t *)0x2a100643288)
int64_t genunix:syscall_ap+0x58()
unix:_syscall_no_proc_exit32+0x78()
-- switch to user panic_thread's user stack --

==== stack @ 0x2a100642b80 (sp: 0x2a100642381) ====
NULL(genunix:anon_private?)()
faultcode_t genunix:segvn_faultpage+0x83c((struct hat *)0x3100642000, (struct seg *)0x2a100642e80, (caddr_t)0xb, (u_offset_t)0x2000, (struct vpage *)0x2000, (page_t **)0x643380, (uint_t), (enum fault_type), (enum seg_rw), (int))
faultcode_t genunix:segvn_fault+0xb28((struct hat *)0x3100642000, (struct seg *), (caddr_t), (size_t)0x643380, (enum fault_type), (enum seg_rw))
-- error reading next frame @ 0x0 --

==== stack @ 0x2a1006425c0 (sp: 0x2a100641dc1) ====
NULL(genunix:segvn_faultpage?)()
faultcode_t genunix:segvn_fault+0xb28((struct hat *)0xc401bf3ee940, (struct seg *)0, (caddr_t)3, (size_t)0xc401bf3ee940, (enum fault_type)0x20118c00, (enum seg_rw)2)
faultcode_t genunix:as_fault+0x3f0((struct hat *), (struct as *)0x4003ee796fc, (caddr_t)0x2013ac10, (size_t)0x40000000001, (enum fault_type), (enum seg_rw))
-- error reading next frame @ 0x0 --

CAT(vmcore.1/11V)>

CAT(vmcore.1/11V)> modinfo | grep ktap
263 LI 0xc400c6e552c0 0x10a3be90 0x6cb58 1 ktap_71327 (guard_tap driver v9.0 64-bit) >>>>>>>>>>> IBM driver
CAT(vmcore.1/11V)>
 

The above stack closely matches the following Bug
Bug 22319453  - 11.2 SRU 12.4 T5-4 panic in get_syscall_args due to invalid lwp_
ap



Conclusion:
---------------------------

Update from the Bug

***************
All third party modules which does their own syscall interposing can hit this panic due to new variable padded in lwp structure in middle.
IBM ktap and veritas modules are hit this issue till now and they have recompile there drivers once upgrading post 11.2 sru 8

Solaris 11.2.8.4.0 and Later Releases may Cause a System Panic for Systems Using Third Party Kernel Drivers (
Doc ID 2111676.1 )
***************

Also we have a KM document related to this issue.
Solaris 11.2.8.4.0 and Later Releases may Cause a System Panic for Systems Using Third Party Kernel Drivers (
Doc ID 2111676.1
) 

Oracle DocID: 2111676.1 reveals:
image

That leaves me with 2 viable options on my table:
1.       Throw the Guardium boxes out of the window
2.       Convert native zone to kernel zone or LDOMs to minimize exposure to instability & panics.

Currently, till we reach an agreement with management regarding (1), I guess I will have to work on (2).
Will post my “Native zones conversion to kernel zones/LDOMs” plan soon.

Update - 19 May 2016:
Created a detailed series of posts to achieve the "native Zone to Kernel Zone to LDOM" conversion.
Links as below:

Let me know how it worked for you,
regards
Sandeep

2 comments:

  1. Great article i had some serius issues with these product of ibm.

    ReplyDelete
  2. Hi Sandeep and Marangani - I'm on the Guardium Product Management Team and would like the opportunity to work with you to overcome some of these challenges. We have many of the largest companies in the world successfully using Guardium on Solaris. We do run into challenges when there are kernel changes made that our out of our control but we do regular testing and put out updates to ensure the highest level of dependability. Please contact me with your contact info. so we can set up a call to discuss. We are here to help. damirgholi@us.ibm.com

    ReplyDelete