Wednesday, March 16, 2016

Solaris / SF12K-15K: Domain Stop analysis using Redx tools

This one is pretty old. SF12K/15K required special tools to analyze domain crash.

A software error on one domain (such as a heartbeat failure, panic timeout, or error-reset) can cause another domain to DStop on Sun Fire 15K systems running SMS 1.1.  The manifestation of this issue may cause the POST running on one domain to Dstop all other running domains.
While the occurrence is rare, the impact is platform wide.  Depending upon domain configuration and applications, down time can be several hours.  This problem is intermittent and may be related to a domain sync operation on the centerplane (reset of unused ports).

Running POST on one domain means that the power-on self-tests are executed on any domain in the system.  This is done to initially bring a domain online, a DR attach of a board (not currently supported), or a recovery action performed by the SMS software to get a domain back up and running after a reboot, panic, or Dstop.

A message in the platform message log (/var/opt/SUNWSMS/adm/platform/messages) would report:

    Jan 17 20:25:55 2002 swmtft901 hwad[22514]: [1156 1693005732870614 ERR
    InterruptHandler.cc 2127] Domain Stop interrupt detected, domain XXX
              
SMS then creates a Dstop dump file in /var/opt/SUNWSMS/adm/[XXX]/dump.
The file name is dsmd.dstop.YYMMDD.hhmm.ss (for this example).  If this dump file is opened with "redx" and the "wfail" command is issued, the output below is reported.  For example:

        sc% redx -cl
        redx> dumpf load dsmd.dstop.020117.2025.55)
        redx> wfail
        ...ouptut below...           

The Dstop signature of this issue is as follows:

        SDI EX03/S0  Master_Stop_Status0[31:0] = 7004004F
              MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
        SDI EX03/S0  Dstop0[31:0] = 12018200
              Dstop0[16]: D    DARB texp requests all Dstop (M)  
              Dstop0[25]: D 1E AXQ requests all Dstop (M)
              Dstop0[28]: D    Slot0 asserted Error, enabled to cause Dstop (M)
        AXQ EX03 ( 3) Error_Flag_02[31:0] = 04008400  Mask = 0000FFFF
              Err2[26]: D 1E AMX 0-3 hs flow control didn't arrive simultaneously 
        FAIL EXB EX3:  Dstop/Rstop detected by AXQ.
        Primary service FRU is EXB EX3.
        SDI EX04/S0  Master_Stop_Status0[31:0] = 0004000F
              MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
        SDI EX04/S0  Dstop0[31:0] = 02018200
              Dstop0[16]: D    DARB texp requests all Dstop (M)
              Dstop0[25]: D 1E AXQ requests all Dstop (M)
        AXQ EX04 ( 4) Error_Flag_03[31:0] = 30009000  Mask = 21005EFF
              Err3[28]: D 1E AMX data ECC uncorrectable error           
              Err3[29]: R    AMX data ECC correctable error     
        FAIL EXB EX4:  Dstop/Rstop detected by AXQ.
        Primary service FRU is EXB EX4.
       
The AMX flow control error shown above is the key message.  The system will recover automatically via ASR (automatic system recovery).  After recording the Dstop information, SMS restarts the domain(s).       
       
Any SMS 1.1 installations without patch 112080 or later installed are susceptible to this problem.  SMS 1.2 and higher are not affected by this issue.

The true cause of the problem is the AMX ASIC which doesn't handle port resets correctly.  The bug fix changes how POST performs the reset to ensure it's done safely.                  

A Dstop, or Domain Stop, occurs when the hardware detects an unrecoverable error.  The ASICs in the system cease processing transactions as quickly as possible to prevent further corruption of data and facilitate debugging.  It also occurs during the centerplane reset of ports.  The AMX has a problem with the reset of ports not done under domain sync.  Changing the reset so that it is done under domain sync causes the problem to go away. 

No comments:

Post a Comment