Tuesday, January 12, 2016

Solaris Cluster : pm_tick delay of xxxx ms exceeds 2147 ms message is continuously logged in /var/adm/messages


SYMPTOMS
The message "pm_tick delay of xxxx ms exceeds 2147 ms" is continuoulsy written in the file /var/adm/messages since several days. The delay number reported (here xxxx) could be variable.
This message is printed thousands of times and the size of the messages files is growing but there is no other impact or symptom on the cluster nodes.
The key symptom is the fact the same message is repeated over time but the delays reported are bounded and almost the same over time rather than showing increasing delays.
In other words the delays are not getting worse.
For example, checking in /var/adm/messages you could see:
grep pm_tick messages |more
Dec 27 01:07:38 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 3589 ms exceeds 2147 ms <==== 1st delay logged
Dec 27 01:07:38 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 3549 ms exceeds 2147 ms
............ snip.....
 This message was printed a
lot of times and the delay is now stable
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep -c pm_tick messages
3112523
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | sort -u
3549 <========= this is the lowest delay
3550
3551
3589 <========= this is the highest delay
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | grep -c 3550
1555990
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | grep -c 3549
1556529
It is clear that the delay is not getting worse: the majority of times we see  delays of 3549 and 3550 seconds and the delay range is bounded.

CAUSE
The pm_tick function mentioned in the message is used by cluster to perform sanity checks to ensure that the node itself is not hang and that cluster heartbeat are sent in a timely fashion.
Repeated delays in the execution of pm_tick are considered as warning of a potential hang issue or serious performance problem.
The reported message is printed when the pm_tick is delayed of more than 2147 milliseconds.
If the delays grew over time and went beyond 5050 milliseconds at the end the node would panic with the message : Aborting node because pm_tick delay of %lld
ms exceeds %lld ms
898738 Aborting node because pm_tick delay of %lld ms exceeds %lld ms
Description: The system is
unable to send heartbeats for a long time. (This is half of the  minimum of timeout values of all the paths. If the
timeout values for all the paths is 10 secs
then this value is 5 secs.) There is probably heavy
interrupt activity causing the clock thread to get delayed, which in turn
causes irregular heartbeats. The node is aborted because it is
considered to be in 'sick' condition and it is better to abort this node
instead of causing other nodes (or the cluster) to go down.
Solution: Check to see what is
causing high interrupt activity and configure the system accordingly
These figures are calculated from the heartbeat_timeout value which is by default 10 seconds. You can check that value by running :
cluster show -t global |grep heartbeat_timeout
heartbeat_timeout: 10000
In the particular scenario described in the Symptoms section, the delays reported are bounded and not growing over time and, in this specific scenario,  the messages are not the symptom of an ongoing problem.
In such a  case, the reason for the delay is a one-shot actual delay that was large enough to trigger the first warning but resulted into a incorrect status of the pm_tick counters which was not reset.
As a result further messages are printed as if there was a constant delay detected at every subsequent pm_tick execution, even if there are no more actual delays.

SOLUTION
The problem can be fixed by disabling and enabling again the path monitoring on the fly (despite the name, this also works for a single node cluster)
# /usr/cluster/lib/sc/cmm_ctl -d
Then after a couple of minutes
 # /usr/cluster/lib/sc/cmm_ctl -e
This will result into resetting the internal counters used by pm_tick and thus zeroing out the historical delays carried over incorrectly.

No comments:

Post a Comment