[GE users] qmaster repeatedly losing contact with all queues
ami at fc.hp.com
Thu Nov 15 00:21:45 GMT 2007
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Over the past few months I have started having serious problems with my
5.3p6 installation. We're trying to finish out a very long project
before moving to 6.1 in hopefully a few months.
All of the queues (in the hundreds) in the cell are frequently going to
an alarm/unknown state for no obvious reason. Network connectivity is
fine between the exec hosts and qmaster. Stopping and restarting the
qmaster daemons almost always clears up the problem for a little while.
The basic configuration of the cell, including the hardware for the
qmaster, has been essentially static for the past 2 years, but the
problem has started showing up only recently. The exception is that we
went to RHEL4 on almost all of our exec hosts this summer, probably 1-2
months before the alarm state problem was noticed.
I found a previous thread on this same problem,
but there didn't seem to be any direct resolution at that time, since
rebooting the qmaster seemed to fix the problem. Rebooting hasn't
worked for us.
There does seem to be rough correlation to the number of pending jobs
and the likelihood of the problem showing up (it seems like if there are
more than ~100 jobs pending for about 20 minutes I will almost certainly
see the problem), but I do see it with a lot less pending job pressure too.
I know I'm using a very old release, but if anyone has suggestions on
what could be causing the problem, or how to go about debugging it, I
would appreciate it.
Other info that might be relevant:
- qmaster is running Debian Sarge
- majority of the exec hosts are RHEL4
- The load on the qmaster machine stays pegged around 2, with the cpu%
of sge_qmaster between 60-80% and sge_schedd between 20-45%.
- A process that is continually ping'ing the qmaster has shown no drops
in connectivity or significant increase in latency
- Hardware diagnostics don't show anything wrong with the qmaster
- Nothing in the qmaster messages file shows anything particular
suspicious. I do see the "could not decrease "max_u_jobs" job counter"
warning quite frequently, which was also mentioned in the previous
thread. I have max_u_jobs set to 0 in both the cluster config and the
- The problem now happens between 4 and 12 times a day, and the
frequency has been increasing
- I've tried dumping some info with commdcntl but that hasn't been
meaningful to me
Thanks in advance.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users