[GE users] qmaster repeatedly losing contact with all queues

Aron Miller ami at fc.hp.com
Thu Nov 15 00:21:45 GMT 2007

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


Over the past few months I have started having serious problems with my 
5.3p6 installation.  We're trying to finish out a very long project 
before moving to 6.1 in hopefully a few months.

All of the queues (in the hundreds) in the cell are frequently going to 
an alarm/unknown state for no obvious reason.  Network connectivity is 
fine between the exec hosts and qmaster.  Stopping and restarting the 
qmaster daemons almost always clears up the problem for a little while. 
  The basic configuration of the cell, including the hardware for the 
qmaster, has been essentially static for the past 2 years, but the 
problem has started showing up only recently.  The exception is that we 
went to RHEL4 on almost all of our exec hosts this summer, probably 1-2 
months before the alarm state problem was noticed.

I found a previous thread on this same problem, 
but there didn't seem to be any direct resolution at that time, since 
rebooting the qmaster seemed to fix the problem.  Rebooting hasn't 
worked for us.

There does seem to be rough correlation to the number of pending jobs 
and the likelihood of the problem showing up (it seems like if there are 
more than ~100 jobs pending for about 20 minutes I will almost certainly 
see the problem), but I do see it with a lot less pending job pressure too.

I know I'm using a very old release, but if anyone has suggestions on 
what could be causing the problem, or how to go about debugging it, I 
would appreciate it.

Other info that might be relevant:
- qmaster is running Debian Sarge
- majority of the exec hosts are RHEL4
- The load on the qmaster machine stays pegged around 2, with the cpu% 
of sge_qmaster between 60-80% and sge_schedd between 20-45%.
- A process that is continually ping'ing the qmaster has shown no drops 
in connectivity or significant increase in latency
- Hardware diagnostics don't show anything wrong with the qmaster
- Nothing in the qmaster messages file shows anything particular 
suspicious.  I do see the "could not decrease "max_u_jobs" job counter" 
warning quite frequently, which was also mentioned in the previous 
thread.  I have max_u_jobs set to 0 in both the cluster config and the 
scheduler config.
- The problem now happens between 4 and 12 times a day, and the 
frequency has been increasing
- I've tried dumping some info with commdcntl but that hasn't been 
meaningful to me

Thanks in advance.


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list