[GE users] SGE 5.3p5 stops scheduling jobs, schedd error?

Eric Wu ewu at bbn.com
Tue Jul 13 15:09:23 BST 2004

Hello all

Twice in the last week, we have had the scheduler stop sending
jobs out to the exec hosts.  The execd hosts seem
to be running, according to qmon.  We are able to solve
the problem by migrating to the failover server.
This error is somewhat similar to issue #1141

We are running SGE5.3p5 on Intel Xeon RedHat 7.3.

We see these errors in our logs

Tue Jul 13 03:49:29 2004|schedd|d03|C|!!!!!!!!!! lGetRef(): got NULL 
element for JRL_category !!!!!!!!!!

Tue Jul 13 03:59:30 2004|qmaster|d03|E|acknowledge timeout after 600 
seconds for event client (schedd:1) on host "d03.bbn.com"

We have not tried qconf -sss, qstat -j or qstat -r, but we will.  We have 
not been able to reliably reproduce the error.  Our queues have a few 
thousand jobs waiting.

sched_job_info was true, but I just set it to false after the latest crash.

Any help would be appreciated.



To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list