[GE users] SGE 5.3p5 stops scheduling jobs, schedd error?

Eric Wu ewu at bbn.com
Tue Jul 13 15:09:23 BST 2004


Hello all


Twice in the last week, we have had the scheduler stop sending
jobs out to the exec hosts.  The execd hosts seem
to be running, according to qmon.  We are able to solve
the problem by migrating to the failover server.
This error is somewhat similar to issue #1141


We are running SGE5.3p5 on Intel Xeon RedHat 7.3.

We see these errors in our logs

schedd:
Tue Jul 13 03:49:29 2004|schedd|d03|C|!!!!!!!!!! lGetRef(): got NULL 
element for JRL_category !!!!!!!!!!

qmaster:
Tue Jul 13 03:59:30 2004|qmaster|d03|E|acknowledge timeout after 600 
seconds for event client (schedd:1) on host "d03.bbn.com"


We have not tried qconf -sss, qstat -j or qstat -r, but we will.  We have 
not been able to reliably reproduce the error.  Our queues have a few 
thousand jobs waiting.

sched_job_info was true, but I just set it to false after the latest crash.

Any help would be appreciated.

Thanks

Eric




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list