[GE users] SGE 5.3p5 stops scheduling jobs, schedd error?

Andy Schwierskott andy.schwierskott at sun.com
Tue Jul 13 15:53:10 BST 2004


Eric,

we need a stack trace. Do you have a sched core (in
<qamster_spool_dir>/schedd)?

You don't get it if this is an admin user system. To get a core (on machines
other than Solaris were you can enable global core dump creation with
coreadm) you need to start the scheduler as admin user.

Once you have the stack trace please submit a P1 or P2 bug to Issuezilla
with the stack trace output. Please provide other information (like "qstat
-f", or "qstat -r" as well).

Thanks,
Andy

> Hello all
>
>
> Twice in the last week, we have had the scheduler stop sending
> jobs out to the exec hosts.  The execd hosts seem
> to be running, according to qmon.  We are able to solve
> the problem by migrating to the failover server.
> This error is somewhat similar to issue #1141
>
>
> We are running SGE5.3p5 on Intel Xeon RedHat 7.3.
>
> We see these errors in our logs
>
> schedd:
> Tue Jul 13 03:49:29 2004|schedd|d03|C|!!!!!!!!!! lGetRef(): got NULL element 
> for JRL_category !!!!!!!!!!
>
> qmaster:
> Tue Jul 13 03:59:30 2004|qmaster|d03|E|acknowledge timeout after 600 seconds 
> for event client (schedd:1) on host "d03.bbn.com"
>
>
> We have not tried qconf -sss, qstat -j or qstat -r, but we will.  We have not 
> been able to reliably reproduce the error.  Our queues have a few thousand 
> jobs waiting.
>
> sched_job_info was true, but I just set it to false after the latest crash.
>
> Any help would be appreciated.
>
> Thanks
>
> Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list