[GE users] Invisible Jobs

Reuti reuti at staff.uni-marburg.de
Thu May 12 17:16:59 BST 2005


Can you check on the nodes, whether all the tasks on the nodes are still 
a child of the execd and shepered, or whether they jumped off the 
process control. - Reuti

Jim Marconnet wrote:
> Using 6.0u3. We've had a lot of NFS craziness and a server loss due to a
> power failure and other craziness lately. I just realized that once jobs are
> submitted using qsub and are running, that they become invisible to qmon and
> to qstat. What I thought was a completely idle cluster is actually a beehive
> of invisible jobs running.
> 
> qstat -f shows the queue instances and their loading, but not the jobs.
> 
> Since things like queue subordination and slots are being ignored, our nodes
> are getting way oversubscribed. So many of the jobs currently running will
> run for a LONG time.
> 
> Any idea what typically causes this to happen? How to prevent it?
> 
> Any suggestion in laymen's terms what to ask the IT folks to do to fix it?
> Hopefully once and for all!
> 
> Any way to make these jobs visible again and controllable by qmon?
> 
> Or do we just have to wait for all jobs to complete (or go to all the
> individual nodes and kill them manually?) and then reboot everything?
> 
> Thanks!
> Jim Marconnet
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list