[GE users] Invisible Jobs

Reuti reuti at staff.uni-marburg.de
Sun May 15 20:42:26 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Jim,

if you could login to a slave node, you can use the command (on Linux):

$ ps f -eo pid,ppid,pgrp,command

and look at a part of the process tree:

25495     1 25495 /home/reuti/sge/bin/lx24-amd64/sge_execd
28589 25495 28589  \_ sge_shepherd-498 -bg
28590 28589 28590      \_ /bin/sh 
/home/reuti/sge/default/spool/node050/job_scripts/498
28591 28590 28590          \_ sleep 120

are your running jobs still in some way a child of the sge_execd? But anyway, 
if the jobs are still producing useful output, you can let them run of course. 
Is "qstat -s z" showing them as already finished?

Cheers - Reuti


Quoting Jim Marconnet <jmarconnet at knology.net>:

> I talked to our IT guy Friday. In non-technical terms, he said he restarted
> SGE and that caused the invisible jobs. They are still running as far as I
> know.
> 
> Thanks for your diagnostic suggestions. Unfortunately I'm sufficiently
> ignorant to not know how to do either one of them, much less to know what
> it
> meant if I did so. So for now I'll just let the hidden jobs run.
> 
> I qsubed a few test jobs, and they ran OK and they showed up like normal
> using qmon and qstat, so it looks like we're OK again.
> 
> Thanks!
> Jim Marconnet
> 
> --------------
> Reuti asked:
> 
> Can you check on the nodes, whether all the tasks on the nodes are still
> a child of the execd and shepered, or whether they jumped off the
> process control. - Reuti
> 
> -and-
> ----- Original Message ----- 
> From: "Rayson Ho" <raysonho at eseenet.com>
> To: <users at gridengine.sunsource.net>
> Sent: Thursday, May 12, 2005 12:09 PM
> Subject: Re: [GE users] Invisible Jobs
> 
> 
> > What does "qacct -j <job id>" show??
> >
> > Also, by looking at the qacct output, you can find the execution host the
> > job runs on, and then you can examine the log file of the execd on that
> > host.
> >
> > Rayson
> >
> >
> >
> > >Using 6.0u3. We've had a lot of NFS craziness and a server loss due to a
> > >power failure and other craziness lately. I just realized that once jobs
> > are
> > >submitted using qsub and are running, that they become invisible to qmon
> > and
> > >to qstat. What I thought was a completely idle cluster is actually a
> > beehive
> > >of invisible jobs running.
> > >
> > >qstat -f shows the queue instances and their loading, but not the jobs.
> > >
> > >Since things like queue subordination and slots are being ignored, our
> > nodes
> > >are getting way oversubscribed. So many of the jobs currently running
> > will
> > >run for a LONG time.
> > >
> > >Any idea what typically causes this to happen? How to prevent it?
> > >
> > >Any suggestion in laymen's terms what to ask the IT folks to do to fix
> > it?
> > >Hopefully once and for all!
> > >
> > >Any way to make these jobs visible again and controllable by qmon?
> > >
> > >Or do we just have to wait for all jobs to complete (or go to all the
> > >individual nodes and kill them manually?) and then reboot everything?
> > >
> > >Thanks!
> > >Jim Marconnet
> > ---------------------------------------------------------
> > Get your FREE E-mail account at http://www.eseenet.com !
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
> >
> > -- 
> > No virus found in this incoming message.
> > Checked by AVG Anti-Virus.
> > Version: 7.0.308 / Virus Database: 266.11.9 - Release Date: 5/12/2005
> >
> >
> 
> 
> 
> -- 
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.308 / Virus Database: 266.11.10 - Release Date: 5/13/2005
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list