[GE users] Invisible Jobs

Jim Marconnet jmarconnet at knology.net
Sun May 15 04:10:56 BST 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I talked to our IT guy Friday. In non-technical terms, he said he restarted
SGE and that caused the invisible jobs. They are still running as far as I
know.

Thanks for your diagnostic suggestions. Unfortunately I'm sufficiently
ignorant to not know how to do either one of them, much less to know what it
meant if I did so. So for now I'll just let the hidden jobs run.

I qsubed a few test jobs, and they ran OK and they showed up like normal
using qmon and qstat, so it looks like we're OK again.

Thanks!
Jim Marconnet

--------------
Reuti asked:

Can you check on the nodes, whether all the tasks on the nodes are still
a child of the execd and shepered, or whether they jumped off the
process control. - Reuti

-and-
----- Original Message ----- 
From: "Rayson Ho" <raysonho at eseenet.com>
To: <users at gridengine.sunsource.net>
Sent: Thursday, May 12, 2005 12:09 PM
Subject: Re: [GE users] Invisible Jobs


> What does "qacct -j <job id>" show??
>
> Also, by looking at the qacct output, you can find the execution host the
> job runs on, and then you can examine the log file of the execd on that
> host.
>
> Rayson
>
>
>
> >Using 6.0u3. We've had a lot of NFS craziness and a server loss due to a
> >power failure and other craziness lately. I just realized that once jobs
> are
> >submitted using qsub and are running, that they become invisible to qmon
> and
> >to qstat. What I thought was a completely idle cluster is actually a
> beehive
> >of invisible jobs running.
> >
> >qstat -f shows the queue instances and their loading, but not the jobs.
> >
> >Since things like queue subordination and slots are being ignored, our
> nodes
> >are getting way oversubscribed. So many of the jobs currently running
> will
> >run for a LONG time.
> >
> >Any idea what typically causes this to happen? How to prevent it?
> >
> >Any suggestion in laymen's terms what to ask the IT folks to do to fix
> it?
> >Hopefully once and for all!
> >
> >Any way to make these jobs visible again and controllable by qmon?
> >
> >Or do we just have to wait for all jobs to complete (or go to all the
> >individual nodes and kill them manually?) and then reboot everything?
> >
> >Thanks!
> >Jim Marconnet
> ---------------------------------------------------------
> Get your FREE E-mail account at http://www.eseenet.com !
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Anti-Virus.
> Version: 7.0.308 / Virus Database: 266.11.9 - Release Date: 5/12/2005
>
>



-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.10 - Release Date: 5/13/2005


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list