[GE users] jobs in queue always going to "transfer" status

Ron Chen ron_chen_123 at yahoo.com
Thu Oct 2 02:12:51 BST 2008


And don't forget the qping tool!

If everything fails, qping -dump could help!

 - Ron



--- On Thu, 10/2/08, Rayson Ho <rayrayson at gmail.com> wrote:
> The error messages are saying that qmaster can't contact
> the execution host. (See: max_unheard in sge_conf(5),
> and also see reschedule_unknown to make sure that jobs
> get restarted correctly.)
> 
> So it seems like that it's a problem with the execution
> host(s). For
> starters: how different is the configuration (OS, network
> segment,
> DNS)?? Then, after checking with the network, DNS, and
> other obvious
> issues, the place to start would be the execd.
> 
> When this problem happens again, log onto the execution
> host.
> - See if execd is running??
> - If it is, check if shepherd is running??
> 
> Also, attach a debugger to see if execd or shepherd is
> hanging
> somewhere?? (like trying to read NFS partition and got
> stuck?)
> 
> And if execd is not running, see if there is a core file??
> Or, you may
> want to restart execd and attach a debugger right away and
> then let
> the host accept jobs, and soon or later you should be able
> to
> reproduce the problem...
> 
> Rayson
> 
> 
> 
> On Wed, Oct 1, 2008 at 8:34 PM, Sean Davis
> <sdavis2 at mail.nih.gov> wrote:
> > And a couple more lines of interest, all from qmaster:
> >
> > 10/01/2008 20:24:17| timer|shakespeare|W|failed to
> deliver job 3265.1
> > to queue "all.q at grass.nci.nih.gov"
> > 10/01/2008 20:24:17| timer|shakespeare|E|got max.
> unheard timeout for
> > target "execd" on host
> "grass.nci.nih.gov", can't deliver job
> "3265"
> >
> > The eight jobs before this one went into
> "run" status, one completed,
> > and the next one was job 3265; it remains in
> "transfer" status.
> >
> > Sean
> >
> >> Thanks, Rayson.  This looks suspicious.  I'm
> not sure what to do with
> >> this.  How does one end up with an unknown queue? 
> The timing was such
> >> that I had submitted several jobs for testing to
> one of the machines
> >> in question (i.e., qsub -q all.q at machine
> sleeper.sh).
> >>
> >> Sean
> >>
> >>>>
> >>>> Thanks,
> >>>> Sean
> >>>>
> >>>>
> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> >>>>
> >>>>
> >>>
> >>>
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> >>>
> >>>
> >>
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net


      

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list