[GE users] jobs in queue always going to "transfer" status

Rayson Ho rayrayson at gmail.com
Thu Oct 2 01:57:54 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

The error messages are saying that qmaster can't contact the execution
host. (See: max_unheard in sge_conf(5), and also see
reschedule_unknown to make sure that jobs get restarted correctly.)

So it seems like that it's a problem with the execution host(s). For
starters: how different is the configuration (OS, network segment,
DNS)?? Then, after checking with the network, DNS, and other obvious
issues, the place to start would be the execd.

When this problem happens again, log onto the execution host.
- See if execd is running??
- If it is, check if shepherd is running??

Also, attach a debugger to see if execd or shepherd is hanging
somewhere?? (like trying to read NFS partition and got stuck?)

And if execd is not running, see if there is a core file?? Or, you may
want to restart execd and attach a debugger right away and then let
the host accept jobs, and soon or later you should be able to
reproduce the problem...

Rayson



On Wed, Oct 1, 2008 at 8:34 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> And a couple more lines of interest, all from qmaster:
>
> 10/01/2008 20:24:17| timer|shakespeare|W|failed to deliver job 3265.1
> to queue "all.q at grass.nci.nih.gov"
> 10/01/2008 20:24:17| timer|shakespeare|E|got max. unheard timeout for
> target "execd" on host "grass.nci.nih.gov", can't deliver job "3265"
>
> The eight jobs before this one went into "run" status, one completed,
> and the next one was job 3265; it remains in "transfer" status.
>
> Sean
>
>> Thanks, Rayson.  This looks suspicious.  I'm not sure what to do with
>> this.  How does one end up with an unknown queue?  The timing was such
>> that I had submitted several jobs for testing to one of the machines
>> in question (i.e., qsub -q all.q at machine sleeper.sh).
>>
>> Sean
>>
>>>>
>>>> Thanks,
>>>> Sean
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list