[GE users] jobs in queue always going to "transfer" status

Sean Davis sdavis2 at mail.nih.gov
Thu Oct 2 19:06:58 BST 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

On Wed, Oct 1, 2008 at 8:57 PM, Rayson Ho <rayrayson at gmail.com> wrote:
> The error messages are saying that qmaster can't contact the execution
> host. (See: max_unheard in sge_conf(5), and also see
> reschedule_unknown to make sure that jobs get restarted correctly.)
>
> So it seems like that it's a problem with the execution host(s). For
> starters: how different is the configuration (OS, network segment,
> DNS)?? Then, after checking with the network, DNS, and other obvious
> issues, the place to start would be the execd.
>
> When this problem happens again, log onto the execution host.
> - See if execd is running??
> - If it is, check if shepherd is running??
>
> Also, attach a debugger to see if execd or shepherd is hanging
> somewhere?? (like trying to read NFS partition and got stuck?)
>
> And if execd is not running, see if there is a core file?? Or, you may
> want to restart execd and attach a debugger right away and then let
> the host accept jobs, and soon or later you should be able to
> reproduce the problem...

So, here is a back trace from the gdb attached to the execd on the machine.

0x00007fbcfbd8d05d in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
(gdb) conti
Continuing.
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fbcfc4fa6f0 (LWP 18677)]
0x00007fbcfba5b5c5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fbcfba5b5c5 in raise () from /lib64/libc.so.6
#1  0x00007fbcfba5cbb3 in abort () from /lib64/libc.so.6
#2  0x00007fbcfba541e9 in __assert_fail () from /lib64/libc.so.6
#3  0x00007fbcfad91613 in ber_flush2 () from /usr/lib64/liblber-2.4.so.2
#4  0x00007fbcfafbb34c in ldap_int_flush_request ()
   from /usr/lib64/libldap-2.4.so.2
#5  0x00007fbcfafbb75f in ldap_send_server_request ()
   from /usr/lib64/libldap-2.4.so.2
#6  0x00007fbcfafbba10 in ldap_send_initial_request ()
   from /usr/lib64/libldap-2.4.so.2
#7  0x00007fbcfafab360 in ldap_search () from /usr/lib64/libldap-2.4.so.2
#8  0x00007fbcfafab47a in ldap_search_st () from /usr/lib64/libldap-2.4.so.2
#9  0x00007fbcfb1e4703 in ?? () from /lib64/libnss_ldap.so.2
#10 0x00007fbcfb1e3a13 in ?? () from /lib64/libnss_ldap.so.2
#11 0x00007fbcfb1e44ce in ?? () from /lib64/libnss_ldap.so.2
#12 0x00007fbcfb1e4b5f in ?? () from /lib64/libnss_ldap.so.2
#13 0x00007fbcfb1e5197 in _nss_ldap_getpwnam_r () from /lib64/libnss_ldap.so.2
#14 0x00007fbcfb61814b in ?? () from /lib64/libnss_compat.so.2
#15 0x00007fbcfb618417 in _nss_compat_getpwnam_r ()
   from /lib64/libnss_compat.so.2
#16 0x00007fbcfbaca01d in getpwnam_r () from /lib64/libc.so.6
#17 0x000000000050a3cc in sge_getpwnam_r ()
#18 0x00000000004280de in sge_exec_job ()
---Type <return> to continue, or q <return> to quit---
#19 0x000000000042e60c in exec_job_or_task ()
#20 0x000000000042e160 in sge_start_jobs ()
#21 0x000000000042def0 in do_ck_to_do ()
#22 0x0000000000427835 in sge_execd_process_messages ()
#23 0x0000000000424b6d in main ()

I didn't mention that we are running openSUSE 11 on this machine.

uname -a
Linux mahfouz 2.6.25.16-0.1-default #1 SMP 2008-08-21 00:34:25 +0200
x86_64 x86_64 x86_64 GNU/Linux

And the libc major version is 2.8, if I recall.

Any other ideas before I try to compile a debugging version with some
print statements?

Thanks,
Sean


> On Wed, Oct 1, 2008 at 8:34 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>> And a couple more lines of interest, all from qmaster:
>>
>> 10/01/2008 20:24:17| timer|shakespeare|W|failed to deliver job 3265.1
>> to queue "all.q at grass.nci.nih.gov"
>> 10/01/2008 20:24:17| timer|shakespeare|E|got max. unheard timeout for
>> target "execd" on host "grass.nci.nih.gov", can't deliver job "3265"
>>
>> The eight jobs before this one went into "run" status, one completed,
>> and the next one was job 3265; it remains in "transfer" status.
>>
>> Sean
>>
>>> Thanks, Rayson.  This looks suspicious.  I'm not sure what to do with
>>> this.  How does one end up with an unknown queue?  The timing was such
>>> that I had submitted several jobs for testing to one of the machines
>>> in question (i.e., qsub -q all.q at machine sleeper.sh).
>>>
>>> Sean
>>>
>>>>>
>>>>> Thanks,
>>>>> Sean
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list