[GE users] jobs in queue always going to "transfer" status

Sean Davis sdavis2 at mail.nih.gov
Fri Oct 3 21:01:59 BST 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

On Thu, Oct 2, 2008 at 3:06 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> On Thu, Oct 2, 2008 at 2:59 PM, Rayson Ho <rayrayson at gmail.com> wrote:
>> On 10/2/08, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>> The machine is using openldap-2.4.9.  It looks like this bug was fixed
>>> some time ago (unless is has reemerged), or am I reading the bug
>>> report incorrectly?
>>
>> Actually, I simply googled the stack trace and found that bug... Other
>> OpenSUSE 11 users also reported the same problem:
>>
>> http://lists.opensuse.org/opensuse-bugs/2008-07/msg03377.html
>>
>> I think it was finally fixed in 2.4.9-1:
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=484802
>>
>> If you upgrade to OpenLDAP 2.4.9-1 and the problem is still there, you
>> may want to contact the OpenLDAP mailing list directly as it seems to
>> be an OpenLDAP issue than an SGE issue.
>
> Thanks for doing all my homework for me.  I'll try to fix the openldap
> issue and hope that does it.

I upgraded OpenLDAP as you suggested and have been hammering the
affected machines for the past few hours with no issues.  Thanks,
Rayson.

Sean


>>> > On 10/2/08, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>> >> Program received signal SIGABRT, Aborted.
>>> >> [Switching to Thread 0x7fbcfc4fa6f0 (LWP 18677)]
>>> >> 0x00007fbcfba5b5c5 in raise () from /lib64/libc.so.6
>>> >> (gdb) bt
>>> >> #0  0x00007fbcfba5b5c5 in raise () from /lib64/libc.so.6
>>> >> #1  0x00007fbcfba5cbb3 in abort () from /lib64/libc.so.6
>>> >> #2  0x00007fbcfba541e9 in __assert_fail () from /lib64/libc.so.6
>>> >> #3  0x00007fbcfad91613 in ber_flush2 () from /usr/lib64/liblber-2.4.so.2
>>> >> #4  0x00007fbcfafbb34c in ldap_int_flush_request ()
>>> >>   from /usr/lib64/libldap-2.4.so.2
>>> >> #5  0x00007fbcfafbb75f in ldap_send_server_request ()
>>> >>   from /usr/lib64/libldap-2.4.so.2
>>> >> #6  0x00007fbcfafbba10 in ldap_send_initial_request ()
>>> >>   from /usr/lib64/libldap-2.4.so.2
>>> >> #7  0x00007fbcfafab360 in ldap_search () from /usr/lib64/libldap-2.4.so.2
>>> >> #8  0x00007fbcfafab47a in ldap_search_st () from /usr/lib64/libldap-2.4.so.2
>>> >> #9  0x00007fbcfb1e4703 in ?? () from /lib64/libnss_ldap.so.2
>>> >> #10 0x00007fbcfb1e3a13 in ?? () from /lib64/libnss_ldap.so.2
>>> >> #11 0x00007fbcfb1e44ce in ?? () from /lib64/libnss_ldap.so.2
>>> >> #12 0x00007fbcfb1e4b5f in ?? () from /lib64/libnss_ldap.so.2
>>> >> #13 0x00007fbcfb1e5197 in _nss_ldap_getpwnam_r () from /lib64/libnss_ldap.so.2
>>> >> #14 0x00007fbcfb61814b in ?? () from /lib64/libnss_compat.so.2
>>> >> #15 0x00007fbcfb618417 in _nss_compat_getpwnam_r ()
>>> >>   from /lib64/libnss_compat.so.2
>>> >> #16 0x00007fbcfbaca01d in getpwnam_r () from /lib64/libc.so.6
>>> >> #17 0x000000000050a3cc in sge_getpwnam_r ()
>>> >> #18 0x00000000004280de in sge_exec_job ()
>>> >> ---Type <return> to continue, or q <return> to quit---
>>> >> #19 0x000000000042e60c in exec_job_or_task ()
>>> >> #20 0x000000000042e160 in sge_start_jobs ()
>>> >> #21 0x000000000042def0 in do_ck_to_do ()
>>> >> #22 0x0000000000427835 in sge_execd_process_messages ()
>>> >> #23 0x0000000000424b6d in main ()
>>> >>
>>> >> I didn't mention that we are running openSUSE 11 on this machine.
>>> >>
>>> >> uname -a
>>> >> Linux mahfouz 2.6.25.16-0.1-default #1 SMP 2008-08-21 00:34:25 +0200
>>> >> x86_64 x86_64 x86_64 GNU/Linux
>>> >>
>>> >> And the libc major version is 2.8, if I recall.
>>> >>
>>> >> Any other ideas before I try to compile a debugging version with some
>>> >> print statements?
>>> >>
>>> >> Thanks,
>>> >> Sean
>>> >>
>>> >>
>>> >> > On Wed, Oct 1, 2008 at 8:34 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>> >> >> And a couple more lines of interest, all from qmaster:
>>> >> >>
>>> >> >> 10/01/2008 20:24:17| timer|shakespeare|W|failed to deliver job 3265.1
>>> >> >> to queue "all.q at grass.nci.nih.gov"
>>> >> >> 10/01/2008 20:24:17| timer|shakespeare|E|got max. unheard timeout for
>>> >> >> target "execd" on host "grass.nci.nih.gov", can't deliver job "3265"
>>> >> >>
>>> >> >> The eight jobs before this one went into "run" status, one completed,
>>> >> >> and the next one was job 3265; it remains in "transfer" status.
>>> >> >>
>>> >> >> Sean
>>> >> >>
>>> >> >>> Thanks, Rayson.  This looks suspicious.  I'm not sure what to do with
>>> >> >>> this.  How does one end up with an unknown queue?  The timing was such
>>> >> >>> that I had submitted several jobs for testing to one of the machines
>>> >> >>> in question (i.e., qsub -q all.q at machine sleeper.sh).
>>> >> >>>
>>> >> >>> Sean
>>> >> >>>
>>> >> >>>>>
>>> >> >>>>> Thanks,
>>> >> >>>>> Sean
>>> >> >>>>>
>>> >> >>>>> ---------------------------------------------------------------------
>>> >> >>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >> >>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> ---------------------------------------------------------------------
>>> >> >>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >> >>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >> >>>>
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >> ---------------------------------------------------------------------
>>> >> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >> >>
>>> >> >>
>>> >> >
>>> >> > ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >> >
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >>
>>> >>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list