[GE users] exit status = 10 pe_start = 134

henk h.a.slim at durham.ac.uk
Thu May 27 14:52:10 BST 2010


Reuti,

> This SIGABRT was already on the list, but it seems not to be
> persistent. It even hit us some time ago. But with exactly one job,
and
> resubmitting the job was successful. So we didn't pay much attention
to
> it.

The problem appears now more and with different jobs. Sometimes
immediately, sometimes after a 72 slots job has ran for ~24 hrs. On what
list is the SIGABRT? On what architecture and with what versions of
gridengine and mpi did you see this problem?

Thanks

Henk




> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 11 May 2010 10:24
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] exit status = 10 pe_start = 134
>
> Am 07.05.2010 um 18:40 schrieb l_heck:
>
> > I got the following output in the .e file of a batch failing
> >
> > ======= Backtrace: =========
> > /lib64/libc.so.6[0x7f2261894118]
> > /lib64/libc.so.6(cfree+0x76)[0x7f2261895c76]
> > /lib64/libnsl.so.1[0x7f22610e0c89]
> > /lib64/libpthread.so.0(pthread_once+0x53)[0x7f2261b84ed3]
> > /lib64/libnsl.so.1(_nsl_default_nss+0x21)[0x7f22610e0df1]
> > /lib64/libnss_nis.so.2(_nss_nis_initgroups_dyn+0x6a)[0x7f22612f0cca]
> > /lib64/libc.so.6[0x7f22618be618]
> > /lib64/libc.so.6(initgroups+0x75)[0x7f22618be7f5]
> > sge_shepherd-558[0x52049d]
> > sge_shepherd-558(sge_set_uid_gid_addgrp+0x7c)[0x52038c]
> > sge_shepherd-558(son+0x4fc)[0x44fdfc]
> > sge_shepherd-558(__strtod_internal+0x10ea)[0x44c542]
> > sge_shepherd-558(main+0x622)[0x44bfe2]
> > /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f226183e586]
> > sge_shepherd-558(tcsetattr+0x92)[0x44b8ea]
> > ======= Memory map: ========
> > 00400000-0063e000 r-xp 00000000 3cf:a4500 87071869
> > /hpsfs/Cluster-Apps/sge/6.2u5/bin/lx24-amd64/sge_shepherd
> > 0073d000-00775000 rwxp 0023d000 3cf:a4500 87071869
> > /hpsfs/Cluster-Apps/sge/6.2u5/bin/lx24-amd64/sge_shepherd
> > 00775000-007a1000 rwxp 00775000 00:00 0
> [heap]
> > 7f2260eb9000-7f2260ecf000 r-xp 00000000 08:03 306062
> > /lib64/libgcc_s.so.1
> > 7f2260ecf000-7f22610cf000 ---p 00016000 08:03 306062
> > /lib64/libgcc_s.so.1
> > 7f22610cf000-7f22610d0000 r-xp 00016000 08:03 306062
> > /lib64/libgcc_s.so.1
> > 7f22610d0000-7f22610d1000 rwxp 00017000 08:03 306062
> > /lib64/libgcc_s.so.1
> > 7f22610d1000-7f22610e6000 r-xp 00000000 08:03 305953
> > /lib64/libnsl-2.9.so
> >                    7f22610d0000-7f22610d1000 rwxp 00017000 08:03
> 306062
> > /lib64/libgcc_s.so.1
> > 7f22610d1000-7f22610e6000 r-xp 00000000 08:03 305953
> > /lib64/libnsl-2.9.so
> > 7f22610e6000-7f22612e5000 ---p 00015000 08:03 305953
> > /lib64/libnsl-2.9.so
> > 7f22612e5000-7f22612e6000 r-xp 00014000 08:03 305953
> > /lib64/libnsl-2.9.so
> > 7f22612e6000-7f22612e7000 rwxp 00015000 08:03 305953
> > /lib64/libnsl-2.9.so
> > 7f22612e7000-7f22612e9000 rwxp 7f22612e7000 00:00 0
> > 7f22612e9000-7f22612f3000 r-xp 00000000 08:03 306054
> > /lib64/libnss_nis-2.9.so
> > 7f22612f3000-7f22614f2000 ---p 0000a000 08:03 306054
> > /lib64/libnss_nis-2.9.so
> > 7f22614f2000-7f22614f3000 r-xp 00009000 08:03 306054
> > /lib64/libnss_nis-2.9.so
> > 7f22614f3000-7f22614f4000 rwxp 0000a000 08:03 306054
> > /lib64/libnss_nis-2.9.so
> > 7f22614f4000-7f22614ff000 r-xp 00000000 08:03 306046
> > /lib64/libnss_files-2.9.so
> > 7f22614ff000-7f22616fe000 ---p 0000b000 08:03 306046
> > /lib64/libnss_files-2.9.so
> > 7f22616fe000-7f22616ff000 r-xp 0000a000 08:03 306046
> > /lib64/libnss_files-2.9.so
> > 7f22616ff000-7f2261700000 rwxp 0000b000 08:03 306046
> > /lib64/libnss_files-2.9.so
> > 7f2261700000-7f2261800000 rwxp 7f2261700000 00:00 0
> > 7f2261820000-7f226196f000 r-xp 00000000 08:03 306050
> > /lib64/libc-2.9.so
> > 7f226196f000-7f2261b6f000 ---p 0014f000 08:03 306050
> > /lib64/libc-2.9.so
> > 7f2261b6f000-7f2261b73000 r-xp 0014f000 08:03 306050
> > /lib64/libc-2.9.so
> > 7f2261b73000-7f2261b74000 rwxp 00153000 08:03 306050
> > /lib64/libc-2.9.so
> > 7f2261b74000-7f2261b79000 rwxp 7f2261b74000 00:00 0
> > 7f2261b79000-7f2261b8f000 r-xp 00000000 08:03 306034
> > /lib64/libpthread-2.9.so
> > 7f2261b8f000-7f2261d8f000 ---p 00016000 08:03 306034
> > /lib64/libpthread-2.9.so
> > 7f2261d8f000-7f2261d90000 r-xp 00016000 08:03 306034
> > /lib64/libpthread-2.9.so
> > 7f2261d90000-7f2261d91000 rwxp 00017000 08:03 306034
> > /lib64/libpthread-2.9.so
> > 7f2261d91000-7f2261d95000 rwxp 7f2261d91000 00:00 0
> > 7f2261d95000-7f2261dea000 r-xp 00000000 08:03 305951
> > /lib64/libm-2.9.so
> > 7f2261dea000-7f2261fe9000 ---p 00055000 08:03 305951
> > /lib64/libm-2.9.so
> > 7f2261fe9000-7f2261fea000 r-xp 00054000 08:03 305951
> > /lib64/libm-2.9.so
> > 7f2261fea000-7f2261feb000 rwxp 00055000 08:03 305951
> > /lib64/libm-2.9.so
> > 7f2261feb000-7f2261fed000 r-xp 00000000 08:03 305921
> > /lib64/libdl-2.9.so
> > 7f2261fed000-7f22621ed000 ---p 00002000 08:03 305921
> > /lib64/libdl-2.9.so
> > 7f22621ed000-7f22621ee000 r-xp 00002000 08:03 305921
> > /lib64/libdl-2.9.so
> > 7f22621ee000-7f22621ef000 rwxp 00003000 08:03 305921
> > /lib64/libdl-2.9.so
> > 7f22621ef000-7f226220d000 r-xp 00000000 08:03 306070
> > /lib64/ld-2.9.so
> > 7f22623d4000-7f22623d7000 rwxp 7f22623d4000 00:00 0
> > 7f2262404000-7f226240c000 rwxp 7f2262404000 00:00 0
> > 7f226240c000-7f226240d000 r-xp 0001d000 08:03 306070
> > /lib64/ld-2.9.so
> > 7f226240d000-7f226240e000 rwxp 0001e000 08:03 306070
> > /lib64/ld-2.9.so
> > 7fff6a3f1000-7fff6a40d000 rwxp 7ffffffe3000 00:00 0
> [stack]
> > 7fff6a45a000-7fff6a45b000 r-xp 7fff6a45a000 00:00 0
> [vdso]
> > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
> > [vsyscall]
> > *** glibc detected *** sge_shepherd-558: free(): invalid pointer:
> > 0x00007f8def843100 ***
> >
---------------------------------------------------------------------
> -----
>
> This SIGABRT was already on the list, but it seems not to be
> persistent. It even hit us some time ago. But with exactly one job,
and
> resubmitting the job was successful. So we didn't pay much attention
to
> it.
>
> -- Reuti
>
>
> > This might shed light on the issue
> >
> >
> > 8,1            1%
> >
> >
> >
> >
> > On Fri, 7 May 2010, henk wrote:
> >
> >> These are the settings in the pe
> >>
> >> start_proc_args    /bin/true
> >> stop_proc_args     /bin/true
> >> allocation_rule    $fill_up
> >> control_slaves     TRUE
> >> job_is_first_task  FALSE
> >>
> >> I believe this indicates absence of start or stop operations? I
also
> use
> >> these in another cluster. I'll change the posix_compliant.
> >>
> >> Thanks
> >>
> >>> -----Original Message-----
> >>> From: reuti [mailto:reuti at staff.uni-marburg.de]
> >>> Sent: 07 May 2010 17:18
> >>> To: users at gridengine.sunsource.net
> >>> Subject: Re: [GE users] exit status = 10 pe_start = 134
> >>>
> >>> Am 07.05.2010 um 18:12 schrieb henk:
> >>>
> >>>> Hi Reuti
> >>>>
> >>>>> When you use Open MPI, there is no need for any start procedure
> of
> >>> the
> >>>>> PE. Did you define one anyway as you use the same PE also for
> other
> >>>>> types of jobs?
> >>>>
> >>>> No, there is no start procedure. The queue is basically the
> default
> >>>> all.q with adjustment for the pe and the number of slots:
> >>>>
> >>>> shell_start_mode      posix_compliant
> >>>
> >>> Often unix_behavior is more appropriate as it honors the first
line
> of
> >>> the script.
> >>>
> >>>
> >>>> starter_method        NONE
> >>>> suspend_method        NONE
> >>>> resume_method         NONE
> >>>> terminate_method      NONE
> >>>
> >>> No, in the PE:
> >>>
> >>> $ qconf -sp orte
> >>>
> >>> (or whatever you call it)
> >>>
> >>> -- Reuti
> >>>
> >>>
> >>>> I also tried the debug options with the dl utility but I don't
> think
> >>>> this gives more information for this kind of problem?
> >>>>
> >>>> Thanks
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
> >>>>> Sent: 07 May 2010 17:07
> >>>>> To: users at gridengine.sunsource.net
> >>>>> Subject: Re: [GE users] exit status = 10 pe_start = 134
> >>>>>
> >>>>> Am 07.05.2010 um 17:00 schrieb henk:
> >>>>>
> >>>>>> I installed gridengine 6.2u5 and allmost all nodes work fine
> >> except
> >>>> a
> >>>>>> few where a job generates the following error message:
> >>>>>>
> >>>>>> failed in pestart:05/07/2010 15:14:29 [43532:15077]:
exit_status
> >> of
> >>>>>> pe_start = 134
> >>>>>>
> >>>>>> (It's also in the qmaster message file)
> >>>>>>
> >>>>>> and the node message file has this entry
> >>>>>>
> >>>>>> 05/07/2010 15:14:30|  main|cn031|E|shepherd of job 556.1 exited
> >>> with
> >>>>>> exit status = 10
> >>>>>
> >>>>> This just says that the PE start procedure failed.
> >>>>>
> >>>>>
> >>>>>> indicating the problem.
> >>>>>>
> >>>>>> I use openmpi-1.4.1. The job is put in the queue again and the
> >>> queue
> >>>>> is
> >>>>>> in the error state. Clearing the error repeats the problem.
> >>>>>>
> >>>>>> Does anyone know what the code 134 means?
> >>>>>
> >>>>> Codes greater 128 are the sum of 128 plus the number of the
> >> received
> >>>>> signal. Means SIGABRT in your case. Now the question: where does
> >>> this
> >>>>> signal come from.
> >>>>>
> >>>>>
> >>>>>
> >>>>> When you use Open MPI, there is no need for any start procedure
> of
> >>> the
> >>>>> PE. Did you define one anyway as you use the same PE also for
> other
> >>>>> types of jobs?
> >>>>>
> >>>>> -- Reuti
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> Henk
> >>>>>>
> >>>>>> ------------------------------------------------------
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> >>>>> eId=256539
> >>>>>>
> >>>>>> To unsubscribe from this discussion, e-mail: [users-
> >>>>> unsubscribe at gridengine.sunsource.net].
> >>>>>>
> >>>>>
> >>>>> ------------------------------------------------------
> >>>>>
> >>>>
> >>>
> >>
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> >>>>> eId=256545
> >>>>>
> >>>>> To unsubscribe from this discussion, e-mail: [users-
> >>>>> unsubscribe at gridengine.sunsource.net].
> >>>>
> >>>> ------------------------------------------------------
> >>>>
> >>>
> >>
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> >>> eId=256546
> >>>>
> >>>> To unsubscribe from this discussion, e-mail: [users-
> >>> unsubscribe at gridengine.sunsource.net].
> >>>>
> >>>
> >>> ------------------------------------------------------
> >>>
> >>
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> >>> eId=256547
> >>>
> >>> To unsubscribe from this discussion, e-mail: [users-
> >>> unsubscribe at gridengine.sunsource.net].
> >>
> >> ------------------------------------------------------
> >>
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=256548
> >>
> >> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
> >>
> >
> > ------------------------------------------
> > Dr E L  Heck
> >
> > University of Durham
> > Institute for Computational Cosmology
> > Ogden Centre
> > Department of Physics
> > South Road
> >
> > DURHAM, DH1 3LE
> > United Kingdom
> >
> > e-mail: lydia.heck at durham.ac.uk
> >
> > Tel.: + 44 191 - 334 3628
> > Fax.: + 44 191 - 334 3645
> > ___________________________________________
> >
> > ------------------------------------------------------
> >
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=256550
> >
> > To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=256937
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259018

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list