[GE users] exit status = 10 pe_start = 134

reuti reuti at staff.uni-marburg.de
Thu May 27 18:37:39 BST 2010


Hi,

Am 27.05.2010 um 15:52 schrieb henk:

> Reuti,
>
>> This SIGABRT was already on the list, but it seems not to be
>> persistent. It even hit us some time ago. But with exactly one job,
> and
>> resubmitting the job was successful. So we didn't pay much attention
> to
>> it.
>
> The problem appears now more and with different jobs. Sometimes
> immediately, sometimes after a 72 slots job has ran for ~24 hrs. On what
> list is the SIGABRT? On what architecture and with what versions of
> gridengine and mpi did you see this problem?

I got it one time and it was an Open MPI job (but with $pe_slots for our special case) - hence no "start_proc_args" was necessary, and it happend before the job started at all (failed before pe_prolog; exit status = 10 / error 134).

Out platform is AMD64 (Opterons) with OpenSUSE 11.2 and SGE 6.2u4.

But when your job ran for 24 hrs, the issue seems to be different.

Was the job a one-time `mpiexec` run or is it issuing many `mpiexec`s during its lifetime?

-- Reuti


> Thanks
>
> Henk
>
>
>
>
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 11 May 2010 10:24
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] exit status = 10 pe_start = 134
>>
>> Am 07.05.2010 um 18:40 schrieb l_heck:
>>
>>> I got the following output in the .e file of a batch failing
>>>
>>> ======= Backtrace: =========
>>> /lib64/libc.so.6[0x7f2261894118]
>>> /lib64/libc.so.6(cfree+0x76)[0x7f2261895c76]
>>> /lib64/libnsl.so.1[0x7f22610e0c89]
>>> /lib64/libpthread.so.0(pthread_once+0x53)[0x7f2261b84ed3]
>>> /lib64/libnsl.so.1(_nsl_default_nss+0x21)[0x7f22610e0df1]
>>> /lib64/libnss_nis.so.2(_nss_nis_initgroups_dyn+0x6a)[0x7f22612f0cca]
>>> /lib64/libc.so.6[0x7f22618be618]
>>> /lib64/libc.so.6(initgroups+0x75)[0x7f22618be7f5]
>>> sge_shepherd-558[0x52049d]
>>> sge_shepherd-558(sge_set_uid_gid_addgrp+0x7c)[0x52038c]
>>> sge_shepherd-558(son+0x4fc)[0x44fdfc]
>>> sge_shepherd-558(__strtod_internal+0x10ea)[0x44c542]
>>> sge_shepherd-558(main+0x622)[0x44bfe2]
>>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f226183e586]
>>> sge_shepherd-558(tcsetattr+0x92)[0x44b8ea]
>>> ======= Memory map: ========
>>> 00400000-0063e000 r-xp 00000000 3cf:a4500 87071869
>>> /hpsfs/Cluster-Apps/sge/6.2u5/bin/lx24-amd64/sge_shepherd
>>> 0073d000-00775000 rwxp 0023d000 3cf:a4500 87071869
>>> /hpsfs/Cluster-Apps/sge/6.2u5/bin/lx24-amd64/sge_shepherd
>>> 00775000-007a1000 rwxp 00775000 00:00 0
>> [heap]
>>> 7f2260eb9000-7f2260ecf000 r-xp 00000000 08:03 306062
>>> /lib64/libgcc_s.so.1
>>> 7f2260ecf000-7f22610cf000 ---p 00016000 08:03 306062
>>> /lib64/libgcc_s.so.1
>>> 7f22610cf000-7f22610d0000 r-xp 00016000 08:03 306062
>>> /lib64/libgcc_s.so.1
>>> 7f22610d0000-7f22610d1000 rwxp 00017000 08:03 306062
>>> /lib64/libgcc_s.so.1
>>> 7f22610d1000-7f22610e6000 r-xp 00000000 08:03 305953
>>> /lib64/libnsl-2.9.so
>>>                   7f22610d0000-7f22610d1000 rwxp 00017000 08:03
>> 306062
>>> /lib64/libgcc_s.so.1
>>> 7f22610d1000-7f22610e6000 r-xp 00000000 08:03 305953
>>> /lib64/libnsl-2.9.so
>>> 7f22610e6000-7f22612e5000 ---p 00015000 08:03 305953
>>> /lib64/libnsl-2.9.so
>>> 7f22612e5000-7f22612e6000 r-xp 00014000 08:03 305953
>>> /lib64/libnsl-2.9.so
>>> 7f22612e6000-7f22612e7000 rwxp 00015000 08:03 305953
>>> /lib64/libnsl-2.9.so
>>> 7f22612e7000-7f22612e9000 rwxp 7f22612e7000 00:00 0
>>> 7f22612e9000-7f22612f3000 r-xp 00000000 08:03 306054
>>> /lib64/libnss_nis-2.9.so
>>> 7f22612f3000-7f22614f2000 ---p 0000a000 08:03 306054
>>> /lib64/libnss_nis-2.9.so
>>> 7f22614f2000-7f22614f3000 r-xp 00009000 08:03 306054
>>> /lib64/libnss_nis-2.9.so
>>> 7f22614f3000-7f22614f4000 rwxp 0000a000 08:03 306054
>>> /lib64/libnss_nis-2.9.so
>>> 7f22614f4000-7f22614ff000 r-xp 00000000 08:03 306046
>>> /lib64/libnss_files-2.9.so
>>> 7f22614ff000-7f22616fe000 ---p 0000b000 08:03 306046
>>> /lib64/libnss_files-2.9.so
>>> 7f22616fe000-7f22616ff000 r-xp 0000a000 08:03 306046
>>> /lib64/libnss_files-2.9.so
>>> 7f22616ff000-7f2261700000 rwxp 0000b000 08:03 306046
>>> /lib64/libnss_files-2.9.so
>>> 7f2261700000-7f2261800000 rwxp 7f2261700000 00:00 0
>>> 7f2261820000-7f226196f000 r-xp 00000000 08:03 306050
>>> /lib64/libc-2.9.so
>>> 7f226196f000-7f2261b6f000 ---p 0014f000 08:03 306050
>>> /lib64/libc-2.9.so
>>> 7f2261b6f000-7f2261b73000 r-xp 0014f000 08:03 306050
>>> /lib64/libc-2.9.so
>>> 7f2261b73000-7f2261b74000 rwxp 00153000 08:03 306050
>>> /lib64/libc-2.9.so
>>> 7f2261b74000-7f2261b79000 rwxp 7f2261b74000 00:00 0
>>> 7f2261b79000-7f2261b8f000 r-xp 00000000 08:03 306034
>>> /lib64/libpthread-2.9.so
>>> 7f2261b8f000-7f2261d8f000 ---p 00016000 08:03 306034
>>> /lib64/libpthread-2.9.so
>>> 7f2261d8f000-7f2261d90000 r-xp 00016000 08:03 306034
>>> /lib64/libpthread-2.9.so
>>> 7f2261d90000-7f2261d91000 rwxp 00017000 08:03 306034
>>> /lib64/libpthread-2.9.so
>>> 7f2261d91000-7f2261d95000 rwxp 7f2261d91000 00:00 0
>>> 7f2261d95000-7f2261dea000 r-xp 00000000 08:03 305951
>>> /lib64/libm-2.9.so
>>> 7f2261dea000-7f2261fe9000 ---p 00055000 08:03 305951
>>> /lib64/libm-2.9.so
>>> 7f2261fe9000-7f2261fea000 r-xp 00054000 08:03 305951
>>> /lib64/libm-2.9.so
>>> 7f2261fea000-7f2261feb000 rwxp 00055000 08:03 305951
>>> /lib64/libm-2.9.so
>>> 7f2261feb000-7f2261fed000 r-xp 00000000 08:03 305921
>>> /lib64/libdl-2.9.so
>>> 7f2261fed000-7f22621ed000 ---p 00002000 08:03 305921
>>> /lib64/libdl-2.9.so
>>> 7f22621ed000-7f22621ee000 r-xp 00002000 08:03 305921
>>> /lib64/libdl-2.9.so
>>> 7f22621ee000-7f22621ef000 rwxp 00003000 08:03 305921
>>> /lib64/libdl-2.9.so
>>> 7f22621ef000-7f226220d000 r-xp 00000000 08:03 306070
>>> /lib64/ld-2.9.so
>>> 7f22623d4000-7f22623d7000 rwxp 7f22623d4000 00:00 0
>>> 7f2262404000-7f226240c000 rwxp 7f2262404000 00:00 0
>>> 7f226240c000-7f226240d000 r-xp 0001d000 08:03 306070
>>> /lib64/ld-2.9.so
>>> 7f226240d000-7f226240e000 rwxp 0001e000 08:03 306070
>>> /lib64/ld-2.9.so
>>> 7fff6a3f1000-7fff6a40d000 rwxp 7ffffffe3000 00:00 0
>> [stack]
>>> 7fff6a45a000-7fff6a45b000 r-xp 7fff6a45a000 00:00 0
>> [vdso]
>>> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
>>> [vsyscall]
>>> *** glibc detected *** sge_shepherd-558: free(): invalid pointer:
>>> 0x00007f8def843100 ***
>>>
> ---------------------------------------------------------------------
>> -----
>>
>> This SIGABRT was already on the list, but it seems not to be
>> persistent. It even hit us some time ago. But with exactly one job,
> and
>> resubmitting the job was successful. So we didn't pay much attention
> to
>> it.
>>
>> -- Reuti
>>
>>
>>> This might shed light on the issue
>>>
>>>
>>> 8,1            1%
>>>
>>>
>>>
>>>
>>> On Fri, 7 May 2010, henk wrote:
>>>
>>>> These are the settings in the pe
>>>>
>>>> start_proc_args    /bin/true
>>>> stop_proc_args     /bin/true
>>>> allocation_rule    $fill_up
>>>> control_slaves     TRUE
>>>> job_is_first_task  FALSE
>>>>
>>>> I believe this indicates absence of start or stop operations? I
> also
>> use
>>>> these in another cluster. I'll change the posix_compliant.
>>>>
>>>> Thanks
>>>>
>>>>> -----Original Message-----
>>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>>> Sent: 07 May 2010 17:18
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] exit status = 10 pe_start = 134
>>>>>
>>>>> Am 07.05.2010 um 18:12 schrieb henk:
>>>>>
>>>>>> Hi Reuti
>>>>>>
>>>>>>> When you use Open MPI, there is no need for any start procedure
>> of
>>>>> the
>>>>>>> PE. Did you define one anyway as you use the same PE also for
>> other
>>>>>>> types of jobs?
>>>>>>
>>>>>> No, there is no start procedure. The queue is basically the
>> default
>>>>>> all.q with adjustment for the pe and the number of slots:
>>>>>>
>>>>>> shell_start_mode      posix_compliant
>>>>>
>>>>> Often unix_behavior is more appropriate as it honors the first
> line
>> of
>>>>> the script.
>>>>>
>>>>>
>>>>>> starter_method        NONE
>>>>>> suspend_method        NONE
>>>>>> resume_method         NONE
>>>>>> terminate_method      NONE
>>>>>
>>>>> No, in the PE:
>>>>>
>>>>> $ qconf -sp orte
>>>>>
>>>>> (or whatever you call it)
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> I also tried the debug options with the dl utility but I don't
>> think
>>>>>> this gives more information for this kind of problem?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>>>>> Sent: 07 May 2010 17:07
>>>>>>> To: users at gridengine.sunsource.net
>>>>>>> Subject: Re: [GE users] exit status = 10 pe_start = 134
>>>>>>>
>>>>>>> Am 07.05.2010 um 17:00 schrieb henk:
>>>>>>>
>>>>>>>> I installed gridengine 6.2u5 and allmost all nodes work fine
>>>> except
>>>>>> a
>>>>>>>> few where a job generates the following error message:
>>>>>>>>
>>>>>>>> failed in pestart:05/07/2010 15:14:29 [43532:15077]:
> exit_status
>>>> of
>>>>>>>> pe_start = 134
>>>>>>>>
>>>>>>>> (It's also in the qmaster message file)
>>>>>>>>
>>>>>>>> and the node message file has this entry
>>>>>>>>
>>>>>>>> 05/07/2010 15:14:30|  main|cn031|E|shepherd of job 556.1 exited
>>>>> with
>>>>>>>> exit status = 10
>>>>>>>
>>>>>>> This just says that the PE start procedure failed.
>>>>>>>
>>>>>>>
>>>>>>>> indicating the problem.
>>>>>>>>
>>>>>>>> I use openmpi-1.4.1. The job is put in the queue again and the
>>>>> queue
>>>>>>> is
>>>>>>>> in the error state. Clearing the error repeats the problem.
>>>>>>>>
>>>>>>>> Does anyone know what the code 134 means?
>>>>>>>
>>>>>>> Codes greater 128 are the sum of 128 plus the number of the
>>>> received
>>>>>>> signal. Means SIGABRT in your case. Now the question: where does
>>>>> this
>>>>>>> signal come from.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> When you use Open MPI, there is no need for any start procedure
>> of
>>>>> the
>>>>>>> PE. Did you define one anyway as you use the same PE also for
>> other
>>>>>>> types of jobs?
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Henk
>>>>>>>>
>>>>>>>> ------------------------------------------------------
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>>>>>> eId=256539
>>>>>>>>
>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>>>>>> eId=256545
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>>
>>>>>
>>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>>>> eId=256546
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>>
>>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>>>> eId=256547
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>> ------------------------------------------------------
>>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>> eId=256548
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------
>>> Dr E L  Heck
>>>
>>> University of Durham
>>> Institute for Computational Cosmology
>>> Ogden Centre
>>> Department of Physics
>>> South Road
>>>
>>> DURHAM, DH1 3LE
>>> United Kingdom
>>>
>>> e-mail: lydia.heck at durham.ac.uk
>>>
>>> Tel.: + 44 191 - 334 3628
>>> Fax.: + 44 191 - 334 3645
>>> ___________________________________________
>>>
>>> ------------------------------------------------------
>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>> eId=256550
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>> eId=256937
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259018
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259076

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list