[GE users] exit status = 10 pe_start = 134

reuti reuti at staff.uni-marburg.de
Tue May 11 10:24:03 BST 2010


Am 07.05.2010 um 18:40 schrieb l_heck:

> I got the following output in the .e file of a batch failing
>
> ======= Backtrace: =========
> /lib64/libc.so.6[0x7f2261894118]
> /lib64/libc.so.6(cfree+0x76)[0x7f2261895c76]
> /lib64/libnsl.so.1[0x7f22610e0c89]
> /lib64/libpthread.so.0(pthread_once+0x53)[0x7f2261b84ed3]
> /lib64/libnsl.so.1(_nsl_default_nss+0x21)[0x7f22610e0df1]
> /lib64/libnss_nis.so.2(_nss_nis_initgroups_dyn+0x6a)[0x7f22612f0cca]
> /lib64/libc.so.6[0x7f22618be618]
> /lib64/libc.so.6(initgroups+0x75)[0x7f22618be7f5]
> sge_shepherd-558[0x52049d]
> sge_shepherd-558(sge_set_uid_gid_addgrp+0x7c)[0x52038c]
> sge_shepherd-558(son+0x4fc)[0x44fdfc]
> sge_shepherd-558(__strtod_internal+0x10ea)[0x44c542]
> sge_shepherd-558(main+0x622)[0x44bfe2]
> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f226183e586]
> sge_shepherd-558(tcsetattr+0x92)[0x44b8ea]
> ======= Memory map: ========
> 00400000-0063e000 r-xp 00000000 3cf:a4500 87071869
> /hpsfs/Cluster-Apps/sge/6.2u5/bin/lx24-amd64/sge_shepherd
> 0073d000-00775000 rwxp 0023d000 3cf:a4500 87071869
> /hpsfs/Cluster-Apps/sge/6.2u5/bin/lx24-amd64/sge_shepherd
> 00775000-007a1000 rwxp 00775000 00:00 0                                  [heap]
> 7f2260eb9000-7f2260ecf000 r-xp 00000000 08:03 306062
> /lib64/libgcc_s.so.1
> 7f2260ecf000-7f22610cf000 ---p 00016000 08:03 306062
> /lib64/libgcc_s.so.1
> 7f22610cf000-7f22610d0000 r-xp 00016000 08:03 306062
> /lib64/libgcc_s.so.1
> 7f22610d0000-7f22610d1000 rwxp 00017000 08:03 306062
> /lib64/libgcc_s.so.1
> 7f22610d1000-7f22610e6000 r-xp 00000000 08:03 305953
> /lib64/libnsl-2.9.so
>                    7f22610d0000-7f22610d1000 rwxp 00017000 08:03 306062
> /lib64/libgcc_s.so.1
> 7f22610d1000-7f22610e6000 r-xp 00000000 08:03 305953
> /lib64/libnsl-2.9.so
> 7f22610e6000-7f22612e5000 ---p 00015000 08:03 305953
> /lib64/libnsl-2.9.so
> 7f22612e5000-7f22612e6000 r-xp 00014000 08:03 305953
> /lib64/libnsl-2.9.so
> 7f22612e6000-7f22612e7000 rwxp 00015000 08:03 305953
> /lib64/libnsl-2.9.so
> 7f22612e7000-7f22612e9000 rwxp 7f22612e7000 00:00 0
> 7f22612e9000-7f22612f3000 r-xp 00000000 08:03 306054
> /lib64/libnss_nis-2.9.so
> 7f22612f3000-7f22614f2000 ---p 0000a000 08:03 306054
> /lib64/libnss_nis-2.9.so
> 7f22614f2000-7f22614f3000 r-xp 00009000 08:03 306054
> /lib64/libnss_nis-2.9.so
> 7f22614f3000-7f22614f4000 rwxp 0000a000 08:03 306054
> /lib64/libnss_nis-2.9.so
> 7f22614f4000-7f22614ff000 r-xp 00000000 08:03 306046
> /lib64/libnss_files-2.9.so
> 7f22614ff000-7f22616fe000 ---p 0000b000 08:03 306046
> /lib64/libnss_files-2.9.so
> 7f22616fe000-7f22616ff000 r-xp 0000a000 08:03 306046
> /lib64/libnss_files-2.9.so
> 7f22616ff000-7f2261700000 rwxp 0000b000 08:03 306046
> /lib64/libnss_files-2.9.so
> 7f2261700000-7f2261800000 rwxp 7f2261700000 00:00 0
> 7f2261820000-7f226196f000 r-xp 00000000 08:03 306050
> /lib64/libc-2.9.so
> 7f226196f000-7f2261b6f000 ---p 0014f000 08:03 306050
> /lib64/libc-2.9.so
> 7f2261b6f000-7f2261b73000 r-xp 0014f000 08:03 306050
> /lib64/libc-2.9.so
> 7f2261b73000-7f2261b74000 rwxp 00153000 08:03 306050
> /lib64/libc-2.9.so
> 7f2261b74000-7f2261b79000 rwxp 7f2261b74000 00:00 0
> 7f2261b79000-7f2261b8f000 r-xp 00000000 08:03 306034
> /lib64/libpthread-2.9.so
> 7f2261b8f000-7f2261d8f000 ---p 00016000 08:03 306034
> /lib64/libpthread-2.9.so
> 7f2261d8f000-7f2261d90000 r-xp 00016000 08:03 306034
> /lib64/libpthread-2.9.so
> 7f2261d90000-7f2261d91000 rwxp 00017000 08:03 306034
> /lib64/libpthread-2.9.so
> 7f2261d91000-7f2261d95000 rwxp 7f2261d91000 00:00 0
> 7f2261d95000-7f2261dea000 r-xp 00000000 08:03 305951
> /lib64/libm-2.9.so
> 7f2261dea000-7f2261fe9000 ---p 00055000 08:03 305951
> /lib64/libm-2.9.so
> 7f2261fe9000-7f2261fea000 r-xp 00054000 08:03 305951
> /lib64/libm-2.9.so
> 7f2261fea000-7f2261feb000 rwxp 00055000 08:03 305951
> /lib64/libm-2.9.so
> 7f2261feb000-7f2261fed000 r-xp 00000000 08:03 305921
> /lib64/libdl-2.9.so
> 7f2261fed000-7f22621ed000 ---p 00002000 08:03 305921
> /lib64/libdl-2.9.so
> 7f22621ed000-7f22621ee000 r-xp 00002000 08:03 305921
> /lib64/libdl-2.9.so
> 7f22621ee000-7f22621ef000 rwxp 00003000 08:03 305921
> /lib64/libdl-2.9.so
> 7f22621ef000-7f226220d000 r-xp 00000000 08:03 306070
> /lib64/ld-2.9.so
> 7f22623d4000-7f22623d7000 rwxp 7f22623d4000 00:00 0
> 7f2262404000-7f226240c000 rwxp 7f2262404000 00:00 0
> 7f226240c000-7f226240d000 r-xp 0001d000 08:03 306070
> /lib64/ld-2.9.so
> 7f226240d000-7f226240e000 rwxp 0001e000 08:03 306070
> /lib64/ld-2.9.so
> 7fff6a3f1000-7fff6a40d000 rwxp 7ffffffe3000 00:00 0                      [stack]
> 7fff6a45a000-7fff6a45b000 r-xp 7fff6a45a000 00:00 0                      [vdso]
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
> [vsyscall]
> *** glibc detected *** sge_shepherd-558: free(): invalid pointer:
> 0x00007f8def843100 ***
> --------------------------------------------------------------------------

This SIGABRT was already on the list, but it seems not to be persistent. It even hit us some time ago. But with exactly one job, and resubmitting the job was successful. So we didn't pay much attention to it.

-- Reuti


> This might shed light on the issue
>
>
> 8,1            1%
>
>
>
>
> On Fri, 7 May 2010, henk wrote:
>
>> These are the settings in the pe
>>
>> start_proc_args    /bin/true
>> stop_proc_args     /bin/true
>> allocation_rule    $fill_up
>> control_slaves     TRUE
>> job_is_first_task  FALSE
>>
>> I believe this indicates absence of start or stop operations? I also use
>> these in another cluster. I'll change the posix_compliant.
>>
>> Thanks
>>
>>> -----Original Message-----
>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: 07 May 2010 17:18
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] exit status = 10 pe_start = 134
>>>
>>> Am 07.05.2010 um 18:12 schrieb henk:
>>>
>>>> Hi Reuti
>>>>
>>>>> When you use Open MPI, there is no need for any start procedure of
>>> the
>>>>> PE. Did you define one anyway as you use the same PE also for other
>>>>> types of jobs?
>>>>
>>>> No, there is no start procedure. The queue is basically the default
>>>> all.q with adjustment for the pe and the number of slots:
>>>>
>>>> shell_start_mode      posix_compliant
>>>
>>> Often unix_behavior is more appropriate as it honors the first line of
>>> the script.
>>>
>>>
>>>> starter_method        NONE
>>>> suspend_method        NONE
>>>> resume_method         NONE
>>>> terminate_method      NONE
>>>
>>> No, in the PE:
>>>
>>> $ qconf -sp orte
>>>
>>> (or whatever you call it)
>>>
>>> -- Reuti
>>>
>>>
>>>> I also tried the debug options with the dl utility but I don't think
>>>> this gives more information for this kind of problem?
>>>>
>>>> Thanks
>>>>
>>>>> -----Original Message-----
>>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>>> Sent: 07 May 2010 17:07
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] exit status = 10 pe_start = 134
>>>>>
>>>>> Am 07.05.2010 um 17:00 schrieb henk:
>>>>>
>>>>>> I installed gridengine 6.2u5 and allmost all nodes work fine
>> except
>>>> a
>>>>>> few where a job generates the following error message:
>>>>>>
>>>>>> failed in pestart:05/07/2010 15:14:29 [43532:15077]: exit_status
>> of
>>>>>> pe_start = 134
>>>>>>
>>>>>> (It's also in the qmaster message file)
>>>>>>
>>>>>> and the node message file has this entry
>>>>>>
>>>>>> 05/07/2010 15:14:30|  main|cn031|E|shepherd of job 556.1 exited
>>> with
>>>>>> exit status = 10
>>>>>
>>>>> This just says that the PE start procedure failed.
>>>>>
>>>>>
>>>>>> indicating the problem.
>>>>>>
>>>>>> I use openmpi-1.4.1. The job is put in the queue again and the
>>> queue
>>>>> is
>>>>>> in the error state. Clearing the error repeats the problem.
>>>>>>
>>>>>> Does anyone know what the code 134 means?
>>>>>
>>>>> Codes greater 128 are the sum of 128 plus the number of the
>> received
>>>>> signal. Means SIGABRT in your case. Now the question: where does
>>> this
>>>>> signal come from.
>>>>>
>>>>>
>>>>>
>>>>> When you use Open MPI, there is no need for any start procedure of
>>> the
>>>>> PE. Did you define one anyway as you use the same PE also for other
>>>>> types of jobs?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Henk
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>>
>>>>>
>>>>
>>>
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>>>> eId=256539
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>>
>>>>
>>>
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>>>> eId=256545
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>> ------------------------------------------------------
>>>>
>>>
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>> eId=256546
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>>
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>> eId=256547
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256548
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------
> Dr E L  Heck
>
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
>
> DURHAM, DH1 3LE
> United Kingdom
>
> e-mail: lydia.heck at durham.ac.uk
>
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___________________________________________
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256550
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256937

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list