[GE users] exit status = 10 pe_start = 134

l_heck lydia.heck at durham.ac.uk
Fri May 7 17:40:24 BST 2010


I got the following output in the .e file of a batch failing

======= Backtrace: =========
/lib64/libc.so.6[0x7f2261894118]
/lib64/libc.so.6(cfree+0x76)[0x7f2261895c76]
/lib64/libnsl.so.1[0x7f22610e0c89]
/lib64/libpthread.so.0(pthread_once+0x53)[0x7f2261b84ed3]
/lib64/libnsl.so.1(_nsl_default_nss+0x21)[0x7f22610e0df1]
/lib64/libnss_nis.so.2(_nss_nis_initgroups_dyn+0x6a)[0x7f22612f0cca]
/lib64/libc.so.6[0x7f22618be618]
/lib64/libc.so.6(initgroups+0x75)[0x7f22618be7f5]
sge_shepherd-558[0x52049d]
sge_shepherd-558(sge_set_uid_gid_addgrp+0x7c)[0x52038c]
sge_shepherd-558(son+0x4fc)[0x44fdfc]
sge_shepherd-558(__strtod_internal+0x10ea)[0x44c542]
sge_shepherd-558(main+0x622)[0x44bfe2]
/lib64/libc.so.6(__libc_start_main+0xe6)[0x7f226183e586]
sge_shepherd-558(tcsetattr+0x92)[0x44b8ea]
======= Memory map: ========
00400000-0063e000 r-xp 00000000 3cf:a4500 87071869 
/hpsfs/Cluster-Apps/sge/6.2u5/bin/lx24-amd64/sge_shepherd
0073d000-00775000 rwxp 0023d000 3cf:a4500 87071869 
/hpsfs/Cluster-Apps/sge/6.2u5/bin/lx24-amd64/sge_shepherd
00775000-007a1000 rwxp 00775000 00:00 0                                  [heap]
7f2260eb9000-7f2260ecf000 r-xp 00000000 08:03 306062 
/lib64/libgcc_s.so.1
7f2260ecf000-7f22610cf000 ---p 00016000 08:03 306062 
/lib64/libgcc_s.so.1
7f22610cf000-7f22610d0000 r-xp 00016000 08:03 306062 
/lib64/libgcc_s.so.1
7f22610d0000-7f22610d1000 rwxp 00017000 08:03 306062 
/lib64/libgcc_s.so.1
7f22610d1000-7f22610e6000 r-xp 00000000 08:03 305953 
/lib64/libnsl-2.9.so
                    7f22610d0000-7f22610d1000 rwxp 00017000 08:03 306062 
/lib64/libgcc_s.so.1
7f22610d1000-7f22610e6000 r-xp 00000000 08:03 305953 
/lib64/libnsl-2.9.so
7f22610e6000-7f22612e5000 ---p 00015000 08:03 305953 
/lib64/libnsl-2.9.so
7f22612e5000-7f22612e6000 r-xp 00014000 08:03 305953 
/lib64/libnsl-2.9.so
7f22612e6000-7f22612e7000 rwxp 00015000 08:03 305953 
/lib64/libnsl-2.9.so
7f22612e7000-7f22612e9000 rwxp 7f22612e7000 00:00 0
7f22612e9000-7f22612f3000 r-xp 00000000 08:03 306054 
/lib64/libnss_nis-2.9.so
7f22612f3000-7f22614f2000 ---p 0000a000 08:03 306054 
/lib64/libnss_nis-2.9.so
7f22614f2000-7f22614f3000 r-xp 00009000 08:03 306054 
/lib64/libnss_nis-2.9.so
7f22614f3000-7f22614f4000 rwxp 0000a000 08:03 306054 
/lib64/libnss_nis-2.9.so
7f22614f4000-7f22614ff000 r-xp 00000000 08:03 306046 
/lib64/libnss_files-2.9.so
7f22614ff000-7f22616fe000 ---p 0000b000 08:03 306046 
/lib64/libnss_files-2.9.so
7f22616fe000-7f22616ff000 r-xp 0000a000 08:03 306046 
/lib64/libnss_files-2.9.so
7f22616ff000-7f2261700000 rwxp 0000b000 08:03 306046 
/lib64/libnss_files-2.9.so
7f2261700000-7f2261800000 rwxp 7f2261700000 00:00 0
7f2261820000-7f226196f000 r-xp 00000000 08:03 306050 
/lib64/libc-2.9.so
7f226196f000-7f2261b6f000 ---p 0014f000 08:03 306050 
/lib64/libc-2.9.so
7f2261b6f000-7f2261b73000 r-xp 0014f000 08:03 306050 
/lib64/libc-2.9.so
7f2261b73000-7f2261b74000 rwxp 00153000 08:03 306050 
/lib64/libc-2.9.so
7f2261b74000-7f2261b79000 rwxp 7f2261b74000 00:00 0
7f2261b79000-7f2261b8f000 r-xp 00000000 08:03 306034 
/lib64/libpthread-2.9.so
7f2261b8f000-7f2261d8f000 ---p 00016000 08:03 306034 
/lib64/libpthread-2.9.so
7f2261d8f000-7f2261d90000 r-xp 00016000 08:03 306034 
/lib64/libpthread-2.9.so
7f2261d90000-7f2261d91000 rwxp 00017000 08:03 306034 
/lib64/libpthread-2.9.so
7f2261d91000-7f2261d95000 rwxp 7f2261d91000 00:00 0
7f2261d95000-7f2261dea000 r-xp 00000000 08:03 305951 
/lib64/libm-2.9.so
7f2261dea000-7f2261fe9000 ---p 00055000 08:03 305951 
/lib64/libm-2.9.so
7f2261fe9000-7f2261fea000 r-xp 00054000 08:03 305951 
/lib64/libm-2.9.so
7f2261fea000-7f2261feb000 rwxp 00055000 08:03 305951 
/lib64/libm-2.9.so
7f2261feb000-7f2261fed000 r-xp 00000000 08:03 305921 
/lib64/libdl-2.9.so
7f2261fed000-7f22621ed000 ---p 00002000 08:03 305921 
/lib64/libdl-2.9.so
7f22621ed000-7f22621ee000 r-xp 00002000 08:03 305921 
/lib64/libdl-2.9.so
7f22621ee000-7f22621ef000 rwxp 00003000 08:03 305921 
/lib64/libdl-2.9.so
7f22621ef000-7f226220d000 r-xp 00000000 08:03 306070 
/lib64/ld-2.9.so
7f22623d4000-7f22623d7000 rwxp 7f22623d4000 00:00 0
7f2262404000-7f226240c000 rwxp 7f2262404000 00:00 0
7f226240c000-7f226240d000 r-xp 0001d000 08:03 306070 
/lib64/ld-2.9.so
7f226240d000-7f226240e000 rwxp 0001e000 08:03 306070 
/lib64/ld-2.9.so
7fff6a3f1000-7fff6a40d000 rwxp 7ffffffe3000 00:00 0                      [stack]
7fff6a45a000-7fff6a45b000 r-xp 7fff6a45a000 00:00 0                      [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 
[vsyscall]
*** glibc detected *** sge_shepherd-558: free(): invalid pointer: 
0x00007f8def843100 ***
--------------------------------------------------------------------------

This might shed light on the issue


8,1            1%




On Fri, 7 May 2010, henk wrote:

> These are the settings in the pe
>
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
>
> I believe this indicates absence of start or stop operations? I also use
> these in another cluster. I'll change the posix_compliant.
>
> Thanks
>
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 07 May 2010 17:18
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] exit status = 10 pe_start = 134
>>
>> Am 07.05.2010 um 18:12 schrieb henk:
>>
>>> Hi Reuti
>>>
>>>> When you use Open MPI, there is no need for any start procedure of
>> the
>>>> PE. Did you define one anyway as you use the same PE also for other
>>>> types of jobs?
>>>
>>> No, there is no start procedure. The queue is basically the default
>>> all.q with adjustment for the pe and the number of slots:
>>>
>>> shell_start_mode      posix_compliant
>>
>> Often unix_behavior is more appropriate as it honors the first line of
>> the script.
>>
>>
>>> starter_method        NONE
>>> suspend_method        NONE
>>> resume_method         NONE
>>> terminate_method      NONE
>>
>> No, in the PE:
>>
>> $ qconf -sp orte
>>
>> (or whatever you call it)
>>
>> -- Reuti
>>
>>
>>> I also tried the debug options with the dl utility but I don't think
>>> this gives more information for this kind of problem?
>>>
>>> Thanks
>>>
>>>> -----Original Message-----
>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: 07 May 2010 17:07
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] exit status = 10 pe_start = 134
>>>>
>>>> Am 07.05.2010 um 17:00 schrieb henk:
>>>>
>>>>> I installed gridengine 6.2u5 and allmost all nodes work fine
> except
>>> a
>>>>> few where a job generates the following error message:
>>>>>
>>>>> failed in pestart:05/07/2010 15:14:29 [43532:15077]: exit_status
> of
>>>>> pe_start = 134
>>>>>
>>>>> (It's also in the qmaster message file)
>>>>>
>>>>> and the node message file has this entry
>>>>>
>>>>> 05/07/2010 15:14:30|  main|cn031|E|shepherd of job 556.1 exited
>> with
>>>>> exit status = 10
>>>>
>>>> This just says that the PE start procedure failed.
>>>>
>>>>
>>>>> indicating the problem.
>>>>>
>>>>> I use openmpi-1.4.1. The job is put in the queue again and the
>> queue
>>>> is
>>>>> in the error state. Clearing the error repeats the problem.
>>>>>
>>>>> Does anyone know what the code 134 means?
>>>>
>>>> Codes greater 128 are the sum of 128 plus the number of the
> received
>>>> signal. Means SIGABRT in your case. Now the question: where does
>> this
>>>> signal come from.
>>>>
>>>>
>>>>
>>>> When you use Open MPI, there is no need for any start procedure of
>> the
>>>> PE. Did you define one anyway as you use the same PE also for other
>>>> types of jobs?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Henk
>>>>>
>>>>> ------------------------------------------------------
>>>>>
>>>>
>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>>> eId=256539
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>
>>>> ------------------------------------------------------
>>>>
>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>>>> eId=256545
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>> eId=256546
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
>> eId=256547
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256548
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.heck at durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256550

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list