[GE users] SGE-6.2u3: error reason 1: exit_status of pe_start = 134

andy andy.schwierskott at sun.com
Wed May 19 08:45:01 BST 2010


Erik,

on Linux and Solaris the signal causing "exit status" 134 is "ABRT" (signal
6, 128 + 6 = 134).

There is a known issue in SGE that prolog/epilog/pe* are started with the
job limits - could an accidentially small job limit, e.g. "1k" instead of
"1g" have caused the exec() of the shell (I think /bin/true is at least
sometimes a shell script) or binary to die quickly?

Why not simply set the pe* methods to "NONE"?

There are certainly other reasons why /bin/true can fail.

Andy

On Wed, 19 May 2010, soyez wrote:

> Good day,
>
> does anybody know why PEs can put queues into state "Error" even if
> the pe_start-file is "/bin/true"?  The usual suspects automounter,
> NFS, home directories, etc. all work very well.  Is it a known bug
> or only a mistakable error message?
>
> Thanks, Erik Soyez.
>
> ------------------------------------------------------------------------
>  				:
>  				:
> parallel environment:  abaqus range: 4
> error reason    1:          05/18/2010 15:01:52 [0:22789]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:02:07 [0:21269]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:02:24 [0:5181]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:02:38 [0:6302]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:09:52 [0:22909]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:10:09 [0:5235]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:14:11 [0:22945]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:14:28 [0:5267]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:32:35 [0:21558]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:32:51 [0:6603]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:46:49 [0:21699]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:47:05 [0:6733]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:47:34 [0:21703]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:47:51 [0:5644]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:48:49 [0:21709]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:49:05 [0:6744]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:50:34 [0:21720]: exit_status of pe_start = 134
>                  1:          05/18/2010 15:50:51 [0:5661]: exit_status of pe_start = 134
>  				:
>  				:
> ------------------------------------------------------------------------
>
>
> ------------------------------------------------------------------------
> pe_name            abaqus
> slots              999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $fill_up
> control_slaves     FALSE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary FALSE
> ------------------------------------------------------------------------
>
>
> --
>
>
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257830

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list