[GE users] sge_shepherd segfaults

templedf dan.templeton at sun.com
Tue Mar 9 00:09:15 GMT 2010


Well, it does say that the "job signaled: 11".

Daniel

On 03/08/10 16:05, reuti wrote:
> Am 09.03.2010 um 00:50 schrieb templedf:
>
>> I can see from the trace file that the segfault isn't coming from the
>> shepherd.  It's coming from your job after it's forked.  At that
>> point,
>> the job is executing as the submitting user.  Where the core file
>> lands
>> depends on your OS and the configuration.  The job's working directory
>> is a good first place to check.  By default that's the submitting
>> user's
>> home directory, unless the user specified somewhere else.
>
> I though the same. The interesting thing is of course:
>
>>> <snip>
>>> 03/05/2010 15:54:19 [1000:22139]: wait3 returned 22145 (status: 11;
>>> WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)
>
> if it's a sigsegv the status code should be 139 (128 + 11).
>
> It's anything else than Linux?
>
> -- Reuti
>
>
>>> 03/05/2010 15:54:19 [1000:22139]: job exited with exit status 0
>>> 03/05/2010 15:54:19 [1000:22139]: reaped "job" with pid 22145
>>> 03/05/2010 15:54:19 [1000:22139]: job exited due to signal
>>> 03/05/2010 15:54:19 [1000:22139]: job signaled: 11
>>> 03/05/2010 15:54:19 [1000:22139]: now sending signal KILL to pid
>>> -22145
>>> 03/05/2010 15:54:19 [1000:22139]: writing usage file to "usage"
>>> 03/05/2010 15:54:19 [1000:22139]: no tasker to notify
>>> 03/05/2010 15:54:19 [1000:22139]: no epilog script to start
>>>
>>>
>>> Re-running sge-shepherd in the job directory didn't show any problem.
>>> However, the original problem is 100% reproducible with that
>>> particular
>>> type of job.
>>>
>>> I would like to examine a dumped core. I enabled dumping cores for
>>> the
>>> user that runs the gridengine (sgeadmin), but I am not sure at what
>>> stage the core might be dumped, under which uid and in whcih
>>> directory.
>>> Are there any recommendations on obtaining a core dump from
>>> sge_shepherd?
>>>
>>> Thank you.
>>> Serge.
>>>
>>>
>>>
>>> On Fri, Mar 5, 2010 at 1:30 PM, Daniel Templeton<Dan.Templeton at sun.com
>>> <mailto:Dan.Templeton at sun.com>>  wrote:
>>>
>>>     You just run "sge_shepherd" in the job directory.  It will figure
>>>     everything else out from the files in that directory.  Like I
>>> said,
>>>     though, have a look in the error and trace files first.
>>>
>>>     Daniel
>>>
>>>
>>>     On 03/05/10 13:19, snosov wrote:
>>>
>>>         Thanks, Daniel,
>>>
>>>         I will try to do that. One question, though. How do I manually
>>>         run sge_shepherd and tell it which job to process? The man
>>> page
>>>         only talks about exit values and that the program should not
>>> be
>>>         run manually.
>>>
>>>         Thank you,
>>>         Serge.
>>>
>>>
>>>         On Fri, Mar 5, 2010 at 12:12 PM, Daniel Templeton
>>>         <Dan.Templeton at sun.com<mailto:Dan.Templeton at sun.com>
>>>         <mailto:Dan.Templeton at sun.com<mailto:Dan.Templeton at sun.com>>>
>>>         wrote:
>>>
>>>             The way to debug the problem would be to set KEPP_ACTIVE
>>> to TRUE
>>>             in the execd_params and then run a job.  After the job
>>> fails, go
>>>             to the<execd_spool_dir>/active_jobs/<jobid>.1 directory
>>> and run
>>>             sge_shepherd in a debugger.  I guess before you do that
>>> you
>>>         should
>>>             look in the error and trace files in that directory.
>>>
>>>             Daniel
>>>
>>>
>>>
>>>
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247594
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
>> ].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247599
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247602

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list