[GE users] sge_shepherd segfaults

reuti reuti at staff.uni-marburg.de
Tue Mar 9 00:05:25 GMT 2010


Am 09.03.2010 um 00:50 schrieb templedf:

> I can see from the trace file that the segfault isn't coming from the
> shepherd.  It's coming from your job after it's forked.  At that  
> point,
> the job is executing as the submitting user.  Where the core file  
> lands
> depends on your OS and the configuration.  The job's working directory
> is a good first place to check.  By default that's the submitting  
> user's
> home directory, unless the user specified somewhere else.

I though the same. The interesting thing is of course:

>> <snip>
>> 03/05/2010 15:54:19 [1000:22139]: wait3 returned 22145 (status: 11;
>> WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)

if it's a sigsegv the status code should be 139 (128 + 11).

It's anything else than Linux?

-- Reuti


>> 03/05/2010 15:54:19 [1000:22139]: job exited with exit status 0
>> 03/05/2010 15:54:19 [1000:22139]: reaped "job" with pid 22145
>> 03/05/2010 15:54:19 [1000:22139]: job exited due to signal
>> 03/05/2010 15:54:19 [1000:22139]: job signaled: 11
>> 03/05/2010 15:54:19 [1000:22139]: now sending signal KILL to pid  
>> -22145
>> 03/05/2010 15:54:19 [1000:22139]: writing usage file to "usage"
>> 03/05/2010 15:54:19 [1000:22139]: no tasker to notify
>> 03/05/2010 15:54:19 [1000:22139]: no epilog script to start
>>
>>
>> Re-running sge-shepherd in the job directory didn't show any problem.
>> However, the original problem is 100% reproducible with that  
>> particular
>> type of job.
>>
>> I would like to examine a dumped core. I enabled dumping cores for  
>> the
>> user that runs the gridengine (sgeadmin), but I am not sure at what
>> stage the core might be dumped, under which uid and in whcih  
>> directory.
>> Are there any recommendations on obtaining a core dump from  
>> sge_shepherd?
>>
>> Thank you.
>> Serge.
>>
>>
>>
>> On Fri, Mar 5, 2010 at 1:30 PM, Daniel Templeton <Dan.Templeton at sun.com
>> <mailto:Dan.Templeton at sun.com>> wrote:
>>
>>    You just run "sge_shepherd" in the job directory.  It will figure
>>    everything else out from the files in that directory.  Like I  
>> said,
>>    though, have a look in the error and trace files first.
>>
>>    Daniel
>>
>>
>>    On 03/05/10 13:19, snosov wrote:
>>
>>        Thanks, Daniel,
>>
>>        I will try to do that. One question, though. How do I manually
>>        run sge_shepherd and tell it which job to process? The man  
>> page
>>        only talks about exit values and that the program should not  
>> be
>>        run manually.
>>
>>        Thank you,
>>        Serge.
>>
>>
>>        On Fri, Mar 5, 2010 at 12:12 PM, Daniel Templeton
>>        <Dan.Templeton at sun.com <mailto:Dan.Templeton at sun.com>
>>        <mailto:Dan.Templeton at sun.com <mailto:Dan.Templeton at sun.com>>>
>>        wrote:
>>
>>            The way to debug the problem would be to set KEPP_ACTIVE  
>> to TRUE
>>            in the execd_params and then run a job.  After the job  
>> fails, go
>>            to the <execd_spool_dir>/active_jobs/<jobid>.1 directory  
>> and run
>>            sge_shepherd in a debugger.  I guess before you do that  
>> you
>>        should
>>            look in the error and trace files in that directory.
>>
>>            Daniel
>>
>>
>>
>>
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247594
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247599

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list