[GE users] sge_shepherd segfaults

templedf dan.templeton at sun.com
Mon Mar 8 23:50:44 GMT 2010


I can see from the trace file that the segfault isn't coming from the 
shepherd.  It's coming from your job after it's forked.  At that point, 
the job is executing as the submitting user.  Where the core file lands 
depends on your OS and the configuration.  The job's working directory 
is a good first place to check.  By default that's the submitting user's 
home directory, unless the user specified somewhere else.

Daniel

On 03/08/10 15:34, snosov wrote:
> Below is the trace file for one of such jobs where sge_sgepherd
> segfaulted. It appears that sge_shepherd segfaults when it tries to
> spawn a new shell for the job. However, it dies before the shell gets to
> .profile.
>
> 03/05/2010 15:54:19 [1000:22139]: shepherd called with uid = 0, euid = 1000
> 03/05/2010 15:54:19 [1000:22139]: starting up 6.2u5
> 03/05/2010 15:54:19 [1000:22139]: setpgid(22139, 22139) returned 0
> 03/05/2010 15:54:19 [1000:22139]: do_core_binding: "binding" parameter
> not found in config file
> 03/05/2010 15:54:19 [1000:22139]: parent: forked "prolog" with pid 22140
> 03/05/2010 15:54:19 [1000:22139]: using signal delivery delay of 120 seconds
> 03/05/2010 15:54:19 [1000:22139]: parent: prolog-pid: 22140
> 03/05/2010 15:54:19 [1000:22140]: child: starting son(prolog,
> /opt/prolog.sh, 0);
> 03/05/2010 15:54:19 [1000:22140]: pid=22140 pgrp=22140 sid=22140 old
> pgrp=22139 getlogin()=<no login set>
> 03/05/2010 15:54:19 [1000:22140]: reading passwd information for user 'user'
> 03/05/2010 15:54:19 [1000:22140]: setting limits
> 03/05/2010 15:54:19 [1000:22140]: setting environment
> 03/05/2010 15:54:19 [1000:22140]: Initializing error file
> 03/05/2010 15:54:19 [1000:22140]: switching to intermediate/target user
> 03/05/2010 15:54:19 [550:22140]: closing all filedescriptors
> 03/05/2010 15:54:19 [550:22140]: further messages are in "error" and "trace"
> 03/05/2010 15:54:19 [550:22140]: using "/bin/bash" as shell of user "user"
> 03/05/2010 15:54:19 [550:22140]: using stdout as stderr
> 03/05/2010 15:54:19 [550:22140]: now running with uid=550, euid=550
> 03/05/2010 15:54:19 [550:22140]: execvp(/opt/prolog.sh, "/opt/prolog.sh")
> 03/05/2010 15:54:19 [1000:22139]: wait3 returned 22140 (status: 0;
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 03/05/2010 15:54:19 [1000:22139]: prolog exited with exit status 0
> 03/05/2010 15:54:19 [1000:22139]: reaped "prolog" with pid 22140
> 03/05/2010 15:54:19 [1000:22139]: prolog exited not due to signal
> 03/05/2010 15:54:19 [1000:22139]: prolog exited with status 0
> 03/05/2010 15:54:19 [1000:22139]: parent: forked "job" with pid 22145
> 03/05/2010 15:54:19 [1000:22139]: parent: job-pid: 22145
> 03/05/2010 15:54:19 [1000:22145]: child: starting son(job,
> /home/user/worker.pl <http://worker.pl>, 0);
> 03/05/2010 15:54:19 [1000:22145]: pid=22145 pgrp=22145 sid=22145 old
> pgrp=22139 getlogin()=<no login set>
> 03/05/2010 15:54:19 [1000:22145]: reading passwd information for user 'user'
> 03/05/2010 15:54:19 [1000:22145]: setosjobid: uid = 0, euid = 1000
> 03/05/2010 15:54:19 [1000:22145]: setting limits
> 03/05/2010 15:54:19 [1000:22145]: RLIMIT_CPU setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> 03/05/2010 15:54:19 [1000:22145]: RLIMIT_FSIZE setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> 03/05/2010 15:54:19 [1000:22145]: RLIMIT_DATA setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> 03/05/2010 15:54:19 [1000:22145]: RLIMIT_STACK setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> 03/05/2010 15:54:19 [1000:22145]: RLIMIT_CORE setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> 03/05/2010 15:54:19 [1000:22145]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> 03/05/2010 15:54:19 [1000:22145]: RLIMIT_RSS setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> 03/05/2010 15:54:19 [1000:22145]: setting environment
> 03/05/2010 15:54:19 [1000:22145]: Initializing error file
> 03/05/2010 15:54:19 [1000:22145]: switching to intermediate/target user
> 03/05/2010 15:54:19 [550:22145]: closing all filedescriptors
> 03/05/2010 15:54:19 [550:22145]: further messages are in "error" and "trace"
> 03/05/2010 15:54:19 [550:22145]: using stdout as stderr
> 03/05/2010 15:54:19 [550:22145]: now running with uid=550, euid=550
> 03/05/2010 15:54:19 [1000:22139]: wait3 returned 22145 (status: 11;
> WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)
> 03/05/2010 15:54:19 [1000:22139]: job exited with exit status 0
> 03/05/2010 15:54:19 [1000:22139]: reaped "job" with pid 22145
> 03/05/2010 15:54:19 [1000:22139]: job exited due to signal
> 03/05/2010 15:54:19 [1000:22139]: job signaled: 11
> 03/05/2010 15:54:19 [1000:22139]: now sending signal KILL to pid -22145
> 03/05/2010 15:54:19 [1000:22139]: writing usage file to "usage"
> 03/05/2010 15:54:19 [1000:22139]: no tasker to notify
> 03/05/2010 15:54:19 [1000:22139]: no epilog script to start
>
>
> Re-running sge-shepherd in the job directory didn't show any problem.
> However, the original problem is 100% reproducible with that particular
> type of job.
>
> I would like to examine a dumped core. I enabled dumping cores for the
> user that runs the gridengine (sgeadmin), but I am not sure at what
> stage the core might be dumped, under which uid and in whcih directory.
> Are there any recommendations on obtaining a core dump from sge_shepherd?
>
> Thank you.
> Serge.
>
>
>
> On Fri, Mar 5, 2010 at 1:30 PM, Daniel Templeton <Dan.Templeton at sun.com
> <mailto:Dan.Templeton at sun.com>> wrote:
>
>     You just run "sge_shepherd" in the job directory.  It will figure
>     everything else out from the files in that directory.  Like I said,
>     though, have a look in the error and trace files first.
>
>     Daniel
>
>
>     On 03/05/10 13:19, snosov wrote:
>
>         Thanks, Daniel,
>
>         I will try to do that. One question, though. How do I manually
>         run sge_shepherd and tell it which job to process? The man page
>         only talks about exit values and that the program should not be
>         run manually.
>
>         Thank you,
>         Serge.
>
>
>         On Fri, Mar 5, 2010 at 12:12 PM, Daniel Templeton
>         <Dan.Templeton at sun.com <mailto:Dan.Templeton at sun.com>
>         <mailto:Dan.Templeton at sun.com <mailto:Dan.Templeton at sun.com>>>
>         wrote:
>
>             The way to debug the problem would be to set KEPP_ACTIVE to TRUE
>             in the execd_params and then run a job.  After the job fails, go
>             to the <execd_spool_dir>/active_jobs/<jobid>.1 directory and run
>             sge_shepherd in a debugger.  I guess before you do that you
>         should
>             look in the error and trace files in that directory.
>
>             Daniel
>
>
>
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247594

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list