[GE users] sge_shepherd segfaults

snosov serge.nosov2 at gmail.com
Mon Mar 8 23:34:17 GMT 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Below is the trace file for one of such jobs where sge_sgepherd segfaulted. It appears that sge_shepherd segfaults when it tries to spawn a new shell for the job. However, it dies before the shell gets to .profile.

03/05/2010 15:54:19 [1000:22139]: shepherd called with uid = 0, euid = 1000
03/05/2010 15:54:19 [1000:22139]: starting up 6.2u5
03/05/2010 15:54:19 [1000:22139]: setpgid(22139, 22139) returned 0
03/05/2010 15:54:19 [1000:22139]: do_core_binding: "binding" parameter not found in config file
03/05/2010 15:54:19 [1000:22139]: parent: forked "prolog" with pid 22140
03/05/2010 15:54:19 [1000:22139]: using signal delivery delay of 120 seconds
03/05/2010 15:54:19 [1000:22139]: parent: prolog-pid: 22140
03/05/2010 15:54:19 [1000:22140]: child: starting son(prolog, /opt/prolog.sh, 0);
03/05/2010 15:54:19 [1000:22140]: pid=22140 pgrp=22140 sid=22140 old pgrp=22139 getlogin()=<no login set>
03/05/2010 15:54:19 [1000:22140]: reading passwd information for user 'user'
03/05/2010 15:54:19 [1000:22140]: setting limits
03/05/2010 15:54:19 [1000:22140]: setting environment
03/05/2010 15:54:19 [1000:22140]: Initializing error file
03/05/2010 15:54:19 [1000:22140]: switching to intermediate/target user
03/05/2010 15:54:19 [550:22140]: closing all filedescriptors
03/05/2010 15:54:19 [550:22140]: further messages are in "error" and "trace"
03/05/2010 15:54:19 [550:22140]: using "/bin/bash" as shell of user "user"
03/05/2010 15:54:19 [550:22140]: using stdout as stderr
03/05/2010 15:54:19 [550:22140]: now running with uid=550, euid=550
03/05/2010 15:54:19 [550:22140]: execvp(/opt/prolog.sh, "/opt/prolog.sh")
03/05/2010 15:54:19 [1000:22139]: wait3 returned 22140 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
03/05/2010 15:54:19 [1000:22139]: prolog exited with exit status 0
03/05/2010 15:54:19 [1000:22139]: reaped "prolog" with pid 22140
03/05/2010 15:54:19 [1000:22139]: prolog exited not due to signal
03/05/2010 15:54:19 [1000:22139]: prolog exited with status 0
03/05/2010 15:54:19 [1000:22139]: parent: forked "job" with pid 22145
03/05/2010 15:54:19 [1000:22139]: parent: job-pid: 22145
03/05/2010 15:54:19 [1000:22145]: child: starting son(job, /home/user/worker.pl<http://worker.pl>, 0);
03/05/2010 15:54:19 [1000:22145]: pid=22145 pgrp=22145 sid=22145 old pgrp=22139 getlogin()=<no login set>
03/05/2010 15:54:19 [1000:22145]: reading passwd information for user 'user'
03/05/2010 15:54:19 [1000:22145]: setosjobid: uid = 0, euid = 1000
03/05/2010 15:54:19 [1000:22145]: setting limits
03/05/2010 15:54:19 [1000:22145]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
03/05/2010 15:54:19 [1000:22145]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
03/05/2010 15:54:19 [1000:22145]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
03/05/2010 15:54:19 [1000:22145]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
03/05/2010 15:54:19 [1000:22145]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
03/05/2010 15:54:19 [1000:22145]: RLIMIT_VMEM/RLIMIT_AS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
03/05/2010 15:54:19 [1000:22145]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
03/05/2010 15:54:19 [1000:22145]: setting environment
03/05/2010 15:54:19 [1000:22145]: Initializing error file
03/05/2010 15:54:19 [1000:22145]: switching to intermediate/target user
03/05/2010 15:54:19 [550:22145]: closing all filedescriptors
03/05/2010 15:54:19 [550:22145]: further messages are in "error" and "trace"
03/05/2010 15:54:19 [550:22145]: using stdout as stderr
03/05/2010 15:54:19 [550:22145]: now running with uid=550, euid=550
03/05/2010 15:54:19 [1000:22139]: wait3 returned 22145 (status: 11; WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)
03/05/2010 15:54:19 [1000:22139]: job exited with exit status 0
03/05/2010 15:54:19 [1000:22139]: reaped "job" with pid 22145
03/05/2010 15:54:19 [1000:22139]: job exited due to signal
03/05/2010 15:54:19 [1000:22139]: job signaled: 11
03/05/2010 15:54:19 [1000:22139]: now sending signal KILL to pid -22145
03/05/2010 15:54:19 [1000:22139]: writing usage file to "usage"
03/05/2010 15:54:19 [1000:22139]: no tasker to notify
03/05/2010 15:54:19 [1000:22139]: no epilog script to start


Re-running sge-shepherd in the job directory didn't show any problem.
However, the original problem is 100% reproducible with that particular type of job.

I would like to examine a dumped core. I enabled dumping cores for the user that runs the gridengine (sgeadmin), but I am not sure at what stage the core might be dumped, under which uid and in whcih directory. Are there any recommendations on obtaining a core dump from sge_shepherd?

Thank you.
Serge.



On Fri, Mar 5, 2010 at 1:30 PM, Daniel Templeton <Dan.Templeton at sun.com<mailto:Dan.Templeton at sun.com>> wrote:
You just run "sge_shepherd" in the job directory.  It will figure everything else out from the files in that directory.  Like I said, though, have a look in the error and trace files first.

Daniel


On 03/05/10 13:19, snosov wrote:
Thanks, Daniel,

I will try to do that. One question, though. How do I manually run sge_shepherd and tell it which job to process? The man page only talks about exit values and that the program should not be run manually.

Thank you,
Serge.


On Fri, Mar 5, 2010 at 12:12 PM, Daniel Templeton <Dan.Templeton at sun.com<mailto:Dan.Templeton at sun.com> <mailto:Dan.Templeton at sun.com<mailto:Dan.Templeton at sun.com>>> wrote:

   The way to debug the problem would be to set KEPP_ACTIVE to TRUE
   in the execd_params and then run a job.  After the job fails, go
   to the <execd_spool_dir>/active_jobs/<jobid>.1 directory and run
   sge_shepherd in a debugger.  I guess before you do that you should
   look in the error and trace files in that directory.

   Daniel








More information about the gridengine-users mailing list