[GE users] sge_shepherd problems perhaps connected to nfs problems

Rayson Ho rayrayson at gmail.com
Wed Jun 27 20:48:12 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Is process 4294967295 still running??

Rayson



On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
> An strace on the back-ground shepherd  process reports
>
> ps -ef | grep sge
> sge       4191     1  0 Jun25 ?        00:05:32 /opt/gridengine/bin/
> lx26-amd64/sge_execd
> sge       8615  4191  0 Jun26 ?        00:00:00 sge_shepherd-543 -bg
> root     13055 12922  0 15:40 pts/1    00:00:00 grep sge
> [root at compute-0-1 ~]# /share/apps/strace/strace -p  8615
> Process 8615 attached - interrupt to quit
> wait4(4294967295, 0x7fbfff5cb8, 0, 0x7fbfff5e20) = ? ERESTARTSYS (To
> be restarted)
> --- SIGTSTP (Stopped) @ 0 (0) ---
> rt_sigreturn(0x14)                      = -1 EINTR (Interrupted
> system call)
> alarm(0)                                = 0
> fstat(3, {st_mode=S_IFREG|0644, st_size=54458, ...}) = 0
> geteuid()                               = 400
> getuid()                                = 0
> geteuid()                               = 400
> setresuid(-1, 0, -1)                    = 0
> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
> setresuid(-1, 400, -1)                  = 0
> fstat(3, {st_mode=S_IFREG|0644, st_size=54508, ...}) = 0
> geteuid()                               = 400
> getuid()                                = 0
> geteuid()                               = 400
> setresuid(-1, 0, -1)                    = 0
> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 66) = 66
> setresuid(-1, 400, -1)                  = 0
> fstat(3, {st_mode=S_IFREG|0644, st_size=54574, ...}) = 0
> geteuid()                               = 400
> getuid()                                = 0
> geteuid()                               = 400
> setresuid(-1, 0, -1)                    = 0
> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 51) = 51
> setresuid(-1, 400, -1)                  = 0
> fstat(3, {st_mode=S_IFREG|0644, st_size=54625, ...}) = 0
> geteuid()                               = 400
> getuid()                                = 0
> geteuid()                               = 400
> setresuid(-1, 0, -1)                    = 0
> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
> setresuid(-1, 400, -1)                  = 0
> fstat(3, {st_mode=S_IFREG|0644, st_size=54675, ...}) = 0
> geteuid()                               = 400
> getuid()                                = 0
> geteuid()                               = 400
> setresuid(-1, 0, -1)                    = 0
> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 69) = 69
> setresuid(-1, 400, -1)                  = 0
> getuid()                                = 0
> getgid()                                = 0
> getuid()                                = 0
> getegid()                               = 400
> setresgid(-1, 0, -1)                    = 0
> geteuid()                               = 400
> setresuid(-1, 0, -1)                    = 0
> kill(4294958680, SIGKILL)               = 0
> getuid()                                = 0
> getegid()                               = 0
> setresgid(-1, 400, -1)                  = 0
> geteuid()                               = 0
> setresuid(-1, 400, -1)                  = 0
> wait4(4294967295,
>
>
>
> On Jun 27, 2007, at 3:32 PM, Rayson Ho wrote:
>
> > Can you "strace" the shepherd process and see what it is doing??
> >
> > Rayson
> >
> >
> >
> > On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
> >> I have been trying to find the problem why some jobs stop running
> >> as seen
> >> from top, but still show as active using qstat -f
> >>
> >> Symptoms once again.
> >>
> >> not in top
> >> show in qstat -f as running
> >> ps -ef | grep sge  show an shepherd -bg running for the "queued" job
> >> The user cannot ssh into the node where the job is stuck, but
> >> other people
> >> can.
> >> No one can complete a df on the node with the problem.
> >>
> >> Did the  home directory of the user that queued the job become
> >> unmounted
> >> from the  compute node?
> >> If so, why?  Some jobs  successfully for several days.
> >>
> >> I could not find any information in
> >> /opt/gridengine/default/spool/qmaster/messages for the
> >> "lost" job.
> >>
> >>
> >> qsub /script-s
> >> more  script-s
> >> #!/bin/bash
> >>
> >> # job name
> >> #$ -N C-256
> >>
> >> # send the standard output to your current working directory
> >> #$ -cwd
> >>
> >> # define the name of your output file
> >> #$ -o C-2e6.log
> >> # merge error and stdout into a single file
> >> #$ -j y
> >>
> >> # Put in a timestamp
> >> echo Starting execution at `date`
> >>
> >> # run your code, you need to specify the absolute path for your
> >> program in
> >> bash she
> >>
> >> /home/mad/user1/mad
> >>
> >> echo Finished at `date`
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list