[GE users] sge_shepherd problems perhaps connected to nfs problems

Margaret Doll Margaret_Doll at brown.edu
Wed Jun 27 20:56:09 BST 2007


I do not know to which process 4294967295 refers?

The queue job id of the job was 543.  The job when it was running
yesterday was numbered something like 8645.

This morning I saw the job  "r" using qstat -f, but  it was not
present  in the top program.  The output file from the program
indicates that the job stopped in the middle of a calculation.

I then qdel -f 543, but sge_shepherd-543 -bg is still there.
qstat  and qmon no longer show the job.


On Jun 27, 2007, at 3:48 PM, Rayson Ho wrote:

> Is process 4294967295 still running??
>
> Rayson
>
>
>
> On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>> An strace on the back-ground shepherd  process reports
>>
>> ps -ef | grep sge
>> sge       4191     1  0 Jun25 ?        00:05:32 /opt/gridengine/bin/
>> lx26-amd64/sge_execd
>> sge       8615  4191  0 Jun26 ?        00:00:00 sge_shepherd-543 -bg
>> root     13055 12922  0 15:40 pts/1    00:00:00 grep sge
>> [root at compute-0-1 ~]# /share/apps/strace/strace -p  8615
>> Process 8615 attached - interrupt to quit
>> wait4(4294967295, 0x7fbfff5cb8, 0, 0x7fbfff5e20) = ? ERESTARTSYS (To
>> be restarted)
>> --- SIGTSTP (Stopped) @ 0 (0) ---
>> rt_sigreturn(0x14)                      = -1 EINTR (Interrupted
>> system call)
>> alarm(0)                                = 0
>> fstat(3, {st_mode=S_IFREG|0644, st_size=54458, ...}) = 0
>> geteuid()                               = 400
>> getuid()                                = 0
>> geteuid()                               = 400
>> setresuid(-1, 0, -1)                    = 0
>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
>> setresuid(-1, 400, -1)                  = 0
>> fstat(3, {st_mode=S_IFREG|0644, st_size=54508, ...}) = 0
>> geteuid()                               = 400
>> getuid()                                = 0
>> geteuid()                               = 400
>> setresuid(-1, 0, -1)                    = 0
>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 66) = 66
>> setresuid(-1, 400, -1)                  = 0
>> fstat(3, {st_mode=S_IFREG|0644, st_size=54574, ...}) = 0
>> geteuid()                               = 400
>> getuid()                                = 0
>> geteuid()                               = 400
>> setresuid(-1, 0, -1)                    = 0
>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 51) = 51
>> setresuid(-1, 400, -1)                  = 0
>> fstat(3, {st_mode=S_IFREG|0644, st_size=54625, ...}) = 0
>> geteuid()                               = 400
>> getuid()                                = 0
>> geteuid()                               = 400
>> setresuid(-1, 0, -1)                    = 0
>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
>> setresuid(-1, 400, -1)                  = 0
>> fstat(3, {st_mode=S_IFREG|0644, st_size=54675, ...}) = 0
>> geteuid()                               = 400
>> getuid()                                = 0
>> geteuid()                               = 400
>> setresuid(-1, 0, -1)                    = 0
>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 69) = 69
>> setresuid(-1, 400, -1)                  = 0
>> getuid()                                = 0
>> getgid()                                = 0
>> getuid()                                = 0
>> getegid()                               = 400
>> setresgid(-1, 0, -1)                    = 0
>> geteuid()                               = 400
>> setresuid(-1, 0, -1)                    = 0
>> kill(4294958680, SIGKILL)               = 0
>> getuid()                                = 0
>> getegid()                               = 0
>> setresgid(-1, 400, -1)                  = 0
>> geteuid()                               = 0
>> setresuid(-1, 400, -1)                  = 0
>> wait4(4294967295,
>>
>>
>>
>> On Jun 27, 2007, at 3:32 PM, Rayson Ho wrote:
>>
>> > Can you "strace" the shepherd process and see what it is doing??
>> >
>> > Rayson
>> >
>> >
>> >
>> > On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>> >> I have been trying to find the problem why some jobs stop running
>> >> as seen
>> >> from top, but still show as active using qstat -f
>> >>
>> >> Symptoms once again.
>> >>
>> >> not in top
>> >> show in qstat -f as running
>> >> ps -ef | grep sge  show an shepherd -bg running for the  
>> "queued" job
>> >> The user cannot ssh into the node where the job is stuck, but
>> >> other people
>> >> can.
>> >> No one can complete a df on the node with the problem.
>> >>
>> >> Did the  home directory of the user that queued the job become
>> >> unmounted
>> >> from the  compute node?
>> >> If so, why?  Some jobs  successfully for several days.
>> >>
>> >> I could not find any information in
>> >> /opt/gridengine/default/spool/qmaster/messages for the
>> >> "lost" job.
>> >>
>> >>
>> >> qsub /script-s
>> >> more  script-s
>> >> #!/bin/bash
>> >>
>> >> # job name
>> >> #$ -N C-256
>> >>
>> >> # send the standard output to your current working directory
>> >> #$ -cwd
>> >>
>> >> # define the name of your output file
>> >> #$ -o C-2e6.log
>> >> # merge error and stdout into a single file
>> >> #$ -j y
>> >>
>> >> # Put in a timestamp
>> >> echo Starting execution at `date`
>> >>
>> >> # run your code, you need to specify the absolute path for your
>> >> program in
>> >> bash she
>> >>
>> >> /home/mad/user1/mad
>> >>
>> >> echo Finished at `date`
>> >>
>> >>
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> > For additional commands, e-mail: users- 
>> help at gridengine.sunsource.net
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>




More information about the gridengine-users mailing list