[GE users] sge_shepherd problems perhaps connected to nfs problems

Margaret Doll Margaret_Doll at brown.edu
Wed Jun 27 21:25:00 BST 2007


I put another job on the queue as a normal user and then
qdel the job.  The sge-shepherd -bg still hangs around
An strace of the new sge-shepherd is waiting for the same
program as is the other sge-shepherd program on another
compute node.

/share/apps/strace/strace -p 8085
Process 8085 attached - interrupt to quit
wait4(4294967295,

On Jun 27, 2007, at 4:10 PM, Margaret Doll wrote:

> This same program runs fine if submitted as a job on the head
> node or directly on one of the compute nodes without being scheduled.
>
> On Jun 27, 2007, at 4:06 PM, Margaret Doll wrote:
>
>> But the largest process id on the compute node is
>>
>> root     13242 12922  0 16:05 pts/1    00:00:00 more
>>
>>
>> On Jun 27, 2007, at 4:00 PM, Rayson Ho wrote:
>>
>>> If you look at the first parameter of wait4(), that the process that
>>> the shepherd is waiting for...
>>>
>>> Rayson
>>>
>>>
>>>
>>> On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>>>> I do not know to which process 4294967295 refers?
>>>>
>>>> The queue job id of the job was 543.  The job when it was running
>>>> yesterday was numbered something like 8645.
>>>>
>>>> This morning I saw the job  "r" using qstat -f, but  it was not
>>>> present  in the top program.  The output file from the program
>>>> indicates that the job stopped in the middle of a calculation.
>>>>
>>>> I then qdel -f 543, but sge_shepherd-543 -bg is still there.
>>>> qstat  and qmon no longer show the job.
>>>>
>>>>
>>>>
>>>> On Jun 27, 2007, at 3:48 PM, Rayson Ho wrote:
>>>>
>>>> Is process 4294967295 still running??
>>>>
>>>> Rayson
>>>>
>>>>
>>>>
>>>> On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>>>> An strace on the back-ground shepherd  process reports
>>>>
>>>> ps -ef | grep sge
>>>> sge       4191     1  0 Jun25 ?        00:05:32 /opt/gridengine/ 
>>>> bin/
>>>> lx26-amd64/sge_execd
>>>> sge       8615  4191  0 Jun26 ?        00:00:00 sge_shepherd-543  
>>>> -bg
>>>> root     13055 12922  0 15:40 pts/1    00:00:00 grep sge
>>>> [root at compute-0-1 ~]# /share/apps/strace/strace -p  8615
>>>> Process 8615 attached - interrupt to quit
>>>> wait4(4294967295, 0x7fbfff5cb8, 0, 0x7fbfff5e20) = ? ERESTARTSYS  
>>>> (To
>>>> be restarted)
>>>> --- SIGTSTP (Stopped) @ 0 (0) ---
>>>> rt_sigreturn(0x14)                      = -1 EINTR (Interrupted
>>>> system call)
>>>> alarm(0)                                = 0
>>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54458, ...}) = 0
>>>> geteuid()                               = 400
>>>> getuid()                                = 0
>>>> geteuid()                               = 400
>>>> setresuid(-1, 0, -1)                    = 0
>>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
>>>> setresuid(-1, 400, -1)                  = 0
>>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54508, ...}) = 0
>>>> geteuid()                               = 400
>>>> getuid()                                = 0
>>>> geteuid()                               = 400
>>>> setresuid(-1, 0, -1)                    = 0
>>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 66) = 66
>>>> setresuid(-1, 400, -1)                  = 0
>>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54574, ...}) = 0
>>>> geteuid()                               = 400
>>>> getuid()                                = 0
>>>> geteuid()                               = 400
>>>> setresuid(-1, 0, -1)                    = 0
>>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 51) = 51
>>>> setresuid(-1, 400, -1)                  = 0
>>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54625, ...}) = 0
>>>> geteuid()                               = 400
>>>> getuid()                                = 0
>>>> geteuid()                               = 400
>>>> setresuid(-1, 0, -1)                    = 0
>>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
>>>> setresuid(-1, 400, -1)                  = 0
>>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54675, ...}) = 0
>>>> geteuid()                               = 400
>>>> getuid()                                = 0
>>>> geteuid()                               = 400
>>>> setresuid(-1, 0, -1)                    = 0
>>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 69) = 69
>>>> setresuid(-1, 400, -1)                  = 0
>>>> getuid()                                = 0
>>>> getgid()                                = 0
>>>> getuid()                                = 0
>>>> getegid()                               = 400
>>>> setresgid(-1, 0, -1)                    = 0
>>>> geteuid()                               = 400
>>>> setresuid(-1, 0, -1)                    = 0
>>>> kill(4294958680, SIGKILL)               = 0
>>>> getuid()                                = 0
>>>> getegid()                               = 0
>>>> setresgid(-1, 400, -1)                  = 0
>>>> geteuid()                               = 0
>>>> setresuid(-1, 400, -1)                  = 0
>>>> wait4(4294967295,
>>>>
>>>>
>>>>
>>>> On Jun 27, 2007, at 3:32 PM, Rayson Ho wrote:
>>>>
>>>> > Can you "strace" the shepherd process and see what it is doing??
>>>> >
>>>> > Rayson
>>>> >
>>>> >
>>>> >
>>>> > On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>>>> >> I have been trying to find the problem why some jobs stop  
>>>> running
>>>> >> as seen
>>>> >> from top, but still show as active using qstat -f
>>>> >>
>>>> >> Symptoms once again.
>>>> >>
>>>> >> not in top
>>>> >> show in qstat -f as running
>>>> >> ps -ef | grep sge  show an shepherd -bg running for the  
>>>> "queued" job
>>>> >> The user cannot ssh into the node where the job is stuck, but
>>>> >> other people
>>>> >> can.
>>>> >> No one can complete a df on the node with the problem.
>>>> >>
>>>> >> Did the  home directory of the user that queued the job become
>>>> >> unmounted
>>>> >> from the  compute node?
>>>> >> If so, why?  Some jobs  successfully for several days.
>>>> >>
>>>> >> I could not find any information in
>>>> >> /opt/gridengine/default/spool/qmaster/messages for the
>>>> >> "lost" job.
>>>> >>
>>>> >>
>>>> >> qsub /script-s
>>>> >> more  script-s
>>>> >> #!/bin/bash
>>>> >>
>>>> >> # job name
>>>> >> #$ -N C-256
>>>> >>
>>>> >> # send the standard output to your current working directory
>>>> >> #$ -cwd
>>>> >>
>>>> >> # define the name of your output file
>>>> >> #$ -o C-2e6.log
>>>> >> # merge error and stdout into a single file
>>>> >> #$ -j y
>>>> >>
>>>> >> # Put in a timestamp
>>>> >> echo Starting execution at `date`
>>>> >>
>>>> >> # run your code, you need to specify the absolute path for your
>>>> >> program in
>>>> >> bash she
>>>> >>
>>>> >> /home/mad/user1/mad
>>>> >>
>>>> >> echo Finished at `date`
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> > To unsubscribe, e-mail:
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> > For additional commands, e-mail:
>>>> users-help at gridengine.sunsource.net
>>>> >
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail:
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
>>>> users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail:
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
>>>> users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>




More information about the gridengine-users mailing list