[GE users] sge_shepherd problems perhaps connected to nfs problems

Margaret Doll Margaret_Doll at brown.edu
Wed Jun 27 21:10:50 BST 2007


This same program runs fine if submitted as a job on the head
node or directly on one of the compute nodes without being scheduled.

On Jun 27, 2007, at 4:06 PM, Margaret Doll wrote:

> But the largest process id on the compute node is
>
> root     13242 12922  0 16:05 pts/1    00:00:00 more
>
>
> On Jun 27, 2007, at 4:00 PM, Rayson Ho wrote:
>
>> If you look at the first parameter of wait4(), that the process that
>> the shepherd is waiting for...
>>
>> Rayson
>>
>>
>>
>> On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>>> I do not know to which process 4294967295 refers?
>>>
>>> The queue job id of the job was 543.  The job when it was running
>>> yesterday was numbered something like 8645.
>>>
>>> This morning I saw the job  "r" using qstat -f, but  it was not
>>> present  in the top program.  The output file from the program
>>> indicates that the job stopped in the middle of a calculation.
>>>
>>> I then qdel -f 543, but sge_shepherd-543 -bg is still there.
>>> qstat  and qmon no longer show the job.
>>>
>>>
>>>
>>> On Jun 27, 2007, at 3:48 PM, Rayson Ho wrote:
>>>
>>> Is process 4294967295 still running??
>>>
>>> Rayson
>>>
>>>
>>>
>>> On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>>> An strace on the back-ground shepherd  process reports
>>>
>>> ps -ef | grep sge
>>> sge       4191     1  0 Jun25 ?        00:05:32 /opt/gridengine/bin/
>>> lx26-amd64/sge_execd
>>> sge       8615  4191  0 Jun26 ?        00:00:00 sge_shepherd-543 -bg
>>> root     13055 12922  0 15:40 pts/1    00:00:00 grep sge
>>> [root at compute-0-1 ~]# /share/apps/strace/strace -p  8615
>>> Process 8615 attached - interrupt to quit
>>> wait4(4294967295, 0x7fbfff5cb8, 0, 0x7fbfff5e20) = ? ERESTARTSYS (To
>>> be restarted)
>>> --- SIGTSTP (Stopped) @ 0 (0) ---
>>> rt_sigreturn(0x14)                      = -1 EINTR (Interrupted
>>> system call)
>>> alarm(0)                                = 0
>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54458, ...}) = 0
>>> geteuid()                               = 400
>>> getuid()                                = 0
>>> geteuid()                               = 400
>>> setresuid(-1, 0, -1)                    = 0
>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
>>> setresuid(-1, 400, -1)                  = 0
>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54508, ...}) = 0
>>> geteuid()                               = 400
>>> getuid()                                = 0
>>> geteuid()                               = 400
>>> setresuid(-1, 0, -1)                    = 0
>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 66) = 66
>>> setresuid(-1, 400, -1)                  = 0
>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54574, ...}) = 0
>>> geteuid()                               = 400
>>> getuid()                                = 0
>>> geteuid()                               = 400
>>> setresuid(-1, 0, -1)                    = 0
>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 51) = 51
>>> setresuid(-1, 400, -1)                  = 0
>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54625, ...}) = 0
>>> geteuid()                               = 400
>>> getuid()                                = 0
>>> geteuid()                               = 400
>>> setresuid(-1, 0, -1)                    = 0
>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
>>> setresuid(-1, 400, -1)                  = 0
>>> fstat(3, {st_mode=S_IFREG|0644, st_size=54675, ...}) = 0
>>> geteuid()                               = 400
>>> getuid()                                = 0
>>> geteuid()                               = 400
>>> setresuid(-1, 0, -1)                    = 0
>>> write(3, "06/27/2007 15:41:11 [400:8615]: "..., 69) = 69
>>> setresuid(-1, 400, -1)                  = 0
>>> getuid()                                = 0
>>> getgid()                                = 0
>>> getuid()                                = 0
>>> getegid()                               = 400
>>> setresgid(-1, 0, -1)                    = 0
>>> geteuid()                               = 400
>>> setresuid(-1, 0, -1)                    = 0
>>> kill(4294958680, SIGKILL)               = 0
>>> getuid()                                = 0
>>> getegid()                               = 0
>>> setresgid(-1, 400, -1)                  = 0
>>> geteuid()                               = 0
>>> setresuid(-1, 400, -1)                  = 0
>>> wait4(4294967295,
>>>
>>>
>>>
>>> On Jun 27, 2007, at 3:32 PM, Rayson Ho wrote:
>>>
>>> > Can you "strace" the shepherd process and see what it is doing??
>>> >
>>> > Rayson
>>> >
>>> >
>>> >
>>> > On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>>> >> I have been trying to find the problem why some jobs stop running
>>> >> as seen
>>> >> from top, but still show as active using qstat -f
>>> >>
>>> >> Symptoms once again.
>>> >>
>>> >> not in top
>>> >> show in qstat -f as running
>>> >> ps -ef | grep sge  show an shepherd -bg running for the  
>>> "queued" job
>>> >> The user cannot ssh into the node where the job is stuck, but
>>> >> other people
>>> >> can.
>>> >> No one can complete a df on the node with the problem.
>>> >>
>>> >> Did the  home directory of the user that queued the job become
>>> >> unmounted
>>> >> from the  compute node?
>>> >> If so, why?  Some jobs  successfully for several days.
>>> >>
>>> >> I could not find any information in
>>> >> /opt/gridengine/default/spool/qmaster/messages for the
>>> >> "lost" job.
>>> >>
>>> >>
>>> >> qsub /script-s
>>> >> more  script-s
>>> >> #!/bin/bash
>>> >>
>>> >> # job name
>>> >> #$ -N C-256
>>> >>
>>> >> # send the standard output to your current working directory
>>> >> #$ -cwd
>>> >>
>>> >> # define the name of your output file
>>> >> #$ -o C-2e6.log
>>> >> # merge error and stdout into a single file
>>> >> #$ -j y
>>> >>
>>> >> # Put in a timestamp
>>> >> echo Starting execution at `date`
>>> >>
>>> >> # run your code, you need to specify the absolute path for your
>>> >> program in
>>> >> bash she
>>> >>
>>> >> /home/mad/user1/mad
>>> >>
>>> >> echo Finished at `date`
>>> >>
>>> >>
>>> >
>>> >
>>> -------------------------------------------------------------------- 
>>> -
>>> > To unsubscribe, e-mail:
>>> users-unsubscribe at gridengine.sunsource.net
>>> > For additional commands, e-mail:
>>> users-help at gridengine.sunsource.net
>>> >
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail:
>>> users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail:
>>> users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail:
>>> users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail:
>>> users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list