[GE users] sge_shepherd problems perhaps connected to nfs problems

Margaret Doll Margaret_Doll at brown.edu
Wed Jun 27 20:41:46 BST 2007


An strace on the back-ground shepherd  process reports

ps -ef | grep sge
sge       4191     1  0 Jun25 ?        00:05:32 /opt/gridengine/bin/ 
lx26-amd64/sge_execd
sge       8615  4191  0 Jun26 ?        00:00:00 sge_shepherd-543 -bg
root     13055 12922  0 15:40 pts/1    00:00:00 grep sge
[root at compute-0-1 ~]# /share/apps/strace/strace -p  8615
Process 8615 attached - interrupt to quit
wait4(4294967295, 0x7fbfff5cb8, 0, 0x7fbfff5e20) = ? ERESTARTSYS (To  
be restarted)
--- SIGTSTP (Stopped) @ 0 (0) ---
rt_sigreturn(0x14)                      = -1 EINTR (Interrupted  
system call)
alarm(0)                                = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=54458, ...}) = 0
geteuid()                               = 400
getuid()                                = 0
geteuid()                               = 400
setresuid(-1, 0, -1)                    = 0
write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
setresuid(-1, 400, -1)                  = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=54508, ...}) = 0
geteuid()                               = 400
getuid()                                = 0
geteuid()                               = 400
setresuid(-1, 0, -1)                    = 0
write(3, "06/27/2007 15:41:11 [400:8615]: "..., 66) = 66
setresuid(-1, 400, -1)                  = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=54574, ...}) = 0
geteuid()                               = 400
getuid()                                = 0
geteuid()                               = 400
setresuid(-1, 0, -1)                    = 0
write(3, "06/27/2007 15:41:11 [400:8615]: "..., 51) = 51
setresuid(-1, 400, -1)                  = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=54625, ...}) = 0
geteuid()                               = 400
getuid()                                = 0
geteuid()                               = 400
setresuid(-1, 0, -1)                    = 0
write(3, "06/27/2007 15:41:11 [400:8615]: "..., 50) = 50
setresuid(-1, 400, -1)                  = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=54675, ...}) = 0
geteuid()                               = 400
getuid()                                = 0
geteuid()                               = 400
setresuid(-1, 0, -1)                    = 0
write(3, "06/27/2007 15:41:11 [400:8615]: "..., 69) = 69
setresuid(-1, 400, -1)                  = 0
getuid()                                = 0
getgid()                                = 0
getuid()                                = 0
getegid()                               = 400
setresgid(-1, 0, -1)                    = 0
geteuid()                               = 400
setresuid(-1, 0, -1)                    = 0
kill(4294958680, SIGKILL)               = 0
getuid()                                = 0
getegid()                               = 0
setresgid(-1, 400, -1)                  = 0
geteuid()                               = 0
setresuid(-1, 400, -1)                  = 0
wait4(4294967295,



On Jun 27, 2007, at 3:32 PM, Rayson Ho wrote:

> Can you "strace" the shepherd process and see what it is doing??
>
> Rayson
>
>
>
> On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>> I have been trying to find the problem why some jobs stop running  
>> as seen
>> from top, but still show as active using qstat -f
>>
>> Symptoms once again.
>>
>> not in top
>> show in qstat -f as running
>> ps -ef | grep sge  show an shepherd -bg running for the "queued" job
>> The user cannot ssh into the node where the job is stuck, but  
>> other people
>> can.
>> No one can complete a df on the node with the problem.
>>
>> Did the  home directory of the user that queued the job become  
>> unmounted
>> from the  compute node?
>> If so, why?  Some jobs  successfully for several days.
>>
>> I could not find any information in
>> /opt/gridengine/default/spool/qmaster/messages for the
>> "lost" job.
>>
>>
>> qsub /script-s
>> more  script-s
>> #!/bin/bash
>>
>> # job name
>> #$ -N C-256
>>
>> # send the standard output to your current working directory
>> #$ -cwd
>>
>> # define the name of your output file
>> #$ -o C-2e6.log
>> # merge error and stdout into a single file
>> #$ -j y
>>
>> # Put in a timestamp
>> echo Starting execution at `date`
>>
>> # run your code, you need to specify the absolute path for your  
>> program in
>> bash she
>>
>> /home/mad/user1/mad
>>
>> echo Finished at `date`
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list