[GE users] sge_shepherd problems perhaps connected to nfs problems

Rayson Ho rayrayson at gmail.com
Wed Jun 27 20:32:52 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Can you "strace" the shepherd process and see what it is doing??

Rayson



On 6/27/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
> I have been trying to find the problem why some jobs stop running as seen
> from top, but still show as active using qstat -f
>
> Symptoms once again.
>
> not in top
> show in qstat -f as running
> ps -ef | grep sge  show an shepherd -bg running for the "queued" job
> The user cannot ssh into the node where the job is stuck, but other people
> can.
> No one can complete a df on the node with the problem.
>
> Did the  home directory of the user that queued the job become unmounted
> from the  compute node?
> If so, why?  Some jobs  successfully for several days.
>
> I could not find any information in
> /opt/gridengine/default/spool/qmaster/messages for the
> "lost" job.
>
>
> qsub /script-s
> more  script-s
> #!/bin/bash
>
> # job name
> #$ -N C-256
>
> # send the standard output to your current working directory
> #$ -cwd
>
> # define the name of your output file
> #$ -o C-2e6.log
> # merge error and stdout into a single file
> #$ -j y
>
> # Put in a timestamp
> echo Starting execution at `date`
>
> # run your code, you need to specify the absolute path for your program in
> bash she
>
> /home/mad/user1/mad
>
> echo Finished at `date`
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list