[GE users] sge_shepherd problems perhaps connected to nfs problems
Margaret_Doll at brown.edu
Wed Jun 27 19:43:01 BST 2007
I have been trying to find the problem why some jobs stop running as
seen from top, but still show as active using qstat -f
Symptoms once again.
not in top
show in qstat -f as running
ps -ef | grep sge show an shepherd -bg running for the "queued" job
The user cannot ssh into the node where the job is stuck, but other
No one can complete a df on the node with the problem.
Did the home directory of the user that queued the job become
unmounted from the compute node?
If so, why? Some jobs successfully for several days.
I could not find any information in /opt/gridengine/default/spool/
qmaster/messages for the "lost" job.
# job name
#$ -N C-256
# send the standard output to your current working directory
# define the name of your output file
#$ -o C-2e6.log
# merge error and stdout into a single file
#$ -j y
# Put in a timestamp
echo Starting execution at `date`
# run your code, you need to specify the absolute path for your
program in bash she
echo Finished at `date`
More information about the gridengine-users