[GE users] sge_shepherd problems perhaps connected to nfs problems

Margaret Doll Margaret_Doll at brown.edu
Wed Jun 27 19:43:01 BST 2007


I have been trying to find the problem why some jobs stop running as  
seen from top, but still show as active using qstat -f

Symptoms once again.

not in top
show in qstat -f as running
ps -ef | grep sge  show an shepherd -bg running for the "queued" job
The user cannot ssh into the node where the job is stuck, but other  
people can.
No one can complete a df on the node with the problem.

Did the  home directory of the user that queued the job become  
unmounted from the  compute node?
If so, why?  Some jobs  successfully for several days.

I could not find any information in /opt/gridengine/default/spool/ 
qmaster/messages for the "lost" job.


qsub /script-s
more  script-s
#!/bin/bash

# job name
#$ -N C-256

# send the standard output to your current working directory
#$ -cwd

# define the name of your output file
#$ -o C-2e6.log
# merge error and stdout into a single file
#$ -j y

# Put in a timestamp
echo Starting execution at `date`

# run your code, you need to specify the absolute path for your  
program in bash she

/home/mad/user1/mad

echo Finished at `date`




More information about the gridengine-users mailing list