No subject


Wed Jan 12 20:38:46 GMT 2011


# df -k
Filesystem            kbytes    used   avail capacity  Mounted on
<snip>
vi64-x4150c-sca11:/opt/sge
                    135832780 8524108 120408756     7%    /opt/sge

# mount | grep vi64-x4150c-sca11        /opt/sge on vi64-x4150c-sca11:/opt/sge
remote/read/write/setuid/devices/xattr/dev=534002d on Wed Jan  7 11:32:08 2009


Now, the following steps were taken on the execd host:

1. Wait about 15 minutes
   I think the nfs mount has something to do with this issue.
   So I waited this long without any nfs activities on the SGE_ROOT.
   SGE job spool directory is set to be a local directory:  
execd_spool_dir=/var/spool/sge/taketwo

2. Submit a job
   # qsub -b y -o /dev/null -j y -w e -js 0 -t 1:100 sleep 36

Then, this job did not get dispatched at all unless another job was submitted or
executed "qconf -tsm".

3. Submit another job.

Now the first 8 tasks from the job 24 were dispatched but there was no further
dispatch after the first dispatched tasks were completed.

# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q at v4u-m3000a-sca11         BIP   0/0/8          0.00     sol-sparc64 
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    24 0.00000 sleep      root         qw    01/08/2009 01:12:30     1 9-100:1
    25 0.00000 sleep      root         qw    01/08/2009 02:15:12     1 1-100:1


Now the qsat -j 24 shows that the scheduler thinks the queue is full.

qstat -j 24
==============================================================
job_number:                 24
exec_file:                  job_scripts/24
submission_time:            Thu Jan  8 01:12:30 2009
<snip>
scheduling info:            queue instance "all.q at v4u-m3000a-sca11" dropped
because it is full
                           All queues dropped because of overload or full


However, after this reproduction, I was not able to reproduce this issue again
by following aforementioned steps any more.

Interestingly enough, on a later time, another incident happened with JobID=28.
 This time the job was submitted from the qmaster host.
Yet the job didn't get scheduled to be executed.

It is interesting that it gets an commlib error with the qstat after submitting
the job.

# qstat -j 28
==============================================================
job_number:                 28
exec_file:                  job_scripts/28
submission_time:            Thu Jan  8 11:26:19 2009



More information about the gridengine-users mailing list