Dear SGE users,

Where yesterday things seemed to be fine, SGE now fails to schedule qsh after a few attempts. The first times it works, then it fails, and doesn't seem to recover. The logs show nothing this time. And no error states in qstat -f.

Qacct came up with this:

celaeno:~# qacct -j 2730
qname        all_8max.q
hostname     celaeno.sron.nl<http://celaeno.sron.nl>
group        eos
owner        pieterm
project      NONE
department   defaultdepartment
jobname      INTERACTIVE
jobnumber    2730
taskid       undefined
account      sge
priority     0
qsub_time    Fri Jun 26 09:38:46 2009
start_time   Fri Jun 26 09:38:46 2009
end_time     Fri Jun 26 09:38:46 2009
granted_pe   NONE
slots        1
failed       0
exit_status  1
ru_wallclock 0
ru_utime     0.000
ru_stime     0.004
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    810
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   8
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     24
ru_nivcsw    0
cpu          0.004
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined

Qstat output:

pleione [~]% qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
all_8max.q at celaeno.sron.nl<mailto:all_8max.q at celaeno.sron.nl>     BIP   0/0/8          0.00     lx26-amd64
all_8max.q at merope.sron.nl<mailto:all_8max.q at merope.sron.nl>      BIP   0/3/8          3.01     lx26-amd64
all_8max.q at pleione.sron.nl<mailto:all_8max.q at pleione.sron.nl>     BIP   0/0/8          4.00     lx26-amd64
all_8max.q at taygeta.sron.nl<mailto:all_8max.q at taygeta.sron.nl>     BIP   0/1/8          1.94     lx26-amd64

Seems there are plenty of slots left on the machines..
What could be the cause of qsh failing after a couple of attempts? Any help would be appreciated.

With kind regards,

Pieter van der Meer

