[GE users] sgemaster all jobs stuck in qw status even though plenty of slots are available
mhanby at uab.edu
Sat Oct 30 05:20:12 BST 2010
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Grid Engine 6.2u5 on CentOS 5.5 x86_64. Currently there are plenty of slots available and none of the nodes with available slots are overloaded.
I have a user who has been submitting blocks of jobs, 50,000 and more at a time. The jobs run only a few minutes. This had been working until today when he submitted over 80,000 jobs in a single batch. Perhaps it's just coincidence, but the scheduler hasn't been starting jobs since this 80k load was submitted.
sgemaster is gobbling up 99.9% of the cpu and his jobs and other users jobs are stuck in a 'qw' state.
I've tried restarting sgemaster AND restarting the server without success.
I can't find anything that helps in the sgemaster messages log file.
'qalter -w p' for jobs that 80k user submitted provides this result:
$ qalter -w p 4001818
verification: found suitable queue(s)
And qalter provides this result for the other users jobs:
$ qalter -w p 4081593
verification: found possible assignment with 1 slots
I'm at a loss for what to to get sgemaster starting submitting jobs again.
Any help would be appreciated,
More information about the gridengine-users