[GE users] sgemaster all jobs stuck in qw status even though plenty of slots are available
laotsao at gmail.com
Sat Oct 30 14:54:51 BST 2010
[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
what is ur qmaster CPU? (core, socket, Ghz) and memory?
not sure why ur user want to submit 80k, every jobs in the queue will take up memory of qmaster
On 10/30/2010 12:20 AM, mhanby wrote:
Grid Engine 6.2u5 on CentOS 5.5 x86_64. Currently there are plenty of slots available and none of the nodes with available slots are overloaded.
I have a user who has been submitting blocks of jobs, 50,000 and more at a time. The jobs run only a few minutes. This had been working until today when he submitted over 80,000 jobs in a single batch. Perhaps it's just coincidence, but the scheduler hasn't been starting jobs since this 80k load was submitted.
sgemaster is gobbling up 99.9% of the cpu and his jobs and other users jobs are stuck in a 'qw' state.
I've tried restarting sgemaster AND restarting the server without success.
I can't find anything that helps in the sgemaster messages log file.
'qalter -w p' for jobs that 80k user submitted provides this result:
$ qalter -w p 4001818
verification: found suitable queue(s)
And qalter provides this result for the other users jobs:
$ qalter -w p 4081593
verification: found possible assignment with 1 slots
I'm at a loss for what to to get sgemaster starting submitting jobs again.
Any help would be appreciated,
[ Part 2, "laotsao.vcf" Text/X-VCARD (Name: "laotsao.vcf") ~286 ]
[ bytes. ]
[ Unable to print this part. ]
More information about the gridengine-users