[GE users] sgemaster all jobs stuck in qw status even though plenty of slots are available
mhanby at uab.edu
Sat Oct 30 17:39:57 BST 2010
[ The following text is in the "iso-2022-jp" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
This is an 8 core Xeon server E5450, 3 GHz with 16GB of RAM.
I agree with you about the user submitting jobs in such large batches, but, in the past, this many jobs in the queue hasn't caused any problems (just annoyances like qstat slowness). I've had to change the qstat behavior back to the default to only show the owners jobs, which annoys my user base to no end :-)
Do other SGE sites have to do any special SGE conf changes to handle large queues?
I found out what was going on, the file system for $SGE_ROOT filled up because schedd_runlog hit 16GB. The odd thing, the file system didn't report that it was low on space until I rebooted the server, at which time it reported 0%.
I bzip'd the schedd_runlog to free up space and sgemaster started up and began running jobs.
Do I need to manually create the schedd_runlog, I notice it hasn't been created by SGE even though jobs have started and completed.
From: laotsao ?? [laotsao at gmail.com]
Sent: Saturday, October 30, 2010 8:54 AM
Cc: Mike Hanby
Subject: Re: [GE users] sgemaster all jobs stuck in qw status even though plenty of slots are available
what is ur qmaster CPU? (core, socket, Ghz) and memory?
not sure why ur user want to submit 80k, every jobs in the queue will take up memory of qmaster
On 10/30/2010 12:20 AM, mhanby wrote:
Grid Engine 6.2u5 on CentOS 5.5 x86_64. Currently there are plenty of slots available and none of the nodes with available slots are overloaded.
I have a user who has been submitting blocks of jobs, 50,000 and more at a time. The jobs run only a few minutes. This had been working until today when he submitted over 80,000 jobs in a single batch. Perhaps it's just coincidence, but the scheduler hasn't been starting jobs since this 80k load was submitted.
sgemaster is gobbling up 99.9% of the cpu and his jobs and other users jobs are stuck in a 'qw' state.
I've tried restarting sgemaster AND restarting the server without success.
I can't find anything that helps in the sgemaster messages log file.
'qalter -w p' for jobs that 80k user submitted provides this result:
$ qalter -w p 4001818
verification: found suitable queue(s)
And qalter provides this result for the other users jobs:
$ qalter -w p 4081593
verification: found possible assignment with 1 slots
I'm at a loss for what to to get sgemaster starting submitting jobs again.
Any help would be appreciated,
More information about the gridengine-users