[GE users] sgemaster all jobs stuck in qw status even though plenty of slots are available

mhanby mhanby at uab.edu
Sat Oct 30 05:20:12 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Howdy,

Grid Engine 6.2u5 on CentOS 5.5 x86_64. Currently there are plenty of slots available and none of the nodes with available slots are overloaded.

I have a user who has been submitting blocks of jobs, 50,000 and more at a time. The jobs run only a few minutes. This had been working until today when he submitted over 80,000 jobs in a single batch. Perhaps it's just coincidence, but the scheduler hasn't been starting jobs since this 80k load was submitted.

sgemaster is gobbling up 99.9% of the cpu and his jobs and other users jobs are stuck in a 'qw' state.

I've tried restarting sgemaster AND restarting the server without success.

I can't find anything that helps in the sgemaster messages log file.

'qalter -w p' for jobs that 80k user submitted provides this result:

$ qalter -w p 4001818
verification: found suitable queue(s)
And qalter provides this result for the other users jobs:

$ qalter -w p 4081593
verification: found possible assignment with 1 slots
I'm at a loss for what to to get sgemaster starting submitting jobs again.

Any help would be appreciated,

Mike




More information about the gridengine-users mailing list