[GE users] Deleting jobs without qdel

iadzhubey iadzhubey at rics.bwh.harvard.edu
Wed Apr 28 16:17:08 BST 2010


Todd,

Try doing that from some other admin host which is neither a qmaster nor an 
execute host. I have a dedicated admin host that is completely outside the 
grid for that purpose.

That said, I've never seen an error like you posted before, even with a 
situations when our cluster was otherwise totally out of control due to some 
nasty user error. Could it be something else that's causing it? Network 
connection blocked? Bad network driver?

Best,
Ivan

On Wednesday 28 April 2010 11:00:20 am heywood wrote:
> Thanks, Ivan. Looks like the overload is too much for even that...
> 
> [root at bhmnode2 qmaster]# qmod -d \*
> failed receiving gdi request response for mid=1 (got syncron message
> receive timeout error).
> error: commlib error: got read error (closing "bhmnode2/qmaster/1")
> 
> 
> Todd
> 
> On 4/28/10 10:52 AM, "iadzhubey" <iadzhubey at rics.bwh.harvard.edu> wrote:
> > Hi Todd
> > 
> > On Wednesday 28 April 2010 10:38:27 am heywood wrote:
> >> Is there any shortcut to deleting jobs in the system without qdel? We
> >> had a user "accidentally" submit 500K very short running jobs. SGE goes
> >> unresponsive, i.e. all commands hang, even qdel. Qping shows the
> >> messgaes in the read buffer constantly growing. I have even tried
> >> shutting down the qmaster and restarting it.
> > 
> > Been there, done that. Except our users often submit arrays in the range
> > of 10 million tasks easily. If something goes wrong it may take quite an
> > effort to get rid of them. My strategy is to first of all immediately
> > disable all queues on the system. You can do this with 'qmod -d \*'
> > command which does not involve scanning queues contents and thus
> > executes fairly fast even on a heavily oversubscribed system. You can
> > then proceed with deleting rogue jobs still sitting in the queue.
> > 
> > Best,
> > Ivan
> > 
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageI
> > d=255 300
> > 
> > To unsubscribe from this discussion, e-mail:
> > [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=
> 255301
> 
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255304

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list