[GE users] Deleting jobs without qdel

heywood heywood at cshl.edu
Wed Apr 28 16:29:59 BST 2010


Ivan,

No, I have logs showing the situation started when this job set came into
the system, and have confirmed with the user that he set a mistaken "batch
factor".

Basically, the messages in the read buffer (from qping) just keep climbing
fast, so I guess that qmod and qdel commands are way back in the line in the
buffer.

Tried your suggestion, as well as taking the qmaster down on the main head
node and bringing it up on a 2nd (normally for the shadow master). Nothing
working so far. I was hoping some SGE developers might have some trick which
clears out jobs from some database files :-).

Todd


On 4/28/10 11:17 AM, "iadzhubey" <iadzhubey at rics.bwh.harvard.edu> wrote:

> Todd,
> 
> Try doing that from some other admin host which is neither a qmaster nor an
> execute host. I have a dedicated admin host that is completely outside the
> grid for that purpose.
> 
> That said, I've never seen an error like you posted before, even with a
> situations when our cluster was otherwise totally out of control due to some
> nasty user error. Could it be something else that's causing it? Network
> connection blocked? Bad network driver?
> 
> Best,
> Ivan
> 
> On Wednesday 28 April 2010 11:00:20 am heywood wrote:
>> Thanks, Ivan. Looks like the overload is too much for even that...
>> 
>> [root at bhmnode2 qmaster]# qmod -d \*
>> failed receiving gdi request response for mid=1 (got syncron message
>> receive timeout error).
>> error: commlib error: got read error (closing "bhmnode2/qmaster/1")
>> 
>> 
>> Todd
>> 
>> On 4/28/10 10:52 AM, "iadzhubey" <iadzhubey at rics.bwh.harvard.edu> wrote:
>>> Hi Todd
>>> 
>>> On Wednesday 28 April 2010 10:38:27 am heywood wrote:
>>>> Is there any shortcut to deleting jobs in the system without qdel? We
>>>> had a user "accidentally" submit 500K very short running jobs. SGE goes
>>>> unresponsive, i.e. all commands hang, even qdel. Qping shows the
>>>> messgaes in the read buffer constantly growing. I have even tried
>>>> shutting down the qmaster and restarting it.
>>> 
>>> Been there, done that. Except our users often submit arrays in the range
>>> of 10 million tasks easily. If something goes wrong it may take quite an
>>> effort to get rid of them. My strategy is to first of all immediately
>>> disable all queues on the system. You can do this with 'qmod -d \*'
>>> command which does not involve scanning queues contents and thus
>>> executes fairly fast even on a heavily oversubscribed system. You can
>>> then proceed with deleting rogue jobs still sitting in the queue.
>>> 
>>> Best,
>>> Ivan
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageI
>>> d=255 300
>>> 
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=
>> 255301
>> 
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255
> 304
> 
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255305

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list