[GE users] many "killing job that was not supposed to be there" messages in logs

reuti reuti at staff.uni-marburg.de
Fri May 21 11:28:36 BST 2010


Hi,

Am 20.05.2010 um 19:10 schrieb engel_sanchez:

> Hello. My qmaster messages file has many messages like the following:
> 
> 05/20/2010 11:32:14|worker|head|E|execd at node024 reports running job (37267.1/2.node024) in queue "a
> ll.q at node024" that was not supposed to be there - killing
> 
> 
> This job (37267) was initially running in my node020, but got stuck a while ago. Ever since my node020 gets stuck the same way (jobs remain in dr state after being deleted) and the qmaster logs these error messages. The qmaster had to be restarted twice around that time, so I imagine that something in the spooling db might be corrupted. Any pointers into what to do to debug this further or fix it would be really appreciated. Thanks in advance!

are there left entries in:

$SGE_ROOT/spool/qmaster/<exechost>/00/...

For the already gone jobs it should be possible to remvove these entries. To be on the safe side you can drain the node remove all inside then.

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=258072

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list