[GE users] Job state 65536

Javier Lopez Cacheiro jlopez at cesga.es
Fri Nov 14 10:42:32 GMT 2008


Hi all,

Yesterday we have suffered a serious problem in our cluster and we do 
not understand what could be the reason. The symptoms where that at 
14:57 all running jobs suddenly appeared in state 65536 for the qmaster 
and  and a few seconds after that the qmaster killed them all.

These are the messages that appear in the qmaster logs:

11/13/2008 14:56:57|qmaster|cn142|E|execd cn068.null reports running 
state for job (691876.1/1.cn068) in queue "medium_queue at cn068.null" 
while job is in state 65536

11/13/2008 14:58:07|qmaster|cn142|E|execd at cn035.null reports running job 
(691876.1/1.cn035) in queue "medium_queue at cn035.null" that was not 
supposed to be there - killing

These two messages are repeated for every running job.

I am completely unaware of what the reason could be for these type of 
messages and how a given job could pass the state 65536.

Any help would be much appreciated!

Thanks,
Javier

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88740

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Text/X-VCARD (charset: UTF-8 "Internet-standard Unicode") ]
    [ (Name: "jlopez.vcf") 14 lines. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list