[GE users] Job state 65536

jlopez jlopez at cesga.es
Mon Nov 17 16:54:19 GMT 2008


Hi all,

I have no clue yet about what caused this problem. The only problem we 
found was that ten minutes before this happened there was one job that 
was not able to write in the HOME directory because of a quota exceeded 
problem.

I do not know if these two problems could be related.

Looking at the code I have seen that the job state 65536 corresponds to 
SUSPENDED_ON_THRESHOLD or FINISHED (I do not know why these two states 
share the same number):

./libs/sgeobj/sge_jobL.h
#define JSUSPENDED_ON_THRESHOLD              0x00010000
#define JFINISHED                            0x00010000

Anyone has seen this type message before?

execd cn068.null reports running 
state for job (691876.1/1.cn068) in queue "medium_queue at cn068.null" 
while job is in state 65536

This is printed by MSG_JOB_REPORTRUNQ_SUUSSU but I do not understand in 
which situations it appears.

Do you think this problem could be due to a bug in GE6.1?

Any help would be much appreciated!

Thanks,
Javier

Javier Lopez Cacheiro wrote:
> Hi all,
>
> Yesterday we have suffered a serious problem in our cluster and we do 
> not understand what could be the reason. The symptoms where that at 
> 14:57 all running jobs suddenly appeared in state 65536 for the qmaster 
> and  and a few seconds after that the qmaster killed them all.
>
> These are the messages that appear in the qmaster logs:
>
> 11/13/2008 14:56:57|qmaster|cn142|E|execd cn068.null reports running 
> state for job (691876.1/1.cn068) in queue "medium_queue at cn068.null" 
> while job is in state 65536
>
> 11/13/2008 14:58:07|qmaster|cn142|E|execd at cn035.null reports running job 
> (691876.1/1.cn035) in queue "medium_queue at cn035.null" that was not 
> supposed to be there - killing
>
> These two messages are repeated for every running job.
>
> I am completely unaware of what the reason could be for these type of 
> messages and how a given job could pass the state 65536.
>
> Any help would be much appreciated!
>
> Thanks,
> Javier
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88740
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88894

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Text/X-VCARD (charset: UTF-8 "Internet-standard Unicode") ]
    [ (Name: "jlopez.vcf") 14 lines. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list