[GE users] JB_ja_tasks not found in element

jlopez jlopez at cesga.es
Fri Feb 27 10:20:06 GMT 2009


We have experienced another strange problem with GE6.2.

The NFS directory where GE is installed went down and it was recovered 
after several hours without any apparent problem.

After the recovery running jobs continue running and it was possible to 
en-queue new jobs but no new jobs entered execution.

This was the message that appeared in the qmaster logs:

02/26/2009 21:23:14|  main|cn142|I|read job database with 934 entries in 
2 seconds
02/26/2009 21:23:14|  main|cn142|C|!!!!!!!!!! JB_ja_tasks not found in 
element !!!!!!!!!!

The cause of the problem seems to be a queued mpi job that was blocking 
the system. After moving the spool directory corresponding to this job 
queued jobs started execution.

The job that caused the problem (I will call it job 2) had this 
interesting features:

Job 2 was an MPI job using 16 slots and it was on hold waiting for job 3 
that was in execution at the time of the NFS failure.
Job 3 was on hold waiting for job 2.

Just when the NFS filesystem was recovered job 2 finished execution 
(probably it was stalled to write its final output).

It is not very clear why this happened, anyone suffered something similar?



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, "jlopez.vcf"  Text/X-VCARD (Name: "jlopez.vcf") ~367 bytes. ]
    [ Unable to print this part. ]

More information about the gridengine-users mailing list