[GE users] JB_ja_tasks not found in element
jlopez at cesga.es
Fri Feb 27 10:20:06 GMT 2009
We have experienced another strange problem with GE6.2.
The NFS directory where GE is installed went down and it was recovered
after several hours without any apparent problem.
After the recovery running jobs continue running and it was possible to
en-queue new jobs but no new jobs entered execution.
This was the message that appeared in the qmaster logs:
02/26/2009 21:23:14| main|cn142|I|read job database with 934 entries in
02/26/2009 21:23:14| main|cn142|C|!!!!!!!!!! JB_ja_tasks not found in
The cause of the problem seems to be a queued mpi job that was blocking
the system. After moving the spool directory corresponding to this job
queued jobs started execution.
The job that caused the problem (I will call it job 2) had this
Job 2 was an MPI job using 16 slots and it was on hold waiting for job 3
that was in execution at the time of the NFS failure.
Job 3 was on hold waiting for job 2.
Just when the NFS filesystem was recovered job 2 finished execution
(probably it was stalled to write its final output).
It is not very clear why this happened, anyone suffered something similar?
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
[ Part 2, "jlopez.vcf" Text/X-VCARD (Name: "jlopez.vcf") ~367 bytes. ]
[ Unable to print this part. ]
More information about the gridengine-users