[GE users] mpi jobs failing

Steven Ruby steven.ruby at wni.com
Thu May 5 16:07:51 BST 2005


./qmaster/messages:05/05/2005 10:37:41|qmaster|wems|W|job 9052.1 failed
on host wems44.grid.wni.com in recognising job because: execd doesn't
know this job

./qmaster/messages:05/05/2005
10:38:02|qmaster|wems|E|execd at wems44.grid.wni.com reports running job
(9052.1/master) in queue "Comp.q at wems44.grid.wni.com" that was not
supposed to be there - killing

./wems44/messages:05/05/2005 10:37:32|execd|wems44|E|acknowledge for
unknown job 9052.1/master

./wems44/messages:05/05/2005 10:37:32|execd|wems44|E|can't find active
jobs directory "active_jobs/9052.1" for reaping job 9052

./wems44/messages:05/05/2005 10:37:32|execd|wems44|E|ERROR: unlinking
"jobs/00/0000/9052.1": No such file or directory

./wems44/messages:05/05/2005 10:37:32|execd|wems44|E|can not remove file
job spool file: jobs/00/0000/9052.1

./wems44/messages:05/05/2005 10:37:32|execd|wems44|E|can't remove
directory "active_jobs/9052.1": opendir(active_jobs/9052.1) failed: No
such file or directory

./wems44/messages:05/05/2005 10:37:32|execd|wems44|E|ja-task "9052.1" is
unknown - reporting it to qmaster

./wems44/messages:05/05/2005 10:38:02|execd|wems44|E|acknowledge for
unknown job 9052.1/master

./wems44/messages:05/05/2005 10:38:02|execd|wems44|E|can't find active
jobs directory "active_jobs/9052.1" for reaping job 9052

./wems44/messages:05/05/2005 10:38:02|execd|wems44|E|ERROR: unlinking
"jobs/00/0000/9052.1": No such file or directory

./wems44/messages:05/05/2005 10:38:02|execd|wems44|E|can not remove file
job spool file: jobs/00/0000/9052.1

./wems44/messages:05/05/2005 10:38:02|execd|wems44|E|can't remove
directory "active_jobs/9052.1": opendir(active_jobs/9052.1) failed: No
such file or directory

./wems44/messages:05/05/2005 10:38:02|execd|wems44|E|ja-task "9052.1" is
unknown - reporting it to qmaster

./wems44/messages:05/05/2005 10:38:42|execd|wems44|E|acknowledge for
unknown job 9052.1/master

./wems44/messages:05/05/2005 10:38:42|execd|wems44|E|can't find active
jobs directory "active_jobs/9052.1" for reaping job 9052

./wems44/messages:05/05/2005 10:38:42|execd|wems44|E|ERROR: unlinking
"jobs/00/0000/9052.1": No such file or directory

./wems44/messages:05/05/2005 10:38:42|execd|wems44|E|can not remove file
job spool file: jobs/00/0000/9052.1

./wems44/messages:05/05/2005 10:38:42|execd|wems44|E|can't remove
directory "active_jobs/9052.1": opendir(active_jobs/9052.1) failed: No
such file or directory

./wems44/messages:05/05/2005 14:43:00|execd|wems44|E|removing
unreferenced job 9052.1 without job report from ptf

 

 

 

I have seen some traffic in the list with this error. Does anyone have a
culprit for what is causing this? It seems to be a random error as jobs
get resubmitted to the same hosts and they will run.

 

 

sr

 

--------

"Give me an army of West Point graduates and i'll win a battle. Give me
a handful of Texas Aggies and i'll win the war."

        -- Gen. George S. Patton

 




More information about the gridengine-users mailing list