[GE users] Unlinking-problems on job-start

Thomas Neumann neumann at exasol.com
Tue Nov 15 10:14:13 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello !

Currently I have problems with jobs running under SGE-6.0u4: After 
starting a job using more then 16 machines, the job reports immediatly 
an error of the following type:
ERROR: unlinking "jobs/00/0001/3484.1": no such file or directory

(I filtered out the messages for the job from all message-files in the 
spool directories, see below)

Here is what I figured out up to this point:
* The machine reporting the error is always different from the machine 
where the "master job" is running. It is always a different machine 
(both physical and the machines 'position' in the pe_hostfile)
* Reading the messages of several errors, it seems that the problem 
always appears seven seconds after the job has been started.
* The job already created logfiles, but they do no contain any obvious 
errors.
* I was unable to 'reproduce' the problem in an interactive job.
* Reading the logs, all commands run with 'qrsh -inherit' seem to have 
failed.
* The job did not terminate after the problem, but it seems that after 
the first appearance of this error no further 'qrsh -inherit' commands 
in the job were run successfully.

Has anybody got an idea what could be the reason for the problem?

Thanks,
    Thomas



(Messages for one of the crashed jobs using 32 machines):
Sun Nov 13 16:02:07 2005        execd           cn14            ERROR: 
unlinking "jobs/00/0001/3484.1": No such file or directory

[13 times exactly the same message]

Sun Nov 13 16:02:07 2005        execd           cn14            ERROR: 
unlinking "jobs/00/0001/3484.1": No such file or directory
Sun Nov 13 16:02:07 2005        execd           cn14            
acknowledge for unknown job 13484.1/master

[13 times exactly the same message]

Sun Nov 13 16:02:07 2005        execd           cn14            
acknowledge for unknown job 13484.1/master
Sun Nov 13 16:02:07 2005        execd           cn14            can not 
remove file job spool file: jobs/00/0001/3484.1

[13 times exactly the same message]

Sun Nov 13 16:02:07 2005        execd           cn14            can not 
remove file job spool file: jobs/00/0001/3484.1
Sun Nov 13 16:02:07 2005        execd           cn14            can't 
find active jobs directory "active_jobs/13484.1" for reaping job 13484

[13 times exactly the same message]

Sun Nov 13 16:02:07 2005        execd           cn14            can't 
find active jobs directory "active_jobs/13484.1" for reaping job 13484
Sun Nov 13 16:02:07 2005        execd           cn14            can't 
remove directory "active_jobs/13484.1": opendir(active_jobs/13484.1) 
failed: No such file or directory

[13 times exactly the same message]

Sun Nov 13 16:02:07 2005        execd           cn14            can't 
remove directory "active_jobs/13484.1": opendir(active_jobs/13484.1) 
failed: No such file or directory
Sun Nov 13 16:02:07 2005        execd           cn14            ja-task 
"13484.1" is unknown - reporting it to qmaster

[13 times exactly the same message]

Sun Nov 13 16:02:07 2005        execd           cn14            ja-task 
"13484.1" is unknown - reporting it to qmaster

[From this point there are messages from different execds and the qmaster]



If it helps I can send the whole messages seperately as I do not want to 
attach too much text ...

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list