[GE users] Intermittent problem with job submision

John Hearns john.hearns at streamline-computing.com
Wed Dec 14 18:49:44 GMT 2005


On Wed, 2005-12-14 at 17:05 +0000, Petra Kogel wrote:
> 
> Hi!
> 
> We are (still) running sge6.0u1, and all was fine until a few days now. 
> Since then, we see intermittent job submission failures such as this:
> 
> Job 3663358 caused action: Queue "serial at bee-ge07" set to ERROR
>   User        = rdx
>   Queue       = serial at bee-ge07
>   Host        = bee-ge07
>   Start Time  = <unknown>
>   End Time    = <unknown>
> failed assumedly before job:can't write script file 
> "job_scripts/3663358" wrote only -1 of 234272 bytes: Bad address
> 
> When looking at the job_scripts directories,
> - the one on the sgemaster contains the complete job script
> - the one the execution is always exactly 4096 bytes long and empty.
> 
Petra,
   I'm sorry that this is not much help,
but I had a similar problem recently.
Running a job would switch a master queue into ERROR state, with the
error 'could not create directory active_jobs/nnnn.1 ' where nnnn was
the jobid. But if you cleared the E state the job would submit OK.
I tracked it down to one host having a problem with the NIS setup,
and not having the correct SGE user.













---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list