[GE users] Intermittent problem with job submision

Petra Kogel Petra.Kogel at ecmwf.int
Wed Dec 14 17:05:27 GMT 2005


We are (still) running sge6.0u1, and all was fine until a few days now. 
Since then, we see intermittent job submission failures such as this:

Job 3663358 caused action: Queue "serial at bee-ge07" set to ERROR
  User        = rdx
  Queue       = serial at bee-ge07
  Host        = bee-ge07
  Start Time  = <unknown>
  End Time    = <unknown>
failed assumedly before job:can't write script file 
"job_scripts/3663358" wrote only -1 of 234272 bytes: Bad address

When looking at the job_scripts directories,
- the one on the sgemaster contains the complete job script
- the one the execution is always exactly 4096 bytes long and empty.

The problem appears to hit whichever execution host is used (we have 26 
of them). And since we have queues spanning several execution hosts,
the job keeps switch off more and more queue instances until it gets
killed with qdel.

It therefore appears unlikely that the problem is on the execution 
hosts. So yesterday, I stopped the qmaster, cleaned out the job_scripts
on the master, and restarted qmaster. Things were quiet for a day after
that, now the problem is back.

Any clues would be most welcome!

Many thanks and kind regards,


Petra Kogel, Senior Systems Analyst, Servers & Desktops Section
European Centre for Medium-Range Weather Forecasts (ECMWF)
Shinfield Park, Reading, Berkshire, RG2 9AX, UK (http://www.ecmwf.int)
Email: pkogel at ecmwf.int Telephone: (++44) 118 9499364

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list