[GE users] very large messages file

cgull matt.mcnally at virgin.net
Fri Jan 29 09:36:22 GMT 2010

Last night on a cluster running SGE, a job appeared to run ok but at the end of the run the following error was outputted to the log "error: commlib error: got read error (closing "nodename/sheperd_i js/1").
For all of the nodes.

I can then see the following error message in one of the nodes "main|nuvo|E|slave sheperd of job 1094.1 exited with exit status = 11.
"main|nuvo|E|abnormal termination of shepherd for job 1094.1 task 10.nuvo: "exit_status" file is empty.

The next job that attempted to go onto these machines then was unable to start as the directory was filled.

The hard disk appeared to fill up on a few nodes as the messages file in the dir /opt/sge6-2/ge6.2u3/default/spool/"nodename"  

I am unsure if the directory being full was the cause of not exiting cleanly. 

Or not exiting cleanly made very large messages file.  
A few of the nodes had the error main|"nodename"|W|get exit ack for pe task 1."nodename" but task is not in state exiting.

repeated lots of times. Making the files over 20G!! Which filled the disks.

Any ideas as to what the actual problem was and how to fix this so that it does not happen again?
I have currently removed the very large messages files and restarted the sge daemons and jobs are launching and exiting ok?

Thanks for your time in advance.


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list