[GE users] very large messages file

reuti reuti at staff.uni-marburg.de
Fri Jan 29 15:25:52 GMT 2010


Am 29.01.2010 um 10:36 schrieb cgull:

> Hi,
> Last night on a cluster running SGE, a job appeared to run ok but  
> at the end of the run the following error was outputted to the log  
> "error: commlib error: got read error (closing "nodename/sheperd_i  
> js/1").
> For all of the nodes.
> I can then see the following error message in one of the nodes  
> "main|nuvo|E|slave sheperd of job 1094.1 exited with exit status = 11.
> "main|nuvo|E|abnormal termination of shepherd for job 1094.1 task  
> 10.nuvo: "exit_status" file is empty.
> The next job that attempted to go onto these machines then was  
> unable to start as the directory was filled.
> The hard disk appeared to fill up on a few nodes as the messages  
> file in the dir /opt/sge6-2/ge6.2u3/default/spool/"nodename"
> I am unsure if the directory being full was the cause of not  
> exiting cleanly.
> Or not exiting cleanly made very large messages file.
> A few of the nodes had the error main|"nodename"|W|get exit ack for  
> pe task 1."nodename" but task is not in state exiting.
> repeated lots of times. Making the files over 20G!! Which filled  
> the disks.
> Any ideas as to what the actual problem was and how to fix this so  
> that it does not happen again?
> I have currently removed the very large messages files and  
> restarted the sge daemons and jobs are launching and exiting ok?

there is a prepared script to rotate SGE's logfiles:


-- Reuti

> Thanks for your time in advance.
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=241692
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list