[GE users] very large messages file

templedf dan.templeton at sun.com
Fri Jan 29 14:52:37 GMT 2010


The large messages files caused the jobs not to exit cleanly, not the 
other way around.  Deleting the files and restarting the execds is 
exactly the right response.  As for why the messages file filled up with 
the PE task ack messages, that I don't know.  If you can reproduce the 
problem, we should have a look into it.

Daniel

On 01/29/10 01:36, cgull wrote:
> Hi,
> Last night on a cluster running SGE, a job appeared to run ok but at the end of the run the following error was outputted to the log "error: commlib error: got read error (closing "nodename/sheperd_i js/1").
> For all of the nodes.
>
> I can then see the following error message in one of the nodes "main|nuvo|E|slave sheperd of job 1094.1 exited with exit status = 11.
> "main|nuvo|E|abnormal termination of shepherd for job 1094.1 task 10.nuvo: "exit_status" file is empty.
>
> The next job that attempted to go onto these machines then was unable to start as the directory was filled.
>
> The hard disk appeared to fill up on a few nodes as the messages file in the dir /opt/sge6-2/ge6.2u3/default/spool/"nodename"
>
> I am unsure if the directory being full was the cause of not exiting cleanly.
>
> Or not exiting cleanly made very large messages file.
> A few of the nodes had the error main|"nodename"|W|get exit ack for pe task 1."nodename" but task is not in state exiting.
>
> repeated lots of times. Making the files over 20G!! Which filled the disks.
>
> Any ideas as to what the actual problem was and how to fix this so that it does not happen again?
> I have currently removed the very large messages files and restarted the sge daemons and jobs are launching and exiting ok?
>
> Thanks for your time in advance.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241692
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241740

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list