[GE users] scalability problems

Sean Dilda agrajag at dragaera.net
Fri Apr 2 15:30:55 BST 2004


On Fri, 2004-04-02 at 00:09, Ron Chen wrote:
> Hi,
> 
> I read your issue 942, and I think this is one of the
> great findings, and thx :)
> 
> Also, did you use strace to find out what's going on
> when you run the other test case?

Do you mean the test case in which I was able to make SGE drop nodes? 
Yes, I did run strace, but didn't see anything drastically different in
the output.

I've not read enough of the code to understand the inner workings of how
SGE times out nodes.  However, from looking at it, I'm thinking that
what ends up happening when SGE drops nodes is that qmaster is getting
blocked with NFS writes and isn't able to process everything, and thus
nodes end up timing out, even if they're communicating.

After fixing my spool files, I was again able to make SGE drop nodes,
but it was a lot harder.  I have a program that does nothing but write a
bunch of data to an NFS file really quickly.   I had 54 copies running,
causing over 20 MB/s to be sent over my NFS server's NIC, and everything
continued to run fine.  I then upped it to 78 copies, which pushed well
over 30 MB/s and over 50 load avg on the NFS server.  Everything seemed
fine, but when I ran a 'qdel -u <userid>' to remove all of those jobs,
SGE then started dropping nodes.  I'm thinking the extra I/O from the
massive delete was enough to tie up the qmaster process and cause the
communications to "timeout".


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list