[GE users] Desperate need for CPU clock cycles!

Neil Baker neil at futurity.co.uk
Mon Jan 21 20:07:58 GMT 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Richard,

We're actually using version 5.7 as our old Linux qmaster was version 5.6
and we needed to get a new system up and running very quickly.  I
unfortunately missed Andreas' recent posting about the memory leak.
Hopefully it effects just version 5.7 on Solaris as it'll explain why it has
the problem on Solaris and not Linux.

We didn't have "top" installed on the Solaris machine when this problem hit
us and the machine grounded to a halt, so I can't actually say if its memory
related at this point.  I have to say it does seem like a memory leak.
Also, as Richard needed to get it back up and running immediately due to the
unfortunate staff deadlines.  As a result he had to migrated the grid to a
different Linux box running OpenSuse.  This is going to of course make
recreating the problem quite tricky until their deadlines have been met.

Does anyone still have Andreas' posting about the memory leak?

Regards

Neil

-----Original Message-----
From: Richard Ems [mailto:Richard.Ems at cape-horn-eng.com] 
Sent: 21 January 2008 17:53
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Desperate need for CPU clock cycles!

Are you sure that it's a CPU bottleneck and not a memory one?
And, which version of SGE are you talking about?

We had a problem with the load getting also very high (on Linux, 
openSUSE 10.3) but this was because SGE was eating all our memory and 
the system started swapping.
There was a memory leak which has been found by Andreas just some days 
ago, but this probably only triggers on a configuration with several 
queues and PEs and submitting jobs using wildcards ... Andreas?

just my $0.01 ...

Richard


Neil Baker wrote:
> I?m  Neil, Richard?s (the original poster?s) colleague. 
> 
>  
> 
> Has anyone else had a similar experience when using Solaris 10 for the 
> qmaster? 
> 
>  
> 
> We actually migrated the qmaster from a RedHat Linux box to Solaris 10 
> box to try and gain extra stability as the RedHat box kept crashing (due 
> to hardware not Grid Engine).  I assumed that as Grid Engine was 
> initially written by SUN, that it should be more compatible and more 
> stable on SUN kit running Solaris.  I?ve also heard from people who say 
> that other scheduling software runs quite happily on similar specified 
> hardware.
> 
>  
> 
> Our execution hosts currently run OpenSuse 10 (these haven?t changed) 
> and we have approx 28 machines each running up to 4 jobs at a time (so a 
> max of 112 jobs running at a time).  We do use the gird a lot and there 
> is the possibility that the queued jobs can be as high as 500 to 1000 
> during peek usage.  We are also likely to double the number of execution 
> hosts in the near future.
> 
>  
> 
> The Sungrid binaries are also being shared via NFS from the same slow 
> Solaris Grid Engine machine.  The Solaris box is configured using soft 
> raid mirroring and could it be that the disk performance is causing a 
> bottle neck as the mirroring uses the CPU?  Is there an easy way for us 
> to tell if the disk is the bottle neck?  We do have a separate super 
> fast NetApp NAS device and I?m wondering how much of a benefit it would 
> be if we moved the shared binaries / SGE directory over to that NAS
device?
> 
>  
> 
> In the past this system used to be a 1.8GHz box again with 512MB of 
> RAM.  Although this is approx 5 times faster than the 350Mhz Sun Netra 
> T1 105 we are experiencing these problems on, I didn?t expect the 
> qmaster to be so demanding on CPU resource.
> 
>  
> 
> Any suggestions would be gratefully received.


-- 
Richard Ems       mail: Richard.Ems at Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5? piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list