[GE users] Desperate need for CPU clock cycles!

Roland Dittel Roland.Dittel at Sun.COM
Tue Jan 22 09:53:23 GMT 2008


    [ The following text is in the "windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Neil,

Neil Baker wrote:
> Hi Richard,
> 
> We're actually using version 5.7 as our old Linux qmaster was version 5.6
> and we needed to get a new system up and running very quickly.  

Because we don't have a Grid Engine version "5.7" I assume you mean 
version 5.3u7. This version is extremely old and I "highly" recommend to 
upgrade to a the current version.

> I
> unfortunately missed Andreas' recent posting about the memory leak.
> Hopefully it effects just version 5.7 on Solaris as it'll explain why it has
> the problem on Solaris and not Linux.

The memory leak mentioned in Andreas recent posting only effects 6.0 and 
6.1.

> We didn't have "top" installed on the Solaris machine when this problem hit
> us and the machine grounded to a halt, so I can't actually say if its memory
> related at this point.  I have to say it does seem like a memory leak.

Instead of "top" you can use the Solaris command prstat. The output is 
similar.

Normally if the machine goes into halt there was a critical issue 
related to memory or CPU faults and a kernel core dump is created. You 
can try to analyze the core dump to figure out what happened.

> Also, as Richard needed to get it back up and running immediately due to the
> unfortunate staff deadlines.  As a result he had to migrated the grid to a
> different Linux box running OpenSuse.  This is going to of course make
> recreating the problem quite tricky until their deadlines have been met.
> 
> Does anyone still have Andreas' posting about the memory leak?

All of postings on this alias are archived and can be searched on:
http://gridengine.sunsource.net/maillist.html

Regards
Roland

> 
> Regards
> 
> Neil
> 
> -----Original Message-----
> From: Richard Ems [mailto:Richard.Ems at cape-horn-eng.com] 
> Sent: 21 January 2008 17:53
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Desperate need for CPU clock cycles!
> 
> Are you sure that it's a CPU bottleneck and not a memory one?
> And, which version of SGE are you talking about?
> 
> We had a problem with the load getting also very high (on Linux, 
> openSUSE 10.3) but this was because SGE was eating all our memory and 
> the system started swapping.
> There was a memory leak which has been found by Andreas just some days 
> ago, but this probably only triggers on a configuration with several 
> queues and PEs and submitting jobs using wildcards ... Andreas?
> 
> just my $0.01 ...
> 
> Richard
> 
> 
> Neil Baker wrote:
>> I?m  Neil, Richard?s (the original poster?s) colleague. 
>>
>>  
>>
>> Has anyone else had a similar experience when using Solaris 10 for the 
>> qmaster? 
>>
>>  
>>
>> We actually migrated the qmaster from a RedHat Linux box to Solaris 10 
>> box to try and gain extra stability as the RedHat box kept crashing (due 
>> to hardware not Grid Engine).  I assumed that as Grid Engine was 
>> initially written by SUN, that it should be more compatible and more 
>> stable on SUN kit running Solaris.  I?ve also heard from people who say 
>> that other scheduling software runs quite happily on similar specified 
>> hardware.
>>
>>  
>>
>> Our execution hosts currently run OpenSuse 10 (these haven?t changed) 
>> and we have approx 28 machines each running up to 4 jobs at a time (so a 
>> max of 112 jobs running at a time).  We do use the gird a lot and there 
>> is the possibility that the queued jobs can be as high as 500 to 1000 
>> during peek usage.  We are also likely to double the number of execution 
>> hosts in the near future.
>>
>>  
>>
>> The Sungrid binaries are also being shared via NFS from the same slow 
>> Solaris Grid Engine machine.  The Solaris box is configured using soft 
>> raid mirroring and could it be that the disk performance is causing a 
>> bottle neck as the mirroring uses the CPU?  Is there an easy way for us 
>> to tell if the disk is the bottle neck?  We do have a separate super 
>> fast NetApp NAS device and I?m wondering how much of a benefit it would 
>> be if we moved the shared binaries / SGE directory over to that NAS
> device?
>>  
>>
>> In the past this system used to be a 1.8GHz box again with 512MB of 
>> RAM.  Although this is approx 5 times faster than the 350Mhz Sun Netra 
>> T1 105 we are experiencing these problems on, I didn?t expect the 
>> qmaster to be so demanding on CPU resource.
>>
>>  
>>
>> Any suggestions would be gratefully received.
> 
> 


-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Roland Dittel               Tel: +49 (0)941 3075-275 (x60275)
Software Engineering        Fax: +49 (0)941 3075-222 (x60222)
Sun Microsystems GmbH
Dr.-Leo-Ritter-Str. 7       mailto:roland.dittel at sun.com
D-93049 Regensburg          http://www.sun.com/gridware
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Registered Office / Sitz der Gesellschaft:
   Sun Microsystems GmbH
   Sonnenallee 1
   D-85551 Kirchheim-Heimstetten
   Germany
Commercial register of the Local Court of Munich /
Handelsregistereintrag Amtsgericht Muenchen:
   HRB 161028
Managing Directors / Geschaeftsfuehrer:
   Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Chairman of the Supervisory Board / Vorsitzender des Aufsichtsrates
   Martin Haering


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list