[GE users] Desperate need for CPU clock cycles!

Neil Baker neil at futurity.co.uk
Mon Jan 21 17:22:13 GMT 2008


I'm  Neil, Richard's (the original poster's) colleague.  

 

Has anyone else had a similar experience when using Solaris 10 for the
qmaster?  

 

We actually migrated the qmaster from a RedHat Linux box to Solaris 10 box
to try and gain extra stability as the RedHat box kept crashing (due to
hardware not Grid Engine).  I assumed that as Grid Engine was initially
written by SUN, that it should be more compatible and more stable on SUN kit
running Solaris.  I've also heard from people who say that other scheduling
software runs quite happily on similar specified hardware.

 

Our execution hosts currently run OpenSuse 10 (these haven't changed) and we
have approx 28 machines each running up to 4 jobs at a time (so a max of 112
jobs running at a time).  We do use the gird a lot and there is the
possibility that the queued jobs can be as high as 500 to 1000 during peek
usage.  We are also likely to double the number of execution hosts in the
near future.

 

The Sungrid binaries are also being shared via NFS from the same slow
Solaris Grid Engine machine.  The Solaris box is configured using soft raid
mirroring and could it be that the disk performance is causing a bottle neck
as the mirroring uses the CPU?  Is there an easy way for us to tell if the
disk is the bottle neck?  We do have a separate super fast NetApp NAS device
and I'm wondering how much of a benefit it would be if we moved the shared
binaries / SGE directory over to that NAS device?

 

In the past this system used to be a 1.8GHz box again with 512MB of RAM.
Although this is approx 5 times faster than the 350Mhz Sun Netra T1 105 we
are experiencing these problems on, I didn't expect the qmaster to be so
demanding on CPU resource.

 

Any suggestions would be gratefully received.

 

Regards

 

Neil

 

From: tmac [mailto:tmacmd at gmail.com] 
Sent: 21 January 2008 15:50
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Desperate need for CPU clock cycles!

 

I noticed some similarities on Solaris 9 and Solaris 10 with regards to what
you are seeing.

How about: Solaris just stinks as a master server?
I am sure someone could probably tweak lots of system variables
to make it work nice. 

Linux does not need the same hand holding that Solaris does.

I moved from a Sun 6500 20 CPUs and 20 Gig of RAM to a
Intel-based system (2 hyperthreaded cpus 533FSB, and 4 Gig of RAM)
and have not had any problems since... 

--tmac

On Jan 21, 2008 8:36 AM, <Andreas.Haas at sun.com> wrote:

Hi Richard,

increasing the load report interval may help, but without knowing the
reason it's just guesswork.

You should try running the Dtrace master monitor

    http://wiki.gridengine.info/wiki/index.php/Dtrace
<http://wiki.gridengine.info/wiki/index.php/Dtrace> 

to gain some understanding about the root cause of this situation.

I assume memory shortage can be ruled out since that were easy to
diagnose?

Regards,
Andreas



On Mon, 21 Jan 2008, Richard Hobbs wrote:

> Hello,
>
> We have just moved our qmaster to a 330MHz/512MB RAM Sun Netra t1 105
> running Solaris 10 11/06. 
>
> We also have around 38 Linux-based dual-CPU exec hosts, each with 4
> running queues.
>
> Last week, we noticed that periodically (while people were submitting
> jobs [in the order of 60-200 in the queue at any one time]) the load 
> average on the qmaster machine was through the roof (we noticed it at
> 164 at one point) and as a result, the qmaster grinds to a halt and
> becomes completely unresponsive for 20-40 minutes, during which time job 
> submissions basically fail! It does recover afterwards, however.
>
> Obviously this is unacceptable, so we need a solution! :-)
>
> I realise that a 330MHz SPARC with 512MB RAM isn't the best spec, but 
> this is only a job scheduler after all. Surely that should be plenty to
> run a qmaster on a grid of this size, right?
>
> Anyway, regardless of how this spec fits (or doesn't fit) the
> requirements of the qmaster, is there any way we can claw back some 
> clock cycles to use during this process. We want our qmaster to be as
> efficient as possible, and ideally to continue running on this box!
>
> Are there any options we can turn on to make it quicker? Perhaps reduce 
> the polling rate to the exec hosts (if such an event occurs)?
>
> Any advice is appreciated...
>
> Thanks in advance,
> Richard.
>
> --
> Richard Hobbs (Systems Administrator) 
> Toshiba Research Europe Ltd. - Cambridge Research Laboratory
> Email: richard.hobbs at crl.toshiba.co.uk
> Web: http://www.toshiba-europe.com/research/
> Tel: +44 1223 436999        Mobile: +44 7811 803377
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net 
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551
Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028 
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering


--------------------------------------------------------------------- 
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




-- 
--tmac

RedHat Certified Engineer #804006984323821 (RHEL4)
RedHat Certified Engineer #805007643429572 (RHEL5) 

Principal Consultant, RABA Technologies

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________




More information about the gridengine-users mailing list