[GE users] Desperate need for CPU clock cycles!

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Tue Jan 22 09:10:25 GMT 2008


Hi Neil,

distributing binaries via NFS is at all events harmful for overall performance, 
but I can not estimate whether that is the only problem or even whether it is 
pivotal. This is true especially when I ask questions which are answered with 
silence.

Have you seen the questions from Richard Ems? Have you seen my last reply?

Note, actually my plea to run the master monitor aims on gaining the understanding 
on the questions you raise below. Since you run the master on Solaris 10 running it 
should cost you not more than three commands:

> su
Password: 
# cd $SGE_ROOT/dtrace
# ./monitor.sh 
dtrace: script './monitor.d' matched 72 probes
CPU     ID                    FUNCTION:NAME
   1      1                           :BEGIN                 Time |   #wrt  wrt/ms |#rep #gdi #ack|   #dsp  dsp/ms    #sad|   #snd    #rcv|  #in++ #in--  #out++  #out--|  #lck0  #ulck0   #lck1  #ulck1
   1   1732                      :tick-15sec 2008 Jan 22 10:00:20 |       0       0|   0   16    8|      8       0       0|      8       8|     24 24      32      32|     57      57     200     201
   1   1732                      :tick-15sec 2008 Jan 22 10:00:35 |       0       0|   0   14    7|      7       0       0|      7       7|     21 21      28      28|     51      51     193     193

Best regards,
Andreas

On Mon, 21 Jan 2008, Neil Baker wrote:

> I'm  Neil, Richard's (the original poster's) colleague.
>
>
>
> Has anyone else had a similar experience when using Solaris 10 for the
> qmaster?
>
>
>
> We actually migrated the qmaster from a RedHat Linux box to Solaris 10 box
> to try and gain extra stability as the RedHat box kept crashing (due to
> hardware not Grid Engine).  I assumed that as Grid Engine was initially
> written by SUN, that it should be more compatible and more stable on SUN kit
> running Solaris.  I've also heard from people who say that other scheduling
> software runs quite happily on similar specified hardware.
>
>
>
> Our execution hosts currently run OpenSuse 10 (these haven't changed) and we
> have approx 28 machines each running up to 4 jobs at a time (so a max of 112
> jobs running at a time).  We do use the gird a lot and there is the
> possibility that the queued jobs can be as high as 500 to 1000 during peek
> usage.  We are also likely to double the number of execution hosts in the
> near future.
>
>
>
> The Sungrid binaries are also being shared via NFS from the same slow
> Solaris Grid Engine machine.  The Solaris box is configured using soft raid
> mirroring and could it be that the disk performance is causing a bottle neck
> as the mirroring uses the CPU?  Is there an easy way for us to tell if the
> disk is the bottle neck?  We do have a separate super fast NetApp NAS device
> and I'm wondering how much of a benefit it would be if we moved the shared
> binaries / SGE directory over to that NAS device?
>
>
>
> In the past this system used to be a 1.8GHz box again with 512MB of RAM.
> Although this is approx 5 times faster than the 350Mhz Sun Netra T1 105 we
> are experiencing these problems on, I didn't expect the qmaster to be so
> demanding on CPU resource.
>
>
>
> Any suggestions would be gratefully received.
>
>
>
> Regards
>
>
>
> Neil
>
>
>
> From: tmac [mailto:tmacmd at gmail.com]
> Sent: 21 January 2008 15:50
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Desperate need for CPU clock cycles!
>
>
>
> I noticed some similarities on Solaris 9 and Solaris 10 with regards to what
> you are seeing.
>
> How about: Solaris just stinks as a master server?
> I am sure someone could probably tweak lots of system variables
> to make it work nice.
>
> Linux does not need the same hand holding that Solaris does.
>
> I moved from a Sun 6500 20 CPUs and 20 Gig of RAM to a
> Intel-based system (2 hyperthreaded cpus 533FSB, and 4 Gig of RAM)
> and have not had any problems since...
>
> --tmac
>
> On Jan 21, 2008 8:36 AM, <Andreas.Haas at sun.com> wrote:
>
> Hi Richard,
>
> increasing the load report interval may help, but without knowing the
> reason it's just guesswork.
>
> You should try running the Dtrace master monitor
>
>    http://wiki.gridengine.info/wiki/index.php/Dtrace
> <http://wiki.gridengine.info/wiki/index.php/Dtrace>
>
> to gain some understanding about the root cause of this situation.
>
> I assume memory shortage can be ruled out since that were easy to
> diagnose?
>
> Regards,
> Andreas
>
>
>
> On Mon, 21 Jan 2008, Richard Hobbs wrote:
>
>> Hello,
>>
>> We have just moved our qmaster to a 330MHz/512MB RAM Sun Netra t1 105
>> running Solaris 10 11/06.
>>
>> We also have around 38 Linux-based dual-CPU exec hosts, each with 4
>> running queues.
>>
>> Last week, we noticed that periodically (while people were submitting
>> jobs [in the order of 60-200 in the queue at any one time]) the load
>> average on the qmaster machine was through the roof (we noticed it at
>> 164 at one point) and as a result, the qmaster grinds to a halt and
>> becomes completely unresponsive for 20-40 minutes, during which time job
>> submissions basically fail! It does recover afterwards, however.
>>
>> Obviously this is unacceptable, so we need a solution! :-)
>>
>> I realise that a 330MHz SPARC with 512MB RAM isn't the best spec, but
>> this is only a job scheduler after all. Surely that should be plenty to
>> run a qmaster on a grid of this size, right?
>>
>> Anyway, regardless of how this spec fits (or doesn't fit) the
>> requirements of the qmaster, is there any way we can claw back some
>> clock cycles to use during this process. We want our qmaster to be as
>> efficient as possible, and ideally to continue running on this box!
>>
>> Are there any options we can turn on to make it quicker? Perhaps reduce
>> the polling rate to the exec hosts (if such an event occurs)?
>>
>> Any advice is appreciated...
>>
>> Thanks in advance,
>> Richard.
>>
>> --
>> Richard Hobbs (Systems Administrator)
>> Toshiba Research Europe Ltd. - Cambridge Research Laboratory
>> Email: richard.hobbs at crl.toshiba.co.uk
>> Web: http://www.toshiba-europe.com/research/
>> Tel: +44 1223 436999        Mobile: +44 7811 803377
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> http://gridengine.info/
>
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551
> Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>
> -- 
> --tmac
>
> RedHat Certified Engineer #804006984323821 (RHEL4)
> RedHat Certified Engineer #805007643429572 (RHEL5)
>
> Principal Consultant, RABA Technologies
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list