[GE users] Desperate need for CPU clock cycles!

tmac tmacmd at gmail.com
Mon Jan 21 15:50:06 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I noticed some similarities on Solaris 9 and Solaris 10 with regards to what
you are seeing.

How about: Solaris just stinks as a master server?
I am sure someone could probably tweak lots of system variables
to make it work nice.

Linux does not need the same hand holding that Solaris does.

I moved from a Sun 6500 20 CPUs and 20 Gig of RAM to a
Intel-based system (2 hyperthreaded cpus 533FSB, and 4 Gig of RAM)
and have not had any problems since...

--tmac

On Jan 21, 2008 8:36 AM, <Andreas.Haas at sun.com> wrote:

> Hi Richard,
>
> increasing the load report interval may help, but without knowing the
> reason it's just guesswork.
>
> You should try running the Dtrace master monitor
>
>    http://wiki.gridengine.info/wiki/index.php/Dtrace
>
> to gain some understanding about the root cause of this situation.
>
> I assume memory shortage can be ruled out since that were easy to
> diagnose?
>
> Regards,
> Andreas
>
>
> On Mon, 21 Jan 2008, Richard Hobbs wrote:
>
> > Hello,
> >
> > We have just moved our qmaster to a 330MHz/512MB RAM Sun Netra t1 105
> > running Solaris 10 11/06.
> >
> > We also have around 38 Linux-based dual-CPU exec hosts, each with 4
> > running queues.
> >
> > Last week, we noticed that periodically (while people were submitting
> > jobs [in the order of 60-200 in the queue at any one time]) the load
> > average on the qmaster machine was through the roof (we noticed it at
> > 164 at one point) and as a result, the qmaster grinds to a halt and
> > becomes completely unresponsive for 20-40 minutes, during which time job
> > submissions basically fail! It does recover afterwards, however.
> >
> > Obviously this is unacceptable, so we need a solution! :-)
> >
> > I realise that a 330MHz SPARC with 512MB RAM isn't the best spec, but
> > this is only a job scheduler after all. Surely that should be plenty to
> > run a qmaster on a grid of this size, right?
> >
> > Anyway, regardless of how this spec fits (or doesn't fit) the
> > requirements of the qmaster, is there any way we can claw back some
> > clock cycles to use during this process. We want our qmaster to be as
> > efficient as possible, and ideally to continue running on this box!
> >
> > Are there any options we can turn on to make it quicker? Perhaps reduce
> > the polling rate to the exec hosts (if such an event occurs)?
> >
> > Any advice is appreciated...
> >
> > Thanks in advance,
> > Richard.
> >
> > --
> > Richard Hobbs (Systems Administrator)
> > Toshiba Research Europe Ltd. - Cambridge Research Laboratory
> > Email: richard.hobbs at crl.toshiba.co.uk
> > Web: http://www.toshiba-europe.com/research/
> > Tel: +44 1223 436999        Mobile: +44 7811 803377
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
> http://gridengine.info/
>
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551
> Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 
--tmac

RedHat Certified Engineer #804006984323821 (RHEL4)
RedHat Certified Engineer #805007643429572 (RHEL5)

Principal Consultant, RABA Technologies



More information about the gridengine-users mailing list