[SGE-discuss] [gridengine users] Beware Univa FUD

William Hay w.hay at ucl.ac.uk
Wed Nov 16 09:24:47 GMT 2011

On 16 November 2011 00:10, Dave Love <d.love at liverpool.ac.uk> wrote:
> William Hay <w.hay at ucl.ac.uk> writes:
>> On 10 November 2011 03:46, Ron Chen <ron_chen_123 at yahoo.com> wrote:
>>> 4) Fritz was telling customers (including William Hay) that open source Grid Engine is "buggy, unstable, hard to debug", and to use SGE in production customers need to buy support from Univa.
>> I should point out this was in the context of me baiting him with the
>> assertion that we didn't need a support contract because his team had
>> produced such a robust product.  Also I believe his remarks were
>> directed at the common Grid engine code base not specifically the open
>> source variants.
> I'd say the code is relatively buggy and intractable by the standards of
> (different sorts of) projects I'm used to.  [I say that neutrally, and I
> haven't worked on a more-or-less equivalent system, say SLURM, to
> compare.]  I don't know when most regressions in the 6.2 series
> occurred, and they're not all easy to spot in change logs, but possibly
> the version in use at UCL was in something of a sweet spot.  I'd expect
> our usage to be similar to UCL's as far as showing them up.  As it
> happens, I've recently been fighting a spooling regression (and cocked
> up pushing the patch -- thanks Florian).

My characterisation of Grid Engine as robust was in comparison to
Torque and Moab.  Torque in particular seemed to be rather fragile and
the combination seemed to have issues scaling to the number of jobs we
needed (array jobs didn't appear to work properly and somewhere
slightly north of 50000 jobs in the queue the two of them timed out
when talking to each other).  Possibly later versions would have
resolved some of these issues but at the time Cluster Resources
(Adaptive computing as it is now)wouldn't promise more than 50000 jobs
and we'd seen fairly glaring bugs get past their
regression testing.  While Grid Engine isn't bug free it scales to our
workload and on the few occasions when it has thrown a wobbly the log
files were very helpful in identifying the source of the problem.   We
did try 6.2u5 but encountered some issues that although we were able
to work around them bore an uncomfortable resemblance to known bugs in
that version and decided to switch back to 6.2u3 as that was the
preferred version of our cluster integrator.  The main issue we
currently have with SGE is the time a scheduling cycle takes.  We're
currently trying to tweak the configuration to minimise the work SGE
has to do while still implementing our policy.


More information about the SGE-discuss mailing list