[GE users] LSF vs SGE
cpiodumpv at gmail.com
Tue Apr 12 03:52:42 BST 2005
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
thanx all for your helpful responses. Looks like I have a lot of
research to do.
On Apr 11, 2005 6:26 AM, Chris Dagdigian <dag at sonsorol.org> wrote:
> I am a *huge* grid engine fan, daily user and I promote/recommend it
> just about constantly to people I meet and converse with. In the last
> year I've worked on or deployed dozens and dozens of SGE clusters and
> less than 3 clusters running Platform LSF.
> I have defend LSF when this issue comes up ocasionally. I'm not trying
> to convert anyone (heck I converted to SGE myself for most work).
> Just trying to list some points that will show that the LSF vs SGE
> decision can be a bit more complicated for some people depending on
> their particular needs.
> My #1 recommendation to everyone:
> YOU MUST DO THE DUE DILLEGENCE RESEARCH YOURSELF!
> The fact of the matter is that for many people who just want a queueing
> system w/ scheduler then LSF and SGE are functionally the same and SGE
> is often the best choice. Not just because of the price but also because
> the community behind SGE causes it to improve at an amazing rate.
> In a lot of areas LSF simply leaves SGE in the dust. Grid Engine can't
> currently come close to what LSF has to offer.
> It's early AM here and I have lots to do today so I'm just going to list
> a few reasons/points. This is not going to be a complete discourse.
> Please forgive my mistakes as well as I have not been seriously
> knee-deep in LSF for quite a while.
> 1. For people who care about high availibility, LSF's failover mechanism
> is better than anything in SGE 5 or SGE 6. In an LSF cluster nodes will
> hold an election to promote a new master batch daemon. You can lose all
> of your nodes but 1 and still have a 100% fully functioning cluster.
> Compare this to shadow masters (which you have to set up and name in
> advance) and the fact that failover is "broken" currently in SGE 6
> unless you trust the folks that are using berkeleyb over NFSv4 right
> now. The fact that you can only have a single berkeley RPC spooling
> server just moves a single point of failure from the SGE qmaster node to
> the RPC spooling server node. Everyone knows that this will be fixed
> quickly but it is still an issue for the HA-concerned users.
> 2. LSF ships with full API hooks (and webservices!) allowing software
> developers to write seriously integrated cluster-aware code. This is
> invaluable for special purpose pipelines and workflows. Compare this to
> the fact that the only "official" way to do this currently in SGE6 is
> with limited-functionality DRMAA 1.0 or by doing what most of us do
> which is wrap calls to qsub/qrsh in perl scripts. Webservices in
> particular is compelling for some people and not for others.
> 3. LSF ships (by default) with a tomcat/apache app server and web
> interface that provides full web functionality for both cluster
> administrators *and* cluster users who want to run jobs or check up on
> running jobs. These web interfaces are superior to any of the free web
> frontends to SGE and are superior to what Sun has done internally with
> their N1 software integration.
> 4. LSF has both midrange and enterprise ( read "expensive!") monitoring,
> reporting and accounting tools that simply are an order of magnitude
> better whan what Sun ships with the commercial N1GE product. There are
> no reporting/accounting tools in SGE currently and most folks doing such
> reporting are just sucking their parsed accounting logs into mySQL for
> the moment and this does not get the "derived" data that most people
> actually need (N1GE ARCO does this; so do LSF tools) . Not a big deal if
> you don't care about reporting but a big deal if you have to justify to
> management ("with pretty reports") that your expensive 7 figure cluster
> has been 80% utilized over the last quarter etc.
> 5. The Platform documentation seems to be more comprehensive although
> this gap is closing quickly. Sun did a really really good job on the
> Admin/User/Install guides when SGE6 came out.
> 6. You can use LSF to build real multi-site and WAN-spanning grids. Try
> to do this with SGE and people tell you "just use globus!". The number
> of people yelling "just use globus!" must be far smaller than the number
> of people who have actually tried to do this in a *production* setting
> across multiple firewalls, heterogenous systems and administrative
> domains. Transfer queues + Globus just does not do the trick. Globus
> people tend to get all excited about features but when you pin them
> down you quickly learn that "oh, that feature is currently not in the
> GT-toolkit, it will be out sometime in 2005 we think..."
> 7. This one is subjective: For seriously complicated resource allocation
> policies it seems to be "less work" to configure LSF than SGE. To do
> serious policy work in SGE without being an expert/experienced admin is
> quite a daunting task. Advanced policy config stuff is not well covered
> in existing manuals - the best source of info I have found is this
> mailing list. If the cost of purchasing LSF licenses is less than the
> cost of hiring additional staff to run your cluster then what is the
> more practical choice? For sites where salary costs are far higher than
> hardware costs choosing products that require "less work" is a big, big
> 8. In my experience Platform Support has been consistantly good and a
> good support mechanism is worth paying for in some cluster environments.
> When I had a problem deploying LSF on one of the first Apple (G4) Xserve
> clusters I was speaking directly to the LSF product engineers within a
> few hours of contacting the support desk. A few hours after that, they
> sent me a new binary that fixed the issue.
> 9. Have you seen the price list for Sun N1GE? This is your primary
> option if you want enterprise support or the accounting/reporting
> module. I admit I have not seen it since N1GE was launched but the list
> I saw was prohibitively expensive for the types of clusters I usually
> work on (many small "bioclusters" of less than 50 nodes). Telling people
> who need support to purchase N1GE is not cool unless you know what
> it's going to cost. heh.
> 10. LSF sometimes has special pricing for non-profit and other types of
> research institutions in the US. For those that qualify, this can be
> more attractive than a free product because (in my mind at least) the
> cost you pay to platform is justified by the reduced administrative
> burden, extra features you get *and most importantly* full on product
> technical support.
> Ok. Time to get some coffee. Like I said above I'm not trying to promote
> LSF as much as I'm trying to make the point that LSF vs SGE is not a
> simplistic choice and most people are better off if they do the research
> themselves to see what fits best for their use cases and environments.
> SGE is a best-fit for lots of people, but not all people in all settings.
> Adam Lowry wrote:
> > Why would anyone use a DRM that costs $ such
> > as LSF when grid SGE is free?
> > -thanx
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users