[GE users] LSF vs SGE
dag at sonsorol.org
Mon Apr 11 12:26:18 BST 2005
I am a *huge* grid engine fan, daily user and I promote/recommend it
just about constantly to people I meet and converse with. In the last
year I've worked on or deployed dozens and dozens of SGE clusters and
less than 3 clusters running Platform LSF.
I have defend LSF when this issue comes up ocasionally. I'm not trying
to convert anyone (heck I converted to SGE myself for most work).
Just trying to list some points that will show that the LSF vs SGE
decision can be a bit more complicated for some people depending on
their particular needs.
My #1 recommendation to everyone:
YOU MUST DO THE DUE DILLEGENCE RESEARCH YOURSELF!
The fact of the matter is that for many people who just want a queueing
system w/ scheduler then LSF and SGE are functionally the same and SGE
is often the best choice. Not just because of the price but also because
the community behind SGE causes it to improve at an amazing rate.
In a lot of areas LSF simply leaves SGE in the dust. Grid Engine can't
currently come close to what LSF has to offer.
It's early AM here and I have lots to do today so I'm just going to list
a few reasons/points. This is not going to be a complete discourse.
Please forgive my mistakes as well as I have not been seriously
knee-deep in LSF for quite a while.
1. For people who care about high availibility, LSF's failover mechanism
is better than anything in SGE 5 or SGE 6. In an LSF cluster nodes will
hold an election to promote a new master batch daemon. You can lose all
of your nodes but 1 and still have a 100% fully functioning cluster.
Compare this to shadow masters (which you have to set up and name in
advance) and the fact that failover is "broken" currently in SGE 6
unless you trust the folks that are using berkeleyb over NFSv4 right
now. The fact that you can only have a single berkeley RPC spooling
server just moves a single point of failure from the SGE qmaster node to
the RPC spooling server node. Everyone knows that this will be fixed
quickly but it is still an issue for the HA-concerned users.
2. LSF ships with full API hooks (and webservices!) allowing software
developers to write seriously integrated cluster-aware code. This is
invaluable for special purpose pipelines and workflows. Compare this to
the fact that the only "official" way to do this currently in SGE6 is
with limited-functionality DRMAA 1.0 or by doing what most of us do
which is wrap calls to qsub/qrsh in perl scripts. Webservices in
particular is compelling for some people and not for others.
3. LSF ships (by default) with a tomcat/apache app server and web
interface that provides full web functionality for both cluster
administrators *and* cluster users who want to run jobs or check up on
running jobs. These web interfaces are superior to any of the free web
frontends to SGE and are superior to what Sun has done internally with
their N1 software integration.
4. LSF has both midrange and enterprise ( read "expensive!") monitoring,
reporting and accounting tools that simply are an order of magnitude
better whan what Sun ships with the commercial N1GE product. There are
no reporting/accounting tools in SGE currently and most folks doing such
reporting are just sucking their parsed accounting logs into mySQL for
the moment and this does not get the "derived" data that most people
actually need (N1GE ARCO does this; so do LSF tools) . Not a big deal if
you don't care about reporting but a big deal if you have to justify to
management ("with pretty reports") that your expensive 7 figure cluster
has been 80% utilized over the last quarter etc.
5. The Platform documentation seems to be more comprehensive although
this gap is closing quickly. Sun did a really really good job on the
Admin/User/Install guides when SGE6 came out.
6. You can use LSF to build real multi-site and WAN-spanning grids. Try
to do this with SGE and people tell you "just use globus!". The number
of people yelling "just use globus!" must be far smaller than the number
of people who have actually tried to do this in a *production* setting
across multiple firewalls, heterogenous systems and administrative
domains. Transfer queues + Globus just does not do the trick. Globus
people tend to get all excited about features but when you pin them
down you quickly learn that "oh, that feature is currently not in the
GT-toolkit, it will be out sometime in 2005 we think..."
7. This one is subjective: For seriously complicated resource allocation
policies it seems to be "less work" to configure LSF than SGE. To do
serious policy work in SGE without being an expert/experienced admin is
quite a daunting task. Advanced policy config stuff is not well covered
in existing manuals - the best source of info I have found is this
mailing list. If the cost of purchasing LSF licenses is less than the
cost of hiring additional staff to run your cluster then what is the
more practical choice? For sites where salary costs are far higher than
hardware costs choosing products that require "less work" is a big, big
8. In my experience Platform Support has been consistantly good and a
good support mechanism is worth paying for in some cluster environments.
When I had a problem deploying LSF on one of the first Apple (G4) Xserve
clusters I was speaking directly to the LSF product engineers within a
few hours of contacting the support desk. A few hours after that, they
sent me a new binary that fixed the issue.
9. Have you seen the price list for Sun N1GE? This is your primary
option if you want enterprise support or the accounting/reporting
module. I admit I have not seen it since N1GE was launched but the list
I saw was prohibitively expensive for the types of clusters I usually
work on (many small "bioclusters" of less than 50 nodes). Telling people
who need support to purchase N1GE is not cool unless you know what
it's going to cost. heh.
10. LSF sometimes has special pricing for non-profit and other types of
research institutions in the US. For those that qualify, this can be
more attractive than a free product because (in my mind at least) the
cost you pay to platform is justified by the reduced administrative
burden, extra features you get *and most importantly* full on product
Ok. Time to get some coffee. Like I said above I'm not trying to promote
LSF as much as I'm trying to make the point that LSF vs SGE is not a
simplistic choice and most people are better off if they do the research
themselves to see what fits best for their use cases and environments.
SGE is a best-fit for lots of people, but not all people in all settings.
Adam Lowry wrote:
> Why would anyone use a DRM that costs $ such
> as LSF when grid SGE is free?
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users