[GE users] LSF vs SGE

Chris Dagdigian dag at sonsorol.org
Mon Apr 11 12:26:18 BST 2005


I am a *huge* grid engine fan, daily user and I promote/recommend it 
just about constantly to people I meet and converse with. In the last 
year I've worked on or deployed dozens and dozens of SGE clusters and 
less than 3 clusters running Platform LSF.

Still...

I have defend LSF when this issue comes up ocasionally. I'm not trying 
to convert anyone (heck I converted to SGE myself for most work).

Just trying to list some points that will show that the LSF vs SGE 
decision can be a bit more complicated for some people depending on 
their particular needs.

My #1 recommendation to everyone:
YOU MUST DO THE DUE DILLEGENCE RESEARCH YOURSELF!

The fact of the matter is that for many people who just want a queueing 
system w/ scheduler then LSF and SGE are functionally the same and SGE 
is often the best choice. Not just because of the price but also because 
the community behind SGE causes it to improve at an amazing rate.

But..

In a lot of areas LSF simply leaves SGE in the dust. Grid Engine can't 
currently come close to what LSF has to offer.

It's early AM here and I have lots to do today so I'm just going to list 
a few reasons/points. This is not going to be a complete discourse.

Please forgive my mistakes as well as I have not been seriously 
knee-deep in LSF for quite a while.

1. For people who care about high availibility, LSF's failover mechanism 
is better than anything in SGE 5 or SGE 6. In an LSF cluster nodes will 
hold an election to promote a new master batch daemon. You can lose all 
of your nodes but 1 and still have a 100% fully functioning cluster. 
Compare this to shadow masters (which you have to set up and name in 
advance) and the fact that failover is "broken" currently in SGE 6 
unless you trust the folks that are using berkeleyb over NFSv4 right 
now. The fact that you can only have a single berkeley RPC spooling 
server just moves a single point of failure from the SGE qmaster node to 
the RPC spooling server node. Everyone knows that this will be fixed 
quickly but it is still an issue for the HA-concerned users.

2. LSF ships with full API hooks (and webservices!)  allowing software 
developers to write seriously integrated cluster-aware code. This is 
invaluable for special purpose pipelines and workflows. Compare this to 
the fact that the only "official" way to do this currently in SGE6 is 
with limited-functionality DRMAA 1.0 or by doing what most of us do 
which is wrap calls to qsub/qrsh in perl scripts. Webservices in 
particular is compelling for some people and not for others.

3. LSF ships (by default) with a tomcat/apache app server and web 
interface that provides full web functionality for both cluster 
administrators *and* cluster users who want to run jobs or check up on 
running jobs. These web interfaces are superior to any of the free web 
frontends to SGE and are superior to what Sun has done internally with 
their N1 software integration.

4. LSF has both midrange and enterprise ( read "expensive!") monitoring, 
reporting and accounting tools that simply are an order of magnitude 
better whan what Sun ships with the commercial N1GE product. There are 
no reporting/accounting tools in SGE currently and most folks doing such 
reporting are just sucking their parsed accounting logs into mySQL for 
the moment and this does not get the "derived" data that most people 
actually need (N1GE ARCO does this; so do LSF tools) . Not a big deal if 
you don't care about reporting but a big deal if you have to justify to 
management ("with pretty reports") that your expensive 7 figure cluster 
has been 80% utilized over the last quarter etc.

5. The Platform documentation seems to be more comprehensive although 
this gap is closing quickly. Sun did a really really good job on the 
Admin/User/Install guides when SGE6 came out.

6. You can use LSF to build real multi-site and WAN-spanning grids. Try 
to do this with SGE and people tell you "just use globus!". The number 
of people yelling "just use globus!" must be far smaller than the number 
of people who have actually tried to do this in a *production* setting 
across multiple firewalls, heterogenous systems and administrative 
domains. Transfer queues + Globus just does not do the trick. Globus 
people tend to get all excited about features  but when you pin them 
down you quickly learn that "oh, that feature is currently not in the 
GT-toolkit, it will be out sometime in 2005 we think..."


7. This one is subjective: For seriously complicated resource allocation 
policies it seems to be "less work" to configure LSF than SGE. To do 
serious policy work in SGE without being an expert/experienced admin is 
quite a daunting task. Advanced policy config stuff is not well covered 
in existing manuals - the best source of info I have found is this 
mailing list. If the cost of purchasing LSF licenses is less than the 
cost of hiring additional staff to run your cluster then what is the 
more practical choice? For sites where salary costs are far higher than 
hardware costs choosing products that require "less work" is a big, big 
deal.


8. In my experience Platform Support has been consistantly good and a 
good support mechanism is worth paying for in some cluster environments. 
When I had a problem deploying LSF on one of the first Apple (G4) Xserve 
clusters I was speaking directly to the LSF product engineers within a 
few hours of contacting the support desk. A few hours after that, they 
sent me a new binary that fixed the issue.


9. Have you seen the price list for Sun N1GE? This is your primary 
option if you want enterprise support or the accounting/reporting 
module.  I admit I have not seen it since N1GE was launched but the list 
  I saw was prohibitively expensive for the types of clusters I usually 
work on (many small "bioclusters" of less than 50 nodes). Telling people 
  who need support to purchase N1GE is not cool unless you know what 
it's going to cost. heh.

10. LSF sometimes has special pricing for non-profit and other types of 
research institutions in the US. For those that qualify, this can be 
more attractive than a free product because (in my mind at least) the 
cost you pay to platform is justified by the reduced administrative 
burden, extra features you get *and most importantly* full on product 
technical support.


Ok. Time to get some coffee. Like I said above I'm not trying to promote 
LSF as much as I'm trying to make the point that LSF vs SGE is not a 
simplistic choice and most people are better off if they do the research 
themselves to see what fits best for their use cases and environments. 
SGE is a best-fit for lots of people, but not all people in all settings.


-Chris



Adam Lowry wrote:
> Why would anyone use a DRM that costs $ such
> as LSF when grid SGE is free?
> 
> -thanx


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list