[GE users] Documentation about SGE

Chris Dagdigian dag at sonsorol.org
Fri Jul 23 17:31:05 BST 2004

Yeah. Didn't mean to be unfair to SGE. I also should have explicitly 
ended my last email with the strong statement that SGE is by far the 
best choice for people who are looking purely at the zero-cost (ie open 
source) solutions.

SGE also trumps LSF in a bunch of different scenarios, especially for 
people who just want a solid DRM and don't envision the need for the 
hardcore add-on abilities that Platform can sell to LSF users.

Helping people with the "what product do I use and how do I configure it 
to be useful to me?" problem is one of the most enjoyable parts of my 
day job.

{more comments below}

Rayson Ho wrote:

>>o LSF *by far* has fault tolerance and resiliancy features that 
>> blow away all the competition. The SGE shadow master does not 
>> come close to LSF's ability to keep "electing" a new master node 
>> as systems fail or drop offline one by one.
> Mostly agree with your other points, but the point above is not to fair to
> SGE: LSF can do that because the "sbatchd" is a fat daemon with the
> functionality of SGE's execd (the job management part, not the load
> reporting part) and shadowd.

> SGE can also keep on electing new master node also, but it is not done this
> way likely because the SGE designers think that the cluster admin should
> configure it when it is needed rather than setting every batch node to be
> the fail-over master by default.

There is a difference between "what the developers think the cluster 
admin should do" and "what the cluster admin actually ends up doing" in 
the real world :)

Lots of the hardcore cluster folks assume all clusters are like the big 
ones they are running. They also assume a high level of skill and actual 
cluster management interest. Then there are the grid/cluster sales and 
marketing folks who only want to target, sell to, or talk about the 
"sexy" cluster opportunities with hundreds or thousands of CPU count.

What is often ignored is the very large number of small 2,4,8,10 node 
systems being set up by ordinary people who have no desire to be cluster 
guru's but *do* have some large problems that they need to solve.

I deal every day with biologists who build clusters to solve problems of 
interest. They build and run these systems because they have to, not 
because they want to be a best-practices cluster admin. Most of these 
clusters could fit under someone's desk.

For these people, having the (LSF failover) ability essentially be 
automatic cluster-wide because the sbatchd daemon runs everywhere is an 
*excellent* feature, mostly because they don't have to do anything or 
really care about the internal guts of scheduler failover and 
fault-tolerance issues. For them, the cluster "just runs" if the primary 
master batch daemon host decides to crash or call it a day.

In my long-winded way I'm trying to argue that sometimes, the end-user 
is not an expert, and may not even have an interest in becoming an 
expert at all. The cluster is just a tool, nothing more, nothing less. 
For these cases it is sometimes good to have a few of the powerfull 
save-the-user-from-themselves type technologies active and running 
behind the scenes.

I'll stop now that I've gone 100% off topic. heh


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list