[GE users] Using multiple grid systems

Brett_W_Grant at raytheon.com Brett_W_Grant at raytheon.com
Fri Mar 9 18:43:36 GMT 2007


First off, let me say that I am not a computer scientist, so most of my 
computer knowledge is self-taught.  Here at work, we have a simulation 
that takes in a number of inputs, but essentially gives the state of a 
system at a bunch of x,y points for the given inputs through monte carlo 
simulation, ie, each x,y point is run multiple times to get a statistical 
answer.

Three years ago, we had one computer cluster of 22 Opteron boxes and we 
ran all of our jobs through that.  Not a big deal, took a little bit to 
figure out how to run gridware, but we eventually figured that out and 
were good to go.

Two years ago, the company provided a second cluster of 35 old Xeon boxes 
that was sitting around and we got gridware installed on there.  Now these 
two systems can not and will never be connected.  I wrote some perl and 
awk scripts to figure out what was running where, and what the status of 
the output files were.  Any data computed on the one was burned to a CD 
and uploaded to the other.

Not much later, my boss bought 7 Mac G5 towers to see if we could use 
Macs.  This brought the number of independent SGE grids to three.  The 
Macs can see the Xeon boxes, but due to company policies, I can not add 
the Xeons to the MacGrid or the Macs to the Xeon Grid.  I'm still keeping 
track of jobs by hand.  Due to time constraints and computer shutdowns, 
sometimes the same jobs are submitted to one or both of the other systems. 
 Many headache ensue.

About a year ago, Boss likes Macs, so he convinces his boss to pony up 
money to buy  50+ Mac Xserves.  Fights with company's IT department break 
out over these computers, which lead to these being put on yet another 
network, giving my 4 grids to worry about.  This network can connect to 
the other Macs or Xeons, but I must still keep each grid separate.  Now I 
have some real fancy scripts to keep track of everything as long as I 
submitted it.

A few months ago, management suddenly realizes that we need more computers 
to finish our jobs.  They purchase 30 more Macs.  IT provides yet another 
computer cluster (officially we are testing it) of 60 computers, but it 
too must be kept on a separate grid.  Independently  IT also informs us 
that they have purchased GridMP and if we can compile our simulation to 
run on windows boxes, we can have 1000 more computers to run things on.

Now all of these computers have been added to support multiple contracts 
and different people are in charge of each one.  Essentially their are too 
many jobs for me to manage and now others will be doing that.  I would 
really like to find a way so that job submission and metric tracking will 
not be so difficult.

I looked at the Globus website, which I don't really understand, but I 
don't think that it is very realistic for this project.  The concept 
sounds good, but I know that I do not have the knowledge to implement it, 
and I don't think that my Boss is willing to hire someone to figure it 
out, plus I don't think that they will allow for the time to figure it 
out.

Lately, I have been playing with MySQL and perl and cron jobs to automate 
some stuff.  I can submit my inputs to the database.  If I tell it what 
computer to run on, the scripts can submit the jobs to the individual 
grids and keep track of them.  What I need to figure out is how to 
dispatch the jobs to the various computers networks.  I have some real 
schedule deadlines and I can't afford to have the Xeons with 10000 jobs in 
the que and the Macs idle.  However, what always caused the most headaches 
was trying to balance the requested jobs on all of the computer systems 
and then reassembling the data.

Perhaps I have lost sight of the big picture while trying to put out the 
tiny fires.  Has anyone had this problem before and overcome it?  I am 
really nervous about trying to use a different style of grid software 
(GridMP) as I haven't had time to figure out how it works.  Has anyone 
kept track of jobs using mysql?

Any input is appreciated.

Thanks,
Brett Grant




More information about the gridengine-users mailing list