[GE users] Using multiple grid systems
Brett_W_Grant at raytheon.com
Brett_W_Grant at raytheon.com
Fri Mar 9 18:43:36 GMT 2007
First off, let me say that I am not a computer scientist, so most of my
computer knowledge is self-taught. Here at work, we have a simulation
that takes in a number of inputs, but essentially gives the state of a
system at a bunch of x,y points for the given inputs through monte carlo
simulation, ie, each x,y point is run multiple times to get a statistical
Three years ago, we had one computer cluster of 22 Opteron boxes and we
ran all of our jobs through that. Not a big deal, took a little bit to
figure out how to run gridware, but we eventually figured that out and
were good to go.
Two years ago, the company provided a second cluster of 35 old Xeon boxes
that was sitting around and we got gridware installed on there. Now these
two systems can not and will never be connected. I wrote some perl and
awk scripts to figure out what was running where, and what the status of
the output files were. Any data computed on the one was burned to a CD
and uploaded to the other.
Not much later, my boss bought 7 Mac G5 towers to see if we could use
Macs. This brought the number of independent SGE grids to three. The
Macs can see the Xeon boxes, but due to company policies, I can not add
the Xeons to the MacGrid or the Macs to the Xeon Grid. I'm still keeping
track of jobs by hand. Due to time constraints and computer shutdowns,
sometimes the same jobs are submitted to one or both of the other systems.
Many headache ensue.
About a year ago, Boss likes Macs, so he convinces his boss to pony up
money to buy 50+ Mac Xserves. Fights with company's IT department break
out over these computers, which lead to these being put on yet another
network, giving my 4 grids to worry about. This network can connect to
the other Macs or Xeons, but I must still keep each grid separate. Now I
have some real fancy scripts to keep track of everything as long as I
A few months ago, management suddenly realizes that we need more computers
to finish our jobs. They purchase 30 more Macs. IT provides yet another
computer cluster (officially we are testing it) of 60 computers, but it
too must be kept on a separate grid. Independently IT also informs us
that they have purchased GridMP and if we can compile our simulation to
run on windows boxes, we can have 1000 more computers to run things on.
Now all of these computers have been added to support multiple contracts
and different people are in charge of each one. Essentially their are too
many jobs for me to manage and now others will be doing that. I would
really like to find a way so that job submission and metric tracking will
not be so difficult.
I looked at the Globus website, which I don't really understand, but I
don't think that it is very realistic for this project. The concept
sounds good, but I know that I do not have the knowledge to implement it,
and I don't think that my Boss is willing to hire someone to figure it
out, plus I don't think that they will allow for the time to figure it
Lately, I have been playing with MySQL and perl and cron jobs to automate
some stuff. I can submit my inputs to the database. If I tell it what
computer to run on, the scripts can submit the jobs to the individual
grids and keep track of them. What I need to figure out is how to
dispatch the jobs to the various computers networks. I have some real
schedule deadlines and I can't afford to have the Xeons with 10000 jobs in
the que and the Macs idle. However, what always caused the most headaches
was trying to balance the requested jobs on all of the computer systems
and then reassembling the data.
Perhaps I have lost sight of the big picture while trying to put out the
tiny fires. Has anyone had this problem before and overcome it? I am
really nervous about trying to use a different style of grid software
(GridMP) as I haven't had time to figure out how it works. Has anyone
kept track of jobs using mysql?
Any input is appreciated.
More information about the gridengine-users