[GE users] Cluster Queue errors w/ PVM
raychan at ucdavis.edu
Thu Jan 5 00:17:32 GMT 2006
I've been experimenting w/ tight integration of PVM by following Reuti's
HowTo. My main concern was using qdel to kill PVM jobs, and seeing the
processes killed on all nodes. This seems to work cleanly sometimes, but
other times one of the nodes in the cluster queue I submit the job to is put
into an error state. I then have to either go into qmon to clear the error
flag or do a qmod -cq <queue_name>.
I thought I followed the tight integration instructions correctly, as it
does work sometimes, but I need it to work 100% of the time, as this system
will be hit hard w/ a bunch of PVM jobs when it goes into production. If
there's no solution for my PVM problem tight integration problem, is there a
way to have SGE auto-clear an error state in a cluster-queue whenever an
error flag is set on a node, or never set the error flag at all? I'm
primarily going to set up a specific cluster-queue for these type of PVM
jobs, so this kind of patch work would be acceptable as long as it works.
Hope someone has an idea that can point me in the right direction.
P.S. One last observation w/ my setup with integrating PVM I had was that
even when I submitted a PVM job w/ say 6 processors onto 3 dual cpu nodes
(e.g. -pe PVM 6), the CPU utilization on my servers for the job was only
working on one processor on each node. Therefore, I tried submitting two
jobs w/ 3 processors each (-pe PVM 3) to try to utilize all CPUs, but for
some reason it halved it again to where only one CPU was used on each node
again, but this time each job took 50% of that one processor. I even
hardcoded a "sp 2000" to indicate dual cpus for a PVM hostfile in the
startpvm.sh that is next to the -ep flag in the script. If anyone has any
ideas on this, that would be great as well. Thanks again!
More information about the gridengine-users