[GE users] Cluster Queue errors w/ PVM
Stephan Grell - Sun Germany - SSG - Software Engineer
stephan.grell at sun.com
Thu Jan 5 08:57:22 GMT 2006
[ The following text is in the "windows-1252" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
could you post the queue error message and what you can find in the
qmaster messages file and
the execd messages file, on which the error was generated?
What you are doing should work. Deleting a job should not put a queue in
error state. Therefore,
it sounds to me, as if you found a bug.
Raymond Chan wrote:
> I?ve been experimenting w/ tight integration of PVM by following
> Reuti?s HowTo. My main concern was using qdel to kill PVM jobs, and
> seeing the processes killed on all nodes. This seems to work cleanly
> sometimes, but other times one of the nodes in the cluster queue I
> submit the job to is put into an error state. I then have to either go
> into qmon to clear the error flag or do a qmod ?cq <queue_name>.
> I thought I followed the tight integration instructions correctly, as
> it does work sometimes, but I need it to work 100% of the time, as
> this system will be hit hard w/ a bunch of PVM jobs when it goes into
> production. If there?s no solution for my PVM problem tight
> integration problem, is there a way to have SGE auto-clear an error
> state in a cluster-queue whenever an error flag is set on a node, or
> never set the error flag at all? I?m primarily going to set up a
> specific cluster-queue for these type of PVM jobs, so this kind of
> patch work would be acceptable as long as it works.
> Hope someone has an idea that can point me in the right direction.
> Thank you,
> Ray C.
> P.S. One last observation w/ my setup with integrating PVM I had was
> that even when I submitted a PVM job w/ say 6 processors onto 3 dual
> cpu nodes (e.g. ?pe PVM 6), the CPU utilization on my servers for the
> job was only working on one processor on each node. Therefore, I tried
> submitting two jobs w/ 3 processors each (-pe PVM 3) to try to utilize
> all CPUs, but for some reason it halved it again to where only one CPU
> was used on each node again, but this time each job took 50% of that
> one processor. I even hardcoded a ?sp 2000? to indicate dual cpus for
> a PVM hostfile in the startpvm.sh that is next to the ?ep flag in the
> script. If anyone has any ideas on this, that would be great as well.
> Thanks again!
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users