[GE users] Yet anaother MPICH tight-integration problem

David S. dgs at gs.washington.edu
Thu Sep 9 00:46:26 BST 2004


(This is mostly for this mailling list archives, for anybody else
who may come upon this problem.)

I've think that I've fixed this by modifying the 
'$SGE_ROOT/mpi/startmpi.sh' script to write fully
qualified domain names into the '$TMPDIR/machines'
file.  Why 'qrsh', through the '$SGE_ROOT/mpi/rsh'
wrapper, should need FQDNs in this circumstance is
a question someone who knows more about the program
will have to answer.

David S.

> >
> > Changing the value of 'job_is_first_task' makes no difference.
> > In either case, the grid engine appears to start the master
> > process and one slave process on a node, walks through
> > '$TMPDIR/machines' starting slaves on the nodes listed there,
> > then tries and fails to start a second slave on the node
> > running the master.  At that point the job aborts.  All that's
> > in the 'messages' file in the spool directory of the master's
> > node is a message like
> >
> > 	09/07/2004 20:48:25|execd|eee006|E|no free queue for job 75 of user dgs at eee008.grid.gs.washington.edu (localhost = eee006.grid.gs.washington.edu)
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list