[GE users] mpich2 tight integration not working

Patterson, Ron (NIH/NLM/NCBI) [C] patterso at ncbi.nlm.nih.gov
Thu Dec 4 18:49:55 GMT 2008


A bit of background: I had mpich2 (with tight integration) working on a
previous SGE cluster (6.1.x). That cluster was using very fast,
clustered NAS storage (Panasas) as it's spooling dir for the master (no
NFS), with local spooling for all of the sge_execd hosts. 

I have a new 6.2 cluster which has the master spooling on a NetApp NFS
(v3) share and sge_execd hosts using local spooling. I followed Reuti's
latest posts about mpd based mpich2 on SGE 6.2 and I'm having problems.
I verified that mpich2 itself is working (a can manually mpdboot a ring
of hosts a run tests jobs on it). Below are the errors I see from the
test jobs as well as the master task's exec host messages file. It looks
like the master task's host (in this case sge079) isn't "expecting" the
task. I tried adding some delays in the start_mpich2.sh script, but it
didn't help.  I'm using mpich2 v. 1.0.8.

I'm wondering if this could be a NFS caching issue on the exec server
side, or if this is some other known issue. Thanks for any help.

Ron


$ cat mpich2_mpd.sh.po1413882
-catch_rsh /var/sge/ncbi/spool/sge079/active_jobs/1413882.1/pe_hostfile
/netmnt/sge62/mpich2
sge079:1
sge092:1
sge005:1
sge004:1
sge007:1
sge006:1
sge028:1
sge011:1
startmpich2.sh: check for local mpd daemon (1 of 10)
/netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge079
/netmnt/sge62/mpich2/bin/mpd
error: executing task of job 1413882 failed: execution daemon on host
"sge079" didn't accept task
startmpich2.sh: check for local mpd daemon (2 of 10)
startmpich2.sh: check for local mpd daemon (3 of 10)
startmpich2.sh: check for local mpd daemon (4 of 10)
startmpich2.sh: check for local mpd daemon (5 of 10)
startmpich2.sh: check for local mpd daemon (6 of 10)
startmpich2.sh: check for local mpd daemon (7 of 10)
startmpich2.sh: check for local mpd daemon (8 of 10)
startmpich2.sh: check for local mpd daemon (9 of 10)
startmpich2.sh: check for local mpd daemon (10 of 10)
startmpich2.sh: check for mpd daemons (1 of 10)
/netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge004
/netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
/netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge007
/netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
/netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge092
/netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
/netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge005
/netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
/netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge028
/netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
/netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge006
/netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
/netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge011
/netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
startmpich2.sh: got all 8 of 8 nodes
-catch_rsh /netmnt/sge62/mpich2
mpdallexit: cannot connect to local mpd
(/tmp/mpd2.console_patterso_sge_1413882.); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
sge005_53033: conn error in connect_lhs: Connection refused
sge005_53033 (connect_lhs 900): failed to connect to lhs at sge079
1413882
sge005_53033 (enter_ring 855): lhs connect failed
sge005_53033 (run 252): failed to enter ring
sge028_39127: conn error in connect_lhs: Connection refused
sge028_39127 (connect_lhs 900): failed to connect to lhs at sge079
1413882
sge028_39127 (enter_ring 855): lhs connect failed
sge028_39127 (run 252): failed to enter ring
sge004_53802: conn error in connect_lhs: Connection refused
sge004_53802 (connect_lhs 900): failed to connect to lhs at sge079
1413882
sge004_53802 (enter_ring 855): lhs connect failed
sge004_53802 (run 252): failed to enter ring
sge006_48625: conn error in connect_lhs: Connection refused
sge006_48625 (connect_lhs 900): failed to connect to lhs at sge079
1413882
sge006_48625 (enter_ring 855): lhs connect failed
sge006_48625 (run 252): failed to enter ring
sge007_44937: conn error in connect_lhs: Connection refused
sge007_44937 (connect_lhs 900): failed to connect to lhs at sge079
1413882
sge007_44937 (enter_ring 855): lhs connect failed
sge007_44937 (run 252): failed to enter ring
sge092_38739: conn error in connect_lhs: Connection refused
sge092_38739 (connect_lhs 900): failed to connect to lhs at sge079
1413882
sge092_38739 (enter_ring 855): lhs connect failed
sge092_38739 (run 252): failed to enter ring
sge011_39350: conn error in connect_lhs: Connection refused
sge011_39350 (connect_lhs 900): failed to connect to lhs at sge079
1413882
sge011_39350 (enter_ring 855): lhs connect failed
sge011_39350 (run 252): failed to enter ring
patterso at cfengine1:~/mpi/mpich2>

And in the master exec host's messages file:

12/04/2008 13:32:05|  main|sge079|E|no free queue for job 1413882 of
user patterso at sge079.be-md.ncbi.nlm.nih.gov (localhost =
sge079.be-md.ncbi.nlm.nih.gov)

-----------------------------------
Ron Patterson
UNIX Systems Administrator
NCBI/NLM/NIH contractor
301.435.5956

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91197

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list