[GE users] mpich2 tight integration not working

reuti reuti at staff.uni-marburg.de
Thu Dec 4 19:10:47 GMT 2008


Hi,

Am 04.12.2008 um 19:49 schrieb Patterson, Ron (NIH/NLM/NCBI) [C]:

> A bit of background: I had mpich2 (with tight integration) working  
> on a
> previous SGE cluster (6.1.x). That cluster was using very fast,
> clustered NAS storage (Panasas) as it's spooling dir for the master  
> (no
> NFS), with local spooling for all of the sge_execd hosts.
>
> I have a new 6.2 cluster which has the master spooling on a NetApp NFS
> (v3) share and sge_execd hosts using local spooling. I followed  
> Reuti's
> latest posts about mpd based mpich2 on SGE 6.2 and I'm having  
> problems.
> I verified that mpich2 itself is working (a can manually mpdboot a  
> ring
> of hosts a run tests jobs on it). Below are the errors I see from the
> test jobs as well as the master task's exec host messages file. It  
> looks
> like the master task's host (in this case sge079) isn't "expecting"  
> the
> task. I tried adding some delays in the start_mpich2.sh script, but it
> didn't help.  I'm using mpich2 v. 1.0.8.
>
> I'm wondering if this could be a NFS caching issue on the exec server
> side, or if this is some other known issue. Thanks for any help.
>
> Ron
>
>
> $ cat mpich2_mpd.sh.po1413882
> -catch_rsh /var/sge/ncbi/spool/sge079/active_jobs/1413882.1/ 
> pe_hostfile
> /netmnt/sge62/mpich2
> sge079:1
> sge092:1
> sge005:1
> sge004:1
> sge007:1
> sge006:1
> sge028:1
> sge011:1
> startmpich2.sh: check for local mpd daemon (1 of 10)
> /netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge079
> /netmnt/sge62/mpich2/bin/mpd
> error: executing task of job 1413882 failed: execution daemon on host
> "sge079" didn't accept task

you set "job_is_first_task  FALSE" in the PE?

> startmpich2.sh: check for local mpd daemon (2 of 10)
> startmpich2.sh: check for local mpd daemon (3 of 10)
> startmpich2.sh: check for local mpd daemon (4 of 10)
> startmpich2.sh: check for local mpd daemon (5 of 10)
> startmpich2.sh: check for local mpd daemon (6 of 10)
> startmpich2.sh: check for local mpd daemon (7 of 10)
> startmpich2.sh: check for local mpd daemon (8 of 10)
> startmpich2.sh: check for local mpd daemon (9 of 10)
> startmpich2.sh: check for local mpd daemon (10 of 10)

Of course - it should stop here when the master daemon is not  
running. I'll check it.

-- Reuti

> startmpich2.sh: check for mpd daemons (1 of 10)
> /netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge004
> /netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
> /netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge007
> /netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
> /netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge092
> /netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
> /netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge005
> /netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
> /netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge028
> /netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
> /netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge006
> /netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
> /netmnt/sge62/bin/lx24-amd64/qrsh -inherit -V sge011
> /netmnt/sge62/mpich2/bin/mpd -h sge079 -p 1413882 -n
> startmpich2.sh: got all 8 of 8 nodes
> -catch_rsh /netmnt/sge62/mpich2
> mpdallexit: cannot connect to local mpd
> (/tmp/mpd2.console_patterso_sge_1413882.); possible causes:
>   1. no mpd is running on this host
>   2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>     mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
> sge005_53033: conn error in connect_lhs: Connection refused
> sge005_53033 (connect_lhs 900): failed to connect to lhs at sge079
> 1413882
> sge005_53033 (enter_ring 855): lhs connect failed
> sge005_53033 (run 252): failed to enter ring
> sge028_39127: conn error in connect_lhs: Connection refused
> sge028_39127 (connect_lhs 900): failed to connect to lhs at sge079
> 1413882
> sge028_39127 (enter_ring 855): lhs connect failed
> sge028_39127 (run 252): failed to enter ring
> sge004_53802: conn error in connect_lhs: Connection refused
> sge004_53802 (connect_lhs 900): failed to connect to lhs at sge079
> 1413882
> sge004_53802 (enter_ring 855): lhs connect failed
> sge004_53802 (run 252): failed to enter ring
> sge006_48625: conn error in connect_lhs: Connection refused
> sge006_48625 (connect_lhs 900): failed to connect to lhs at sge079
> 1413882
> sge006_48625 (enter_ring 855): lhs connect failed
> sge006_48625 (run 252): failed to enter ring
> sge007_44937: conn error in connect_lhs: Connection refused
> sge007_44937 (connect_lhs 900): failed to connect to lhs at sge079
> 1413882
> sge007_44937 (enter_ring 855): lhs connect failed
> sge007_44937 (run 252): failed to enter ring
> sge092_38739: conn error in connect_lhs: Connection refused
> sge092_38739 (connect_lhs 900): failed to connect to lhs at sge079
> 1413882
> sge092_38739 (enter_ring 855): lhs connect failed
> sge092_38739 (run 252): failed to enter ring
> sge011_39350: conn error in connect_lhs: Connection refused
> sge011_39350 (connect_lhs 900): failed to connect to lhs at sge079
> 1413882
> sge011_39350 (enter_ring 855): lhs connect failed
> sge011_39350 (run 252): failed to enter ring
> patterso at cfengine1:~/mpi/mpich2>
>
> And in the master exec host's messages file:
>
> 12/04/2008 13:32:05|  main|sge079|E|no free queue for job 1413882 of
> user patterso at sge079.be-md.ncbi.nlm.nih.gov (localhost =
> sge079.be-md.ncbi.nlm.nih.gov)
>
> -----------------------------------
> Ron Patterson
> UNIX Systems Administrator
> NCBI/NLM/NIH contractor
> 301.435.5956
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=91197
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91199

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list