[GE users] MPICH2 tight integration

reuti reuti at staff.uni-marburg.de
Thu Aug 13 16:17:01 BST 2009


Am 13.08.2009 um 16:22 schrieb skylar2:

> reuti wrote:
>> Hi,
>>
>> Am 13.08.2009 um 01:44 schrieb skylar2:
>>
>>> Has anyone gotten the MPICH2 tight integration working? I'm  
>>> following
>>> the stuff in the wiki:
>>>
>>> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
>>> integration.html
>>>
>>> I installed the mpich2_mpd package in $SGE_ROOT and setup my  
>>> parallel
>>> environment like so:
>>>
>>> pe_name           mpich2_mpd
>>> slots             999
>>> user_lists        NONE
>>> xuser_lists       NONE
>>> start_proc_args   /net/gs/vol3/software/sge/mpich2_mpd/
>>> startmpich2.sh \
>>>                   -catch_rsh $pe_hostfile \
>>>
>>> /net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/RHEL5/x86_64
>>> stop_proc_args    /net/gs/vol3/software/sge/mpich2_mpd/ 
>>> stopmpich2.sh \
>>>                   -catch_rsh \
>>>
>>> /net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/RHEL5/x86_64
>>> allocation_rule   $round_robin
>>> control_slaves    TRUE
>>> job_is_first_task FALSE
>>> urgency_slots     min
>>>
>>> When I try to run a job it fails during the mpd boot up:
>>>
>>> -catch_rsh
>>> /net/gs/vol3/software/sge/sage/spool/sage011/active_jobs/828151.1/
>>> pe_hostfile
>>> /net/gs/vol3/software/modules-sw/mpi
>>> ch2/1.0.8/Linux/RHEL5/x86_64
>>
>> was your first test already with such a high number of nodes? Is it
>> working with only 4/8/16... slots on 1/2/4 nodes?
>>
>> a) Are all nodes the same, or  can it be that some have a firewall?
>
> They are all the same, and on a private network so no firewall. I also
> verified that I could mpdboot and mpirun outside SGE, and that  
> worked fine.
>
>> b) It may be, that the startup sequence takes just too long and you
>> have to increase the "sleep" time in the startmpich2.sh But then the
>> startup would be really long.
>
> I doubled the SLEEPTIME and it still didn't work even with only 24
> slots, so I don't think that was the issue.

The point where I'm confused is, that in your last email I saw:

startmpich2.sh: got only 8 of 23 nodes, aborting

Looks like it's working on some nodes. Did you get slots from  
different queues, i.e. is the PE bound to more than one queue?

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212145

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list