[GE users] MPICH2 tight integration

skylar2 skylar2 at u.washington.edu
Thu Aug 13 15:22:25 BST 2009


reuti wrote:
> Hi,
> 
> Am 13.08.2009 um 01:44 schrieb skylar2:
> 
>> Has anyone gotten the MPICH2 tight integration working? I'm following
>> the stuff in the wiki:
>>
>> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
>> integration.html
>>
>> I installed the mpich2_mpd package in $SGE_ROOT and setup my parallel
>> environment like so:
>>
>> pe_name           mpich2_mpd
>> slots             999
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /net/gs/vol3/software/sge/mpich2_mpd/ 
>> startmpich2.sh \
>>                   -catch_rsh $pe_hostfile \
>>
>> /net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/RHEL5/x86_64
>> stop_proc_args    /net/gs/vol3/software/sge/mpich2_mpd/stopmpich2.sh \
>>                   -catch_rsh \
>>
>> /net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/RHEL5/x86_64
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>>
>> When I try to run a job it fails during the mpd boot up:
>>
>> -catch_rsh
>> /net/gs/vol3/software/sge/sage/spool/sage011/active_jobs/828151.1/ 
>> pe_hostfile
>> /net/gs/vol3/software/modules-sw/mpi
>> ch2/1.0.8/Linux/RHEL5/x86_64
> 
> was your first test already with such a high number of nodes? Is it  
> working with only 4/8/16... slots on 1/2/4 nodes?
> 
> a) Are all nodes the same, or  can it be that some have a firewall?

They are all the same, and on a private network so no firewall. I also
verified that I could mpdboot and mpirun outside SGE, and that worked fine.

> b) It may be, that the startup sequence takes just too long and you  
> have to increase the "sleep" time in the startmpich2.sh But then the  
> startup would be really long.

I doubled the SLEEPTIME and it still didn't work even with only 24
slots, so I don't think that was the issue.

> c) What about using a daemonless smpd startup?

I'm partial to MPIs with daemons, but if it works I won't complain. I'll
see if I can get it wired up.

-- 
-- Skylar Thompson (skylar2 at u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S048, (206)-685-7354
-- University of Washington School of Medicine

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212137

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, "OpenPGP digital signature"  Application/PGP-SIGNATURE ]
    [ (Name: "signature.asc") 261 bytes. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list