[GE users] MPICH process groups

Jeroen M. Kleijer jeroen.m.kleijer at philips.com
Tue Dec 13 08:31:22 GMT 2005


Ok, seems I've spoken too soon again. 
MPICH's mpirun does acknowledge the hostfile format "<node>:<cpus>" as 
well as the machine file format "<node> <node> <node>". (made a typo in 
the function PeHostFile2HostFile causing it to omit the ":" and then MPICH 
thinks the number of cpus is a host as well)
But the hanging semaphores still remain meaning a user still has to run 
mpiclean on every node every time he issues a qdel command.

Met vriendelijke groeten / Kind regards

Jeroen Kleijer
Unix Systeembeheer
Philips Applied Technologies








"Jeroen M. Kleijer" <jeroen.m.kleijer+FromInternet at philips.com> 
2005-12-13 09:21 AM
Please respond to
users at gridengine.sunsource.net


To
users at gridengine.sunsource.net
cc

Subject
Re: [GE users] MPICH process groups
Classification








Ok, I somehow got things working  yesterday evening and now all processes 
on all nodes get killed when a qdel command is issued. (main problem is 
that I'm not quite sure what it was that I did....) 
To clarify, Msc.Marc is still the basic _huge_ "run_marc" script run from 
top to bottom with no functions or procedures defined making it extremely 
hard to debug. I've noticed at least 4 different programming styles when 
it comes to indentation and 'if .. then .. else' structures making it even 
harder. What it basically does is parse the input line and create a second 
script which then gets submitted through qsub. This second script contains 
the mpirun line. 

I thought it was using the p4 device instead of shared memory because I 
only used 1 cpu on each node. 
Here are the process trees: 

master node 
 2317     0     0 root         1 [lockd] 
 2364 12707  5555 sge       2364 /home/sge/bin/lx24-amd64/sge_execd 
17364 12707  5555 sge      17364  \_ sge_shepherd-4335 -bg 
17389 12628  5555 nly00281 17389      \_ /bin/sh 
/home/sge/default/spool/nlcftcs14/job_scripts/4335 
17402 12628  5555 nly00281 17389          \_ /bin/sh 
/cadappl/marc/2005r3b.sge/marc2005r3/mpich/bin/mpirun -np 2 -machinefile 
/volumes/scratch/4335.1.batch.q/machines 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc -jid 
simplenewnonlinassstrain_job1 -dirjid /home/nly00281/tests/marc/2cpu/tmp/. 
-nprocd 2 -maxnum 1000000 -itree 0 -nthread 0 -dirjob 
/home/nly00281/tests/marc/2cpu/tmp -mhost 
/home/nly00281/tests/marc/2cpu/tmp/hostfile -ci yes -cr yes -dirscr 
/volumes/scratch 
17496 12628  5555 nly00281 17389              \_ 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc -jid 
simplenewnonlinassstrain_job1 -dirjid /home/nly00281/tests/marc/2cpu/tmp/. 
-nprocd 2 -maxnum 1000000 -itree 0 -nthread 0 -dirjob 
/home/nly00281/tests/marc/2cpu/tmp -mhost 
/home/nly00281/tests/marc/2cpu/tmp/hostfile -ci yes -cr yes -dirscr 
/volumes/scratch -p4pg /home/nly00281/tests/marc/2cpu/tmp/PI17402 -p4wd 
/home/nly00281/tests/marc/2cpu/tmp 
17497 12628  5555 nly00281 17389                  \_ 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc -jid 
simplenewnonlinassstrain_job1 -dirjid /home/nly00281/tests/marc/2cpu/tmp/. 
-nprocd 2 -maxnum 1000000 -itree 0 -nthread 0 -dirjob 
/home/nly00281/tests/marc/2cpu/tmp -mhost 
/home/nly00281/tests/marc/2cpu/tmp/hostfile -ci yes -cr yes -dirscr 
/volumes/scratch -p4pg /home/nly00281/tests/marc/2cpu/tmp/PI17402 -p4wd 
/home/nly00281/tests/marc/2cpu/tmp 
17498 12628  5555 nly00281 17389                  \_ 
/home/sge/bin/lx24-amd64/qrsh -inherit -nostdin nlcftcs13 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc nlcftcs14 37117 \-p4amslave 
\-p4yourname nlcftcs13 \-p4rmrank 1 
17506 12628  5555 nly00281 17389                      \_ 
/home/sge/utilbin/lx24-amd64/rsh -n -p 36756 nlcftcs13 exec 
'/home/sge/utilbin/lx24-amd64/qrsh_starter' 
'/home/sge/default/spool/nlcftcs13/active_jobs/4335.1/1.nlcftcs13' 

slave node 
 2353     0     0 root         1 [lockd] 
 2400 12707  5555 sge       2400 /home/sge/bin/lx24-amd64/sge_execd 
32419 12707  5555 sge      32419  \_ sge_shepherd-4335 -bg 
32420     0     0 root     32420      \_ /home/sge/utilbin/lx24-amd64/rshd 
-l 
32423 12628  5555 nly00281 32423          \_ 
/home/sge/utilbin/lx24-amd64/qrsh_starter 
/home/sge/default/spool/nlcftcs13/active_jobs/4335.1/1.nlcftcs13 
32424 12628  5555 nly00281 32424              \_ ksh -c 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc nlcftcs14 37117 \-p4amslave 
\-p4yourname nlcftcs13 \-p4rmrank 1 
32425 12628  5555 nly00281 32424                  \_ 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc nlcftcs14 37117 -p4amslave 
-p4yourname nlcftcs13 -p4rmrank 1 
32426 12628  5555 nly00281 32424                      \_ 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc nlcftcs14 37117 -p4amslave 
-p4yourname nlcftcs13 -p4rmrank 1 

As far as I can tell MPICH only uses the machinefile format and if you 
feed it a hostfile format it breaks. 
Does anybody have an idea how to either prevent these open semaphores or 
clean them up after a run automagically? 


Met vriendelijke groeten / Kind regards

Jeroen Kleijer
Unix Systeembeheer
Philips Applied Technologies 







John Hearns <john.hearns at streamline-computing.com> 
2005-12-12 07:18 PM 

Please respond to
users at gridengine.sunsource.net


To
users at gridengine.sunsource.net 
cc

Subject
Re: [GE users] MPICH process groups 
Classification









On Mon, 2005-12-12 at 18:50 +0100, Reuti wrote:

> 
> I forgot: there is the procedure "cleanipcs" in MPICH (but this might 
> kill another process of the same user on the same node), but they 
> state "mpiclean" for Msc.Marc - which platform are you on, and is it 
> really MPICH?
> 
This 'mpiclean' is likely to be a rename of 'cleanipcs'.
I agree with Reuti though - send us a ps -efj --forest output


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





More information about the gridengine-users mailing list