[GE users] MPICH process groups

Jeroen M. Kleijer jeroen.m.kleijer at philips.com
Tue Dec 13 08:21:11 GMT 2005


Ok, I somehow got things working  yesterday evening and now all processes 
on all nodes get killed when a qdel command is issued. (main problem is 
that I'm not quite sure what it was that I did....)
To clarify, Msc.Marc is still the basic _huge_ "run_marc" script run from 
top to bottom with no functions or procedures defined making it extremely 
hard to debug. I've noticed at least 4 different programming styles when 
it comes to indentation and 'if .. then .. else' structures making it even 
harder. What it basically does is parse the input line and create a second 
script which then gets submitted through qsub. This second script contains 
the mpirun line.

I thought it was using the p4 device instead of shared memory because I 
only used 1 cpu on each node.
Here are the process trees:

master node
 2317     0     0 root         1 [lockd]
 2364 12707  5555 sge       2364 /home/sge/bin/lx24-amd64/sge_execd
17364 12707  5555 sge      17364  \_ sge_shepherd-4335 -bg
17389 12628  5555 nly00281 17389      \_ /bin/sh 
/home/sge/default/spool/nlcftcs14/job_scripts/4335
17402 12628  5555 nly00281 17389          \_ /bin/sh 
/cadappl/marc/2005r3b.sge/marc2005r3/mpich/bin/mpirun -np 2 -machinefile 
/volumes/scratch/4335.1.batch.q/machines 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc -jid 
simplenewnonlinassstrain_job1 -dirjid /home/nly00281/tests/marc/2cpu/tmp/. 
-nprocd 2 -maxnum 1000000 -itree 0 -nthread 0 -dirjob 
/home/nly00281/tests/marc/2cpu/tmp -mhost 
/home/nly00281/tests/marc/2cpu/tmp/hostfile -ci yes -cr yes -dirscr 
/volumes/scratch
17496 12628  5555 nly00281 17389              \_ 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc -jid 
simplenewnonlinassstrain_job1 -dirjid /home/nly00281/tests/marc/2cpu/tmp/. 
-nprocd 2 -maxnum 1000000 -itree 0 -nthread 0 -dirjob 
/home/nly00281/tests/marc/2cpu/tmp -mhost 
/home/nly00281/tests/marc/2cpu/tmp/hostfile -ci yes -cr yes -dirscr 
/volumes/scratch -p4pg /home/nly00281/tests/marc/2cpu/tmp/PI17402 -p4wd 
/home/nly00281/tests/marc/2cpu/tmp
17497 12628  5555 nly00281 17389                  \_ 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc -jid 
simplenewnonlinassstrain_job1 -dirjid /home/nly00281/tests/marc/2cpu/tmp/. 
-nprocd 2 -maxnum 1000000 -itree 0 -nthread 0 -dirjob 
/home/nly00281/tests/marc/2cpu/tmp -mhost 
/home/nly00281/tests/marc/2cpu/tmp/hostfile -ci yes -cr yes -dirscr 
/volumes/scratch -p4pg /home/nly00281/tests/marc/2cpu/tmp/PI17402 -p4wd 
/home/nly00281/tests/marc/2cpu/tmp
17498 12628  5555 nly00281 17389                  \_ 
/home/sge/bin/lx24-amd64/qrsh -inherit -nostdin nlcftcs13 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc nlcftcs14 37117 \-p4amslave 
\-p4yourname nlcftcs13 \-p4rmrank 1
17506 12628  5555 nly00281 17389                      \_ 
/home/sge/utilbin/lx24-amd64/rsh -n -p 36756 nlcftcs13 exec 
'/home/sge/utilbin/lx24-amd64/qrsh_starter' 
'/home/sge/default/spool/nlcftcs13/active_jobs/4335.1/1.nlcftcs13'

slave node
 2353     0     0 root         1 [lockd]
 2400 12707  5555 sge       2400 /home/sge/bin/lx24-amd64/sge_execd
32419 12707  5555 sge      32419  \_ sge_shepherd-4335 -bg
32420     0     0 root     32420      \_ /home/sge/utilbin/lx24-amd64/rshd 
-l
32423 12628  5555 nly00281 32423          \_ 
/home/sge/utilbin/lx24-amd64/qrsh_starter 
/home/sge/default/spool/nlcftcs13/active_jobs/4335.1/1.nlcftcs13
32424 12628  5555 nly00281 32424              \_ ksh -c 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc nlcftcs14 37117 \-p4amslave 
\-p4yourname nlcftcs13 \-p4rmrank 1
32425 12628  5555 nly00281 32424                  \_ 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc nlcftcs14 37117 -p4amslave 
-p4yourname nlcftcs13 -p4rmrank 1
32426 12628  5555 nly00281 32424                      \_ 
/cadappl/marc/2005r3b.sge/marc2005r3/bin/marc nlcftcs14 37117 -p4amslave 
-p4yourname nlcftcs13 -p4rmrank 1

As far as I can tell MPICH only uses the machinefile format and if you 
feed it a hostfile format it breaks.
Does anybody have an idea how to either prevent these open semaphores or 
clean them up after a run automagically?


Met vriendelijke groeten / Kind regards

Jeroen Kleijer
Unix Systeembeheer
Philips Applied Technologies








John Hearns <john.hearns at streamline-computing.com> 
2005-12-12 07:18 PM
Please respond to
users at gridengine.sunsource.net


To
users at gridengine.sunsource.net
cc

Subject
Re: [GE users] MPICH process groups
Classification







On Mon, 2005-12-12 at 18:50 +0100, Reuti wrote:

> 
> I forgot: there is the procedure "cleanipcs" in MPICH (but this might 
> kill another process of the same user on the same node), but they 
> state "mpiclean" for Msc.Marc - which platform are you on, and is it 
> really MPICH?
> 
This 'mpiclean' is likely to be a rename of 'cleanipcs'.
I agree with Reuti though - send us a ps -efj --forest output


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





More information about the gridengine-users mailing list