[GE users] Loose and Tight Integration Execution Times

Reuti reuti at staff.uni-marburg.de
Mon Jun 23 17:39:21 BST 2008


Am 23.06.2008 um 14:38 schrieb Azhar Ali Shah:

> <snip>
> Though in this case the tight integration job runs on a uni- 
> processor node as a MASTER and the loose integration job runs on a  
> dual-processor node as a MASTER but the results are same for  
> converse also. Cann't figure out why?
>
> regards
> Azhar
>
>
> From: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [GE users] Loose and Tight Integration Execution Times
> To: users at gridengine.sunsource.net
> Date: Monday, June 23, 2008, 1:22 PM
>
> Am 23.06.2008 um 13:23 schrieb Azhar Ali Shah:
>
> > --- On Mon, 6/23/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> > >What does your "mpirun" command look like? But all seem to
> be in best
> > >order in your outputs.
> >
> > The mpiexec command for SMPD daemon-based tight integration looks
> > like:
> >
> > PID PPID PGRP COMMAND
> > 14292 1 14292 /usr/SGE6/bin/lx24-x86/sge_execd
> > 30942 14292 30942 \_ sge_shepherd-138 -bg
> > 31127 30942 31127 | \_ bash /usr/SGE6/default/spool/comp6/
> > job_scripts/138
> > 31132 31127 31127 | \_ mpiexec -n 9 -machinefile /tmp/138.1.all.q/
> > machines -port 20138 /home/aas/par_procks
> > 31084 14292 31084 \_ sge_shepherd-138 -bg
> > 31086 31084 31086 \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> > 31095 31086 31095 \_ /usr/SGE6/utilbin/lx24-x86/qrsh_starter /usr/
> > SGE6/default/spool/aragorn/active_jobs/
> > 31098 31095 31098 \_ /home/aas/local/mpich2_smpd/bin/smpd -port
> > 20138 -d 0
> > 31133 31098 31098 \_ /home/aas/local/mpich2_smpd/bin/smpd -port
> > 20138 -d 0
> > 31134 31133 31098 \_ /home/aas/par_procksi_Alone
> > 307 31134 31098 \_ ./fast /home/aas/workspace/AzharPerlProject/SS4-
> > S11-250/1AOEB-1.pdb
> > 30932 1 30931 smpd -s
> > 31008 1 30947 /usr/SGE6/bin/lx24-x86/qrsh -inherit comp6 /home/aas/
> > local/mpich2_smpd/bin/smpd -port 20138 -d
> > 31094 31008 30947 \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 32946 comp6
> > exec '/usr/SGE6/utilbin/lx24-x
> > 31096 31094 30947 \_ [rsh] <defunct>
> > 31010 1 30947 /usr/SGE6/bin/lx24-x86/qrsh -inherit comp4 /home/aas/
> > local/mpich2_smpd/bin/smpd -port 20138 -d
> > 31089 31010 30947 \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 34572 comp4
> > exec '/usr/SGE6/utilbin/lx24-x
> > 31091 31089 30947 \_ [rsh] <defunct>
> > 31012 1 30947 /usr/SGE6/bin/lx24-x86/qrsh -inherit comp3 /home/aas/
> > local/mpich2_smpd/bin/smpd -port 20138 -d 0
> > 31087 31012 30947 \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 34267 comp3
> > exec '/usr/SGE6/utilbin/lx24-x86/
> > 31088 31087 30947 \_ [rsh] <defunct>
> > 31014 1 30947 /usr/SGE6/bin/lx24-x86/qrsh -inherit comp1 /home/aas/
> > local/mpich2_smpd/bin/smpd -port 20138 -d
> > 31083 31014 30947 \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 53966 comp1
> > exec '/usr/SGE6/utilbin/lx24-x
> > 31085 31083 30947 \_ [rsh] <defunct>
> > 31021 1 30947 /usr/SGE6/bin/lx24-x86/qrsh -inherit comp2 /home/aas/
> > local/mpich2_smpd/bin/smpd -port 20138 -d 0
> > 31092 31021 30947 \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 35644 comp2
> > exec '/usr/SGE6/utilbin/lx24-x86
> > 31093 31092 30947 \_ [rsh] <defunct>
> > 31034 1 30947 /usr/SGE6/bin/lx24-x86/qrsh -inherit comp5 /home/aas/
> > local/mpich2_smpd/bin/smpd -port 20138 -d
> > 31097 31034 30947 \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 46549 comp5
> > exec '/usr/SGE6/utilbin/lx24-x

The startup seems to be okay. Did you define any nice-values, i.e.  
priority, in your queue definition? The difference is too big, to be  
a result of SGE starting the daemons. Is the load on all slave nodes  
100%?

In any case: your cluster is too heterogeneous (some have double the  
performance as other nodes) to get a concrete path to an explanation.  
Do you have  a smaller testcase and can run it just on one node only  
with 2 slots (and limit it to be always the same in both cases)?  
Don't expect a parallel application to be 100% parallel. There might  
be cases, where the speed of the master node of the parallel job is  
only responsible for the overall timing.

You upload the timings for the daemonless startup. Can you please  
also post the timing-sheet for the loose integration regarding used  
cpu time for user and system?

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list