[GE users] jobs never die on nodes with mpich

Michel Cuendet michel.cuendet at epfl.ch
Tue Aug 3 19:28:19 BST 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi everyone,

Sorry for not getting back to you earlier, but we had a major server 
crash here, and I had to run to the store to get a new disk...

Reuti wrote:

>can you provide an output of top, where it is shown, which of this processes 
>take up CPU time, and also a process tree of a running job on a slave node?
>
Here you are :

CRASHED JOB on a slave node:

top :
25409 mitch     22   0  217M 217M  3144 R    99.9 10.7 855:30   1 
/home/mitch/QMMM_38/cpmd.x input /home/mitch/PP

ps :
F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
4   200 26943  2434   5 -10  1856  696 wait4  S<   ?          0:00  \_ 
sge_shepherd-4911 -bg
4     0 26944 26943  19   0  1864  596 schedu S    ?          0:00      
\_ /opt/sge/utilbin/glinux/rshd -l
4   503 26945 26944  15   0     0    0 funct> Z    ?          
0:00          \_ [qrsh_starter <defunct>]
0   503 27051     1  25   0 93144 25040 -     R    ?         13:08 
/home/mitch/QMMM_38/cpmd.x input /home/mitch/PP

RUNNING JOB on the master node17:

5   200  2434     1   5 -10  5664 2292 schedu S<   ?         50:17 
/opt/sge/bin/glinux/sge_execd
0   200 30558  2434  11 -10  1856  664 wait4  S<   ?          0:00  \_ 
sge_shepherd-4912 -bg
4   503 30581 30558  23   0  2056  960 wait4  S    ?          0:00  |   
\_ bash /opt/sge/default/spool/node17/job_scripts/4912
0   503 30584 30581  25   0  4300 2728 wait4  S    ?          0:00  
|       \_ perl -S -w /opt/mpich/1.2.5..10/ia32/ic71/bin/mpirun.ch_gm.pl 
-np 2 -machinefile /tmp/4912.1.node17.q/machi
1   503 30613 30584  23   0  4364 2788 schedu S    ?          0:00  
|           \_ perl -S -w 
/opt/mpich/1.2.5..10/ia32/ic71/bin/mpirun.ch_gm.pl -np 2 -machinefile 
/tmp/4912.1.node17.q/m
0   503 30614 30584  25   0  2480 1096 wait4  S    ?          0:00  
|           \_ /opt/sge/bin/glinux/qrsh -inherit node17 cd 
/home/mitch/Azurin/test
4   503 30630 30614  15   0  1552  584 schedu S    ?          0:00  
|           |   \_ /opt/sge/utilbin/glinux/rsh -p 53878 node17.cluster 
exec '/opt/sge/utilbin/glinux/qrsh_starter' '/i
1   503 30633 30630  24   0     0    0 do_exi Z    ?          0:00  
|           |       \_ [rsh <defunct>]
0   503 30615 30584  25   0  2480 1096 wait4  S    ?          0:00  
|           \_ /opt/sge/bin/glinux/qrsh -inherit -nostdin node30 cd 
/home/mitch/Azurin/test ; env GMPI_MASTER=node17.c
4   503 30631 30615  24   0  1544  580 schedu S    ?          0:00  
|               \_ /opt/sge/utilbin/glinux/rsh -n -p 53134 
node30.cluster exec '/opt/sge/utilbin/glinux/qrsh_starter'
0   200 30628  2434  10 -10  1860  672 wait4  S<   ?          0:00  \_ 
sge_shepherd-4912 -bg
4     0 30629 30628  21   0  1868  600 schedu S    ?          0:00      
\_ /opt/sge/utilbin/glinux/rshd -l
4   503 30632 30629  25   0  1632  420 wait4  S    ?          
0:00          \_ /opt/sge/utilbin/glinux/qrsh_starter 
/imports/sge/default/spool/node17/active_jobs/4912.1/1.node17
0   503 30687 30632  25   0  2072  996 wait4  S    ?          
0:00              \_ bash -c cd /home/mitch/Azurin/test
0   503 30739 30687  25   0 611212 431620 -   R    ?         
13:18                  \_ /home/mitch/QMMM_38/cpmd.x input /home/mitch/PP

RUNNING JOB on the slave node30:


1     0  2429     1  15   0  3848  920 schedu S    ?          2:13 
/opt/sge/bin/glinux/sge_commd
5   200  2431     1   5 -10  5648 2272 schedu S<   ?         41:18 
/opt/sge/bin/glinux/sge_execd
0   200 25213  2431  10 -10  1860  672 wait4  S<   ?          0:00  \_ 
sge_shepherd-4912 -bg
4     0 25214 25213  21   0  1868  596 schedu S    ?          0:00      
\_ /opt/sge/utilbin/glinux/rshd -l
4   503 25215 25214  25   0  1632  420 wait4  S    ?          
0:00          \_ /opt/sge/utilbin/glinux/qrsh_starter 
/imports/sge/default/spool/node30/active_jobs/4912.1/1.node30
0   503 25269 25215  25   0  2064  992 wait4  S    ?          
0:00              \_ bash -c cd /home/mitch/Azurin/test
0   503 25321 25269  25   0 582600 400012 -   R    ?         
17:48                  \_ /home/mitch/QMMM_38/cpmd.x input /home/mitch/PP

>But when you modify the $PATH in your shell script, and put anything in front 
>of it, you may get a different behavior. 
>
My submit script sets the intel compilo environment, and indeed puts 
something ahead of $PATH.
I thought I had found the solution, but modifying that doesn't change 
the behaviour at all : ghost
jobs remain on the slave nodes.

Thanks for your support,

Bye,

Michel

-- 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Michel Cuendet, Ph.D. student
Laboratory of Computational Biochemistry and Chemistry
Swiss Federal Institute of Technology in Lausanne (EPFL)
CH-1015 Lausanne						
Switzerland                         	Phone : +41 1 693 0324
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list