[GE users] Questions about log file: $SGE_ROOT/default/spool/qmaster

Viktor Oudovenko udo at physics.rutgers.edu
Tue May 31 01:22:30 BST 2005


Hi, Reuti,

Thanks a lot for the answer. I am observing quite strange thing.
But first I answer your questions.
 
> Viktor,
> 
> when you get the second form of activation of the rsh call in your 
> listing, you are getting a so called Tight Integration of 
> your parallel 
> jobs.


Just like information with all versions of SGE I used 6.0u1, u3 and u4
usually I got the first scenario when it was not tight integration. I solved
the problem with killing jobs like what described on SGE website (I changed
mpich-1.2.5 that time). 
As I remember only one line should be changed. But myrinet had and has
different MPI version and I solved the problem just putting SIGTERM command
instead of NONE and it worked very nice. The jobs got killed immediately.
Then I updated the cluster to SUSE 9.0 and SUSE 9.2 as well as SGE from u1
to u3 and started to observe many duplication of the processes (the scenario
which is true tight integration). But recently (today or yesterday when I
changed dome /etc/hosts files to get route command working properly) I
started again to get only the first scenario except one node.
The only difference between those nodes that one (the first one) I have
updated (I mean operation system from SUSE 8.1 to SuSE 9.0 while the second
one I installed from the scratch).  I used absolutely the same job to run on
those 2 nodes! 
Plz have a look at file info.txt . It mean that tight integration or not
depends on some system setting. I could not understand which ones.

Add-ons: 

A) I presented results for myrinet nodes.

B) I used only 2 CPU just to see the behavior clearly and to have everything
in one box.


> Viktor Oudovenko wrote:
> <snip>
> 
> >>finishes I get
> >>
> >>>error messages like the last two lines.
> >>>Is it normal?
> >>>The  only trick I do it is in the "qmon; queues;  execution 
> >>
> >>method" I put
> >>
> >>>"Terminate Method" SIGTERM.
> >>
> >>the built-in default is the SIGKILL. Wasn't it working? There
> >>was a bug which 
> >>should be fixed in u4 for this error messages (your 
> >>version?). 
> 
> The second version you posted in your file is the one to go 
> for. Did you 
> patched already the perl script, or is the "exec" now already also 
> included in 1.2.4..8?

No I did not touched myrinet mpich. Which perl script do you mean and which
exec ?

> When you use a Tight Integration, SGE can kill the processes 
> for sure. 
> In the first form (using rsh), SGE isn't aware at all of 
> child tasks on 
> the slaves, and so nothing will be killed there (whether you set 
> terminate_method to NONE or a signal).
> 
> > It was working  with SIGKILL for parallel queues but not 
> for myrinet! 
> > I mean for myrinet it failed time after time.
> 
> When you get the second form, it should also work for Myrinet. 
> Otherwise: can you please post the output of the master or a 
> slave of a job:
> 
> ps -e f -o pid,ppid,pgrp,command --cols=500
> 
> As long as all processes are in the same process group for each qrsh 
> call, they should be killed.

Plz see info.txt file attached.


> > Myrinet has different version: I use mpich-1.2.4..8a .
> > 
> > For ordinary parallel queues I use 1.2.6 .
> 
> For MPICH 1.2.6 it was working out of the box?
> 
> > I have wallclock limit but it is equal to one week but this message 
> > appears only when jobs finishes.
> 
> Which version of SGE are you using? Maybe this is the bug, which is 
> removed in 6.0u4.
> 
> >>loglevel to log_info, you
> >>might see the reason for the kill by SGE in the messages file.
> 
> qconf -mconf
> 
> and then you can change the default entry "loglevel log_warning" to 
> "loglevel log_info" using the opened vi editor.

I've fone it and now I also have info information in messages file.
 
> > If you could tell me how I could do I'd appreciate. I can look into 
> > qmon options.
> >  
> > I have one more questions. Before while running jobs on myrinet: 
> > usually I had processes looking like:
> > 
> > See file in attachment . In the second group of processes 
> each process 
> > looks like a duplicated one.
> 
> The second is the correct one.
> 
> > Could you tell me what is normal the first case or the second one.
> > 
> > The first one I get in ordinary parallel queue (not myrinet)
> 
> How are you starting your parallel jobs? It looks also in the 
> first case 
> like a Myrinet job.

Usually I use command: qsub my_script.sh (see example of my script in the
attachment, as well as prolog file.)

Best regards,
v



> Cheers - Reuti
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


    [ Part 2, Text/PLAIN (Name: "info.txt") ~169 lines. ]
    [ Unable to print this part. ]


    [ Part 3, Text/PLAIN (Name: "myri.sh.txt") ~22 lines. ]
    [ Unable to print this part. ]


    [ Part 4, Text/PLAIN (Name: "epilog_myri.sh.txt") ~10 lines. ]
    [ Unable to print this part. ]


    [ Part 5, Text/PLAIN (Name: "prolog_myri.sh.txt") ~42 lines. ]
    [ Unable to print this part. ]


    [ Part 6: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list