[GE users] Questions about log file: $SGE_ROOT/default/spool/qmaster

Reuti reuti at staff.uni-marburg.de
Mon May 30 10:46:31 BST 2005


Viktor,

when you get the second form of activation of the rsh call in your 
listing, you are getting a so called Tight Integration of your parallel 
jobs.

Viktor Oudovenko wrote:
<snip>

>>finishes I get 
>>
>>>error messages like the last two lines.
>>>Is it normal? 
>>>The  only trick I do it is in the "qmon; queues;  execution 
>>
>>method" I put
>>
>>>"Terminate Method" SIGTERM.
>>
>>the built-in default is the SIGKILL. Wasn't it working? There 
>>was a bug which 
>>should be fixed in u4 for this error messages (your 
>>version?). 

The second version you posted in your file is the one to go for. Did you 
patched already the perl script, or is the "exec" now already also 
included in 1.2.4..8?

When you use a Tight Integration, SGE can kill the processes for sure. 
In the first form (using rsh), SGE isn't aware at all of child tasks on 
the slaves, and so nothing will be killed there (whether you set 
terminate_method to NONE or a signal).

> It was working  with SIGKILL for parallel queues but not for myrinet! I mean
> for myrinet it failed time after time.

When you get the second form, it should also work for Myrinet. 
Otherwise: can you please post the output of the master or a slave of a job:

ps -e f -o pid,ppid,pgrp,command --cols=500

As long as all processes are in the same process group for each qrsh 
call, they should be killed.

> Myrinet has different version: I use mpich-1.2.4..8a .
> 
> For ordinary parallel queues I use 1.2.6 .

For MPICH 1.2.6 it was working out of the box?

> I have wallclock limit but it is equal to one week but this message appears
> only when jobs finishes.

Which version of SGE are you using? Maybe this is the bug, which is 
removed in 6.0u4.

>>loglevel to log_info, you 
>>might see the reason for the kill by SGE in the messages file.

qconf -mconf

and then you can change the default entry "loglevel log_warning" to 
"loglevel log_info" using the opened vi editor.

> If you could tell me how I could do I'd appreciate. I can look into qmon
> options.
>  
> I have one more questions. Before while running jobs on myrinet: usually I
> had processes looking like:
> 
> See file in attachment . In the second group of processes each process looks
> like a duplicated one.

The second is the correct one.

> Could you tell me what is normal the first case or the second one.
> 
> The first one I get in ordinary parallel queue (not myrinet)

How are you starting your parallel jobs? It looks also in the first case 
like a Myrinet job.

Cheers - Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list