[GE users] Questions about log file: $SGE_ROOT/default/spool/qmaster
reuti at staff.uni-marburg.de
Mon May 30 10:46:31 BST 2005
when you get the second form of activation of the rsh call in your
listing, you are getting a so called Tight Integration of your parallel
Viktor Oudovenko wrote:
>>finishes I get
>>>error messages like the last two lines.
>>>Is it normal?
>>>The only trick I do it is in the "qmon; queues; execution
>>method" I put
>>>"Terminate Method" SIGTERM.
>>the built-in default is the SIGKILL. Wasn't it working? There
>>was a bug which
>>should be fixed in u4 for this error messages (your
The second version you posted in your file is the one to go for. Did you
patched already the perl script, or is the "exec" now already also
included in 1.2.4..8?
When you use a Tight Integration, SGE can kill the processes for sure.
In the first form (using rsh), SGE isn't aware at all of child tasks on
the slaves, and so nothing will be killed there (whether you set
terminate_method to NONE or a signal).
> It was working with SIGKILL for parallel queues but not for myrinet! I mean
> for myrinet it failed time after time.
When you get the second form, it should also work for Myrinet.
Otherwise: can you please post the output of the master or a slave of a job:
ps -e f -o pid,ppid,pgrp,command --cols=500
As long as all processes are in the same process group for each qrsh
call, they should be killed.
> Myrinet has different version: I use mpich-1.2.4..8a .
> For ordinary parallel queues I use 1.2.6 .
For MPICH 1.2.6 it was working out of the box?
> I have wallclock limit but it is equal to one week but this message appears
> only when jobs finishes.
Which version of SGE are you using? Maybe this is the bug, which is
removed in 6.0u4.
>>loglevel to log_info, you
>>might see the reason for the kill by SGE in the messages file.
and then you can change the default entry "loglevel log_warning" to
"loglevel log_info" using the opened vi editor.
> If you could tell me how I could do I'd appreciate. I can look into qmon
> I have one more questions. Before while running jobs on myrinet: usually I
> had processes looking like:
> See file in attachment . In the second group of processes each process looks
> like a duplicated one.
The second is the correct one.
> Could you tell me what is normal the first case or the second one.
> The first one I get in ordinary parallel queue (not myrinet)
How are you starting your parallel jobs? It looks also in the first case
like a Myrinet job.
Cheers - Reuti
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users