[GE users] Questions about log file: $SGE_ROOT/default/spool/qmaster

Viktor Oudovenko udo at physics.rutgers.edu
Mon May 30 07:45:37 BST 2005


Hi, Reuti,

Thanks a lot for the answer.

> Viktor,
> 
> Quoting Viktor Oudovenko <udo at physics.rutgers.edu>:
> 
> > Hi,
> > 
> > Is it normal such kind of log output or not?
> > 
> > 05/25/2005 13:34:53|qmaster|rupc-cs04b|E|orders 
> user/project version 
> > (2366) is not uptodate (2367) for user/project "cfennie"
> > 
> > 05/25/2005 13:34:53|qmaster|rupc-cs04b|E|orders 
> user/project version 
> > (955) is not uptodate (956) for user/project "karenjoh"
> > 
> > 05/25/2005 14:05:08|qmaster|rupc-cs04b|E|tightly integrated 
> parallel 
> > task 21840.1 task 3.sub04n68 failed - killing job
> > 
> > 05/25/2005 14:08:30|qmaster|rupc-cs04b|E|tightly integrated 
> parallel 
> > task 21858.1 task 4.sub04n61 failed - killing job
> > 
> > 
> > 
> > Actually 2 questions:
> > 
> > 1) when I modify policy configuration I get messages like 
> in the first 
> > 2 lines. How can I get rid of them?
> > 
> > 2) each time parallel job on parallel or myrinet  queue 
> finishes I get 
> > error messages like the last two lines.
> > Is it normal? 
> > The  only trick I do it is in the "qmon; queues;  execution 
> method" I put
> > "Terminate Method" SIGTERM.
> 
> the built-in default is the SIGKILL. Wasn't it working? There 
> was a bug which 
> should be fixed in u4 for this error messages (your 
> version?). 

It was working  with SIGKILL for parallel queues but not for myrinet! I mean
for myrinet it failed time after time.

>Then try a Tight 
> Integration according to the $SGE_ROOT/mpi instructions and 
> the Howto's. Which 
> Myrinet version are you using? For 1.2.5..xx you need a 
> slight modification of a 
> script, in 1.2.6..xx I heard it's working out of the box, 
> i.e. the patch is 
> already in.

Myrinet has different version: I use mpich-1.2.4..8a .

For ordinary parallel queues I use 1.2.6 .

> > It is very helpful to get rid of whole job on all slaves. 
> Especially 
> > on myrinet cluster.
> > 
> > 3) the most important question:
> > One of my users runs perl script calling mpi command a few times in 
> > the SGE script. On occasionally one gets in messages the following 
> > lines after which jobs gets terminated. Any idea what could 
> it be and 
> > how to avoid it?
> > 
> > 05/25/2005 08:58:23|qmaster|rupc-cs04b|E|tightly integrated 
> parallel 
> > task 21823.1 task 5.sub04n88 failed - killing job
> > 
> > 05/25/2005 09:00:12|qmaster|rupc-cs04b|W|job 21823.1 failed on host 
> > sub04n86 assumedly after job because: job 21823.1 died 
> through signal 
> > TERM (15)
> 
> Is there any wallclock or other limit? 

I have wallclock limit but it is equal to one week but this message appears
only when jobs finishes.

>If you turn on 
> loglevel to log_info, you 
> might see the reason for the kill by SGE in the messages file.


If you could tell me how I could do I'd appreciate. I can look into qmon
options.
 
I have one more questions. Before while running jobs on myrinet: usually I
had processes looking like:

See file in attachment . In the second group of processes each process looks
like a duplicated one.
Could you tell me what is normal the first case or the second one.

The first one I get in ordinary parallel queue (not myrinet)

Thank you very much in advance.

Best regards,
v


    [ Part 2, Text/PLAIN (Name: "processes.txt") ~169 lines. ]
    [ Unable to print this part. ]


    [ Part 3: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list