[GE users] Questions about log file: $SGE_ROOT/default/spool/qmaster

Reuti reuti at staff.uni-marburg.de
Thu May 26 10:47:29 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Viktor,

Quoting Viktor Oudovenko <udo at physics.rutgers.edu>:

> Hi,
> 
> Is it normal such kind of log output or not?
> 
> 05/25/2005 13:34:53|qmaster|rupc-cs04b|E|orders user/project version (2366)
> is not uptodate (2367) for user/project "cfennie"
> 
> 05/25/2005 13:34:53|qmaster|rupc-cs04b|E|orders user/project version (955)
> is not uptodate (956) for user/project "karenjoh"
> 
> 05/25/2005 14:05:08|qmaster|rupc-cs04b|E|tightly integrated parallel task
> 21840.1 task 3.sub04n68 failed - killing job
> 
> 05/25/2005 14:08:30|qmaster|rupc-cs04b|E|tightly integrated parallel task
> 21858.1 task 4.sub04n61 failed - killing job
> 
> 
> 
> Actually 2 questions:
> 
> 1) when I modify policy configuration I get messages like in the first 2
> lines.
> How can I get rid of them?
> 
> 2) each time parallel job on parallel or myrinet  queue finishes I get
> error
> messages like the last two lines.
> Is it normal? 
> The  only trick I do it is in the "qmon; queues;  execution method" I put
> "Terminate Method" SIGTERM.

the built-in default is the SIGKILL. Wasn't it working? There was a bug which 
should be fixed in u4 for this error messages (your version?). Then try a Tight 
Integration according to the $SGE_ROOT/mpi instructions and the Howto's. Which 
Myrinet version are you using? For 1.2.5..xx you need a slight modification of a 
script, in 1.2.6..xx I heard it's working out of the box, i.e. the patch is 
already in.

> It is very helpful to get rid of whole job on all slaves. Especially on
> myrinet cluster.
> 
> 3) the most important question:
> One of my users runs perl script calling mpi command a few times in the SGE
> script. On occasionally one gets in messages the following lines after
> which
> jobs gets terminated. Any idea what could it be and how to avoid it?
> 
> 05/25/2005 08:58:23|qmaster|rupc-cs04b|E|tightly integrated parallel task
> 21823.1 task 5.sub04n88 failed - killing job
> 
> 05/25/2005 09:00:12|qmaster|rupc-cs04b|W|job 21823.1 failed on host
> sub04n86
> assumedly after job because: job 21823.1 died through signal TERM (15)

Is there any wallclock or other limit? If you turn on loglevel to log_info, you 
might see the reason for the kill by SGE in the messages file.

Cheers - Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list