[GE users] Questions about log file: $SGE_ROOT/default/spool/qmaster
reuti at staff.uni-marburg.de
Thu May 26 10:47:29 BST 2005
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Quoting Viktor Oudovenko <udo at physics.rutgers.edu>:
> Is it normal such kind of log output or not?
> 05/25/2005 13:34:53|qmaster|rupc-cs04b|E|orders user/project version (2366)
> is not uptodate (2367) for user/project "cfennie"
> 05/25/2005 13:34:53|qmaster|rupc-cs04b|E|orders user/project version (955)
> is not uptodate (956) for user/project "karenjoh"
> 05/25/2005 14:05:08|qmaster|rupc-cs04b|E|tightly integrated parallel task
> 21840.1 task 3.sub04n68 failed - killing job
> 05/25/2005 14:08:30|qmaster|rupc-cs04b|E|tightly integrated parallel task
> 21858.1 task 4.sub04n61 failed - killing job
> Actually 2 questions:
> 1) when I modify policy configuration I get messages like in the first 2
> How can I get rid of them?
> 2) each time parallel job on parallel or myrinet queue finishes I get
> messages like the last two lines.
> Is it normal?
> The only trick I do it is in the "qmon; queues; execution method" I put
> "Terminate Method" SIGTERM.
the built-in default is the SIGKILL. Wasn't it working? There was a bug which
should be fixed in u4 for this error messages (your version?). Then try a Tight
Integration according to the $SGE_ROOT/mpi instructions and the Howto's. Which
Myrinet version are you using? For 1.2.5..xx you need a slight modification of a
script, in 1.2.6..xx I heard it's working out of the box, i.e. the patch is
> It is very helpful to get rid of whole job on all slaves. Especially on
> myrinet cluster.
> 3) the most important question:
> One of my users runs perl script calling mpi command a few times in the SGE
> script. On occasionally one gets in messages the following lines after
> jobs gets terminated. Any idea what could it be and how to avoid it?
> 05/25/2005 08:58:23|qmaster|rupc-cs04b|E|tightly integrated parallel task
> 21823.1 task 5.sub04n88 failed - killing job
> 05/25/2005 09:00:12|qmaster|rupc-cs04b|W|job 21823.1 failed on host
> assumedly after job because: job 21823.1 died through signal TERM (15)
Is there any wallclock or other limit? If you turn on loglevel to log_info, you
might see the reason for the kill by SGE in the messages file.
Cheers - Reuti
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users