[GE users] Jobs killed because of boguous h_cpu values

Göran Uddeborg uddeborg at carmen.se
Thu Jan 12 15:15:24 GMT 2006


Recently SGE has started to kill jobs incorrectly claiming they have
exceeded their h_cpu limit.

As an example, job 2280213 was submitted earlier today.  It executed
during one second, between 13:17:07 and 13:17:08 (see the attached
qacct output).  In the log file of the execution machine, tiptonville,
it says that the job was killed because it exceeded the h_cpu limit,
having used 7767 seconds while the limit is 660.  The limit is
correct, but the usage is obviously wrong.  I attach the tiptonville's
messages file and the configuration of the short queue.

If you look further in the log, there are several jobs that have used
almost, but not exactly, the same amount of time.  There are even more
in the messages files from previous days.  Checking a few samples of
them, they have also executed for just a second or so.  Essentially,
they are killed immediately.

As I mentioned, this started recently.  More exactly, it seems to have
started after we upgraded to U7 on 21 of December.  While we are not
sure this is related to the upgrade, it is a strong suspicion.

Has anybody seen anything like this?  Does anybody have a clue what
the reason for this could be?



    [ Part 2, "qacct -j 2280213"  Text/PLAIN 43 lines. ]
    [ Unable to print this part. ]


    [ Part 3, "messages from tiptonville"  Text/PLAIN 145 lines. ]
    [ Unable to print this part. ]


    [ Part 4, "qconf -sq short"  Text/PLAIN 53 lines. ]
    [ Unable to print this part. ]


    [ Part 5: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list