[GE users] how to distinguish job termination due cpu/mem limit

jank jkoellin at cebitec.uni-bielefeld.de
Tue Dec 1 11:40:47 GMT 2009


Hi, I'm using the sge through the drmaa interface. Right now I want to do some error handling/logging if a submitted job fails or gets stopped by the queue due to limits.

Maybe I'm missing out on some basics here but I can't think of a method to determine if a job was stopped because it hit the mem OR the cpu limit.

The signals are the same for both cases (SIGXCPU for the soft limits and SIGKILL for the hard limits). There is no information in the std error of the job. The only place where the information is written is in the messages file on the exec host but my program has probably no access to this file. Because the job is submitted with a program I don't want to depend on an email notification either.

I still have the option to call "qacct -j id" to get the queue for the job and then compare the queue limits (qconf -sq) with the ressource usage of the job. But this won't work with drmaa and the info is derived quite indirectly.

Example for Info in messages on exec:
11/30/2009 13:18:11|  main|myexechost|W|job 53 exceeds job soft limit "s_vmem" of queue "all.q at myexechost" (4116480.00000 > limit:1000000.00000) - sending SIGXCPU

Is there an (easy) solution to this problem?

Thanks
-Jan

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=230680

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list