[GE users] how to distinguish job termination due cpu/mem limit

reuti reuti at staff.uni-marburg.de
Tue Dec 1 12:45:36 GMT 2009


Am 01.12.2009 um 12:40 schrieb jank:

> Hi, I'm using the sge through the drmaa interface. Right now I want  
> to do some error handling/logging if a submitted job fails or gets  
> stopped by the queue due to limits.
> Maybe I'm missing out on some basics here but I can't think of a  
> method to determine if a job was stopped because it hit the mem OR  
> the cpu limit.
> The signals are the same for both cases (SIGXCPU for the soft  
> limits and SIGKILL for the hard limits). There is no information in  
> the std error of the job. The only place where the information is  
> written is in the messages file on the exec host but my program has  
> probably no access to this file. Because the job is submitted with  
> a program I don't want to depend on an email notification either.
> I still have the option to call "qacct -j id" to get the queue for  
> the job and then compare the queue limits (qconf -sq) with the  
> ressource usage of the job. But this won't work with drmaa and the  
> info is derived quite indirectly.
> Example for Info in messages on exec:
> 11/30/2009 13:18:11|  main|myexechost|W|job 53 exceeds job soft  
> limit "s_vmem" of queue "all.q at myexechost" (4116480.00000 > limit: 
> 1000000.00000) - sending SIGXCPU

as long as it's a serial program (hence only one possible node where  
the violation occured) you can grep it. We do this in a mail-wrapper,  
but in your case I think a job epilog would be more feasible. There  
you can trigger anything to let your workflow know about the cause.  
With hostbased ssh you can also trigger something on the master node.  
As a template out mail-wrapper:


case `echo "$2" | cut -d " " -f 1` in

       Job) JOB_ID=`echo "$2" | cut -d " " -f 2`
            CONDITION=`echo "$2" | cut -d " " -f 4` ;;

Job-array) JOB_ID=`echo "$2" | cut -d " " -f 3`
            CONDITION=`echo "$2" | cut -d " " -f 5` ;;

         *) ;;


appendix=`egrep "[|]job $JOB_ID([.][[:digit:]]+)? exceed" /var/spool/ 
sge/$(hostname)/messages | head -n 1`
if [ -z "$appendix" ]; then
     appendix="Unknown, no entry found in messages file on the master  
node of the job."

if [ "$CONDITION" = "Aborted" ]; then
     (cat; echo; echo "Reason for job abort:"; echo $appendix) | mail  
-s "$2" "$3"
     mail -s "$2" "$3"

HTH - Reuti

> Is there an (easy) solution to this problem?
> Thanks
> -Jan
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=230680
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list