[GE users] Job does not lock on exited with 100 error code when submitted using drmaa

templedf dan.templeton at sun.com
Mon Sep 21 15:27:28 BST 2009


I can confirm the behavior.  What's more, looking at the qacct output 
for such a DRMAA job, I see:

 > qacct -j 132
==============================================================
qname        test.q             
hostname     ultra20            
...
qsub_time    Mon Sep 21 07:16:36 2009
start_time   Mon Sep 21 07:16:49 2009
end_time     Mon Sep 21 07:16:49 2009
granted_pe   NONE               
slots        1                  
failed       30  : rescheduling on application error
exit_status  100
...

So, it looks like the exit 100 is recognized as application error, but 
it's not treated that way.  (failed = 30 means that the job died by 
exiting 100.)  This could actually be a DRMAA-specific feature.  In 
DRMAA, there's no way to deal with a job that's in error state, so we 
went out of our way to try to prevent jobs from entering the error 
state.  I don't recall there being anything in there about blocking exit 
100 for DRMAA jobs, and I don't see anything like that in the source 
code, but maybe one of the other engineers knows something that I don't 
(or don't remember).  Otherwise, it's a really odd bug.

Daniel

levsha wrote:
> Hi!
> I use 6.2u2_1 on FreeBSD. When i submit job using qsub:
>
> # qsub -b y -shell n sh -c 'exit 100'
>
> all work properly: after running i receive mail "GE 6.2u2_1: Job 2
> failed" and job locks in error state:
>
> #qstat
> job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
> -----------------------------------------------------------------------------------------------------------------
>       2 0.55500 sh         levsha       Eqw   09/17/2009 17:14:31                                    1        
>
> And when i submit job using drmaa (from C program), i receive same mail
> "GE 6.2u2_1: Job 3 failed" (i compare messages without timestamps using
> diff: only job id, pid and times different), but no job in queue in
> error state.
>
> Locking jobs in error state is veery important for me: i submit jobs
> jail an want to not start next job when previos failed.
>
> C program source code and error email messages faile attached
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=218202

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list