[GE users] job exit codes and tasks pending via hold_jid arguments

cjf001 john.foley at motorola.com
Tue Dec 22 20:17:58 GMT 2009


Reuti, Chris -

just thought I'd comment on this, as I've done some of it - and I
think Reuti hit the subtle point when he said "avoid the necessity
to way to make all exechosts also submit hosts (or to use rsh/ssh
to the headnode)" - this is the problem I ran into when I tried to
use the epilog script for a followup job - the epilog runs on
the compute (execute) host, as the user, so it doesn't have the
ability to do much. In my case, I didn't want all the execute hosts
to be submit hosts, either, so I was stuck....

What I ended up doing was letting the epilog script write a "flag"
file to a special common location (visible to all, writable by all),
which is read periodically (every 10 seconds) by a perl daemon
process I have running on the SGE master to handle various things
that SGE doesn't do natively. When this daemon sees a flag file,
it spawns a child process (also perl) to do the dirty work of
figuring out if the first job finished OK, and if so, submitting
the followup job. Of course, it removes the flag file (actually
renames it) once it's acted upon it.

A fair amount of work, but doable in a day or two !  :)

     John


reuti wrote:
> Hi,
>
> Am 22.12.2009 um 19:33 schrieb craffi:
>
>> Given this page:
>>
>> http://wikis.sun.com/display/GridEngine/Troubleshooting+and+Error
>> +Messages
>>
>> ... it seems to show tables that say that any exit code other than 0,
>> 100 or 99 from a job will indicate a successful job execution to SGE.
>>
>> Exit 100 seems nice but from memory I recall that it will put the job
>> into Eqw state or something similar that requires human interaction to
>> manually remove.
>>
>> The reason this came up is due to a simple workflow that uses job
>> dependencies, there are times where the first job encounters a
>> specific
>> error case and exits with a special workflow-meaningful code of 255 --
>> it looks like SGE does not see this as an actual failure and thus
>> allows
>> the dependent jobs to go on for dispatch and execution.
>
> correct. The job dependancy will only look whether the job left the
> system. If it's 100 or 99 it whould wait of course. But with any
> other error it will start the follow up jobs.
>
>
>> Looking for the proper way to exit on error in a way does not make the
>> job linger and also does not allow any jobs with -hold_jid set to
>> execute when the upstream task leaves the system.
>>
>> Epilog script? Qalter or qdel from within the first job? Something
>> else?
>
> Well, looks like an usage for an idea I put on the list a couple of
> days ago:
>
> http://gridengine.sunsource.net/ds/viewMessage.do?
> dsMessageId=233910&dsForumId=38
>
> You could use just an intermediate job. I mean, the follow up job of
> the real job is just a fake one to make decisions only (it could even
> run in a special queue which is always available), in my pseudo-code
> the qdecide, and this one can use qdel maybe by name or so (and its
> own name as a prefix). As this fake job won't use resources, it can
> also run on the headnode and avoid the necessity to way to make all
> exechosts also submit hosts (or to use rsh/ssh to the headnode).
>
> I don't know, when I will have time to implement this in a complete
> way. For now I'm not sure about the syntax for such a workflow-in-job-
> dependancies.
>
> You can also check:
>
> http://wildfire.bii.a-star.edu.sg/
>
> but AFAIR it will submit the followup job only (or not), when the
> first one finished. And the controller has to run all the time.
> Unfortunately the project seems to be dead, as I didn't get any reply.
>
> -- Reuti
>
>
>> Regards,
>> Chris
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=234638
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234641
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234645

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list