[GE users] Determining the failure states of completed jobs in SGE 5.3

Fred Youhanaie fly at anydata.co.uk
Thu Jun 7 11:32:42 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Dennis Williams wrote:
> Hello,
>  
> My team are in the process of building an application that submits jobs to node clusters running SGE 5.3. One of the requirements is to monitor the status of a job (throughout its lifecycle) that has been submitted to the SGE.
>  
> Using the "qstat" command it is possible to determine if a job is currently waiting in a queue or running on a node, but once the job has completed I would like to be able to determine if the job has completed successfully or with errors. I understand that once a job has completed two files are written on the compute node containing the stdout and stderr, but our application will not have access to these nodes as they are on private networks.
>  
> So my question is:
>  
> 1) Does SGE 5.3 provide commands (or techniques) that would enable clients to determine if a job has completed with or without errors?

'qstat -s z' will list all the recently terminated jobs.

'qsub -j <jobid>' will give you detailed stats about completed jobs, but 
watch out for multiple entries for tightly integrated parallel jobs.

> 2) Does SGE 5.3 provide commands (or techniques) that would enable clients to access the stdout and stderr files for jobs that have completed?

Not directly, you could use the epilog facility to transfer jobs back to 
head node etc.

> The documentation suggests that the "qacct" command provides the client with information about jobs that have completed. However one of our cluster administrators has explained that this command can only be run on the "head  node" which is not an acceptable option for us. 

I believe qacct can be run on a submit host, similar to qsub.


You could also use DRMAA to submit your jobs, see http://drmaa.org, but 
I'm not sure if it is available for 5.3.

HTH

Cheers
f.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list