[GE users] checking job return status in epilog script

pollinger harald.pollinger at sun.com
Fri Jun 12 13:01:35 BST 2009


madpower wrote:
> Dear John,
> 
>> here's a couple of thoughts/ideas for you -
> thank you very much. I had almost the same idea, but was at this time
> too busy with other work. So I did not test it in detail at this time.
> 
>> first, when I was experimenting with this yesterday, I put the
>> following line in my epilog script:
> Anyhow, there is one big problem with this approach. As soon as the job
> terminates, everything is okay in our cluster. The problem is, that the
> job never terminates. So I cannot automatically copy everthing from the
> $SGE_JOB_SPOOL_DIR.
> Nevertheless, I am going to do something like this for perfectly
> terminating jobs, just to know which files are exactly in this directory
> and whether it is useful to further investigate informations written in
> there.
> 
>> Note that I extract the exit
>> status from the "usage" file.
> This is one of the problems I have. I do not find any file named usage
> (or similar) on the entire execution host while jobs are running. So I
> fear that this file is created on termination of a job. 

Right, it's written by the sge_shepherd after the job itself has 
terminated. The file should be located in the directory 
$EXECD_SPOOLDIR/$HOST/active_jobs/$JOB_ID.
$EXECD_SPOOLDIR is defined in the host or global configuration (see 
"qconf -sconf $HOST" or "qconf -sconf" output), $HOST is the name of the 
execution host.

But it will be there only for the fraction of a second unless you 
specify in the configuration "execd_params keep_active=true", which will 
prevent the execution daemon from deleting the job directory after the 
job has finished.


> However, maybe there is other useful information in the spool directory.
> 
>> PS - oh, one other thing I noticed in your post --
>> you mentioned that your problem jobs are in state "S", which
>> you called "sleeping" -- from the way I understand the qstat
>> output, capital s ("S") means that the queue is suspended
>> (as opposed to a small s ("s") which means that the job is
>> suspended). Not sure if that's just the term you use for this
>> or not, but I thought I'd point it out - it could be that
>> your problem is that the queue is getting suspended for some
>> reason....
> Well, thanks for this indication. Maybe I was a little unprecise on this
> topic. In my case the "S" output is from the unix command "top" (or ps
> faux) on the console of the execution host. So the jobs are regularily
> listes as "running" in qstat but actually they are not running on the
> execution host.

So the job (the process) still runs on the execution host, but SGE 
doesn't list it as a running job any more? This can happen, but then 
"qacct -j <job_id>" should provide the accounting data of the job until 
it detached itself from SGE.

What and how exactly did you submit, and what does the "pstree" output 
of your job and of the sge_execd look like?

Regards,
Harald



> So therefore I hope to find some useful information in
> log files. Regarding system logs there are no problems listed.
> 
> Anyway, we are doing an upgrade now on the new kernel and debian 5.0
> with the included SGE-packages. So we hope that these problems will be
> history in the future.
> 
> Thanks again,
> Matthias
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201625
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         Sun Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201647

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list