[GE users] qstat/qacct

yacc143 andreas at kostyrka.org
Thu Feb 12 19:01:35 GMT 2009


Am Thu, 12 Feb 2009 13:25:59 -0500
schrieb craffi <dag at sonsorol.org>:

> Other suggestions of varying worth:
> 
> - use "qstat" and "qacct" with "-xml" to ease parsing tasks at least
> 
> - Have you looked at 'qevent'? I think this is an external binary
> you can compile into SGE to register for job related events

Nope, where can I find it?

> 
> - If your "real time" need can be dialed back to "every 30 seconds
> or so ..." then you may want to just do a large scale qstat at a
> periodic interval with that output redirected to a file that you can
> process without hammering the qmaster all the time, somehting like
> "qstat -F - u '*' -xml > /opt/sge-status-cache.xml"

Well, my problem is that I need to detect problems as fast as possible,
as postprocessing has to happen (in practice the postprocessing is
mostly copying the results that can be a couple of GBs out of the
cluster).


> 
> - If you use classic spooling you may be able to probe the
> filesystem directly to gain the info you need without having to go
> through the qmaster all the time. If you rsync or replicate this
> spool elsewhere you can hammer that replicated copy without
> clobbering SGE.
> 
> - Have you looked at DRMAA if you are building automated systems for  
> workflows or job submission? It's the API for job submission and  
> control for SGE and other DRMAA compliant DRM systems

Hmm, what's the scope of DRMAA? Does it cover copying data to/from the
grid? (The Wikipedia article on it is somewhat lacking in details :( )

Thanks,

Andreas
> 
> 
> 
> 
> On Feb 12, 2009, at 1:15 PM, Andreas Kostyrka wrote:
> 
> > My problem is not posthum analysis, it's more a GUI frontend for the
> > jobs being submitted. For this I need realtime (more or less)
> > information about the job status.
> >
> > -) I need to know that the job is pending. (Currently using my
> > logging inside the job proper, I'm just assuming that the job will
> > eventually run.)
> > -) I'm strongly interested when the job starts and ends.
> > -) I'm strongly interested in the exit status of my job.
> >
> > The only way, and even that was unperfect that I found was running
> > qstat and qacct with -j JOBNAME, and parsing the output. That worked
> > fine in the beginning, but as always, things grow. Around 1000 jobs
> > running these commands periodically started to fail (actually
> > running them and processing the output took just plainly to long),
> > so I've implemented the work around with logging the start and exit
> > of jobs. Works fine, although it sounds like something that should
> > be part of the SGE proper, and it fails on the first point, e.g. if
> > some job fails
> > before it starts to run, bad things happen.
> >
> > Anyway, your answer suggests that my idea of logging the start/exit
> > myself is a sound decision.
> >
> > Thanks,
> >
> > Andreas
> >
> > Am Thu, 12 Feb 2009 11:44:00 -0500
> > schrieb craffi <dag at sonsorol.org>:
> >
> >> You can directly process the accounting file yourself for bulk
> >> analysis of completed jobs or slurp the file into a simple SQL
> >> database. The SGE ARCo system can do this on a larger scale. I've
> >> personally written perl scripts in the past that took the
> >> accounting file and stuffed it into a simple mysql database.
> >>
> >> One of the best "full SGE life cycle" implementations I've ever
> >> seen did this:
> >>
> >> - Global prolog scripts perform an SQL insert on a central database
> >> for all newly dispatched jobs, capturing significant info about the
> >> job environment
> >> - Global epilog script also does an SQL update to log exit code and
> >> resource consumption data
> >>
> >> Using the prolog/epilog hooks let this group build a custom system
> >> that  tracked the full life cycle of each job. More importantly
> >> though, enough data was captured that the group could resubmit and
> >> repeat any job if needed in *exactly* the same way it was run/
> >> submitted previously.
> >>
> >> -Chris
> >>
> >>
> >>
> >> On Feb 12, 2009, at 11:16 AM, yacc143 wrote:
> >>
> >>> I wondered if there is some way to query the status of submitted
> >>> jobs?
> >>>
> >>> I've been doing a qstat -j JOBID (that would yield the job if it's
> >>> running or pending or failed to start), and a qacct -j JOBNAME to
> >>> figure
> >>> out the exit status of the jobs.
> >>>
> >>> The above has proven to be too slow (it works fine for hundreds of
> >>> jobs
> >>> but breaks miserably when scaled to 20000 qsubs :( ).
> >>>
> >>> Now I'm using a trick in making the submitted job log it's start
> >>> time and end time/exit status, but that's quite dirty.
> >>>
> >>> So what's the correct way to track a job from submission till
> >>> "exit status", for a potentially quite large collection of jobs?
> >>>
> >>> TIA,
> >>>
> >>> Andreas
> >>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=104206
> 
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=104229

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list