[GE users] qstat/qacct

reuti reuti at staff.uni-marburg.de
Thu Feb 12 19:47:26 GMT 2009


Am 12.02.2009 um 20:01 schrieb yacc143:

> Am Thu, 12 Feb 2009 13:25:59 -0500
> schrieb craffi <dag at sonsorol.org>:
>
>> Other suggestions of varying worth:
>>
>> - use "qstat" and "qacct" with "-xml" to ease parsing tasks at least
>>
>> - Have you looked at 'qevent'? I think this is an external binary
>> you can compile into SGE to register for job related events
>
> Nope, where can I find it?
>
>>
>> - If your "real time" need can be dialed back to "every 30 seconds
>> or so ..." then you may want to just do a large scale qstat at a
>> periodic interval with that output redirected to a file that you can
>> process without hammering the qmaster all the time, somehting like
>> "qstat -F - u '*' -xml > /opt/sge-status-cache.xml"
>
> Well, my problem is that I need to detect problems as fast as  
> possible,
> as postprocessing has to happen (in practice the postprocessing is
> mostly copying the results that can be a couple of GBs out of the
> cluster).

This I would do in a queue epilog, hence it's part of the job. Your  
application could either attach some meta-information with qalter to  
the job (qalter -as / -sc / -dc) or write to a result file in the  
job's directory. In the Epilog this information can be processed to  
copy the files back and/or send additonal emails in addition to the  
ones from SGE sent to the user.

-- Reuti


>> - If you use classic spooling you may be able to probe the
>> filesystem directly to gain the info you need without having to go
>> through the qmaster all the time. If you rsync or replicate this
>> spool elsewhere you can hammer that replicated copy without
>> clobbering SGE.
>>
>> - Have you looked at DRMAA if you are building automated systems for
>> workflows or job submission? It's the API for job submission and
>> control for SGE and other DRMAA compliant DRM systems
>
> Hmm, what's the scope of DRMAA? Does it cover copying data to/from the
> grid? (The Wikipedia article on it is somewhat lacking in details :( )
>
> Thanks,
>
> Andreas
>>
>>
>>
>>
>> On Feb 12, 2009, at 1:15 PM, Andreas Kostyrka wrote:
>>
>>> My problem is not posthum analysis, it's more a GUI frontend for the
>>> jobs being submitted. For this I need realtime (more or less)
>>> information about the job status.
>>>
>>> -) I need to know that the job is pending. (Currently using my
>>> logging inside the job proper, I'm just assuming that the job will
>>> eventually run.)
>>> -) I'm strongly interested when the job starts and ends.
>>> -) I'm strongly interested in the exit status of my job.
>>>
>>> The only way, and even that was unperfect that I found was running
>>> qstat and qacct with -j JOBNAME, and parsing the output. That worked
>>> fine in the beginning, but as always, things grow. Around 1000 jobs
>>> running these commands periodically started to fail (actually
>>> running them and processing the output took just plainly to long),
>>> so I've implemented the work around with logging the start and exit
>>> of jobs. Works fine, although it sounds like something that should
>>> be part of the SGE proper, and it fails on the first point, e.g. if
>>> some job fails
>>> before it starts to run, bad things happen.
>>>
>>> Anyway, your answer suggests that my idea of logging the start/exit
>>> myself is a sound decision.
>>>
>>> Thanks,
>>>
>>> Andreas
>>>
>>> Am Thu, 12 Feb 2009 11:44:00 -0500
>>> schrieb craffi <dag at sonsorol.org>:
>>>
>>>> You can directly process the accounting file yourself for bulk
>>>> analysis of completed jobs or slurp the file into a simple SQL
>>>> database. The SGE ARCo system can do this on a larger scale. I've
>>>> personally written perl scripts in the past that took the
>>>> accounting file and stuffed it into a simple mysql database.
>>>>
>>>> One of the best "full SGE life cycle" implementations I've ever
>>>> seen did this:
>>>>
>>>> - Global prolog scripts perform an SQL insert on a central database
>>>> for all newly dispatched jobs, capturing significant info about the
>>>> job environment
>>>> - Global epilog script also does an SQL update to log exit code and
>>>> resource consumption data
>>>>
>>>> Using the prolog/epilog hooks let this group build a custom system
>>>> that  tracked the full life cycle of each job. More importantly
>>>> though, enough data was captured that the group could resubmit and
>>>> repeat any job if needed in *exactly* the same way it was run/
>>>> submitted previously.
>>>>
>>>> -Chris
>>>>
>>>>
>>>>
>>>> On Feb 12, 2009, at 11:16 AM, yacc143 wrote:
>>>>
>>>>> I wondered if there is some way to query the status of submitted
>>>>> jobs?
>>>>>
>>>>> I've been doing a qstat -j JOBID (that would yield the job if it's
>>>>> running or pending or failed to start), and a qacct -j JOBNAME to
>>>>> figure
>>>>> out the exit status of the jobs.
>>>>>
>>>>> The above has proven to be too slow (it works fine for hundreds of
>>>>> jobs
>>>>> but breaks miserably when scaled to 20000 qsubs :( ).
>>>>>
>>>>> Now I'm using a trick in making the submitted job log it's start
>>>>> time and end time/exit status, but that's quite dirty.
>>>>>
>>>>> So what's the correct way to track a job from submission till
>>>>> "exit status", for a potentially quite large collection of jobs?
>>>>>
>>>>> TIA,
>>>>>
>>>>> Andreas
>>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=104206
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=104229
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=104249

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list