[GE users] qstat/qacct

craffi dag at sonsorol.org
Thu Feb 12 18:25:59 GMT 2009


Other suggestions of varying worth:

- use "qstat" and "qacct" with "-xml" to ease parsing tasks at least

- Have you looked at 'qevent'? I think this is an external binary you  
can compile into SGE to register for job related events

- If your "real time" need can be dialed back to "every 30 seconds or  
so ..." then you may want to just do a large scale qstat at a periodic  
interval with that output redirected to a file that you can process  
without hammering the qmaster all the time, somehting like "qstat -F - 
u '*' -xml > /opt/sge-status-cache.xml"

- If you use classic spooling you may be able to probe the filesystem  
directly to gain the info you need without having to go through the  
qmaster all the time. If you rsync or replicate this spool elsewhere  
you can hammer that replicated copy without clobbering SGE.

- Have you looked at DRMAA if you are building automated systems for  
workflows or job submission? It's the API for job submission and  
control for SGE and other DRMAA compliant DRM systems




On Feb 12, 2009, at 1:15 PM, Andreas Kostyrka wrote:

> My problem is not posthum analysis, it's more a GUI frontend for the
> jobs being submitted. For this I need realtime (more or less)
> information about the job status.
>
> -) I need to know that the job is pending. (Currently using my
> logging inside the job proper, I'm just assuming that the job will
> eventually run.)
> -) I'm strongly interested when the job starts and ends.
> -) I'm strongly interested in the exit status of my job.
>
> The only way, and even that was unperfect that I found was running
> qstat and qacct with -j JOBNAME, and parsing the output. That worked
> fine in the beginning, but as always, things grow. Around 1000 jobs
> running these commands periodically started to fail (actually running
> them and processing the output took just plainly to long), so I've
> implemented the work around with logging the start and exit of jobs.
> Works fine, although it sounds like something that should be part of
> the SGE proper, and it fails on the first point, e.g. if some job  
> fails
> before it starts to run, bad things happen.
>
> Anyway, your answer suggests that my idea of logging the start/exit
> myself is a sound decision.
>
> Thanks,
>
> Andreas
>
> Am Thu, 12 Feb 2009 11:44:00 -0500
> schrieb craffi <dag at sonsorol.org>:
>
>> You can directly process the accounting file yourself for bulk
>> analysis of completed jobs or slurp the file into a simple SQL
>> database. The SGE ARCo system can do this on a larger scale. I've
>> personally written perl scripts in the past that took the accounting
>> file and stuffed it into a simple mysql database.
>>
>> One of the best "full SGE life cycle" implementations I've ever seen
>> did this:
>>
>> - Global prolog scripts perform an SQL insert on a central database
>> for all newly dispatched jobs, capturing significant info about the
>> job environment
>> - Global epilog script also does an SQL update to log exit code and
>> resource consumption data
>>
>> Using the prolog/epilog hooks let this group build a custom system
>> that  tracked the full life cycle of each job. More importantly
>> though, enough data was captured that the group could resubmit and
>> repeat any job if needed in *exactly* the same way it was run/
>> submitted previously.
>>
>> -Chris
>>
>>
>>
>> On Feb 12, 2009, at 11:16 AM, yacc143 wrote:
>>
>>> I wondered if there is some way to query the status of submitted
>>> jobs?
>>>
>>> I've been doing a qstat -j JOBID (that would yield the job if it's
>>> running or pending or failed to start), and a qacct -j JOBNAME to
>>> figure
>>> out the exit status of the jobs.
>>>
>>> The above has proven to be too slow (it works fine for hundreds of
>>> jobs
>>> but breaks miserably when scaled to 20000 qsubs :( ).
>>>
>>> Now I'm using a trick in making the submitted job log it's start
>>> time and end time/exit status, but that's quite dirty.
>>>
>>> So what's the correct way to track a job from submission till "exit
>>> status", for a potentially quite large collection of jobs?
>>>
>>> TIA,
>>>
>>> Andreas
>>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=104206

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list