[GE users] 'qstat -s z' Not Reporting Actual Slots Used

Reuti reuti at staff.uni-marburg.de
Fri Dec 30 11:14:15 GMT 2005


Hi,

for qstat some things are thrown away after the job finished.

http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=9180

So the reliable way would be to use qacct (or maybe the just freely  
available ARCo), and qacct will also tell you the used queue. qstat - 
s z shows only the last n jobs as specified in the SGE configuration,  
which might not be the complete report information for which you are  
looking for.

Cheers - Reuti


Am 30.12.2005 um 03:09 schrieb Steve Waltner:

> Sent a second time after changing my subscription e-mail address.  
> My original e-mail is apparently stuck in limbo because my  
> subscription address and the From address didn't match. After  
> changing my subscription address, that should no longer be an issue...
>
> We are using SGE to submit software builds to a group of systems,  
> which has been working extremely well. Over the last six months,  
> deploying SGE and installing an additional build system at all our  
> locations has taken me from receiving daily calls/e-mails  
> complaining about performance to a month since my last complaint.
>
> I'm starting to look at getting some monitoring functions so I can  
> show pretty pictures to management and also try to predict when we  
> are getting close to needing more systems (jobs get queued for a  
> "while" before starting) and have noticed an oddity in the output  
> of "qstat -s z" when it reports the number of slots that a job had  
> used. Even when someone requested more than a single slot and it  
> was granted, it will report that the job only used a single slot.
>
> ======
> ra:~> qconf -sp make
> pe_name           make
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   NONE
> stop_proc_args    NONE
> allocation_rule   $pe_slots
> control_slaves    FALSE
> job_is_first_task TRUE
> urgency_slots     min
> ra:~> qhost -q
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE   
> SWAPTO  SWAPUS
> ---------------------------------------------------------------------- 
> ---------
> global                  -               -     -       -        
> -       -       -
> blaze                   sol-sparc64     8  0.01   16.0G    3.1G    
> 29.9G    1.5G
>    all.q                BIP   0/8
> hyperion                sol-sparc64     8  0.16   14.0G    9.3G    
> 16.0G    6.0M
>    all.q                BIP   0/8
> ictgrid001              sol-sparc64     8  0.02   16.0G    1.1G    
> 20.0G    2.0M
>    all.q                BIP   0/8
> ra                      sol-sparc64     8  2.02   16.0G   11.8G    
> 16.0G    1.4G
>    all.q                BIP   0/8
> ra:~>
> ======
>
> As you can see, the configuration is pretty simple. I have a "make"  
> parallel environment setup with $pe_slots for the build process.  
> This is because our builds are currently being done with the stock  
> GNU Make (instead of qmake). The makefile looks for $NSLOTS and  
> automatically runs the build with a "-j $NSLOTS". We're only using  
> the default all.q queues on our hosts.
>
> Users wanted to see the output of the build in process, so we are  
> using qrsh instead of qsub. Users will typically submit their jobs  
> with a command like
>
> qrsh -pe make 1-4 -cwd gmake
> qrsh -pe make 4 -cwd -now n gmake
>
> When running "qstat -s z", it seemed like very few people had  
> started using the -pe option, since almost all the slots reported  
> by "qstat -s z" were reported as 1. Looking closer at specific  
> jobs, I noticed that jobs would execute using more than 1 slot, but  
> then qstat -s z would show only a single slot. I finally tracked  
> this (bug?) down to users specifying a range of slots for the -pe  
> option. The output is always the low end of the range that is given  
> by the user, where "qacct -j <jobid>" reports the actual number of  
> slots that had been assigned to the job.
>
> =========
> ra:~> qrsh -pe make 1-4 "hostname; sleep 4"
> ictgrid001
> ra:~> qrsh -pe make 2-4 "hostname; sleep 4"
> ictgrid001
> ra:~> qrsh -pe make 4 "hostname; sleep 4"
> ictgrid001
> ra:~> qrsh -pe make 4-40 "hostname; sleep 4"
> ictgrid001
> ra:~>
> =========
>
> While the various jobs were running, I got the following info from  
> qstat...
>
>   38194 0.55500 hostname;  swaltner     r     12/28/2005 11:20:37  
> all.q at ictgrid001.ks.lsil.com       4
>   38195 0.55500 hostname;  swaltner     r     12/28/2005 11:21:05  
> all.q at ictgrid001.ks.lsil.com       4
>   38196 0.55500 hostname;  swaltner     r     12/28/2005 11:21:25  
> all.q at ictgrid001.ks.lsil.com       4
>   38197 0.55500 hostname;  swaltner     r     12/28/2005 11:22:23  
> all.q at ictgrid001.ks.lsil.com       8
>
> The following shows the errant output from "qstat -s z" as well as  
> the fact that qacct keeps track of the actual slot usage.
>
> ======
> ra:~> qstat -s z | grep swaltner
>   38194 0.00000 hostname;  swaltner     qw    12/28/2005  
> 11:20:36                                    1
>   38195 0.00000 hostname;  swaltner     qw    12/28/2005  
> 11:21:03                                    2
>   38196 0.00000 hostname;  swaltner     qw    12/28/2005  
> 11:21:21                                    4
>   38197 0.00000 hostname;  swaltner     qw    12/28/2005  
> 11:22:20                                    4
> ra:~> qacct -j 38194 | grep slots
> slots        4
> ra:~> qacct -j 38195 | grep slots
> slots        4
> ra:~> qacct -j 38196 | grep slots
> slots        4
> ra:~> qacct -j 38197 | grep slots
> slots        8
> ra:~>
> ======
>
> Why is there a discrepancy between qstat and qacct?
>
> Also, is there a reason that "qstat -s z" doesn't show the queue  
> that the job was assigned to?
>
> I can probably use qacct to get the information I need for my  
> graphs, but I wanted to find out why qstat was giving incorrect  
> data since that presents the information in a nice format for  
> running from the shell interactively.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list