[GE users] 'qstat -s z' Not Reporting Actual Slots Used

Steve Waltner steve.waltner at engenio.com
Fri Dec 30 02:09:19 GMT 2005


Sent a second time after changing my subscription e-mail address. My  
original e-mail is apparently stuck in limbo because my subscription  
address and the From address didn't match. After changing my  
subscription address, that should no longer be an issue...

We are using SGE to submit software builds to a group of systems,  
which has been working extremely well. Over the last six months,  
deploying SGE and installing an additional build system at all our  
locations has taken me from receiving daily calls/e-mails complaining  
about performance to a month since my last complaint.

I'm starting to look at getting some monitoring functions so I can  
show pretty pictures to management and also try to predict when we  
are getting close to needing more systems (jobs get queued for a  
"while" before starting) and have noticed an oddity in the output of  
"qstat -s z" when it reports the number of slots that a job had used.  
Even when someone requested more than a single slot and it was  
granted, it will report that the job only used a single slot.

======
ra:~> qconf -sp make
pe_name           make
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   NONE
stop_proc_args    NONE
allocation_rule   $pe_slots
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min
ra:~> qhost -q
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE   
SWAPTO  SWAPUS
------------------------------------------------------------------------ 
-------
global                  -               -     -       -       -        
-       -
blaze                   sol-sparc64     8  0.01   16.0G    3.1G    
29.9G    1.5G
    all.q                BIP   0/8
hyperion                sol-sparc64     8  0.16   14.0G    9.3G    
16.0G    6.0M
    all.q                BIP   0/8
ictgrid001              sol-sparc64     8  0.02   16.0G    1.1G    
20.0G    2.0M
    all.q                BIP   0/8
ra                      sol-sparc64     8  2.02   16.0G   11.8G    
16.0G    1.4G
    all.q                BIP   0/8
ra:~>
======

As you can see, the configuration is pretty simple. I have a "make"  
parallel environment setup with $pe_slots for the build process. This  
is because our builds are currently being done with the stock GNU  
Make (instead of qmake). The makefile looks for $NSLOTS and  
automatically runs the build with a "-j $NSLOTS". We're only using  
the default all.q queues on our hosts.

Users wanted to see the output of the build in process, so we are  
using qrsh instead of qsub. Users will typically submit their jobs  
with a command like

qrsh -pe make 1-4 -cwd gmake
qrsh -pe make 4 -cwd -now n gmake

When running "qstat -s z", it seemed like very few people had started  
using the -pe option, since almost all the slots reported by "qstat - 
s z" were reported as 1. Looking closer at specific jobs, I noticed  
that jobs would execute using more than 1 slot, but then qstat -s z  
would show only a single slot. I finally tracked this (bug?) down to  
users specifying a range of slots for the -pe option. The output is  
always the low end of the range that is given by the user, where  
"qacct -j <jobid>" reports the actual number of slots that had been  
assigned to the job.

=========
ra:~> qrsh -pe make 1-4 "hostname; sleep 4"
ictgrid001
ra:~> qrsh -pe make 2-4 "hostname; sleep 4"
ictgrid001
ra:~> qrsh -pe make 4 "hostname; sleep 4"
ictgrid001
ra:~> qrsh -pe make 4-40 "hostname; sleep 4"
ictgrid001
ra:~>
=========

While the various jobs were running, I got the following info from  
qstat...

   38194 0.55500 hostname;  swaltner     r     12/28/2005 11:20:37  
all.q at ictgrid001.ks.lsil.com       4
   38195 0.55500 hostname;  swaltner     r     12/28/2005 11:21:05  
all.q at ictgrid001.ks.lsil.com       4
   38196 0.55500 hostname;  swaltner     r     12/28/2005 11:21:25  
all.q at ictgrid001.ks.lsil.com       4
   38197 0.55500 hostname;  swaltner     r     12/28/2005 11:22:23  
all.q at ictgrid001.ks.lsil.com       8

The following shows the errant output from "qstat -s z" as well as  
the fact that qacct keeps track of the actual slot usage.

======
ra:~> qstat -s z | grep swaltner
   38194 0.00000 hostname;  swaltner     qw    12/28/2005  
11:20:36                                    1
   38195 0.00000 hostname;  swaltner     qw    12/28/2005  
11:21:03                                    2
   38196 0.00000 hostname;  swaltner     qw    12/28/2005  
11:21:21                                    4
   38197 0.00000 hostname;  swaltner     qw    12/28/2005  
11:22:20                                    4
ra:~> qacct -j 38194 | grep slots
slots        4
ra:~> qacct -j 38195 | grep slots
slots        4
ra:~> qacct -j 38196 | grep slots
slots        4
ra:~> qacct -j 38197 | grep slots
slots        8
ra:~>
======

Why is there a discrepancy between qstat and qacct?

Also, is there a reason that "qstat -s z" doesn't show the queue that  
the job was assigned to?

I can probably use qacct to get the information I need for my graphs,  
but I wanted to find out why qstat was giving incorrect data since  
that presents the information in a nice format for running from the  
shell interactively.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list