[GE users] Fwd: Queue problems

Alan Barclay barclay at rtda.com
Tue Jun 12 20:34:35 BST 2007


One fine point that may relate to this is that if these
are Linux hosts, jobs in the 'D' state (uninterruptible
sleep, usually IO), are counted in the load average.

If your hosts have a high load average, but low CPU
utilization, check ps() output for processes stuck in
that state.  You may be able to use lsof, strace, and
pstack to figure out why.

Regards,
  Alan Barclay

Daniel Templeton wrote:
> 
> According to the qstat -j output, you're not able to schedule jobs to
> those hosts because they're overloaded.  If you do a qstat -f, those
> queues should be reported in (a)larm state.  That's happening because
> the normalized load average for the machines is above 1.75.  If you
> don't care about how overloaded the machines are, set the
> load_thresholds to none for your queues:
> 
> qconf -rattr queue load_thresholds NONE all.q
> qconf -rattr queue load_thresholds NONE LongJobs
> 
> Otherwise, go figure out why the machines are reporting such high load
> averages.  Odds are it's related to why you're having trouble reaching them.
> 
> Daniel
> 
> Margaret Doll wrote:
> >
> >
> > Begin forwarded message:
> >
> >> *From: *Margaret Doll <Margaret_Doll at brown.edu
> >> <mailto:Margaret_Doll at brown.edu>>
> >> *Date: *June 12, 2007 3:14:38 PM EDT
> >> *To: *Grid Engine <users at gridengine.sunsource.net
> >> <mailto:users at gridengine.sunsource.net>>
> >>
> >> Up until today I was able to submit jobs to test.q which has 8 slots
> >> ( 8 processors).  Today the jobs
> >> all go into a pending state.   test.q  also contains only
> >> compute-0-3.local.  Today I cannot ssh
> >> into the compute node although it answers a ping.
> >>
> >> I notice that although I can ssh into the other three compute nodes,
> >> the queues tell me I do
> >> not have available slots on them either.
> >>
> >> How do I fix this situation without killing the jobs that are on
> >> LongJobs and all.q?
> >>
> >> qstat -j 433
> >> ==============================================================
> >> job_number:                 433
> >> exec_file:                  job_scripts/433
> >> submission_time:            Tue Jun 12 14:53:44 2007
> >> owner:                      mad
> >> uid:                        500
> >> group:                      users
> >> gid:                        100
> >> account:                    sge
> >> cwd:                        /root
> >> path_aliases:               /tmp_mnt/ * * /
> >> mail_options:               n
> >> mail_list:                  mad at ted.chem.brown.edu
> >> <mailto:mad at ted.chem.brown.edu>
> >> notify:                     FALSE
> >> job_name:                   shell
> >> jobshare:                   0
> >> hard_queue_list:            LongJobs,test.q
> >>
> >> script_file:                ./shell
> >> version:                    1
> >> scheduling info:            queue instance "all.q at compute-0-1.local
> >> <mailto:all.q at compute-0-1.local>" dropped because it is overloaded:
> >> np_load_avg=1.750000 (no load adjustment) >= 1.75
> >>                             queue instance "all.q at compute-0-0.local
> >> <mailto:all.q at compute-0-0.local>" dropped because it is overloaded:
> >> np_load_avg=1.875000 (no load adjustment) >= 1.75
> >>                             queue instance "all.q at compute-0-3.local
> >> <mailto:all.q at compute-0-3.local>" dropped because it is overloaded:
> >> np_load_avg=3.000000 (no load adjustment) >= 1.75
> >>                             queue instance "all.q at compute-0-2.local
> >> <mailto:all.q at compute-0-2.local>" dropped because it is overloaded:
> >> np_load_avg=1.753750 (no load adjustment) >= 1.75
> >>                             queue instance
> >> "LongJobs at compute-0-1.local <mailto:LongJobs at compute-0-1.local>"
> >> dropped because it is overloaded: np_load_avg=1.750000 (no load
> >> adjustment) >= 1.75
> >>                             queue instance
> >> "LongJobs at compute-0-0.local <mailto:LongJobs at compute-0-0.local>"
> >> dropped because it is overloaded: np_load_avg=1.875000 (no load
> >> adjustment) >= 1.75
> >>                             queue instance
> >> "LongJobs at compute-0-2.local <mailto:LongJobs at compute-0-2.local>"
> >> dropped because it is overloaded: np_load_avg=1.753750 (no load
> >> adjustment) >= 1.75
> >>                             queue instance "test.q at compute-0-3.local
> >> <mailto:test.q at compute-0-3.local>" dropped because it is overloaded:
> >> np_load_avg=3.000000 (no load adjustment) >= 1.75
> >>                             All queues dropped because of overload or
> >> full
> >>
> >>
> >> ssh compute-0-3
> >> ssh_exchange_identification: Connection closed by remote host
> >> [mad at ted moldy]$ ping compute-0-3
> >> PING compute-0-3.local (10.255.255.249) 56(84) bytes of data.
> >> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=0 ttl=64
> >> time=0.096 ms
> >> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=1 ttl=64
> >> time=0.196 ms
> >>
> >>
> >> from qmon:
> >>
> >>
> >>
> >> test.q   - hostlist is @tempo
> >>
> >> @ temp Members - compute-0-3.local
> >>
> >> ClusterQueue Used Avail Total
> >> LongJobs 5 0 24
> >> all.q 11 0 32
> >> test.q 0 0 86
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

-- 
--Alan Barclay--  barclay at rtda.com
  www.rtda.com   (408) 492-0940 main

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list