[GE users] Fwd: Queue problems

Daniel Templeton Dan.Templeton at Sun.COM
Tue Jun 12 20:23:33 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

According to the qstat -j output, you're not able to schedule jobs to 
those hosts because they're overloaded.  If you do a qstat -f, those 
queues should be reported in (a)larm state.  That's happening because 
the normalized load average for the machines is above 1.75.  If you 
don't care about how overloaded the machines are, set the 
load_thresholds to none for your queues:

qconf -rattr queue load_thresholds NONE all.q
qconf -rattr queue load_thresholds NONE LongJobs

Otherwise, go figure out why the machines are reporting such high load 
averages.  Odds are it's related to why you're having trouble reaching them.

Daniel

Margaret Doll wrote:
>
>
> Begin forwarded message:
>
>> *From: *Margaret Doll <Margaret_Doll at brown.edu 
>> <mailto:Margaret_Doll at brown.edu>>
>> *Date: *June 12, 2007 3:14:38 PM EDT
>> *To: *Grid Engine <users at gridengine.sunsource.net 
>> <mailto:users at gridengine.sunsource.net>>
>>
>> Up until today I was able to submit jobs to test.q which has 8 slots 
>> ( 8 processors).  Today the jobs
>> all go into a pending state.   test.q  also contains only 
>> compute-0-3.local.  Today I cannot ssh
>> into the compute node although it answers a ping.
>>
>> I notice that although I can ssh into the other three compute nodes, 
>> the queues tell me I do
>> not have available slots on them either.
>>
>> How do I fix this situation without killing the jobs that are on 
>> LongJobs and all.q?
>>
>> qstat -j 433
>> ==============================================================
>> job_number:                 433
>> exec_file:                  job_scripts/433
>> submission_time:            Tue Jun 12 14:53:44 2007
>> owner:                      mad
>> uid:                        500
>> group:                      users
>> gid:                        100
>> account:                    sge
>> cwd:                        /root
>> path_aliases:               /tmp_mnt/ * * /
>> mail_options:               n
>> mail_list:                  mad at ted.chem.brown.edu 
>> <mailto:mad at ted.chem.brown.edu>
>> notify:                     FALSE
>> job_name:                   shell
>> jobshare:                   0
>> hard_queue_list:            LongJobs,test.q
>>
>> script_file:                ./shell
>> version:                    1
>> scheduling info:            queue instance "all.q at compute-0-1.local 
>> <mailto:all.q at compute-0-1.local>" dropped because it is overloaded: 
>> np_load_avg=1.750000 (no load adjustment) >= 1.75
>>                             queue instance "all.q at compute-0-0.local 
>> <mailto:all.q at compute-0-0.local>" dropped because it is overloaded: 
>> np_load_avg=1.875000 (no load adjustment) >= 1.75
>>                             queue instance "all.q at compute-0-3.local 
>> <mailto:all.q at compute-0-3.local>" dropped because it is overloaded: 
>> np_load_avg=3.000000 (no load adjustment) >= 1.75
>>                             queue instance "all.q at compute-0-2.local 
>> <mailto:all.q at compute-0-2.local>" dropped because it is overloaded: 
>> np_load_avg=1.753750 (no load adjustment) >= 1.75
>>                             queue instance 
>> "LongJobs at compute-0-1.local <mailto:LongJobs at compute-0-1.local>" 
>> dropped because it is overloaded: np_load_avg=1.750000 (no load 
>> adjustment) >= 1.75
>>                             queue instance 
>> "LongJobs at compute-0-0.local <mailto:LongJobs at compute-0-0.local>" 
>> dropped because it is overloaded: np_load_avg=1.875000 (no load 
>> adjustment) >= 1.75
>>                             queue instance 
>> "LongJobs at compute-0-2.local <mailto:LongJobs at compute-0-2.local>" 
>> dropped because it is overloaded: np_load_avg=1.753750 (no load 
>> adjustment) >= 1.75
>>                             queue instance "test.q at compute-0-3.local 
>> <mailto:test.q at compute-0-3.local>" dropped because it is overloaded: 
>> np_load_avg=3.000000 (no load adjustment) >= 1.75
>>                             All queues dropped because of overload or 
>> full
>>
>>
>> ssh compute-0-3
>> ssh_exchange_identification: Connection closed by remote host
>> [mad at ted moldy]$ ping compute-0-3
>> PING compute-0-3.local (10.255.255.249) 56(84) bytes of data.
>> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=0 ttl=64 
>> time=0.096 ms
>> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=1 ttl=64 
>> time=0.196 ms
>>
>>
>> from qmon:
>>
>>
>>
>> test.q   - hostlist is @tempo
>>
>> @ temp Members - compute-0-3.local
>>
>> ClusterQueue Used Avail Total
>> LongJobs 5 0 24
>> all.q 11 0 32
>> test.q 0 0 86
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list