[GE users] Fwd: Queue problems

Margaret Doll Margaret_Doll at brown.edu
Tue Jun 12 20:31:03 BST 2007


Thanks, Daniel.  That solved the  problem of getting the jobs to run.

However,  I still cannot run on the test.q or ssh into compute-0-3.   
Any ideas for that problem.

qconf -rattr queue load_thresholds NONE  test.q
ssh compute-0-3
ssh_exchange_identification: Connection closed by remote host


Also, how do I figure out why the machines are reporting such high  
load averages?


On Jun 12, 2007, at 3:23 PM, Daniel Templeton wrote:

> According to the qstat -j output, you're not able to schedule jobs  
> to those hosts because they're overloaded.  If you do a qstat -f,  
> those queues should be reported in (a)larm state.  That's happening  
> because the normalized load average for the machines is above  
> 1.75.  If you don't care about how overloaded the machines are, set  
> the load_thresholds to none for your queues:
>
> qconf -rattr queue load_thresholds NONE all.q
> qconf -rattr queue load_thresholds NONE LongJobs
>
> Otherwise, go figure out why the machines are reporting such high  
> load averages.  Odds are it's related to why you're having trouble  
> reaching them.
>
> Daniel
>
> Margaret Doll wrote:
>>
>>
>> Begin forwarded message:
>>
>>> *From: *Margaret Doll <Margaret_Doll at brown.edu  
>>> <mailto:Margaret_Doll at brown.edu>>
>>> *Date: *June 12, 2007 3:14:38 PM EDT
>>> *To: *Grid Engine <users at gridengine.sunsource.net  
>>> <mailto:users at gridengine.sunsource.net>>
>>>
>>> Up until today I was able to submit jobs to test.q which has 8  
>>> slots ( 8 processors).  Today the jobs
>>> all go into a pending state.   test.q  also contains only  
>>> compute-0-3.local.  Today I cannot ssh
>>> into the compute node although it answers a ping.
>>>
>>> I notice that although I can ssh into the other three compute  
>>> nodes, the queues tell me I do
>>> not have available slots on them either.
>>>
>>> How do I fix this situation without killing the jobs that are on  
>>> LongJobs and all.q?
>>>
>>> qstat -j 433
>>> ==============================================================
>>> job_number:                 433
>>> exec_file:                  job_scripts/433
>>> submission_time:            Tue Jun 12 14:53:44 2007
>>> owner:                      mad
>>> uid:                        500
>>> group:                      users
>>> gid:                        100
>>> account:                    sge
>>> cwd:                        /root
>>> path_aliases:               /tmp_mnt/ * * /
>>> mail_options:               n
>>> mail_list:                  mad at ted.chem.brown.edu  
>>> <mailto:mad at ted.chem.brown.edu>
>>> notify:                     FALSE
>>> job_name:                   shell
>>> jobshare:                   0
>>> hard_queue_list:            LongJobs,test.q
>>>
>>> script_file:                ./shell
>>> version:                    1
>>> scheduling info:            queue instance  
>>> "all.q at compute-0-1.local <mailto:all.q at compute-0-1.local>"  
>>> dropped because it is overloaded: np_load_avg=1.750000 (no load  
>>> adjustment) >= 1.75
>>>                             queue instance  
>>> "all.q at compute-0-0.local <mailto:all.q at compute-0-0.local>"  
>>> dropped because it is overloaded: np_load_avg=1.875000 (no load  
>>> adjustment) >= 1.75
>>>                             queue instance  
>>> "all.q at compute-0-3.local <mailto:all.q at compute-0-3.local>"  
>>> dropped because it is overloaded: np_load_avg=3.000000 (no load  
>>> adjustment) >= 1.75
>>>                             queue instance  
>>> "all.q at compute-0-2.local <mailto:all.q at compute-0-2.local>"  
>>> dropped because it is overloaded: np_load_avg=1.753750 (no load  
>>> adjustment) >= 1.75
>>>                             queue instance  
>>> "LongJobs at compute-0-1.local <mailto:LongJobs at compute-0-1.local>"  
>>> dropped because it is overloaded: np_load_avg=1.750000 (no load  
>>> adjustment) >= 1.75
>>>                             queue instance  
>>> "LongJobs at compute-0-0.local <mailto:LongJobs at compute-0-0.local>"  
>>> dropped because it is overloaded: np_load_avg=1.875000 (no load  
>>> adjustment) >= 1.75
>>>                             queue instance  
>>> "LongJobs at compute-0-2.local <mailto:LongJobs at compute-0-2.local>"  
>>> dropped because it is overloaded: np_load_avg=1.753750 (no load  
>>> adjustment) >= 1.75
>>>                             queue instance  
>>> "test.q at compute-0-3.local <mailto:test.q at compute-0-3.local>"  
>>> dropped because it is overloaded: np_load_avg=3.000000 (no load  
>>> adjustment) >= 1.75
>>>                             All queues dropped because of  
>>> overload or full
>>>
>>>
>>> ssh compute-0-3
>>> ssh_exchange_identification: Connection closed by remote host
>>> [mad at ted moldy]$ ping compute-0-3
>>> PING compute-0-3.local (10.255.255.249) 56(84) bytes of data.
>>> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=0  
>>> ttl=64 time=0.096 ms
>>> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=1  
>>> ttl=64 time=0.196 ms
>>>
>>>
>>> from qmon:
>>>
>>>
>>>
>>> test.q   - hostlist is @tempo
>>>
>>> @ temp Members - compute-0-3.local
>>>
>>> ClusterQueue Used Avail Total
>>> LongJobs 5 0 24
>>> all.q 11 0 32
>>> test.q 0 0 86
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list