[GE users] Fwd: Queue problems

Daniel Templeton Dan.Templeton at Sun.COM
Tue Jun 12 20:36:29 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

No idea on why your machine went dark.  It's most likely something 
unrelated to Grid Engine.  Can you walk up to it and physically log in?

The load average is a measure of how many processes are waiting to run 
at any given time.  A high load average means that someone is running a 
ton of process on your machines, again, possibly unrelated to Grid 
Engine.  On most OSes, 'top' will show you the list of the top 
CPU-consuming processes that are running.  On Solaris, prstat is the 
Sun-endorsed option.

BTW, when you get the issue resolved, you'd be best served to restore 
the load_thresholds.  It doesn't make much sense to schedule jobs to 
machines that are completely overloaded.

Daniel

Margaret Doll wrote:
> Thanks, Daniel.  That solved the  problem of getting the jobs to run.
>
> However,  I still cannot run on the test.q or ssh into compute-0-3.  
> Any ideas for that problem.
>
> qconf -rattr queue load_thresholds NONE  test.q
> ssh compute-0-3
> ssh_exchange_identification: Connection closed by remote host
>
>
> Also, how do I figure out why the machines are reporting such high 
> load averages?
>
>
> On Jun 12, 2007, at 3:23 PM, Daniel Templeton wrote:
>
>> According to the qstat -j output, you're not able to schedule jobs to 
>> those hosts because they're overloaded.  If you do a qstat -f, those 
>> queues should be reported in (a)larm state.  That's happening because 
>> the normalized load average for the machines is above 1.75.  If you 
>> don't care about how overloaded the machines are, set the 
>> load_thresholds to none for your queues:
>>
>> qconf -rattr queue load_thresholds NONE all.q
>> qconf -rattr queue load_thresholds NONE LongJobs
>>
>> Otherwise, go figure out why the machines are reporting such high 
>> load averages.  Odds are it's related to why you're having trouble 
>> reaching them.
>>
>> Daniel
>>
>> Margaret Doll wrote:
>>>
>>>
>>> Begin forwarded message:
>>>
>>>> *From: *Margaret Doll <Margaret_Doll at brown.edu 
>>>> <mailto:Margaret_Doll at brown.edu>>
>>>> *Date: *June 12, 2007 3:14:38 PM EDT
>>>> *To: *Grid Engine <users at gridengine.sunsource.net 
>>>> <mailto:users at gridengine.sunsource.net>>
>>>>
>>>> Up until today I was able to submit jobs to test.q which has 8 
>>>> slots ( 8 processors).  Today the jobs
>>>> all go into a pending state.   test.q  also contains only 
>>>> compute-0-3.local.  Today I cannot ssh
>>>> into the compute node although it answers a ping.
>>>>
>>>> I notice that although I can ssh into the other three compute 
>>>> nodes, the queues tell me I do
>>>> not have available slots on them either.
>>>>
>>>> How do I fix this situation without killing the jobs that are on 
>>>> LongJobs and all.q?
>>>>
>>>> qstat -j 433
>>>> ==============================================================
>>>> job_number:                 433
>>>> exec_file:                  job_scripts/433
>>>> submission_time:            Tue Jun 12 14:53:44 2007
>>>> owner:                      mad
>>>> uid:                        500
>>>> group:                      users
>>>> gid:                        100
>>>> account:                    sge
>>>> cwd:                        /root
>>>> path_aliases:               /tmp_mnt/ * * /
>>>> mail_options:               n
>>>> mail_list:                  mad at ted.chem.brown.edu 
>>>> <mailto:mad at ted.chem.brown.edu>
>>>> notify:                     FALSE
>>>> job_name:                   shell
>>>> jobshare:                   0
>>>> hard_queue_list:            LongJobs,test.q
>>>>
>>>> script_file:                ./shell
>>>> version:                    1
>>>> scheduling info:            queue instance "all.q at compute-0-1.local 
>>>> <mailto:all.q at compute-0-1.local>" dropped because it is overloaded: 
>>>> np_load_avg=1.750000 (no load adjustment) >= 1.75
>>>>                             queue instance "all.q at compute-0-0.local 
>>>> <mailto:all.q at compute-0-0.local>" dropped because it is overloaded: 
>>>> np_load_avg=1.875000 (no load adjustment) >= 1.75
>>>>                             queue instance "all.q at compute-0-3.local 
>>>> <mailto:all.q at compute-0-3.local>" dropped because it is overloaded: 
>>>> np_load_avg=3.000000 (no load adjustment) >= 1.75
>>>>                             queue instance "all.q at compute-0-2.local 
>>>> <mailto:all.q at compute-0-2.local>" dropped because it is overloaded: 
>>>> np_load_avg=1.753750 (no load adjustment) >= 1.75
>>>>                             queue instance 
>>>> "LongJobs at compute-0-1.local <mailto:LongJobs at compute-0-1.local>" 
>>>> dropped because it is overloaded: np_load_avg=1.750000 (no load 
>>>> adjustment) >= 1.75
>>>>                             queue instance 
>>>> "LongJobs at compute-0-0.local <mailto:LongJobs at compute-0-0.local>" 
>>>> dropped because it is overloaded: np_load_avg=1.875000 (no load 
>>>> adjustment) >= 1.75
>>>>                             queue instance 
>>>> "LongJobs at compute-0-2.local <mailto:LongJobs at compute-0-2.local>" 
>>>> dropped because it is overloaded: np_load_avg=1.753750 (no load 
>>>> adjustment) >= 1.75
>>>>                             queue instance 
>>>> "test.q at compute-0-3.local <mailto:test.q at compute-0-3.local>" 
>>>> dropped because it is overloaded: np_load_avg=3.000000 (no load 
>>>> adjustment) >= 1.75
>>>>                             All queues dropped because of overload 
>>>> or full
>>>>
>>>>
>>>> ssh compute-0-3
>>>> ssh_exchange_identification: Connection closed by remote host
>>>> [mad at ted moldy]$ ping compute-0-3
>>>> PING compute-0-3.local (10.255.255.249) 56(84) bytes of data.
>>>> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=0 ttl=64 
>>>> time=0.096 ms
>>>> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=1 ttl=64 
>>>> time=0.196 ms
>>>>
>>>>
>>>> from qmon:
>>>>
>>>>
>>>>
>>>> test.q   - hostlist is @tempo
>>>>
>>>> @ temp Members - compute-0-3.local
>>>>
>>>> ClusterQueue Used Avail Total
>>>> LongJobs 5 0 24
>>>> all.q 11 0 32
>>>> test.q 0 0 86
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list