[GE users] Fwd: Queue problems

Margaret Doll Margaret_Doll at brown.edu
Tue Jun 12 21:32:34 BST 2007


Thanks, Daniel and Alan,

	I looked at compute-0-3 and could not find anything running on it.   
I rebooted the system
and I can ssh into it again.

	Although "qstat -f" said that I had

------------------------------------------------------------------------ 
----
LongJobs at compute-0-0.local     BIP   0/8       15.01    lx26-amd64
------------------------------------------------------------------------ 
----
LongJobs at compute-0-1.local     BIP   3/8       14.00    lx26-amd64
     286 0.55500 ref3       nanguyen     r     06/06/2007 16:43:08     1
     288 0.55500 ref3-12000 nanguyen     r     06/06/2007 16:49:33     1
     297 0.55500 p3-3740    nanguyen     r     06/06/2007 17:24:05     1
------------------------------------------------------------------------ 
----
LongJobs at compute-0-2.local     BIP   2/8       14.16    lx26-amd64
     290 0.55500 PE-1.3b    nanguyen     r     06/06/2007 16:54:27     1
     293 0.55500 ref3-4477  nanguyen     r     06/06/2007 17:07:32     1
------------------------------------------------------------------------ 
----
all.q at compute-0-0.local        BIP   4/8       15.01    lx26-amd64
     150 0.55500 P473       cwang        r     05/21/2007 11:38:08     1
     151 0.55500 P493       cwang        r     05/21/2007 11:39:09     1
     159 0.55500 P453       cwang        r     05/21/2007 11:46:18     1
     163 0.55500 P400       cwang        r     05/21/2007 11:51:43     1
------------------------------------------------------------------------ 
----
all.q at compute-0-1.local        BIP   4/8       14.00    lx26-amd64
     148 0.55500 P471       cwang        r     05/21/2007 11:36:52     1
     154 0.55500 P490       cwang        r     05/21/2007 11:41:28     1
     156 0.55500 P451       cwang        r     05/21/2007 11:43:16     1
     160 0.55500 P403       cwang        r     05/21/2007 11:49:07     1
------------------------------------------------------------------------ 
----
all.q at compute-0-2.local        BIP   3/8       14.16    lx26-amd64
     149 0.55500 P472       cwang        r     05/21/2007 11:37:23     1
     152 0.55500 P492       cwang        r     05/21/2007 11:39:40     1
     157 0.55500 P452       cwang        r     05/21/2007 11:45:02     1
------------------------------------------------------------------------ 
----
all.q at compute-0-3.local        BIP   0/8       24.00    lx26-amd64
------------------------------------------------------------------------ 
----
test.q at compute-0-3.local       BIP   0/8       24.00    lx26-amd64


when I ssh's  into compute-0-1 and compute-0-0, the jobs by
nanguyen  were taking up no CPU time.

[root at compute-0-0 ~]# ps -ef | grep nan
nanguyen 23235 23234  0 May23 ?        00:00:00 [csh]
nanguyen 23309 23235 11 May23 ?        2-06:32:06 ./Sfn2.x
nanguyen 23311 23310  0 May23 ?        00:00:00 [csh]
nanguyen 23383 23311 11 May23 ?        2-06:31:51 ./Sfn5.x
nanguyen 28554 28552  0 May25 ?        00:00:00 sge_shepherd-186 -bg
nanguyen  7803  7802  0 Jun06 ?        00:00:00 sge_shepherd-289 -bg
root     16764 16631  0 15:47 pts/4    00:00:00 grep nan

I deleted all the jobs submitted by nanguyen  and the queues are working
again.

The error logs (/var/log/messages) on compute-0-3 and the other  
compute nodes
contain only errors about the ntpdate server

Jun 11 10:01:02 compute-0-1 ntpdate[24621]: no server suitable for  
synchronization found

I have the head node listed in /etc/ntp/ntpservers for the compute  
nodes.  The head
node lists the same problem.

Is there an error file that I should examine to see what went wrong?

I have restored the limits on the np_load_avg.







On Jun 12, 2007, at 3:34 PM, Alan Barclay wrote:

> One fine point that may relate to this is that if these
> are Linux hosts, jobs in the 'D' state (uninterruptible
> sleep, usually IO), are counted in the load average.
>
> If your hosts have a high load average, but low CPU
> utilization, check ps() output for processes stuck in
> that state.  You may be able to use lsof, strace, and
> pstack to figure out why.
>
> Regards,
>   Alan Barclay
>
> Daniel Templeton wrote:
>>
>> According to the qstat -j output, you're not able to schedule jobs to
>> those hosts because they're overloaded.  If you do a qstat -f, those
>> queues should be reported in (a)larm state.  That's happening because
>> the normalized load average for the machines is above 1.75.  If you
>> don't care about how overloaded the machines are, set the
>> load_thresholds to none for your queues:
>>
>> qconf -rattr queue load_thresholds NONE all.q
>> qconf -rattr queue load_thresholds NONE LongJobs
>>
>> Otherwise, go figure out why the machines are reporting such high  
>> load
>> averages.  Odds are it's related to why you're having trouble  
>> reaching them.
>>
>> Daniel
>>
>> Margaret Doll wrote:
>>>
>>>
>>> Begin forwarded message:
>>>
>>>> *From: *Margaret Doll <Margaret_Doll at brown.edu
>>>> <mailto:Margaret_Doll at brown.edu>>
>>>> *Date: *June 12, 2007 3:14:38 PM EDT
>>>> *To: *Grid Engine <users at gridengine.sunsource.net
>>>> <mailto:users at gridengine.sunsource.net>>
>>>>
>>>> Up until today I was able to submit jobs to test.q which has 8  
>>>> slots
>>>> ( 8 processors).  Today the jobs
>>>> all go into a pending state.   test.q  also contains only
>>>> compute-0-3.local.  Today I cannot ssh
>>>> into the compute node although it answers a ping.
>>>>
>>>> I notice that although I can ssh into the other three compute  
>>>> nodes,
>>>> the queues tell me I do
>>>> not have available slots on them either.
>>>>
>>>> How do I fix this situation without killing the jobs that are on
>>>> LongJobs and all.q?
>>>>
>>>> qstat -j 433
>>>> ==============================================================
>>>> job_number:                 433
>>>> exec_file:                  job_scripts/433
>>>> submission_time:            Tue Jun 12 14:53:44 2007
>>>> owner:                      mad
>>>> uid:                        500
>>>> group:                      users
>>>> gid:                        100
>>>> account:                    sge
>>>> cwd:                        /root
>>>> path_aliases:               /tmp_mnt/ * * /
>>>> mail_options:               n
>>>> mail_list:                  mad at ted.chem.brown.edu
>>>> <mailto:mad at ted.chem.brown.edu>
>>>> notify:                     FALSE
>>>> job_name:                   shell
>>>> jobshare:                   0
>>>> hard_queue_list:            LongJobs,test.q
>>>>
>>>> script_file:                ./shell
>>>> version:                    1
>>>> scheduling info:            queue instance "all.q at compute-0-1.local
>>>> <mailto:all.q at compute-0-1.local>" dropped because it is overloaded:
>>>> np_load_avg=1.750000 (no load adjustment) >= 1.75
>>>>                             queue instance "all.q at compute-0-0.local
>>>> <mailto:all.q at compute-0-0.local>" dropped because it is overloaded:
>>>> np_load_avg=1.875000 (no load adjustment) >= 1.75
>>>>                             queue instance "all.q at compute-0-3.local
>>>> <mailto:all.q at compute-0-3.local>" dropped because it is overloaded:
>>>> np_load_avg=3.000000 (no load adjustment) >= 1.75
>>>>                             queue instance "all.q at compute-0-2.local
>>>> <mailto:all.q at compute-0-2.local>" dropped because it is overloaded:
>>>> np_load_avg=1.753750 (no load adjustment) >= 1.75
>>>>                             queue instance
>>>> "LongJobs at compute-0-1.local <mailto:LongJobs at compute-0-1.local>"
>>>> dropped because it is overloaded: np_load_avg=1.750000 (no load
>>>> adjustment) >= 1.75
>>>>                             queue instance
>>>> "LongJobs at compute-0-0.local <mailto:LongJobs at compute-0-0.local>"
>>>> dropped because it is overloaded: np_load_avg=1.875000 (no load
>>>> adjustment) >= 1.75
>>>>                             queue instance
>>>> "LongJobs at compute-0-2.local <mailto:LongJobs at compute-0-2.local>"
>>>> dropped because it is overloaded: np_load_avg=1.753750 (no load
>>>> adjustment) >= 1.75
>>>>                             queue instance  
>>>> "test.q at compute-0-3.local
>>>> <mailto:test.q at compute-0-3.local>" dropped because it is  
>>>> overloaded:
>>>> np_load_avg=3.000000 (no load adjustment) >= 1.75
>>>>                             All queues dropped because of  
>>>> overload or
>>>> full
>>>>
>>>>
>>>> ssh compute-0-3
>>>> ssh_exchange_identification: Connection closed by remote host
>>>> [mad at ted moldy]$ ping compute-0-3
>>>> PING compute-0-3.local (10.255.255.249) 56(84) bytes of data.
>>>> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=0 ttl=64
>>>> time=0.096 ms
>>>> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=1 ttl=64
>>>> time=0.196 ms
>>>>
>>>>
>>>> from qmon:
>>>>
>>>>
>>>>
>>>> test.q   - hostlist is @tempo
>>>>
>>>> @ temp Members - compute-0-3.local
>>>>
>>>> ClusterQueue Used Avail Total
>>>> LongJobs 5 0 24
>>>> all.q 11 0 32
>>>> test.q 0 0 86
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> -- 
> --Alan Barclay--  barclay at rtda.com
>   www.rtda.com   (408) 492-0940 main
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list