[GE users] Fwd: Queue problems

Margaret Doll Margaret_Doll at brown.edu
Tue Jun 12 20:15:04 BST 2007



Begin forwarded message:

> From: Margaret Doll <Margaret_Doll at brown.edu>
> Date: June 12, 2007 3:14:38 PM EDT
> To: Grid Engine <users at gridengine.sunsource.net>
>
> Up until today I was able to submit jobs to test.q which has 8  
> slots ( 8 processors).  Today the jobs
> all go into a pending state.   test.q  also contains only  
> compute-0-3.local.  Today I cannot ssh
> into the compute node although it answers a ping.
>
> I notice that although I can ssh into the other three compute  
> nodes, the queues tell me I do
> not have available slots on them either.
>
> How do I fix this situation without killing the jobs that are on  
> LongJobs and all.q?
>
> qstat -j 433
> ==============================================================
> job_number:                 433
> exec_file:                  job_scripts/433
> submission_time:            Tue Jun 12 14:53:44 2007
> owner:                      mad
> uid:                        500
> group:                      users
> gid:                        100
> account:                    sge
> cwd:                        /root
> path_aliases:               /tmp_mnt/ * * /
> mail_options:               n
> mail_list:                  mad at ted.chem.brown.edu
> notify:                     FALSE
> job_name:                   shell
> jobshare:                   0
> hard_queue_list:            LongJobs,test.q
>
> script_file:                ./shell
> version:                    1
> scheduling info:            queue instance  
> "all.q at compute-0-1.local" dropped because it is overloaded:  
> np_load_avg=1.750000 (no load adjustment) >= 1.75
>                             queue instance  
> "all.q at compute-0-0.local" dropped because it is overloaded:  
> np_load_avg=1.875000 (no load adjustment) >= 1.75
>                             queue instance  
> "all.q at compute-0-3.local" dropped because it is overloaded:  
> np_load_avg=3.000000 (no load adjustment) >= 1.75
>                             queue instance  
> "all.q at compute-0-2.local" dropped because it is overloaded:  
> np_load_avg=1.753750 (no load adjustment) >= 1.75
>                             queue instance  
> "LongJobs at compute-0-1.local" dropped because it is overloaded:  
> np_load_avg=1.750000 (no load adjustment) >= 1.75
>                             queue instance  
> "LongJobs at compute-0-0.local" dropped because it is overloaded:  
> np_load_avg=1.875000 (no load adjustment) >= 1.75
>                             queue instance  
> "LongJobs at compute-0-2.local" dropped because it is overloaded:  
> np_load_avg=1.753750 (no load adjustment) >= 1.75
>                             queue instance  
> "test.q at compute-0-3.local" dropped because it is overloaded:  
> np_load_avg=3.000000 (no load adjustment) >= 1.75
>                             All queues dropped because of overload  
> or full
>
>
> ssh compute-0-3
> ssh_exchange_identification: Connection closed by remote host
> [mad at ted moldy]$ ping compute-0-3
> PING compute-0-3.local (10.255.255.249) 56(84) bytes of data.
> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=0 ttl=64  
> time=0.096 ms
> 64 bytes from compute-0-3.local (10.255.255.249): icmp_seq=1 ttl=64  
> time=0.196 ms
>
>
> from qmon:
>
>
>
> test.q   - hostlist is @tempo
>
> @ temp		Members - compute-0-3.local
>
> ClusterQueue	Used	Avail	Total
> LongJobs		5		0		24
> all.q				11		0		32
> test.q			0		0		86




More information about the gridengine-users mailing list