[GE users] How to change "S" and "aS" state on queue instances

Hugo R. Hernandez-Mora hugo.hernandez at loni.ucla.edu
Mon Jul 16 20:45:45 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Daniel,
thanks for your answer.   I figured out the problem with the 'S' state 
when looking in detail our cluster configuration.  For sure, we are 
checking our actual queue configurations.
Regards,
- Hugo


Daniel Templeton wrote:
> Hugo,
>
> The 'S' state means that the queue is suspended on subordinate.  
> Depending on how you configured the subordinate list, when, for 
> example, an instance of your special queue is full on a given host, 
> the other three queues will be suspended on that host.  It sounds like 
> you may need to rethink you queue configurations.
>
> Daniel
>
> Hugo R. Hernandez-Mora wrote:
>> Hello there,
>> we have a cluster of ~300 nodes with two slots each, running Solaris 
>> 10 u06/6 and SGE 6.06.   We have configured the qmaster with a shadow 
>> server to migrate services if it fails.    Also, we have configured 
>> four different queues depending on the kind of job to be submitted:
>>      special queue: using the 100% of the resources of the cluster,
>>    short queue: using all the nodes but only one slot per node.   For 
>> jobs of less than 2 CPU hours,
>>    medium queue: using ~30% of the resources (only one slot per 
>> node).  For jobs of less than 12 CPU hours,
>>    long queue: using ~10% of the resources (only one slot per 
>> node).   For jobs of unlimited time.
>>
>> The queues have a subordinance order in terms of resources: special 
>> queue has subordinate the other three queues, and so on.
>>
>> Since two weeks ago, we have been experiencing a problem with the 
>> cluster.  Most of the nodes are turning into "aS" state.  We have 
>> verified if these nodes are too busy in terms of resources usage.  
>> They are in good shape but masked with that state, preventing to run 
>> jobs on them.   The only solution last time was to restart both 
>> qmaster and the system went into a good state.   Now, we are 
>> experiencing the same situation, no good info on the logfiles or 
>> message files about the problem.   We did the same procedure for the 
>> last time, restart the two qmasters, but now the nodes are marked 
>> with "S" state.  We have tried to force to unsuspended the nodes 
>> without any success.    No jobs can run into the marked nodes and the 
>> resources of them are completely available.
>>
>> Can somebody help me with this problem?   I will appreciate it!
>> Regards,
>> - Hugo
>>
>
>
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

-- 
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.hernandez at loni.ucla.edu
--

"Si seus esfor?os, foram vistos com indefren?a, não desanime, 
que o sol faze un espectacolo maravilhoso todas as manhãs 
cuando a maior parte das pessoas, ainda estam durmindo" 




More information about the gridengine-users mailing list