[GE users] SGE jobs in "qw" state

Iwona Sakrejda isakrejda at lbl.gov
Tue Jun 6 21:40:44 BST 2006


Hi,

My installation every now and then suffers from this losses, but
I never complained as I remember reading that SGE should be installed
on a local filesyste. I have it on GPFS so I thougth I had brought it
upon myself. Only I don't have much choice and it's never very bad.

Here is my story.

I run SGE 6.0u4 on SL3.0.2 (scientific Linux distributed by fermilab,
sort of cloned EHE3....)

What I observe is that every now and then a well established user
suddenly changes - it aquires time limit and a project NONE.
I them do qconf -muser and fix it and then everything is ok for a while.
It does not happen very often, I monitor for it and just lernt to
live with it.

It mostly happens to the users. I think I lost group conf files once or twice.

I think I correlated it with times when GPFS has long waiters...

Hope thios adds to the picture,

Iwona

Chris Dagdigian wrote:

> 
> A well behaved Grid Engine should never lose its configuration this  way 
> -- in classic spooling there will be a human readable textfile in  
> $SGE_ROOT/<cell>/common/schedd_configuration that contains the data.  In 
> binary Berkeley spooling the config would be in the spooldb  files.   
> I've never seen a partial broken configuration unless it was  my own 
> human error -- the times I've seen the SGE configuration  totally hosed 
> either all the files were messed up or the Berkeley DB  database was 
> totally unrecoverable. In the cases of total  configuration loss we've 
> usually tracked the problems down to bad  behavior by  SAN client 
> software or some other OS or file-server  related issue that was 
> external to grid engine.
> 
> In case this is a new problem, it would be helpful if you could reply  
> with the following:
> 
> o what version of SGE did this happen on
> o what OS/architecture
> o what spooling method was in use
> o any other files missing or corrupt ?
> 
> .. this way we can see if others report the same thing.  Spontaneous  
> loss of scheduler configuration would be a big deal if reproducible  in 
> SGE 6.x .
> 
> Regards,
> Chris
> 
> 
> 
> On Jun 6, 2006, at 4:03 PM, Adam Brust wrote:
> 
>> Hi.
>>
>> I just spent an entire day troubleshooting what seems to be a very  
>> similar problem. I finally found a resolution and perhaps it can  help 
>> someone else.  Basically, somehow my scheduler config file got  blown 
>> away. (maybe after I rebooted the system?)  The output of  "qconf 
>> -sconf" displayed nothing.  I re-created this config file  and I am 
>> able to run jobs again.
>>
>> The symptoms in this thread were nearly identical to mine, most  
>> notibly the 'got max. unheard timeout for target "execd" on  host...' 
>> in the qmaster message log, which lead me to believe there  was some 
>> communicaitons problems from the qmaster to the sgeexecd  on the 
>> nodes.  Unfortunatley the error logs in this instance  weren't very 
>> helpful.
>>
>> Hope this helps.
>>
>> -adam
>>
>> Joe Landman wrote:
>>
>>> Chris Dagdigian wrote:
>>>
>>>>
>>>> Sensible error messages at least.
>>>>
>>>> (1) Are sge_qmaster and sge_schedd daemons running OK on the master?
>>>>
>>>> (2) Are there any firewalls blocking TCP port 536? Grid Engine  
>>>> requires 2 TCP ports, one used by sge_qmaster and the other used  
>>>> for sge_execd communication.
>>>>
>>>> (3) I've seen qrsh errors similar to this when the $SGE_ROOT was  
>>>> being shared cluster wide via NFS yet with extremely locked down  
>>>> export permissions that forbid suid operations or remapped the  root 
>>>> user UID to a different, non-privileged user account.  Grid  Engine 
>>>> has some setuid binaries that should not be blocked or  remapped and 
>>>> odd permissions will certainly break qrsh commands  and sometimes 
>>>> other things as well. You may want to look at file  permissions and 
>>>> how they appear from the head (qmaster ) node  versus how they look 
>>>> when you login to a compute node.
>>>>
>>>> I'm not familiar with recent ROCKS so I can't say for sure how  the 
>>>> SGE rocks-roll is deployed or even if it uses a shared NFS  
>>>> $SGE_ROOT by default. Sorry about that.
>>>>
>>>> { Just noticed Joe replying, he knows ROCKS far far better than  I !! }
>>>
>>>
>>>
>>> Hi Chris :)
>>>
>>>   Usually I see name service issues, but more often than not, I  see 
>>> iptables get in the way.
>>>
>>>   If you look on the head node with lsof (lsof is one of your many  
>>> friends)
>>>
>>> [root at minicc ~]# lsof -i | grep -i sge
>>> sge_qmast  3072     sge    3u  IPv4   6914       TCP *:536 (LISTEN)
>>> sge_qmast  3072     sge    4u  IPv4   6934       TCP  
>>> minicc.scalableinformatics.com:536->minicc.scalableinformatics.com: 
>>> 32781 (ESTABLISHED)
>>> sge_qmast  3072     sge    5u  IPv4 497728       TCP  
>>> minicc.scalableinformatics.com:536->compute-0-0.local:33254  
>>> (ESTABLISHED)
>>> sge_sched  3091     sge    3u  IPv4   6933       TCP  
>>> minicc.scalableinformatics.com:32781- 
>>> >minicc.scalableinformatics.com:536 (ESTABLISHED)
>>>
>>>
>>> You will see that it happily talks on port 536.  This is good, we  
>>> will play with this in a second.
>>>
>>> On the compute node, you will see something like this
>>>
>>> [root at compute-0-0 ~]# lsof  -i | grep -i sge
>>> sge_execd  3034     sge    3u  IPv4   6255       TCP *:537 (LISTEN)
>>> sge_execd  3034     sge    4u  IPv4  96002       TCP  
>>> compute-0-0.local:33254->minicc.scalableinformatics.com:536  
>>> (ESTABLISHED)
>>>
>>> where the execd is in listen mode on port 537.  Now to check  
>>> connectivity.
>>>
>>> [root at compute-0-0 ~]# telnet minicc.local 536
>>> Trying 10.1.0.1...
>>> Connected to minicc.local (10.1.0.1).
>>> Escape character is '^]'.
>>>
>>> Yup, we can get through from the compute node to the head node.   
>>> This means that the compute node is not being blocked either  
>>> iptables on either node.  Lets try the other way
>>>
>>> [root at minicc ~]# telnet c0-0 537
>>> Trying 10.1.255.254...
>>> Connected to compute-0-0.local (10.1.255.254).
>>> Escape character is '^]'.
>>>
>>> That also worked.  They should both work.  If they don't, this is  a 
>>> a problem.
>>>
>>> As for qrsh working, the default install of Rocks 4.1 does not  have 
>>> a working qrsh.  I usually install my own SGE if I want a  working 
>>> qrsh (which I usually do).
>>>
>>> [landman at minicc ~]$ qrsh uname -a
>>> poll: protocol failure in circuit setup
>>>
>>> You should be able to run the following job like this:
>>>
>>> [landman at minicc ~]$ cat > e
>>> #!/bin/tcsh
>>> #-S /bin/tcsh
>>> uname -a
>>> date
>>> cat /proc/cpuinfo
>>> [landman at minicc ~]$ chmod +x e
>>> [landman at minicc ~]$ qsub e
>>> Your job 4 ("e") has been submitted.
>>> [landman at minicc ~]$ qstat
>>> job-ID  prior   name       user         state submit/start at      
>>> queue                          slots ja-task-ID
>>> --------------------------------------------------------------------- 
>>> --------------------------------------------
>>>       4 0.00000 e          landman      qw    05/22/2006  
>>> 12:16:29                               1
>>> [landman at minicc ~]$ qstat
>>> job-ID  prior   name       user         state submit/start at      
>>> queue                          slots ja-task-ID
>>> --------------------------------------------------------------------- 
>>> --------------------------------------------
>>>       4 0.00000 e          landman      qw    05/22/2006  
>>> 12:16:29                               1
>>> [landman at minicc ~]$ qstat
>>> [landman at minicc ~]$
>>> [landman at minicc ~]$ cat e.o4
>>> Warning: no access to tty (Bad file descriptor).
>>> Thus no job control in this shell.
>>> Linux compute-0-0.local 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36  BST 
>>> 2005 x86_64 x86_64 x86_64 GNU/Linux
>>> Mon May 22 12:16:37 EDT 2006
>>> processor       : 0
>>> vendor_id       : AuthenticAMD
>>> cpu family      : 15
>>> model           : 37
>>> model name      : AMD Opteron(tm) Processor 252
>>> stepping        : 1
>>> cpu MHz         : 2592.694
>>> cache size      : 1024 KB
>>> fpu             : yes
>>> fpu_exception   : yes
>>> cpuid level     : 1
>>> wp              : yes
>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr  
>>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext  
>>> lm 3dnowext 3dnow pni ts
>>> bogomips        : 5095.42
>>> TLB size        : 1088 4K pages
>>> clflush size    : 64
>>> cache_alignment : 64
>>> address sizes   : 40 bits physical, 48 bits virtual
>>> power management: ts fid vid ttp
>>>
>>> processor       : 1
>>> vendor_id       : AuthenticAMD
>>> cpu family      : 15
>>> model           : 37
>>> model name      : AMD Opteron(tm) Processor 252
>>> stepping        : 1
>>> cpu MHz         : 2592.694
>>> cache size      : 1024 KB
>>> fpu             : yes
>>> fpu_exception   : yes
>>> cpuid level     : 1
>>> wp              : yes
>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr  
>>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext  
>>> lm 3dnowext 3dnow pni ts
>>> bogomips        : 5177.34
>>> TLB size        : 1088 4K pages
>>> clflush size    : 64
>>> cache_alignment : 64
>>> address sizes   : 40 bits physical, 48 bits virtual
>>> power management: ts fid vid ttp
>>>
>>> Joe
>>>
>>>
>>>>
>>>>
>>>> -Chris
>>>>
>>>>
>>>>
>>>>
>>>> On May 22, 2006, at 4:52 PM, Mark_Johnson at URSCorp.com wrote:
>>>>
>>>>> Kickstarted 16:21 27-Mar-2006
>>>>> [urs1 at medusa ~]$ qrsh hostname
>>>>> error: error waiting on socket for client to connect:  Interrupted 
>>>>> system
>>>>> call
>>>>> error: unable to contact qmaster using port 536 on host
>>>>> "medusa.ursdcmetro.com"
>>>>> [urs1 at medusa ~]$
>>>>>
>>>>> Mark A. Johnson
>>>>> URS Network Administrator
>>>>> Gaithersburg, MD
>>>>> Ph:  301-721-2231
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------- -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list