[GE users] SGE jobs in "qw" state

Craig Tierney ctierney at hypermall.net
Tue Jun 6 21:42:51 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Chris Dagdigian wrote:
> 
> A well behaved Grid Engine should never lose its configuration this way 
> -- in classic spooling there will be a human readable textfile in 
> $SGE_ROOT/<cell>/common/schedd_configuration that contains the data. In 
> binary Berkeley spooling the config would be in the spooldb files.   
> I've never seen a partial broken configuration unless it was my own 
> human error -- the times I've seen the SGE configuration totally hosed 
> either all the files were messed up or the Berkeley DB database was 
> totally unrecoverable. In the cases of total configuration loss we've 
> usually tracked the problems down to bad behavior by  SAN client 
> software or some other OS or file-server related issue that was external 
> to grid engine.

I have had times where the ascii files in classic spooling would get
corrupted on the local disk.  They had not been changed in a long time 
(long enough to get to disk).  I don't remember if it was from an 
unclean shutdown or something else.  The files would contain garbage. 
When this happened (2 or 3 times over a few years) the qmaster would not 
restart.

Craig


> 
> In case this is a new problem, it would be helpful if you could reply 
> with the following:
> 
> o what version of SGE did this happen on
> o what OS/architecture
> o what spooling method was in use
> o any other files missing or corrupt ?
> 
> .. this way we can see if others report the same thing.  Spontaneous 
> loss of scheduler configuration would be a big deal if reproducible in 
> SGE 6.x .
> 
> Regards,
> Chris
> 
> 
> 
> On Jun 6, 2006, at 4:03 PM, Adam Brust wrote:
> 
>> Hi.
>>
>> I just spent an entire day troubleshooting what seems to be a very 
>> similar problem. I finally found a resolution and perhaps it can help 
>> someone else.  Basically, somehow my scheduler config file got blown 
>> away. (maybe after I rebooted the system?)  The output of "qconf 
>> -sconf" displayed nothing.  I re-created this config file and I am 
>> able to run jobs again.
>>
>> The symptoms in this thread were nearly identical to mine, most 
>> notibly the 'got max. unheard timeout for target "execd" on host...' 
>> in the qmaster message log, which lead me to believe there was some 
>> communicaitons problems from the qmaster to the sgeexecd on the 
>> nodes.  Unfortunatley the error logs in this instance weren't very 
>> helpful.
>>
>> Hope this helps.
>>
>> -adam
>>
>> Joe Landman wrote:
>>
>>> Chris Dagdigian wrote:
>>>
>>>>
>>>> Sensible error messages at least.
>>>>
>>>> (1) Are sge_qmaster and sge_schedd daemons running OK on the master?
>>>>
>>>> (2) Are there any firewalls blocking TCP port 536? Grid Engine 
>>>> requires 2 TCP ports, one used by sge_qmaster and the other used for 
>>>> sge_execd communication.
>>>>
>>>> (3) I've seen qrsh errors similar to this when the $SGE_ROOT was 
>>>> being shared cluster wide via NFS yet with extremely locked down 
>>>> export permissions that forbid suid operations or remapped the root 
>>>> user UID to a different, non-privileged user account.  Grid Engine 
>>>> has some setuid binaries that should not be blocked or remapped and 
>>>> odd permissions will certainly break qrsh commands and sometimes 
>>>> other things as well. You may want to look at file permissions and 
>>>> how they appear from the head (qmaster ) node versus how they look 
>>>> when you login to a compute node.
>>>>
>>>> I'm not familiar with recent ROCKS so I can't say for sure how the 
>>>> SGE rocks-roll is deployed or even if it uses a shared NFS $SGE_ROOT 
>>>> by default. Sorry about that.
>>>>
>>>> { Just noticed Joe replying, he knows ROCKS far far better than I !! }
>>>
>>>
>>> Hi Chris :)
>>>
>>>   Usually I see name service issues, but more often than not, I see 
>>> iptables get in the way.
>>>
>>>   If you look on the head node with lsof (lsof is one of your many 
>>> friends)
>>>
>>> [root at minicc ~]# lsof -i | grep -i sge
>>> sge_qmast  3072     sge    3u  IPv4   6914       TCP *:536 (LISTEN)
>>> sge_qmast  3072     sge    4u  IPv4   6934       TCP 
>>> minicc.scalableinformatics.com:536->minicc.scalableinformatics.com:32781 
>>> (ESTABLISHED)
>>> sge_qmast  3072     sge    5u  IPv4 497728       TCP 
>>> minicc.scalableinformatics.com:536->compute-0-0.local:33254 
>>> (ESTABLISHED)
>>> sge_sched  3091     sge    3u  IPv4   6933       TCP 
>>> minicc.scalableinformatics.com:32781->minicc.scalableinformatics.com:536 
>>> (ESTABLISHED)
>>>
>>>
>>> You will see that it happily talks on port 536.  This is good, we 
>>> will play with this in a second.
>>>
>>> On the compute node, you will see something like this
>>>
>>> [root at compute-0-0 ~]# lsof  -i | grep -i sge
>>> sge_execd  3034     sge    3u  IPv4   6255       TCP *:537 (LISTEN)
>>> sge_execd  3034     sge    4u  IPv4  96002       TCP 
>>> compute-0-0.local:33254->minicc.scalableinformatics.com:536 
>>> (ESTABLISHED)
>>>
>>> where the execd is in listen mode on port 537.  Now to check 
>>> connectivity.
>>>
>>> [root at compute-0-0 ~]# telnet minicc.local 536
>>> Trying 10.1.0.1...
>>> Connected to minicc.local (10.1.0.1).
>>> Escape character is '^]'.
>>>
>>> Yup, we can get through from the compute node to the head node.  This 
>>> means that the compute node is not being blocked either iptables on 
>>> either node.  Lets try the other way
>>>
>>> [root at minicc ~]# telnet c0-0 537
>>> Trying 10.1.255.254...
>>> Connected to compute-0-0.local (10.1.255.254).
>>> Escape character is '^]'.
>>>
>>> That also worked.  They should both work.  If they don't, this is a a 
>>> problem.
>>>
>>> As for qrsh working, the default install of Rocks 4.1 does not have a 
>>> working qrsh.  I usually install my own SGE if I want a working qrsh 
>>> (which I usually do).
>>>
>>> [landman at minicc ~]$ qrsh uname -a
>>> poll: protocol failure in circuit setup
>>>
>>> You should be able to run the following job like this:
>>>
>>> [landman at minicc ~]$ cat > e
>>> #!/bin/tcsh
>>> #-S /bin/tcsh
>>> uname -a
>>> date
>>> cat /proc/cpuinfo
>>> [landman at minicc ~]$ chmod +x e
>>> [landman at minicc ~]$ qsub e
>>> Your job 4 ("e") has been submitted.
>>> [landman at minicc ~]$ qstat
>>> job-ID  prior   name       user         state submit/start at     
>>> queue                          slots ja-task-ID
>>> ----------------------------------------------------------------------------------------------------------------- 
>>>
>>>       4 0.00000 e          landman      qw    05/22/2006 
>>> 12:16:29                               1
>>> [landman at minicc ~]$ qstat
>>> job-ID  prior   name       user         state submit/start at     
>>> queue                          slots ja-task-ID
>>> ----------------------------------------------------------------------------------------------------------------- 
>>>
>>>       4 0.00000 e          landman      qw    05/22/2006 
>>> 12:16:29                               1
>>> [landman at minicc ~]$ qstat
>>> [landman at minicc ~]$
>>> [landman at minicc ~]$ cat e.o4
>>> Warning: no access to tty (Bad file descriptor).
>>> Thus no job control in this shell.
>>> Linux compute-0-0.local 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 
>>> 2005 x86_64 x86_64 x86_64 GNU/Linux
>>> Mon May 22 12:16:37 EDT 2006
>>> processor       : 0
>>> vendor_id       : AuthenticAMD
>>> cpu family      : 15
>>> model           : 37
>>> model name      : AMD Opteron(tm) Processor 252
>>> stepping        : 1
>>> cpu MHz         : 2592.694
>>> cache size      : 1024 KB
>>> fpu             : yes
>>> fpu_exception   : yes
>>> cpuid level     : 1
>>> wp              : yes
>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
>>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 
>>> 3dnowext 3dnow pni ts
>>> bogomips        : 5095.42
>>> TLB size        : 1088 4K pages
>>> clflush size    : 64
>>> cache_alignment : 64
>>> address sizes   : 40 bits physical, 48 bits virtual
>>> power management: ts fid vid ttp
>>>
>>> processor       : 1
>>> vendor_id       : AuthenticAMD
>>> cpu family      : 15
>>> model           : 37
>>> model name      : AMD Opteron(tm) Processor 252
>>> stepping        : 1
>>> cpu MHz         : 2592.694
>>> cache size      : 1024 KB
>>> fpu             : yes
>>> fpu_exception   : yes
>>> cpuid level     : 1
>>> wp              : yes
>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
>>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 
>>> 3dnowext 3dnow pni ts
>>> bogomips        : 5177.34
>>> TLB size        : 1088 4K pages
>>> clflush size    : 64
>>> cache_alignment : 64
>>> address sizes   : 40 bits physical, 48 bits virtual
>>> power management: ts fid vid ttp
>>>
>>> Joe
>>>
>>>
>>>>
>>>>
>>>> -Chris
>>>>
>>>>
>>>>
>>>>
>>>> On May 22, 2006, at 4:52 PM, Mark_Johnson at URSCorp.com wrote:
>>>>
>>>>> Kickstarted 16:21 27-Mar-2006
>>>>> [urs1 at medusa ~]$ qrsh hostname
>>>>> error: error waiting on socket for client to connect: Interrupted 
>>>>> system
>>>>> call
>>>>> error: unable to contact qmaster using port 536 on host
>>>>> "medusa.ursdcmetro.com"
>>>>> [urs1 at medusa ~]$
>>>>>
>>>>> Mark A. Johnson
>>>>> URS Network Administrator
>>>>> Gaithersburg, MD
>>>>> Ph:  301-721-2231
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list