[GE users] possible qstat problem with 6.0u7?

Marco Donauer - SUN Microsystems Marco.Donauer at Sun.COM
Tue Dec 20 08:45:29 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Sebastian,

please find my comments within you text.

Sebastian Stark wrote:

>Hi,
>
>sorry for answering late, I was on holidays.
>
>On Friday 16 December 2005 10:33, Marco Donauer - SUN Microsystems wrote:
>  
>
>>Hello Sebastian,
>>
>>do you have a special setup or any other hint for me?
>>I can see that you didn't use the csp feature.
>>    
>>
>
>Uhm, what exactly is the csp feature?
>  
>
csp is a installation mode of sge. It's an installation with increased 
security. During the installation
certificates will be created, which are used for authentification. The 
eg. execd authenticates with this cert at the masterd.
The install guide shows, how to install a csp system.

>  
>
>>Did you recognize any other error.
>>    
>>
>
>Not yet. For the last few days our gridengine installation was successfully 
>dealing with thousands of jobs. So far I (or any user) did not notice a 
>problem.
>  
>
ok so I will setup an cluster and writeing a script, looping a qstat. 
Perhaps the error appear on our side, too

>  
>
>>Does the messages file contain any erros?
>>    
>>
>
>Lots of them...  but it always does in a busy environment, doesn't it? I did 
>not find anything directly connected to the problem I have. Anyway, I put the 
>qmaster/messages file on our ftp server so you can see if it contains useful 
>information. I deleted everything to the first lines after the upgrade to u7. 
>You can find it under:
>
>  ftp://ftp.tue.mpg.de/pub/kyb/stark/messages.gz
>
>One thing I noticed right now after I restarted sgemaster was this entry:
>
>  12/19/2005 10:42:20|qmaster|neckar|E|commlib error: got read error (closing 
>"neckar.kyb.local/qstat/3")
>  
>
I talked to our communication guru, and this is no problem. This message 
appears, if a client
is stopped with control c or a kill.

>
>My qmaster and schedd run on a 32-bit amd machine while some of my clients are 
>amd64 and some are x86 machines. I use the courtesy binaries throughout the 
>whole cluster on an nfs mounted share.
>
>I have two custom loadsensors running, one for global license consumables and 
>one on every exechost to report the network load and disk usage.
>
>We have mostly batch jobs, some of them array jobs with up to 10000 tasks. We 
>also have a lot of qlogins over ssh. There is one parallel job running at the 
>moment. I would really like to stop it to see if that changes anything but 
>the owner of the job is going to cut my throat then... I have a very slight 
>suspicion that the problem appeared after I allowed parallel jobs.
>
>The problem was definitely not existent (well, at least it did not bother me) 
>before I changed my installation from u4 to u7.
>
>  
>
>>Currently I'm not able to reproduce the both errors. The qstat -F .....
>>-q .... is working
>>and the qstat -j is working too.
>>The debug output shows, that a functions fails due to a prvided null
>>point, but I can't see what's
>>the reason for this.
>>    
>>
>
>Could it be faulty memory? It's somehow hard to believe because the machine 
>had ~150 days uptime recently without any problem. It's constantly under 
>heavy load though. The only thing is I noticed APIC errors so I rebooted 
>linux with "noapic". Could _that_ be a problem for sge?
>
>Could it be a corrupted database? I am not exactly sure if I use BDB 
>scheduling or not, how can I find out?
>
>What else can I do to help you?
>
>
>  
>
Hm I don't know. I don't thinks that a faulty memory is the reason. 
You're talking about a high load.
Is this load on the nfs also?  In this caes the connection to the master 
host could be lost.

One other question, did you do an upgrade from u4 to u7 or is this a 
complet new installation with u7?
In case of an upgrade, are you really sure, that all binaries and libs 
are upgraded eg. local binaries or something else?

To answer you BDB question, you will find it out looking into the 
bootstrap file (default/common/bootstrap).
It contains an entry, with spooling_method. (berkeley_db=BDB, 
classic=classic spooling).

Thanks for you help, I will contact you again if I need more info or if 
I have a solution.

Regards,
Marco

>-Sebastian
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list