[GE users] possible qstat problem with 6.0u7?

Sebastian Stark stark at tuebingen.mpg.de
Mon Dec 19 10:20:41 GMT 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


Hi,

sorry for answering late, I was on holidays.

On Friday 16 December 2005 10:33, Marco Donauer - SUN Microsystems wrote:
> Hello Sebastian,
>
> do you have a special setup or any other hint for me?
> I can see that you didn't use the csp feature.

Uhm, what exactly is the csp feature?

> Did you recognize any other error.

Not yet. For the last few days our gridengine installation was successfully 
dealing with thousands of jobs. So far I (or any user) did not notice a 
problem.

> Does the messages file contain any erros?

Lots of them...  but it always does in a busy environment, doesn't it? I did 
not find anything directly connected to the problem I have. Anyway, I put the 
qmaster/messages file on our ftp server so you can see if it contains useful 
information. I deleted everything to the first lines after the upgrade to u7. 
You can find it under:

  ftp://ftp.tue.mpg.de/pub/kyb/stark/messages.gz

One thing I noticed right now after I restarted sgemaster was this entry:

  12/19/2005 10:42:20|qmaster|neckar|E|commlib error: got read error (closing 
"neckar.kyb.local/qstat/3")


My qmaster and schedd run on a 32-bit amd machine while some of my clients are 
amd64 and some are x86 machines. I use the courtesy binaries throughout the 
whole cluster on an nfs mounted share.

I have two custom loadsensors running, one for global license consumables and 
one on every exechost to report the network load and disk usage.

We have mostly batch jobs, some of them array jobs with up to 10000 tasks. We 
also have a lot of qlogins over ssh. There is one parallel job running at the 
moment. I would really like to stop it to see if that changes anything but 
the owner of the job is going to cut my throat then... I have a very slight 
suspicion that the problem appeared after I allowed parallel jobs.

The problem was definitely not existent (well, at least it did not bother me) 
before I changed my installation from u4 to u7.

> Currently I'm not able to reproduce the both errors. The qstat -F .....
> -q .... is working
> and the qstat -j is working too.
> The debug output shows, that a functions fails due to a prvided null
> point, but I can't see what's
> the reason for this.

Could it be faulty memory? It's somehow hard to believe because the machine 
had ~150 days uptime recently without any problem. It's constantly under 
heavy load though. The only thing is I noticed APIC errors so I rebooted 
linux with "noapic". Could _that_ be a problem for sge?

Could it be a corrupted database? I am not exactly sure if I use BDB 
scheduling or not, how can I find out?

What else can I do to help you?


-Sebastian

-- 
Sebastian Stark -- http://www.kyb.tuebingen.mpg.de/~stark
Max Planck Institute for Biological Cybernetics

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list