[GE users] possible qstat problem with 6.0u7?

Sebastian Stark stark at tuebingen.mpg.de
Thu Dec 22 13:26:19 GMT 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

On Tuesday 20 December 2005 09:45, Marco Donauer - SUN Microsystems wrote:
> I talked to our communication guru, and this is no problem. This message
> appears, if a client
> is stopped with control c or a kill.

Yes, it sometimes happens if people are doing "qlogin -now no" and are tired 
of waiting for a slot after an hour or so. They hit ctrl-c and the qlogin 
process goes away. Unfortunately the job does not disappear from the qstat 
listing. But that's another problem.

> >>Currently I'm not able to reproduce the both errors. The qstat -F .....
> >>-q .... is working

The "qstat -F ... -q all.q at node1" problem is a bit clearer now:

If I add the node1 to the /etc/hosts file it works. It does not work if the 
only way to resolve the hosts name is DNS however. Most notably this means it 
does not necessarily have to do with the u4->u7 transition because I also 
switched from /etc/hosts to DNS after the upgrade.

> >>and the qstat -j is working too.

Problem with qstat -j still exists, even if I add all hosts to /etc/hosts. So 
those problems do not seem to be related to each other.

> Hm I don't know. I don't thinks that a faulty memory is the reason.
> You're talking about a high load.
> Is this load on the nfs also?  In this caes the connection to the master
> host could be lost.

The nfs load is high sometimes, yes. For this reason we use a Solaris 
fileserver :) I never noticed nfs connection problems, if there were any, I'm 
sure the users would complain (as their homes would stop working)

> One other question, did you do an upgrade from u4 to u7 or is this a
> complet new installation with u7?

I upgraded exactly like the upgrade manual said. I also did the backup and bdb 
upgrade step.

> In case of an upgrade, are you really sure, that all binaries and libs
> are upgraded eg. local binaries or something else?

All binaries and shared libraries in the lib/, bin/ and utilbin/ directories 
carry the date "Dec  9 13:41", so they were updated for sure. I also checked 
the uptimes of all nodes, there's no way they could still run an old execd or 
something like this since all share the same sge installation via nfs.

>
> To answer you BDB question, you will find it out looking into the
> bootstrap file (default/common/bootstrap).
> It contains an entry, with spooling_method. (berkeley_db=BDB,
> classic=classic spooling).

Hmm:

neckar ~ % cat /usr/local/sge/default/common/bootstrap
# Version: 6.0u2
#
admin_user             sge
default_domain          none
ignore_fqdn             true
spooling_method         berkeleydb
spooling_lib            libspoolb
spooling_params         /usr/local/sge/default/spool/spooldb
binary_path             /usr/local/sge/bin
qmaster_spool_dir       /usr/local/sge/default/spool/qmaster
security_mode           none

I'm really concerned about this "Version: 6.0u2" thing. Also the 
"default_domain" might solve the problem I have with dns.


Thank you.


-Sebastian

-- 
Sebastian Stark -- http://www.kyb.tuebingen.mpg.de/~stark
Max Planck Institute for Biological Cybernetics

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list