[GE users] qmaster SEGVs

ckoe christof.koehler at bccms.uni-bremen.de
Sat Mar 13 12:09:32 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello everybody.

fx wrote:
> There have been a few reports of unexplained SEGVs starting to appear
> from recent versions of qmaster.  Has anyone got to the bottom of them?
> It's not obvious from the issue tracker.
>
> It's just started happening here without any obvious change in
> configuration or type of jobs running.  Obviously I'll try to debug it
> next week, but I wonder if anyone has any clues in the meantime.
>

We have been observing qmaster crash issues with 6.2u4 and 6.2u5 since
December. At some point 6.2u4 crashed about every 30 minutes. At the
moment 6.2u5 crashes about every 8 hours on average. We have a shadow
qmaster setup (on hosts winter01a and neuro01a), so it is not user
visible most of the time.

The most annoying thing is that it breaks part of the spooling database
every time, e.g. users are simply _gone_. Excerpt from qmaster messages:


03/12/2010 21:42:52|   jvm|neuro01a|W|could not read keystore path
fopen("/usr/local/opt/sge_6.2u5/bccms_fr/common/jmx/management.properties")
failed: No such file or directory

03/12/2010 21:42:52|schedu|neuro01a|E|element
"winter36a.bccms.uni-bremen.de" does not exist

03/12/2010 21:42:52|schedu|neuro01a|E|callback function for event "1.
EVENT MOD EXECHOST winter36a.bccms.uni-bremen.de" failed
03/12/2010 21:42:52|schedu|neuro01a|E|can't find cluster queue smp_1 for
update in function qinstance_update_cqueue_list
03/12/2010 21:42:52|schedu|neuro01a|E|callback function for event "2.
EVENT MOD QUEUE INSTANCE smp_1 at neuro35a.bccms.uni-bremen.de" failed
03/12/2010 21:42:52|schedu|neuro01a|E|can't find cluster queue smp_2 for
update in function qinstance_update_cqueue_list
03/12/2010 21:42:52|schedu|neuro01a|E|callback function for event "3.
EVENT MOD QUEUE INSTANCE smp_2 at neuro35a.bccms.uni-bremen.de" failed
03/12/2010 21:42:52|schedu|neuro01a|E|element
"neuro35a.bccms.uni-bremen.de" does not exist

03/12/2010 21:42:52|schedu|neuro01a|E|callback function for event "4.
EVENT MOD EXECHOST neuro35a.bccms.uni-bremen.de" failed
03/13/2010 04:31:08|  main|winter01a|I|read job database with 89 entries
in 1 seconds
03/13/2010 04:31:08|  main|winter01a|E|error parsing double value from
string "SCcCCSCCCC"
03/13/2010 04:31:08|  main|winter01a|E|unrecognized characters after the
attribute values in line 9: "0.000000"
03/13/2010 04:31:08|  main|winter01a|E|error reading file:
"/usr/local/opt/sge_6.2u5/bccms_fr/spool/qmaster/users/svea"
03/13/2010 04:31:08|  main|winter01a|E|unrecognized characters after the
attribute values in line 9: "mem"
03/13/2010 04:31:08|  main|winter01a|E|error reading file:
"/usr/local/opt/sge_6.2u5/bccms_fr/spool/qmaster/users/toelle"
03/13/2010 04:31:08|  main|winter01a|E|unrecognized characters after the
attribute values in line 9: "mem"
03/13/2010 04:31:08|  main|winter01a|E|error reading file:
"/usr/local/opt/sge_6.2u5/bccms_fr/spool/qmaster/users/raina"
03/13/2010 04:31:08|  main|winter01a|E|unrecognized characters after the
attribute values in line 9: "mem"
03/13/2010 04:31:08|  main|winter01a|E|error reading file:
"/usr/local/opt/sge_6.2u5/bccms_fr/spool/qmaster/users/ckoe"
03/13/2010 04:31:08|  main|winter01a|E|unrecognized characters after the
attribute values in line 9: "mem"
03/13/2010 04:31:08|  main|winter01a|E|error reading file:
"/usr/local/opt/sge_6.2u5/bccms_fr/spool/qmaster/users/niehaus"
03/13/2010 04:31:08|  main|winter01a|E|error opening file
"/usr/local/opt/sge_6.2u5/bccms_fr/spool/qmaster/./sharetree" for
reading: No such file or directory
03/13/2010 04:31:08|  main|winter01a|I|qmaster hard descriptor limit is
set to 8192
03/13/2010 04:31:08|  main|winter01a|I|qmaster soft descriptor limit is
set to 8192
03/13/2010 04:31:08|  main|winter01a|I|qmaster will use max. 8172 file
descriptors for communication
03/13/2010 04:31:08|  main|winter01a|I|qmaster will accept max. 99
dynamic event clients
03/13/2010 04:31:08|  main|winter01a|I|starting up GE 6.2u5 (lx24-amd64)

qmaster# qconf -muser ckoe
ckoe is not known as user

although there are running jobs of this user, but after not being able
to read the appropriate file at startup (see log excerpt above) SGE
deleted the user. I have everything on auto subscribe to hide this.

The "EVENT MOD ..." messages are spurious, they do not happen after
every restart.

I observe also log messages like this
03/12/2010 17:51:46|  main|neuro01a|E|not enough memory for unpacking
pe_task "jobs/00/0004/5670/1-4096/1/.common"

After manually deleting the (old) job directory it disappered. Actually
.common had zero byte length.

Attaching gdb to a running qmaster shows that the crashes happen in
different subroutines (the courtesy binaries do not contain complete
debug info ?), for example cull_hash_free_descr or lCopySwitchPack.


The OS is ubuntu 9.10 amd64, courtesy binaries, classic spooling on
NFSv3, schedd_job_info false, using a lot of wildcard stuff for PE's.

I did not write earlier about this on the list because I will be on
vacation for the next weeks and therefore it seemed better to wait till
after the vacation to aid debugging.



Best Regards

Christof Köhler


- --
Dr. rer. nat. Christof Köhler       email: c.koehler at bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-2486
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-4764
28359 Bremen

PGP:
http://www.bccms.uni-bremen.de/fileadmin/BCCMS/pgp_keys/ChristofKoehler_UniBremen.asc
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFLm4B8RtHb9dSZpXwRAlnkAKDCca7OtXHYxGEmzSG7skorC+8LygCeNzJr
JzFDlH0kgNyAtnz5YE97rF0=
=u7MM
-----END PGP SIGNATURE-----

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248326

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list