[GE users] Random major ssl issue

jlb jlb at salilab.org
Thu May 14 19:47:00 BST 2009

I've been running SGE 6.1u3 with CSP for over 18 months now.  The first 
round of cert renewals went fine last October.  This morning, all of a 
sudden, everything and everyone is getting this error:

error: commlib error: ssl connect error (SSL handshake error)
error: commlib error: ssl error (the used certificate is expired)

The qmaster messages file shows:

05/14/2009 08:24:02|qmaster|$SERVER|E|commlib error: ssl accept error (ssl accept error for client "$CLIENT")
05/14/2009 08:24:02|qmaster|$SERVER|E|commlib error: ssl error (the used certificate is expired or invalid)

Absolutely nothing changed -- nobody was at work yet when this started and 
the logs show no remote access.  Every cert that I can find is valid for 
months yet:

$ openssl x509 -dates -in common/sgeCA/cacert.pem -noout
notBefore=Oct 29 18:59:27 2008 GMT
notAfter=Oct 29 18:59:27 2009 GMT
$ openssl x509 -dates -in common/sgeCA/certs/cert.pem -noout
notBefore=Oct 30 17:17:21 2008 GMT
notAfter=Oct 30 17:17:21 2009 GMT

and so on for all the various userkeys in /var/sgeCA.  Is there somewhere 
else I should be looking for certs?

I will note that there are a few "different" errors buried in the messages 
file.  They look like this:

05/12/2009 14:05:43|qmaster|$SERVER|E|commlib error: ssl error ([ID=336216132] in module "SSL routines": "internal error")
05/12/2009 14:05:43|qmaster|$SERVER|E|commlib error: got read error (closing "opt115/qdel/60509")
05/14/2009 10:10:30|qmaster|$SERVER|E|commlib error: ssl error ([ID=336216132] in module "SSL routines": "internal error")
05/14/2009 10:10:30|qmaster|$SERVER|E|commlib error: ssl accept error (ssl accept error for client "opt5")

There are currently 10 such errors in the messages file.  The first 4 
(like the first example above) are associated with read errors for either 
qhost or qdel.  Those occurred a couple of days ago.  The other 6 errors 
occurred today (although the first is a couple of hours after the SSL 
issues arose) and are associated with accept errors.

Does anybody have any ideas on this?  Obviously, I'm at a standstill here 
and the hordes are getting restless.  Thanks.

Joshua Baker-LePain
QB3 Shared Cluster Sysadmin


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list