[GE users] Stale finished jobs and nodes in error state (solved)
norbert.crettol at idiap.ch
Wed Apr 30 10:13:36 BST 2008
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
This is some kind of a report.
I sent a mail to this mailing list some time ago
about stale jobs, some times not running and some times
running but remaining in the queue. When this happend,
the nodes lost the connection with the master and I had
to kill and restart sge_execd to have them back.
Following the hints given by the people on this list
(thank you once more) I set the spool to local on the
nodes and it lowered the rate of errors. But I still had
a rate of about 0.08% of stale jobs with an unconnected
node with each occurence of stale job.
After some problems with LDAP due to overload, one of
my colleagues noticed that the cluster was making a huge
amount of ldap requests. I installed nscd (to cache the
ldap requests) on all nodes and we added this to the
nssldap settings (/etc/libnss-ldap.conf) to have nssldap
retry the connection when it fails :
# Connection parameters
# ... connection establishment
# ... connection persistency
# ... request timeout
Since then, no single stale job, no single lost node,
no single failure. I've run 30,000 jobs, our researchers
keep stressing the cluster day and night with thousands
of jobs, they read and write terabytes of data, create,
open and close myriads of files, all works like a charm.
I can leave for my 5 days vacation with the soul in peace ;-)
Thank you for your help.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users