[GE users] can't get password entry for user XXXX ...
txema.heredia at upf.edu
Fri Sep 25 11:41:01 BST 2009
I have a problem with my sge6.1u4 cluster:
Twice this week, two of my hosts have started to put any job that was submitted to them in Error state, reporting this:
error reason 1: 09/25/2009 11:07:45 [0:30915]: can't get password entry for user "XXXXXXXXXX". Either the user does not exist or NIS error!
And after that, the queue instance went in Error state.
I have searched for this problem and the only answers were "You have a problem with your users/NIS/LDAP" or "restart sgeexecd in the host".
This error message is not true. I've ssh'd to that host using that username, and everything was working (user, password, home, ...) OK, so I tried the other option. I stopped and started again the sgeexecd in that host and now, the jobs no longer enter in error state, but they finish unexpectedly without any reason.
This is the qacct output:
failed 100 : assumedly after job
and if I submit them with "-m a" option, I get a mail like this:
Job 740641 (med-19) Aborted
Exit Status = 134
Signal = ABRT
User = XXXXXXXXX
Queue = test2-med at compute-0-4.local
Host = compute-0-4.local
Start Time = 09/25/2009 11:27:31
End Time = 09/25/2009 11:27:31
CPU = NA
Max vmem = NA
failed assumedly after job because:
job 740641.1 died through signal ABRT (6)
This same "can't get password" thing has happened several times in our cluster, but most of them solved it "magically" after a few time. But the last time it happened (last Tuesday), I got to delete the host from the host groups which it belong, and reinstall the host (is a rocks cluster 5.0 host, just restarting the daemon didn't work) before it worked again.
thanks in advice,
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users