[GE users] can't get password entry for user XXXX ...

txema_heredia txema.heredia at upf.edu
Fri Sep 25 11:41:01 BST 2009

Hi folks!

I have a problem with my sge6.1u4 cluster:

Twice this week, two of my hosts have started to put any job that was submitted to them in Error state, reporting this:

error reason    1:          09/25/2009 11:07:45 [0:30915]: can't get password entry for user "XXXXXXXXXX". Either the user does not exist or NIS error!

And after that, the queue instance went in Error state.

I have searched for this problem and the only answers were "You have a problem with your users/NIS/LDAP" or "restart sgeexecd in the host".

This error message is not true. I've ssh'd to that host using that username, and everything was working (user, password, home, ...) OK, so I tried the other option. I stopped and started again the sgeexecd in that host and now, the jobs no longer enter in error state, but they finish unexpectedly without any reason.

This is the qacct output:

failed       100 : assumedly after job

and if I submit them with "-m a" option, I get a mail like this:

Job 740641 (med-19) Aborted
 Exit Status      = 134
 Signal           = ABRT
 User             = XXXXXXXXX
 Queue            = test2-med at compute-0-4.local
 Host             = compute-0-4.local
 Start Time       = 09/25/2009 11:27:31
 End Time         = 09/25/2009 11:27:31
 CPU              = NA
 Max vmem         = NA
failed assumedly after job because:
job 740641.1 died through signal ABRT (6)

This same "can't get password" thing has happened several times in our cluster, but most of them solved it "magically" after a few time. But the last time it happened (last Tuesday), I got to delete the host from the host groups which it belong, and reinstall the host (is a rocks cluster 5.0 host, just restarting the daemon didn't work) before it worked again.

Any suggestion?

thanks in advice,



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list