[GE users] here is a strange one: submitting to a PE reliably takes out sge_schedd

elauzier elauzier2 at perlstar.com
Mon Jan 18 13:24:24 GMT 2010


We had this problem show up with our cluster also in late 2009.  Chris was involved and witnessed the events.  Here is what we did to look into the issue:

1.  submitted the job so that it would not dispatch until say 30 minutes later, making sure that the problem was not with dispatching.
Indeed, this showed that it was during the submission process.

2.  Inspected the user's environment and had him completely log out and then log back in again after cleaning up his env init scripts.

The reason for (2) was that in the LSF world we have seen this also and there can be a parsing issue.  I suspected something in the env caused scheduler's parser to croak.

Well, after we cleaned up the env and had the user log out and then back in again, the problem went away.

If the problem shows up again, then we will request an instrumented binary for further debug...

Ed Lauzier

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239522

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list