[GE users] occasional job failure - can't find user's home directory

cjf001 john.foley at motorola.com
Wed Oct 27 17:51:23 BST 2010


Hi everyone - been having a low-level issue for some time, and was just
discussing it with the users, so I thought I'd post it here to see if
anyone else has seen this -

Occasionally (according to my metrics, about 0.1% of the time !) a job
will be dispatched by SGE to the execution host(s) and will fail
immediately. The error in the qmaster messages file (and also emailed
to the administrator, me) is

failed changing into working directory because:
         10/23/2010 10:07:47 [937:23846]: error: can't chdir to
         /users/cgtb87: No such file or directory

In other words, it couldn't change to the user's home directory. Well,
this is bogus, because the user's home directory is always available,
via the automounter, so I'm guessing that there must be some
kind of timing issue, where the sge_execd on the execution host
goes to start the process, and because it can't *immediately* find
the user's home directory, fails the job. The host it happens on,
and the user it happens to, is fairly random (not always just one
of a few, that is).

The execution hosts (all the hosts, actually) are RHEL5.2. SGE is
version 6.2u5, running since mid June of this year.

So, a couple of questions for the group :

1) anyone else ever see this ?  If so, ever track it down ?

2) for those of you running a RHEL5 environment, do you use any
    special mount options for the automounter ? As far as I can
    tell, we're using all the defaults here.

3) I doubt that there's any way to tell the sge process on the
    execution hosts to give the system a little more time to setup,
    but if anyone knows of something I'm listening :)

4) any thoughts on how to zero in on where, exactly, in the startup
    process the failure occurs ?

    Thanks,

       John



-- 
###########################################################################
# John Foley                          # Location:  IL93-E1-21S            #
# IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
# LV Simulation Cluster Support       #    Email: john.foley at motorola.com #
# Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
# 600 North US Highway 45             #      Fax: (847) 523-5767          #
# Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
###########################################################################
               (this email sent using SeaMonkey on Windows)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=290485

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list