[GE users] occasional job failure - can't find user's home directory

russray rray at semtech.com
Wed Oct 27 19:38:19 BST 2010


One thing I have done at times is to stick a
ls $SGE_O_WORKDIR > /dev/null
at the beginning of a script to force the mount.  Output just goes to /dev/null but that way the automounter has done the mount before I try to cd to it or anything else.

You mention that you automout 2 diff netapps onto the same mount point, it could be that you see the problem when  fileserver A is mounted and then have to unmount it and mount fileserver B.  That extra delay might be enough to cause the issue, maybe not, just a thought.

cjf001 <john.foley at motorola.com> wrote on 10/27/2010 12:51:23 PM:

> Hi everyone - been having a low-level issue for some time, and was just
> discussing it with the users, so I thought I'd post it here to see if
> anyone else has seen this -
>
> Occasionally (according to my metrics, about 0.1% of the time !) a job
> will be dispatched by SGE to the execution host(s) and will fail
> immediately. The error in the qmaster messages file (and also emailed
> to the administrator, me) is
>
> failed changing into working directory because:
>          10/23/2010 10:07:47 [937:23846]: error: can't chdir to
>          /users/cgtb87: No such file or directory
>
> In other words, it couldn't change to the user's home directory. Well,
> this is bogus, because the user's home directory is always available,
> via the automounter, so I'm guessing that there must be some
> kind of timing issue, where the sge_execd on the execution host
> goes to start the process, and because it can't *immediately* find
> the user's home directory, fails the job. The host it happens on,
> and the user it happens to, is fairly random (not always just one
> of a few, that is).
>
> The execution hosts (all the hosts, actually) are RHEL5.2. SGE is
> version 6.2u5, running since mid June of this year.
>
> So, a couple of questions for the group :
>
> 1) anyone else ever see this ?  If so, ever track it down ?
>
> 2) for those of you running a RHEL5 environment, do you use any
>     special mount options for the automounter ? As far as I can
>     tell, we're using all the defaults here.
>
> 3) I doubt that there's any way to tell the sge process on the
>     execution hosts to give the system a little more time to setup,
>     but if anyone knows of something I'm listening :)
>
> 4) any thoughts on how to zero in on where, exactly, in the startup
>     process the failure occurs ?
>
>     Thanks,
>
>        John
>
>
>
> --
> ###########################################################################
> # John Foley                          # Location:  IL93-E1-21S            #
> # IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
> # LV Simulation Cluster Support       #    Email: john.foley at motorola.com #
> # Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
> # 600 North US Highway 45             #      Fax: (847) 523-5767          #
> # Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
> ###########################################################################
>                (this email sent using SeaMonkey on Windows)
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?
> dsForumId=38&dsMessageId=290485
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list