[GE users] occasional job failure - can't find user's home directory

cjf001 john.foley at motorola.com
Wed Oct 27 19:43:45 BST 2010


Ok, thanks. What script did you put that in - the prolog, or
just the script that actually runs the app ?

    John


russray wrote:
>
> One thing I have done at times is to stick a
> ls $SGE_O_WORKDIR > /dev/null
> at the beginning of a script to force the mount. Output just goes to
> /dev/null but that way the automounter has done the mount before I try
> to cd to it or anything else.
>
> You mention that you automout 2 diff netapps onto the same mount point,
> it could be that you see the problem when fileserver A is mounted and
> then have to unmount it and mount fileserver B. That extra delay might
> be enough to cause the issue, maybe not, just a thought.
>
> cjf001 <john.foley at motorola.com> wrote on 10/27/2010 12:51:23 PM:
>
>  > Hi everyone - been having a low-level issue for some time, and was just
>  > discussing it with the users, so I thought I'd post it here to see if
>  > anyone else has seen this -
>  >
>  > Occasionally (according to my metrics, about 0.1% of the time !) a job
>  > will be dispatched by SGE to the execution host(s) and will fail
>  > immediately. The error in the qmaster messages file (and also emailed
>  > to the administrator, me) is
>  >
>  > failed changing into working directory because:
>  > 10/23/2010 10:07:47 [937:23846]: error: can't chdir to
>  > /users/cgtb87: No such file or directory
>  >
>  > In other words, it couldn't change to the user's home directory. Well,
>  > this is bogus, because the user's home directory is always available,
>  > via the automounter, so I'm guessing that there must be some
>  > kind of timing issue, where the sge_execd on the execution host
>  > goes to start the process, and because it can't *immediately* find
>  > the user's home directory, fails the job. The host it happens on,
>  > and the user it happens to, is fairly random (not always just one
>  > of a few, that is).
>  >
>  > The execution hosts (all the hosts, actually) are RHEL5.2. SGE is
>  > version 6.2u5, running since mid June of this year.
>  >
>  > So, a couple of questions for the group :
>  >
>  > 1) anyone else ever see this ? If so, ever track it down ?
>  >
>  > 2) for those of you running a RHEL5 environment, do you use any
>  > special mount options for the automounter ? As far as I can
>  > tell, we're using all the defaults here.
>  >
>  > 3) I doubt that there's any way to tell the sge process on the
>  > execution hosts to give the system a little more time to setup,
>  > but if anyone knows of something I'm listening :)
>  >
>  > 4) any thoughts on how to zero in on where, exactly, in the startup
>  > process the failure occurs ?
>  >
>  > Thanks,
>  >
>  > John
>  >
>  >
>  >
>  > --
>  >
> ###########################################################################
>  > # John Foley # Location: IL93-E1-21S #
>  > # IT & Systems Administration # Maildrop: IL93-E1-35O #
>  > # LV Simulation Cluster Support # Email: john.foley at motorola.com #
>  > # Motorola, Inc. - Mobile Devices # Phone: (847) 523-8719 #
>  > # 600 North US Highway 45 # Fax: (847) 523-5767 #
>  > # Libertyville, IL. 60048 (USA) # Cell: (847) 460-8719 #
>  >
> ###########################################################################
>  > (this email sent using SeaMonkey on Windows)
>  >
>  > ------------------------------------------------------
>  > http://gridengine.sunsource.net/ds/viewMessage.do?
>  > dsForumId=38&dsMessageId=290485
>  >
>  > To unsubscribe from this discussion, e-mail: [users-
>  > unsubscribe at gridengine.sunsource.net].



-- 
###########################################################################
# John Foley                          # Location:  IL93-E1-21S            #
# IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
# LV Simulation Cluster Support       #    Email: john.foley at motorola.com #
# Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
# 600 North US Highway 45             #      Fax: (847) 523-5767          #
# Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
###########################################################################
               (this email sent using SeaMonkey on Windows)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=290536

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list