[GE users] specific host, specific user

Paolo Supino paolo.supino at gmail.com
Thu Oct 30 15:07:29 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Chris

  Thank you for the reply ...
I forgot to mention that this user was able to submit jobs to all nodes 
in the past. The issue popped up last week or so. As far as the points 
you raised:
1. All compute nodes were installed using a single kickstart environment 
and are all the same.
2. Though users are defined locally (external network is problematic and 
unreliable), all of them were defined by a script run from the master 
and hence all UID/GIDs are the same.
3. The user exists on all the hosts (checked).
4. All other users run successfully on the specific node and the user 
with the problem runs fine on all other nodes too.
5. $HOME is on a central storage system (Netapp filer). He is able to 
read and write to it.
6. nothing appears in /tmp ...


  A couple of things I noticed since I sent the email:
1. sge_execd daemon was consuming a lot of resources. I don't know if 
it's because of the stuck job from the specific user or something else. 
I restarted it and it seems OK now.
2. The specific user was having problem accessing the compute nodes with 
ssh public key authentication. I fixed it, but it didn't solve the problem.

  Still have to check and see if the /var/log contains anything 
interesting ...






--
TIA
Paolo






Chris Dagdigian wrote:
>
> If the job runs successfully on all other hosts I'd look first at the 
> host where the job fails to see "what is different" about it.
>
> Common reasons could be:
>
> - different file system permissions on that host
> - different UID/GID mappings (NIS,LDAP issue or /etc/passwd|group are 
> out of date, setuid or squash bits set on filesystems), SELINUX, etc.
> - user does not exist on that host (that would trigger a queue 
> instance E state though)
> - missing application dependencies (libraries, modules)
> - can't read or write to the location where the input and output files 
> are meant to go
>
> The best debug information is the standard output and standard error 
> from the job itself. If the job produces no output then look in the 
> execd messages file in the spool directory for that particular host to 
> see what may be wrong. May also help to check /var/log/messages or 
> equiv and especially check in "/tmp" as that is the SGE panic log 
> location of last resort.
>
> -Chris
>
>
>
> On Oct 30, 2008, at 10:26 AM, Paolo Supino wrote:
>
>> Hi
>>
>> I'm running a small grid (1 master +2 compute nodes) with SGE 6.2 and 
>> I'm experiencing the following issue: I have 1 specific user that 
>> when he submits a job, a job sent to a specific host (same host every 
>> time) hangs and never finishes to run. What do I need to look for in 
>> order to find where the problem is and resolve it?
>>
>>
>>
>>
>>
>> -- 
>> TIA
>> Paolo
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list