[GE users] specific host, specific user

Paolo Supino paolo.supino at gmail.com
Thu Oct 30 15:11:30 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Chris

 Here is a snippet from SGE's messages file:

10/05/2008 13:04:31|  main|mc04n012|W|reaping job "755233" ptf 
complains: Job does not exist
10/09/2008 09:15:13|  main|mc04n012|W|reaping job "773202" ptf 
complains: Job does not exist
10/30/2008 16:23:21|  main|mc04n012|E|shepherd of job 878119.1 died 
through signal = 15
10/30/2008 16:23:21|  main|mc04n012|E|abnormal termination of shepherd 
for job 878119.1: "exit_status" file is empty
10/30/2008 16:23:21|  main|mc04n012|E|can't open usage file 
"active_jobs/878119.1/usage" for job 878119.1: No such file or directory
10/30/2008 16:23:21|  main|mc04n012|E|shepherd exited with exit status 
19: before writing exit_status
10/30/2008 16:23:21|  main|mc04n012|I|controlled shutdown 6.2
10/30/2008 16:23:36|  main|mc04n012|I|starting up SGE 6.2 (lx24-amd64)
10/30/2008 16:23:36|  main|mc04n012|E|abnormal termination of shepherd 
for job 878119.1: "exit_status" file is empty
10/30/2008 16:23:36|  main|mc04n012|E|can't open usage file 
"active_jobs/878119.1/usage" for job 878119.1: No such file or directory
10/30/2008 16:23:36|  main|mc04n012|E|shepherd exited with exit status 
19: before writing exit_status

from line 3 it's when I shut down  SGE's execd. What is it complaining 
about?



--
TIA
Paolo



Chris Dagdigian wrote:
>
> If the job runs successfully on all other hosts I'd look first at the 
> host where the job fails to see "what is different" about it.
>
> Common reasons could be:
>
> - different file system permissions on that host
> - different UID/GID mappings (NIS,LDAP issue or /etc/passwd|group are 
> out of date, setuid or squash bits set on filesystems), SELINUX, etc.
> - user does not exist on that host (that would trigger a queue 
> instance E state though)
> - missing application dependencies (libraries, modules)
> - can't read or write to the location where the input and output files 
> are meant to go
>
> The best debug information is the standard output and standard error 
> from the job itself. If the job produces no output then look in the 
> execd messages file in the spool directory for that particular host to 
> see what may be wrong. May also help to check /var/log/messages or 
> equiv and especially check in "/tmp" as that is the SGE panic log 
> location of last resort.
>
> -Chris
>
>
>
> On Oct 30, 2008, at 10:26 AM, Paolo Supino wrote:
>
>> Hi
>>
>> I'm running a small grid (1 master +2 compute nodes) with SGE 6.2 and 
>> I'm experiencing the following issue: I have 1 specific user that 
>> when he submits a job, a job sent to a specific host (same host every 
>> time) hangs and never finishes to run. What do I need to look for in 
>> order to find where the problem is and resolve it?
>>
>>
>>
>>
>>
>> -- 
>> TIA
>> Paolo
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list