[GE users] Job causes queues to go into error state

reuti reuti at staff.uni-marburg.de
Sat Dec 6 11:48:09 GMT 2008


Hi,

Am 06.12.2008 um 01:02 schrieb Iwona Sakrejda:

> We believe that we understand the problem. A user was using the -V
> option with qsub so his environment was copied wholesale and it was  
> a huge
> environment so it took us a while to notice that among them was
> LD_ASSUME_KERNEL=2.2.5. It was responsible for the prolog and/or  
> the shepherd
> failing.
>
> So I think that now I understand the messages I was receiving,
>
> 12/03/2008 14:27:51 [171:13726]: prolog exited not due to signal
> 12/03/2008 14:27:51 [171:13726]: prolog exited with status 127
> 12/03/2008 14:27:51 [171:13726]: exit_status of prolog = 127
>
> What does "prolog exited not due to signal mean?"

it's the interpretation of the return-code. Usually it's 128+<no.of  
the signal>. Means, a 137 is a SIGKILL, 139 is SIGSEGV... If it's  
lower than 128, it's just the exit code of the program.

> Does it mean it crashed?

Not necessarily, just exited with 127. For bash this means command  
not found:

http://tldp.org/LDP/abs/html/exitcodes.html

(or see "EXIT STATUS" in `man bash`)


-- Reuti


> In principle my problem is solved, but I have to say that all those
> messages were very criptic and not much help in sorting things out....
>
> Iwona
>
>
>
>
>
> On 12/4/08 8:28 AM, reuti wrote:
>>
>> Am 04.12.2008 um 17:09 schrieb Iwona Sakrejda:
>>>
>>> Hi, As what user does the prolog script run? Just want to double- 
>>> check... The user running the job, sgeadmin or root?
>> Hi, by default it will run under the user running the job. You can  
>> change this by prefixing the script by a "user@". -- Reuti
>>>
>>> I am confusing myself... Iwona On 12/4/08 7:59 AM, Iwona Sakrejda  
>>> wrote:
>>>>
>>>> Hi The user is known on the host and the prolog is quite simple:  
>>>> #!/bin/sh GPFSLOGFILE="/var/adm/ras/mmfs.log.latest" # Check  
>>>> whether scratch is there if [ ! -d "/chos/local/scratch" ] ;  
>>>> then echo "Missing scratch on `hostname`" |/bin/mail -s "queue  
>>>> in error state on `hostname`" abc at abc.com exit 2 else perm=`ls - 
>>>> ld /chos/local/scratch|awk '{print $1}'` if [ "$perm" !=  
>>>> "drwxrwxrwt" ] ; then echo "Permissions wrong on scratch on  
>>>> `hostname`" |/bin/mail -s "queue in error state on `hostname`"  
>>>> abc at abc.com exit 2 fi mop=`cat /proc/mounts | /bin/grep /chos/ 
>>>> local/scratch|awk '{print $4}'` if [ "$mop" != "rw" ] ; then  
>>>> echo "Scratch is mounted read only on `hostname`" |/bin/mail -s  
>>>> "queue in error state on `hostname`" abc at abc.com exit 2 fi  fi  
>>>> And I am not getting any e-mails from the prolog. In the past  
>>>> and when I tested now by unmounting scratch on a node I get  
>>>> those e-mails. Actually this morning another user joined the  
>>>> crowd and I see this problem with his account. Anyway seems to  
>>>> me that there is a different kind of problem for prolog other  
>>>> than exit on error... Thank you... Iwona On 12/4/08 3:12 AM,  
>>>> reuti wrote:
>>>>>
>>>>> Hi, Am 04.12.2008 um 00:05 schrieb Iwona Sakrejda:
>>>>>>
>>>>>> I have this one user whose jobs are flushing through hosts and  
>>>>>> pushing queues into error state. I cannot figure it out. Here  
>>>>>> is a snippet from an e-mail generated by suvh a job.  
>>>>>> 12/03/2008 14:27:51 [171:13726]: wait3 returned 13727 (status:  
>>>>>> 32512; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 127)  
>>>>>> 12/03/2008 14:27:51 [171:13726]: prolog exited with exit  
>>>>>> status 127 12/03/2008 14:27:51 [171:13726]: reaped "prolog"  
>>>>>> with pid 13727 12/03/2008 14:27:51 [171:13726]: prolog exited  
>>>>>> not due to signal 12/03/2008 14:27:51 [171:13726]: prolog  
>>>>>> exited with status 127 12/03/2008 14:27:51 [171:13726]:  
>>>>>> exit_status of prolog = 127 12/03/2008 14:27:51 [171:13726]:  
>>>>>> no epilog script to start Other jobs are running happily on  
>>>>>> those nodes. Could you suggest where to start looking for the  
>>>>>> cause?
>>>>> is the user known on the hosts? What's the prolog doing? -- Reuti
>>>>>>
>>>>>> Thanks a lot, iwona  
>>>>>> ------------------------------------------------------ http://  
>>>>>> gridengine.sunsource.net/ds/viewMessage.do?  
>>>>>> dsForumId=38&dsMessageId=90973 To unsubscribe from this  
>>>>>> discussion, e-mail: [users-  
>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>> ------------------------------------------------------ http://  
>>>>> gridengine.sunsource.net/ds/viewMessage.do?  
>>>>> dsForumId=38&dsMessageId=91090 To unsubscribe from this  
>>>>> discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>> ------------------------------------------------------ http:// 
>> gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=91168 To unsubscribe from this  
>> discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91518

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list