[GE users] Job causes queues to go into error state

Iwona Sakrejda isakrejda at lbl.gov
Sat Dec 6 00:02:06 GMT 2008


We believe that we understand the problem. A user was using the -V
option with qsub so his environment was copied wholesale and it was a huge
environment so it took us a while to notice that among them was
LD_ASSUME_KERNEL=2.2.5. It was responsible for the prolog and/or the 
shepherd
failing.

So I think that now I understand the messages I was receiving,

12/03/2008 14:27:51 [171:13726]: prolog exited not due to signal
12/03/2008 14:27:51 [171:13726]: prolog exited with status 127
12/03/2008 14:27:51 [171:13726]: exit_status of prolog = 127

What does "prolog exited not due to signal mean?"
Does it mean it crashed?
In principle my problem is solved, but I have to say that all those
messages were very criptic and not much help in sorting things out....

Iwona




 
On 12/4/08 8:28 AM, reuti wrote:
> Am 04.12.2008 um 17:09 schrieb Iwona Sakrejda:
>
>   
>> Hi,
>>
>> As what user does the prolog script run? Just want to double-check...
>> The user running the job, sgeadmin or root?
>>     
>
> Hi,
>
> by default it will run under the user running the job. You can change  
> this by prefixing the script by a "user@".
>
> -- Reuti
>
>
>   
>> I  am confusing myself...
>>
>> Iwona
>>
>>
>> On 12/4/08 7:59 AM, Iwona Sakrejda wrote:
>>     
>>> Hi
>>>
>>> The user is known on the host and the prolog is quite simple:
>>> #!/bin/sh
>>> GPFSLOGFILE="/var/adm/ras/mmfs.log.latest"
>>> # Check whether scratch is there
>>>
>>> if [ ! -d "/chos/local/scratch" ] ; then
>>>    echo "Missing scratch on `hostname`" |/bin/mail -s "queue in  
>>> error state on `hostname`" abc at abc.com
>>>    exit 2
>>>  else
>>>    perm=`ls -ld /chos/local/scratch|awk '{print $1}'`
>>>
>>>    if [ "$perm" != "drwxrwxrwt" ] ; then
>>>       echo "Permissions wrong on scratch on `hostname`" |/bin/mail  
>>> -s "queue in error state on `hostname`" abc at abc.com
>>>       exit 2
>>>    fi
>>>
>>>    mop=`cat /proc/mounts | /bin/grep /chos/local/scratch|awk  
>>> '{print $4}'`
>>>    if [ "$mop" != "rw" ] ; then
>>>       echo "Scratch is mounted read only on `hostname`" |/bin/mail  
>>> -s "queue in error state on `hostname`" abc at abc.com
>>>       exit 2
>>>    fi
>>>  fi
>>>
>>> And I am not getting any e-mails from the prolog. In the past and  
>>> when I tested now by unmounting scratch on a node
>>> I get those e-mails. Actually this morning another user joined the  
>>> crowd and I see this problem with his account.
>>>
>>> Anyway seems to me that there is a different kind of problem for  
>>> prolog other than exit on error...
>>>
>>> Thank you...
>>>
>>> Iwona
>>>
>>> On 12/4/08 3:12 AM, reuti wrote:
>>>       
>>>> Hi, Am 04.12.2008 um 00:05 schrieb Iwona Sakrejda:
>>>>         
>>>>> I have this one user whose jobs are flushing through hosts and  
>>>>> pushing queues into error state. I cannot figure it out. Here is  
>>>>> a snippet from an e-mail generated by suvh a job. 12/03/2008  
>>>>> 14:27:51 [171:13726]: wait3 returned 13727 (status: 32512;  
>>>>> WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 127) 12/03/2008  
>>>>> 14:27:51 [171:13726]: prolog exited with exit status 127  
>>>>> 12/03/2008 14:27:51 [171:13726]: reaped "prolog" with pid 13727  
>>>>> 12/03/2008 14:27:51 [171:13726]: prolog exited not due to signal  
>>>>> 12/03/2008 14:27:51 [171:13726]: prolog exited with status 127  
>>>>> 12/03/2008 14:27:51 [171:13726]: exit_status of prolog = 127  
>>>>> 12/03/2008 14:27:51 [171:13726]: no epilog script to start Other  
>>>>> jobs are running happily on those nodes. Could you suggest where  
>>>>> to start looking for the cause?
>>>>>           
>>>> is the user known on the hosts? What's the prolog doing? -- Reuti
>>>>         
>>>>> Thanks a lot, iwona  
>>>>> ------------------------------------------------------ http:// 
>>>>> gridengine.sunsource.net/ds/viewMessage.do?  
>>>>> dsForumId=38&dsMessageId=90973 To unsubscribe from this  
>>>>> discussion, e-mail: [users- unsubscribe at gridengine.sunsource.net].
>>>>>           
>>>> ------------------------------------------------------ http:// 
>>>> gridengine.sunsource.net/ds/viewMessage.do? 
>>>> dsForumId=38&dsMessageId=91090 To unsubscribe from this  
>>>> discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>         
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91168
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91454

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list