[GE users] Job causes queues to go into error state

reuti reuti at staff.uni-marburg.de
Thu Dec 4 16:28:48 GMT 2008


Am 04.12.2008 um 17:09 schrieb Iwona Sakrejda:

> Hi,
>
> As what user does the prolog script run? Just want to double-check...
> The user running the job, sgeadmin or root?

Hi,

by default it will run under the user running the job. You can change  
this by prefixing the script by a "user@".

-- Reuti


> I  am confusing myself...
>
> Iwona
>
>
> On 12/4/08 7:59 AM, Iwona Sakrejda wrote:
>>
>> Hi
>>
>> The user is known on the host and the prolog is quite simple:
>> #!/bin/sh
>> GPFSLOGFILE="/var/adm/ras/mmfs.log.latest"
>> # Check whether scratch is there
>>
>> if [ ! -d "/chos/local/scratch" ] ; then
>>    echo "Missing scratch on `hostname`" |/bin/mail -s "queue in  
>> error state on `hostname`" abc at abc.com
>>    exit 2
>>  else
>>    perm=`ls -ld /chos/local/scratch|awk '{print $1}'`
>>
>>    if [ "$perm" != "drwxrwxrwt" ] ; then
>>       echo "Permissions wrong on scratch on `hostname`" |/bin/mail  
>> -s "queue in error state on `hostname`" abc at abc.com
>>       exit 2
>>    fi
>>
>>    mop=`cat /proc/mounts | /bin/grep /chos/local/scratch|awk  
>> '{print $4}'`
>>    if [ "$mop" != "rw" ] ; then
>>       echo "Scratch is mounted read only on `hostname`" |/bin/mail  
>> -s "queue in error state on `hostname`" abc at abc.com
>>       exit 2
>>    fi
>>  fi
>>
>> And I am not getting any e-mails from the prolog. In the past and  
>> when I tested now by unmounting scratch on a node
>> I get those e-mails. Actually this morning another user joined the  
>> crowd and I see this problem with his account.
>>
>> Anyway seems to me that there is a different kind of problem for  
>> prolog other than exit on error...
>>
>> Thank you...
>>
>> Iwona
>>
>> On 12/4/08 3:12 AM, reuti wrote:
>>>
>>> Hi, Am 04.12.2008 um 00:05 schrieb Iwona Sakrejda:
>>>>
>>>> I have this one user whose jobs are flushing through hosts and  
>>>> pushing queues into error state. I cannot figure it out. Here is  
>>>> a snippet from an e-mail generated by suvh a job. 12/03/2008  
>>>> 14:27:51 [171:13726]: wait3 returned 13727 (status: 32512;  
>>>> WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 127) 12/03/2008  
>>>> 14:27:51 [171:13726]: prolog exited with exit status 127  
>>>> 12/03/2008 14:27:51 [171:13726]: reaped "prolog" with pid 13727  
>>>> 12/03/2008 14:27:51 [171:13726]: prolog exited not due to signal  
>>>> 12/03/2008 14:27:51 [171:13726]: prolog exited with status 127  
>>>> 12/03/2008 14:27:51 [171:13726]: exit_status of prolog = 127  
>>>> 12/03/2008 14:27:51 [171:13726]: no epilog script to start Other  
>>>> jobs are running happily on those nodes. Could you suggest where  
>>>> to start looking for the cause?
>>> is the user known on the hosts? What's the prolog doing? -- Reuti
>>>>
>>>> Thanks a lot, iwona  
>>>> ------------------------------------------------------ http:// 
>>>> gridengine.sunsource.net/ds/viewMessage.do?  
>>>> dsForumId=38&dsMessageId=90973 To unsubscribe from this  
>>>> discussion, e-mail: [users- unsubscribe at gridengine.sunsource.net].
>>> ------------------------------------------------------ http:// 
>>> gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=91090 To unsubscribe from this  
>>> discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91168

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list