[GE users] Job causes queues to go into error state

Iwona Sakrejda isakrejda at lbl.gov
Thu Dec 4 16:09:45 GMT 2008


Hi,

As what user does the prolog script run? Just want to double-check...
The user running the job, sgeadmin or root?

I  am confusing myself...

Iwona


On 12/4/08 7:59 AM, Iwona Sakrejda wrote:
> Hi
>
> The user is known on the host and the prolog is quite simple:
> #!/bin/sh
> GPFSLOGFILE="/var/adm/ras/mmfs.log.latest"
> # Check whether scratch is there
>
> if [ ! -d "/chos/local/scratch" ] ; then
>    echo "Missing scratch on `hostname`" |/bin/mail -s "queue in error 
> state on `hostname`" abc at abc.com
>    exit 2
>  else
>    perm=`ls -ld /chos/local/scratch|awk '{print $1}'`
>
>    if [ "$perm" != "drwxrwxrwt" ] ; then
>       echo "Permissions wrong on scratch on `hostname`" |/bin/mail -s 
> "queue in error state on `hostname`" abc at abc.com
>       exit 2
>    fi
>
>    mop=`cat /proc/mounts | /bin/grep /chos/local/scratch|awk '{print $4}'`
>    if [ "$mop" != "rw" ] ; then
>       echo "Scratch is mounted read only on `hostname`" |/bin/mail -s 
> "queue in error state on `hostname`" abc at abc.com
>       exit 2
>    fi
>  fi
>
> And I am not getting any e-mails from the prolog. In the past and when 
> I tested now by unmounting scratch on a node
> I get those e-mails. Actually this morning another user joined the 
> crowd and I see this problem with his account.
>
> Anyway seems to me that there is a different kind of problem for 
> prolog other than exit on error...
>
> Thank you...
>
> Iwona
>
> On 12/4/08 3:12 AM, reuti wrote:
>> Hi,
>>
>> Am 04.12.2008 um 00:05 schrieb Iwona Sakrejda:
>>
>>   
>>> I have this one user whose jobs are flushing through hosts and  
>>> pushing queues into error state.
>>> I cannot figure it out. Here is a snippet from an e-mail generated  
>>> by suvh a job.
>>>
>>> 12/03/2008 14:27:51 [171:13726]: wait3 returned 13727 (status:  
>>> 32512; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 127)
>>> 12/03/2008 14:27:51 [171:13726]: prolog exited with exit status 127
>>> 12/03/2008 14:27:51 [171:13726]: reaped "prolog" with pid 13727
>>> 12/03/2008 14:27:51 [171:13726]: prolog exited not due to signal
>>> 12/03/2008 14:27:51 [171:13726]: prolog exited with status 127
>>> 12/03/2008 14:27:51 [171:13726]: exit_status of prolog = 127
>>> 12/03/2008 14:27:51 [171:13726]: no epilog script to start
>>>
>>> Other jobs are running happily on those nodes.
>>> Could you suggest where to start looking for the cause?
>>>     
>>
>> is the user known on the hosts? What's the prolog doing?
>>
>> -- Reuti
>>
>>   
>>> Thanks a lot,
>>>
>>> iwona
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=90973
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>     
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91090
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91161

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list