[GE users] Job causes queues to go into error state

Iwona Sakrejda isakrejda at lbl.gov
Thu Dec 4 15:59:29 GMT 2008


Hi

The user is known on the host and the prolog is quite simple:
#!/bin/sh
GPFSLOGFILE="/var/adm/ras/mmfs.log.latest"
# Check whether scratch is there

if [ ! -d "/chos/local/scratch" ] ; then
   echo "Missing scratch on `hostname`" |/bin/mail -s "queue in error 
state on `hostname`" abc at abc.com
   exit 2
 else
   perm=`ls -ld /chos/local/scratch|awk '{print $1}'`

   if [ "$perm" != "drwxrwxrwt" ] ; then
      echo "Permissions wrong on scratch on `hostname`" |/bin/mail -s 
"queue in error state on `hostname`" abc at abc.com
      exit 2
   fi

   mop=`cat /proc/mounts | /bin/grep /chos/local/scratch|awk '{print $4}'`
   if [ "$mop" != "rw" ] ; then
      echo "Scratch is mounted read only on `hostname`" |/bin/mail -s 
"queue in error state on `hostname`" abc at abc.com
      exit 2
   fi
 fi

And I am not getting any e-mails from the prolog. In the past and when I 
tested now by unmounting scratch on a node
I get those e-mails. Actually this morning another user joined the crowd 
and I see this problem with his account.

Anyway seems to me that there is a different kind of problem for prolog 
other than exit on error...

Thank you...

Iwona

On 12/4/08 3:12 AM, reuti wrote:
> Hi,
>
> Am 04.12.2008 um 00:05 schrieb Iwona Sakrejda:
>
>   
>> I have this one user whose jobs are flushing through hosts and  
>> pushing queues into error state.
>> I cannot figure it out. Here is a snippet from an e-mail generated  
>> by suvh a job.
>>
>> 12/03/2008 14:27:51 [171:13726]: wait3 returned 13727 (status:  
>> 32512; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 127)
>> 12/03/2008 14:27:51 [171:13726]: prolog exited with exit status 127
>> 12/03/2008 14:27:51 [171:13726]: reaped "prolog" with pid 13727
>> 12/03/2008 14:27:51 [171:13726]: prolog exited not due to signal
>> 12/03/2008 14:27:51 [171:13726]: prolog exited with status 127
>> 12/03/2008 14:27:51 [171:13726]: exit_status of prolog = 127
>> 12/03/2008 14:27:51 [171:13726]: no epilog script to start
>>
>> Other jobs are running happily on those nodes.
>> Could you suggest where to start looking for the cause?
>>     
>
> is the user known on the hosts? What's the prolog doing?
>
> -- Reuti
>
>   
>> Thanks a lot,
>>
>> iwona
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=90973
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91090
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91159

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list