[GE users] Migration Issue + Conditional "failed receiving gdi request" error

Jonathan Pierce jonathan.pierce at loni.ucla.edu
Sat Aug 11 03:08:12 BST 2007


We have two machines set up for qmaster responsibilities: cerebro- 
rmn1 and cerebro-rmn2.  Right now, cerebro-rmn1 is master.  If I  
attempt to qstat from cerebro-rmn2:

<hdezmora at cerebro-rmn2.data> qstat
error: commlib error: got read error (closing "cerebro-rmn1.data/ 
error: failed sending gdi request

and the following appears in /usr/sge/loni/spool/qmaster/messages  
(loni is the cell):

08/10/2007 18:44:21|qmaster|cerebro-rmn1|E|commlib error: got read  
error (closing "cerebro-rsn2.data/qstat/6")

Now, this may be related to another issue we're experiencing.   
Originally, when we tried to migrate services, it shut down the  
qmaster, but failed to start it on the second, leaving the grid  
engine in an unusable state until qmaster was manually started on one  
or the other.  The main issue was that the lock file was not being  
deleted.  We hacked the script as follows:

        while [ $lock_file_read_count -lt $lock_file_read_retries ]; do
           if [ -f $qmaster_spool_dir/lock ]; then
               rm $qmaster_spool_dir/lock
           sleep 5
           lock_file_read_count=`expr $lock_file_read_count + 1`

where the defaults are lock_file_read_retries=10 and sleep 3; the "rm  
$qmaster_spool_dir/lock" line was added.  I would assume that migrate  
should already work on its own, but we added this as a (hopeful)  
temporary fix.  Included this information in case it's helpful to  
anybody in figuring out what's wrong.  Any assistance would be  
greatly appreciated.

Thank you very much,

Jonathan Pierce
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonathan.pierce at loni.ucla.edu

More information about the gridengine-users mailing list