[GE users] FAQ Problem With A New Twist

Langston, Chris Chris.Langston at aa.com
Fri Mar 16 18:14:23 GMT 2007



We are having an intermittent issue with jobs submitted with qrsh. We
are getting an error message: "error: error reading returncode of remote
command"  when submitting jobs with qrsh.


Here's the twist:

*	The job runs to completion.
*	The problem is intermittent. Some succeed. Some fail with the
error message.
*	Only one type of job using qrsh fails. All the others don't have
any problems.
*	The job succeeds when reran. No error message.
*	The solutions in the faq are already in place.

	*	-r-s--x--x   1 root     root       18376 May  8  2006
	*	-r-s--x--x   1 root     root       31896 May  8  2006
	*	-rwxr-xr-x   1 root     root      363344 May  8  2006
	*	SGE_ROOT is on an suid NFS file system


Because a task can be in the form of a C/C++, Java, Perl, Python or
shell script, all task are submitted using a single wrapper script.
Here's how the job task get submitted. 


Both wrapper scripts are ksh scripts. 


 An application needs to run a job inside the grid. It calls a single
entry point shell script.


 WrapperScript '<wrapper script args>' command command_args

         # After setting up the qrsh_args based on the job requirements
it calls qrsh.

         # qrsh submits another wrapper script to that will execute the
command with args

         # inside the grid so that the WrapperScript is *always*
submitting a shell script.

         qrsh <qrsh args> GridWrapperScript command command_args


   # GridWrapperScript executes command command_args and gets the return

        $*; rc=$?


        exit $rc


We use 3 Sun (Opteron) servers to run the jobs thru SGE 6.u8 on an NFS
(suid) mounted file system.


Any help in resolving this will be immensely appreciated.



Chris Langston

More information about the gridengine-users mailing list