[GE users] FAQ Problem With A New Twist
Chris.Langston at aa.com
Fri Mar 16 18:14:23 GMT 2007
We are having an intermittent issue with jobs submitted with qrsh. We
are getting an error message: "error: error reading returncode of remote
command" when submitting jobs with qrsh.
Here's the twist:
* The job runs to completion.
* The problem is intermittent. Some succeed. Some fail with the
* Only one type of job using qrsh fails. All the others don't have
* The job succeeds when reran. No error message.
* The solutions in the faq are already in place.
* -r-s--x--x 1 root root 18376 May 8 2006
* -r-s--x--x 1 root root 31896 May 8 2006
* -rwxr-xr-x 1 root root 363344 May 8 2006
* SGE_ROOT is on an suid NFS file system
Because a task can be in the form of a C/C++, Java, Perl, Python or
shell script, all task are submitted using a single wrapper script.
Here's how the job task get submitted.
Both wrapper scripts are ksh scripts.
An application needs to run a job inside the grid. It calls a single
entry point shell script.
WrapperScript '<wrapper script args>' command command_args
# After setting up the qrsh_args based on the job requirements
it calls qrsh.
# qrsh submits another wrapper script to that will execute the
command with args
# inside the grid so that the WrapperScript is *always*
submitting a shell script.
qrsh <qrsh args> GridWrapperScript command command_args
# GridWrapperScript executes command command_args and gets the return
We use 3 Sun (Opteron) servers to run the jobs thru SGE 6.u8 on an NFS
(suid) mounted file system.
Any help in resolving this will be immensely appreciated.
More information about the gridengine-users