[GE users] LAM & SGE

Orion Poplawski orion at cora.nwra.com
Fri Aug 27 19:23:27 BST 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


> I've not had any luck with this under SGE 6.0.  I end up with a runaway
> qrsh process using 100% cpu and the job fails and puts the queue in an
> error state.  Both SGE and lam are configured to use ssh.
>

I had a number of problems with my setup with the last post.  I fixed
those, but still no luck.  The qrsh process still hangs.

In order to get it to work I have to either comment out:

# close STDIN to avoid stdio race conditions and tty issues
#close(STDIN);

or if running in debug mode:

  debug_print("QRSH LOCAL CONFIG: @myargs");
#  if($debug){ close(SGEDEBUG); }

So it looks like qrsh is hanging on dealing with closed filehandles.

Actually, it looks like qrsh hangs when trying to contact the qmaster over
filehandle 0.  Perhaps it's making some bad assumptions?  Looks like when
select returns with fd 0 writable, it doesn't send any data, but when in
the other case select returns with fd 3 writable, it sends some data.

failed qrsh:
17203 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 0
17203 setsockopt(0, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
17203 fcntl64(0, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
17203 gettimeofday({1093630168, 54081}, NULL) = 0
17203 gettimeofday({1093630168, 54118}, NULL) = 0
17203 gettimeofday({1093630168, 54150}, NULL) = 0
17203 connect(0, {sa_family=AF_INET, sin_port=htons(1071),
sin_addr=inet_addr("6
5.171.192.8")}, 16) = -1 EINPROGRESS (Operation now in progress)
17203 select(1, NULL, [0], NULL, {0, 0}) = 1 (out [0], left {0, 0})
17203 getsockopt(0, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
17203 setsockopt(0, SOL_TCP, TCP_NODELAY, [1], 4) = 0
17203 gettimeofday({1093630168, 54681}, NULL) = 0
17203 gettimeofday({1093630168, 54714}, NULL) = 0
17203 gettimeofday({1093630168, 54746}, NULL) = 0
17203 gettimeofday({1093630168, 54779}, NULL) = 0
17203 select(1, [0], [0], NULL, {1, 0}) = 1 (out [0], left {1, 0})
17203 gettimeofday({1093630168, 55006}, NULL) = 0
17203 gettimeofday({1093630168, 55038}, NULL) = 0
17203 gettimeofday({1093630168, 55070}, NULL) = 0
17203 gettimeofday({1093630168, 55108}, NULL) = 0
17203 select(1, [0], [0], NULL, {1, 0}) = 1 (out [0], left {1, 0})
17203 gettimeofday({1093630168, 55313}, NULL) = 0

successful qrsh:
17323 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
17323 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
17323 fcntl64(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
17323 gettimeofday({1093630392, 367719}, NULL) = 0
17323 gettimeofday({1093630392, 367756}, NULL) = 0
17323 gettimeofday({1093630392, 367788}, NULL) = 0
17323 connect(3, {sa_family=AF_INET, sin_port=htons(1071),
sin_addr=inet_addr("6
5.171.192.8")}, 16) = -1 EINPROGRESS (Operation now in progress)
17323 select(4, NULL, [3], NULL, {0, 0}) = 0 (Timeout)
17323 gettimeofday({1093630392, 368023}, NULL) = 0
17323 gettimeofday({1093630392, 368056}, NULL) = 0
17323 gettimeofday({1093630392, 368089}, NULL) = 0
17323 gettimeofday({1093630392, 368197}, NULL) = 0
17323 select(4, [], [3], NULL, {1, 0})  = 1 (out [3], left {1, 0})
17323 gettimeofday({1093630392, 368410}, NULL) = 0
17323 connect(3, {sa_family=AF_INET, sin_port=htons(1071),
sin_addr=inet_addr("6
5.171.192.8")}, 16) = 0
17323 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
17323 gettimeofday({1093630392, 368533}, NULL) = 0
17323 gettimeofday({1093630392, 368591}, NULL) = 0
17323 select(4, NULL, [3], NULL, {1, 0}) = 1 (out [3], left {1, 0})
17323 write(3, "<gmsh><dl>290</dl></gmsh><cm ver"..., 315) = 315
17323 gettimeofday({1093630392, 368813}, NULL) = 0
17323 gettimeofday({1093630392, 368853}, NULL) = 0
17323 select(4, [3], NULL, NULL, {1, 0}) = 1 (in [3], left {0, 990000})
17323 read(3, "<gmsh><dl>270</dl></gm", 22) = 22
17323 gettimeofday({1093630392, 371120}, NULL) = 0
17323 select(4, [3], NULL, NULL, {1, 0}) = 1 (in [3], left {1, 0})
17323 read(3, "s", 1)                   = 1
17323 gettimeofday({1093630392, 371292}, NULL) = 0
17323 select(4, [3], NULL, NULL, {1, 0}) = 1 (in [3], left {1, 0})
17323 read(3, "h", 1)                   = 1
17323 gettimeofday({1093630392, 371462}, NULL) = 0
17323 select(4, [3], NULL, NULL, {1, 0}) = 1 (in [3], left {1, 0})
17323 read(3, ">", 1)                   = 1
17323 gettimeofday({1093630392, 371655}, NULL) = 0
17323 select(4, [3], NULL, NULL, {1, 0}) = 1 (in [3], left {1, 0})
17323 read(3, "<crm version=\"0.1\"><cs condition"..., 270) = 270
17323 gettimeofday({1093630392, 371910}, NULL) = 0
17323 gettimeofday({1093630392, 371962}, NULL) = 0
17323 gettimeofday({1093630392, 371995}, NULL) = 0



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list