Opened 7 years ago

Closed 6 years ago

#1437 closed defect (fixed)

builtin qlogin client fails on 8.1.1 and 8.1.2

Reported by: d.munro@… Owned by:
Priority: normal Milestone:
Component: sge Version: 8.1.2
Severity: major Keywords:
Cc:

Description

The qlogin builtin client fails to start a shell on versions since 8.1.0. This is what I get when running on 8.1.2

% qlogin
Your job 120 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 120 has been successfully scheduled.
Establishing builtin session to host dingly.ph.surrey.ac.uk ...
error: commlib error: got read error (closing "dingly.ph.surrey.ac.uk/shepherd_ijs/2")
%

If I set up debugging to capture a trace and debug logs as described elsewhere, this is what I get in the terminal windows now with the captured logs in the attached files.

% qlogin
Your job 124 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 124 has been successfully scheduled.
Establishing builtin session to host dingly.ph.surrey.ac.uk ...
09/26/2012 10:38:13 [0:15613]: child: closing parents end of the pipe
09/26/2012 10:38:13 [0:15613]: child: trying to read from parent through the pipe
09/26/2012 10:38:13 [0:15613]: child: error communicating with parent: 0, Success
09/26/2012 10:38:13 [0:15613]: no epilog script to start
09/26/2012 10:38:13 [0:15613]: writing exit status to qrsh: 0
09/26/2012 10:38:13 [0:15613]: sending UNREGISTER_CTRL_MSG with exit_status = "0"
09/26/2012 10:38:13 [0:15613]: sending to host: <null>
09/26/2012 10:38:13 [0:15613]: comm_write_message returned: can't find handle
09/26/2012 10:38:13 [0:15613]: close_parent_loop: comm_write_message() returned 0 instead of 1!!!
09/26/2012 10:38:13 [0:15613]: waiting for UNREGISTER_RESPONSE_CTRL_MSG
09/26/2012 10:38:13 [0:15613]: No connection or problem while waiting for message: 1
09/26/2012 10:38:13 [0:15613]: parent: cl_com_ignore_timeouts
09/26/2012 10:38:13 [0:15613]: parent: error in comm_cleanup_lib(): 3
09/26/2012 10:38:13 [0:15613]: parent: leaving closinge_parent_loop()

Attachments (4)

strace.shepherd.1 (3.4 KB) - added by d.munro@… 7 years ago.
strace.shepherd.2.gz (50.0 KB) - added by d.munro@… 7 years ago.
qrsh (1.7 KB) - added by dlove 6 years ago.
qrsh wrapper
qlogin (674 bytes) - added by dlove 6 years ago.
qlogin wrapper

Download all attachments as: .zip

Change History (14)

Changed 7 years ago by d.munro@…

Changed 7 years ago by d.munro@…

comment:1 Changed 7 years ago by d.munro@…

I can't see quite where it is going wrong but it appears that the child process is trying to read an initial message from the parent before the parent has written it on the pipe and returns from the read with 0 bytes, assumes an error and shuts down.

There are also several cases of the parent trying to close non-valid fds when starting the child as in

[pid 15611] close(4294967295) = -1 EBADF (Bad file descriptor)

comment:4 Changed 7 years ago by dlove

Thanks particularly for the diagnostics. Of course it works fine here,
though it happens I may have run into the same problem recently on a
system I don't admin and so couldn't investigate. What OS are you
using? (I wonder if that makes a difference to it showing up for some
reason.) Do you happen to know whether it's a regression in 8.1.1?

Anyway, I'll see if I can guess the problem on the basis of that info
when I have a chance.

comment:5 Changed 7 years ago by d.munro@…

I'm running on Ubuntu Precise x86_64. I built sge from source starting with 8.1.0 and updating to 8.1.2. I didn't have to make any changes to the source tree to get it to build - just get all the various dev packages in place. Original nodes with 8.1.0 work, any with newer builds fail. If I try to enable debugging/strace on 8.1.0, it also fails so I couldn't compare a working trace with a non-working one.

It's also less of an issue now as I have finally got ssh working properly without all those auth warnings on my test cluster - I just need to package the changes needed to apply to the real cluster.

If you would like me to test anything, just let me know.

comment:6 Changed 7 years ago by dlove

I'm running on Ubuntu Precise x86_64.

Right, I can reproduce that, though I'm sure I've had it working on the
same box, and it's fine on the other GNU/Linux versions I have. I've a
horrible suspicion it's related to racing threads.

It's also less of an issue now as I have finally got ssh working properly
without all those auth warnings on my test cluster - I just need to
package the changes needed to apply to the real cluster.

Are there any SGE changes that should be made? [I assume you know about
the PAM module to have control and accounting working.]

If you would like me to test anything, just let me know.

Thanks.

comment:7 follow-up: Changed 7 years ago by dlove

This definitely looks like a problem with threading. Compiling with extra diagnostics
made it work, and the OGE 6.2u6 fix list has

  • 6910082 make new interactive job support not use threads

comment:8 in reply to: ↑ 7 Changed 7 years ago by heavenevil

Replying to dlove:

This definitely looks like a problem with threading. Compiling with extra diagnostics
made it work, and the OGE 6.2u6 fix list has

  • 6910082 make new interactive job support not use threads

I met same problem, however, I run redhat. Did you find any solutions? Thanks.

comment:9 Changed 7 years ago by dlove

SGE <sge-bugs@…> writes:

I met same problem, however, I run redhat.

Which version, and how was it built, exactly? (It even seems to be
sensitive to optimization.)

Did you find any solutions?

No. I guess it can be kludged with a delay somewhere, but it may be
that the right solution is to use the facility demonstrated by
libs/comm/examples.

Of course, the workaround is to run ssh for the remote startup (using
the PAM module to provide job tracking), but I'm not sure whether that's
only necessary for qlogin, or also for qrsh.

By the way, in case anyone else looks at it, the occurrences of
close(-1) apparently aren't errors; see [4420].

Changed 6 years ago by dlove

qrsh wrapper

comment:10 Changed 6 years ago by dlove

Attached workaround wrappers following suggestion by Mark Dixon.

Changed 6 years ago by dlove

qlogin wrapper

comment:11 Changed 6 years ago by omula

I am also having the same problem with a Debian Squeeze.

Kernel version: 3.2.46-1~bpo60+1
Libc6 version: 2.11.3-4

comment:12 Changed 6 years ago by dlove

  • Resolution set to fixed
  • Severity changed from minor to major
  • Status changed from new to closed

Assuming this is properly fixed by [4547] (#1467), which omitted the fix tag.
Heaven knows why it wasn't obvious in testing/debugging...

Note: See TracTickets for help on using tickets.