Opened 7 years ago

#1364 new defect

intermittent qrsh failure

Reported by: dlove Owned by:
Priority: normal Milestone:
Component: sge Version: 8.0.0c
Severity: minor Keywords:
Cc:

Description

This isn't (consistently?) reproducible. Shepherd trace:

Job 118323 caused action: All Queues on host "node091" set to ERROR
 User        = ***
 Queue       = serial@node091
 Start Time  = <unknown>
 End Time    = <unknown>
failed before prolog: shepherd exited with exit status 7: before prolog
Shepherd trace:
11/11/2011 09:40:53 [340:29368]: shepherd called with uid = 0, euid = 340
11/11/2011 09:40:53 [340:29368]: rlogin_daemon = builtin
11/11/2011 09:40:53 [340:29368]: starting up 8.0.0c
11/11/2011 09:40:53 [340:29368]: setpgid(29368, 29368) returned 0
11/11/2011 09:40:53 [340:29368]: do_core_binding: explicit
11/11/2011 09:40:53 [340:29368]: do_core_binding: explicit: binding done
11/11/2011 09:40:53 [340:29368]: do_core_binding: finishing
11/11/2011 09:40:53 [340:29368]: no prolog script to start
11/11/2011 09:40:53 [340:29368]: pipe to child uses fds 5 and 6
11/11/2011 09:40:53 [340:29368]: calling fork_pty()
11/11/2011 09:40:53 [340:29368]: parent: forked "job" with pid 29369
11/11/2011 09:40:53 [340:29369]: child: closing parents end of the pipe
11/11/2011 09:40:53 [340:29369]: child: trying to read from parent through the pipe
11/11/2011 09:40:53 [340:29368]: parent: job-pid: 29369
11/11/2011 09:40:53 [340:29368]: parent: closing childs end of the pipe
11/11/2011 09:40:53 [340:29368]: csp = 0
11/11/2011 09:40:53 [340:29368]: parent: starting parent loop with remote_host = head, remote_port = 54885, job_owner = ***, fd_pty_master = 7, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1, fd_pipe_to_child = 6
11/11/2011 09:40:53 [340:29368]: parent: opening connection to qrsh/qlogin client
11/11/2011 09:40:53 [340:29368]: parent: sending REGISTER_CTRL_MSG to qrsh/qlogin client
11/11/2011 09:40:53 [340:29368]: parent: creating pty_to_commlib thread
11/11/2011 09:40:53 [340:29368]: parent: creating commlib_to_pty thread
11/11/2011 09:40:53 [340:29368]: commlib_to_pty: received window size message, changing window size
11/11/2011 09:40:53 [340:29368]: pty_to_commlib: closing pipe to child
11/11/2011 09:40:53 [340:29369]: child: error communicating with parent: 1, Operation not permitted
11/11/2011 09:40:53 [340:29369]: no epilog script to start
11/11/2011 09:40:53 [0:29369]: could not write exit_status file

11/11/2011 09:40:53 [340:29369]: writing exit status to qrsh: 0
11/11/2011 09:40:53 [340:29369]: sending UNREGISTER_CTRL_MSG with exit_status = "0"
11/11/2011 09:40:53 [340:29369]: sending to host: <null>
11/11/2011 09:40:53 [340:29369]: comm_write_message returned: can't find handle
11/11/2011 09:40:53 [340:29369]: close_parent_loop: comm_write_message() returned 0 instead of 1!!!
11/11/2011 09:40:53 [340:29369]: waiting for UNREGISTER_RESPONSE_CTRL_MSG
11/11/2011 09:40:53 [340:29369]: No connection or problem while waiting for message: 1
11/11/2011 09:40:53 [340:29369]: parent: cl_com_ignore_timeouts
11/11/2011 09:40:53 [340:29369]: parent: error in comm_cleanup_lib(): 3
11/11/2011 09:40:53 [340:29369]: parent: leaving closinge_parent_loop()
11/11/2011 09:40:53 [340:29368]: commlib_to_pty: received settings message
11/11/2011 09:40:53 [340:29368]: commlib_to_pty: writing to child 11 bytes: noshell = 0
11/11/2011 09:40:53 [340:29368]: commlib_to_pty: error in communicating with child -> exiting
11/11/2011 09:40:53 [340:29368]: now sending signal KILL to pid 29369
11/11/2011 09:40:53 [0:29368]: pty_to_commlib: STDOUT was closed. Our child seems to have exited -> exiting
11/11/2011 09:40:53 [0:29368]: pty_to_commlib: STDERR was closed. Our child seems to have exited -> exiting
11/11/2011 09:40:53 [0:29368]: parent: created both worker threads, now waiting for jobs end
11/11/2011 09:40:53 [0:29368]: wait3 returned 29369 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
11/11/2011 09:40:53 [0:29368]: job exited with exit status 0
11/11/2011 09:40:53 [0:29368]: parent: wait_my_child returned exit_status = 0
11/11/2011 09:40:53 [0:29368]: parent:            rusage.ru_stime.tv_sec  = 0
11/11/2011 09:40:53 [0:29368]: parent:            rusage.ru_stime.tv_usec = 2999
11/11/2011 09:40:53 [0:29368]: parent:            rusage.ru_utime.tv_sec  = 0
11/11/2011 09:40:53 [0:29368]: parent:            rusage.ru_utime.tv_usec = 0
11/11/2011 09:40:53 [0:29368]: parent: received event 1000, g_raised_event = 2
11/11/2011 09:40:53 [0:29368]: parent: shutting down pty_to_commlib thread
11/11/2011 09:40:53 [0:29368]: parent: shutting down commlib_to_pty thread
11/11/2011 09:40:53 [0:29368]: parent: thread_cleanup_lib()
11/11/2011 09:40:53 [0:29368]: parent: leaving main loop. From here on, only the main thread is running.
11/11/2011 09:40:53 [340:29368]: reaped "job" with pid 29369
11/11/2011 09:40:53 [340:29368]: job exited not due to signal
11/11/2011 09:40:53 [340:29368]: job exited with status 0
11/11/2011 09:40:53 [340:29368]: ignored signal KILL to pid -29369
11/11/2011 09:40:53 [340:29368]: writing usage file to "usage"
11/11/2011 09:40:53 [340:29368]: no tasker to notify
11/11/2011 09:40:53 [340:29368]: no epilog script to start
11/11/2011 09:40:53 [0:29368]: could not write exit_status file

11/11/2011 09:40:53 [340:29368]: writing exit status to qrsh: 0
11/11/2011 09:40:53 [340:29368]: sending UNREGISTER_CTRL_MSG with exit_status = "0"
11/11/2011 09:40:53 [340:29368]: sending to host: head
11/11/2011 09:40:53 [340:29368]: waiting for UNREGISTER_RESPONSE_CTRL_MSG
11/11/2011 09:40:53 [340:29368]: Received UNREGISTER_RESPONSE_CTRL_MSG
11/11/2011 09:40:53 [340:29368]: parent: cl_com_ignore_timeouts
11/11/2011 09:40:54 [340:29368]: parent: leaving closinge_parent_loop()

Shepherd pe_hostfile:
node091 1 serial@node091 UNDEFINED

Change History (0)

Note: See TracTickets for help on using tickets.