Opened 9 years ago
#1364 new defect
intermittent qrsh failure
Reported by: | dlove | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 8.0.0c |
Severity: | minor | Keywords: | |
Cc: |
Description
This isn't (consistently?) reproducible. Shepherd trace:
Job 118323 caused action: All Queues on host "node091" set to ERROR User = *** Queue = serial@node091 Start Time = <unknown> End Time = <unknown> failed before prolog: shepherd exited with exit status 7: before prolog Shepherd trace: 11/11/2011 09:40:53 [340:29368]: shepherd called with uid = 0, euid = 340 11/11/2011 09:40:53 [340:29368]: rlogin_daemon = builtin 11/11/2011 09:40:53 [340:29368]: starting up 8.0.0c 11/11/2011 09:40:53 [340:29368]: setpgid(29368, 29368) returned 0 11/11/2011 09:40:53 [340:29368]: do_core_binding: explicit 11/11/2011 09:40:53 [340:29368]: do_core_binding: explicit: binding done 11/11/2011 09:40:53 [340:29368]: do_core_binding: finishing 11/11/2011 09:40:53 [340:29368]: no prolog script to start 11/11/2011 09:40:53 [340:29368]: pipe to child uses fds 5 and 6 11/11/2011 09:40:53 [340:29368]: calling fork_pty() 11/11/2011 09:40:53 [340:29368]: parent: forked "job" with pid 29369 11/11/2011 09:40:53 [340:29369]: child: closing parents end of the pipe 11/11/2011 09:40:53 [340:29369]: child: trying to read from parent through the pipe 11/11/2011 09:40:53 [340:29368]: parent: job-pid: 29369 11/11/2011 09:40:53 [340:29368]: parent: closing childs end of the pipe 11/11/2011 09:40:53 [340:29368]: csp = 0 11/11/2011 09:40:53 [340:29368]: parent: starting parent loop with remote_host = head, remote_port = 54885, job_owner = ***, fd_pty_master = 7, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1, fd_pipe_to_child = 6 11/11/2011 09:40:53 [340:29368]: parent: opening connection to qrsh/qlogin client 11/11/2011 09:40:53 [340:29368]: parent: sending REGISTER_CTRL_MSG to qrsh/qlogin client 11/11/2011 09:40:53 [340:29368]: parent: creating pty_to_commlib thread 11/11/2011 09:40:53 [340:29368]: parent: creating commlib_to_pty thread 11/11/2011 09:40:53 [340:29368]: commlib_to_pty: received window size message, changing window size 11/11/2011 09:40:53 [340:29368]: pty_to_commlib: closing pipe to child 11/11/2011 09:40:53 [340:29369]: child: error communicating with parent: 1, Operation not permitted 11/11/2011 09:40:53 [340:29369]: no epilog script to start 11/11/2011 09:40:53 [0:29369]: could not write exit_status file 11/11/2011 09:40:53 [340:29369]: writing exit status to qrsh: 0 11/11/2011 09:40:53 [340:29369]: sending UNREGISTER_CTRL_MSG with exit_status = "0" 11/11/2011 09:40:53 [340:29369]: sending to host: <null> 11/11/2011 09:40:53 [340:29369]: comm_write_message returned: can't find handle 11/11/2011 09:40:53 [340:29369]: close_parent_loop: comm_write_message() returned 0 instead of 1!!! 11/11/2011 09:40:53 [340:29369]: waiting for UNREGISTER_RESPONSE_CTRL_MSG 11/11/2011 09:40:53 [340:29369]: No connection or problem while waiting for message: 1 11/11/2011 09:40:53 [340:29369]: parent: cl_com_ignore_timeouts 11/11/2011 09:40:53 [340:29369]: parent: error in comm_cleanup_lib(): 3 11/11/2011 09:40:53 [340:29369]: parent: leaving closinge_parent_loop() 11/11/2011 09:40:53 [340:29368]: commlib_to_pty: received settings message 11/11/2011 09:40:53 [340:29368]: commlib_to_pty: writing to child 11 bytes: noshell = 0 11/11/2011 09:40:53 [340:29368]: commlib_to_pty: error in communicating with child -> exiting 11/11/2011 09:40:53 [340:29368]: now sending signal KILL to pid 29369 11/11/2011 09:40:53 [0:29368]: pty_to_commlib: STDOUT was closed. Our child seems to have exited -> exiting 11/11/2011 09:40:53 [0:29368]: pty_to_commlib: STDERR was closed. Our child seems to have exited -> exiting 11/11/2011 09:40:53 [0:29368]: parent: created both worker threads, now waiting for jobs end 11/11/2011 09:40:53 [0:29368]: wait3 returned 29369 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) 11/11/2011 09:40:53 [0:29368]: job exited with exit status 0 11/11/2011 09:40:53 [0:29368]: parent: wait_my_child returned exit_status = 0 11/11/2011 09:40:53 [0:29368]: parent: rusage.ru_stime.tv_sec = 0 11/11/2011 09:40:53 [0:29368]: parent: rusage.ru_stime.tv_usec = 2999 11/11/2011 09:40:53 [0:29368]: parent: rusage.ru_utime.tv_sec = 0 11/11/2011 09:40:53 [0:29368]: parent: rusage.ru_utime.tv_usec = 0 11/11/2011 09:40:53 [0:29368]: parent: received event 1000, g_raised_event = 2 11/11/2011 09:40:53 [0:29368]: parent: shutting down pty_to_commlib thread 11/11/2011 09:40:53 [0:29368]: parent: shutting down commlib_to_pty thread 11/11/2011 09:40:53 [0:29368]: parent: thread_cleanup_lib() 11/11/2011 09:40:53 [0:29368]: parent: leaving main loop. From here on, only the main thread is running. 11/11/2011 09:40:53 [340:29368]: reaped "job" with pid 29369 11/11/2011 09:40:53 [340:29368]: job exited not due to signal 11/11/2011 09:40:53 [340:29368]: job exited with status 0 11/11/2011 09:40:53 [340:29368]: ignored signal KILL to pid -29369 11/11/2011 09:40:53 [340:29368]: writing usage file to "usage" 11/11/2011 09:40:53 [340:29368]: no tasker to notify 11/11/2011 09:40:53 [340:29368]: no epilog script to start 11/11/2011 09:40:53 [0:29368]: could not write exit_status file 11/11/2011 09:40:53 [340:29368]: writing exit status to qrsh: 0 11/11/2011 09:40:53 [340:29368]: sending UNREGISTER_CTRL_MSG with exit_status = "0" 11/11/2011 09:40:53 [340:29368]: sending to host: head 11/11/2011 09:40:53 [340:29368]: waiting for UNREGISTER_RESPONSE_CTRL_MSG 11/11/2011 09:40:53 [340:29368]: Received UNREGISTER_RESPONSE_CTRL_MSG 11/11/2011 09:40:53 [340:29368]: parent: cl_com_ignore_timeouts 11/11/2011 09:40:54 [340:29368]: parent: leaving closinge_parent_loop() Shepherd pe_hostfile: node091 1 serial@node091 UNDEFINED
Note: See
TracTickets for help on using
tickets.