[GE users] switched to 6.1 and many problems

Yifan Zhang yifanz at softsound.com
Thu Aug 9 11:38:19 BST 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Thanks for confirming that it is using qrsh to do qsub. I checked qsub, 
it is not linked or aliased to qrsh.
Here is what in my sge_request
-cwd -e . -o . -V -S /bin/sh

Is there any way inside script I can check if my job is qsub-ed or qrsh-ed?

Thanks

Harald Pollinger wrote:

> Hi,
>
> this is the trace file of a qrsh job. Are you sure there is no 
> link/alias/wrapper and the "qrsh" binary was not renamed or copied to 
> "qsub"?
> "qsub -help" prints what it really is.
>
>
> > 08/08/2007 16:45:40 [0:15106]: error connecting stream socket: No
> > route to host
>
> This is the reason why the job fails, all later error messages are 
> only consecutive faults. "No route to host" is a system error returned 
> by the connect(3SOCKET) call (it's errno 148), the execution host 
> really can't connect to "munge.softsound.com".
>
> Regards,
> Harald
>
>
> Yifan Zhang wrote:
>
>> Hi,
>>
>> Thanks for helping. The ignore_fqdn was set to true. The funny thing 
>> is the failure "can't open file /tmp/13851.1.real.q/pid: No such 
>> file" does not really stop the job from finishing. I don't really 
>> know the details of GE, does every qsub start as qlogin to execute jobs?
>> And we have linux systems in different flavours, could that cause 
>> problem? say submit host is FC and execution is SuSE
>>
>> Thanks
>>
>> Chansup Byun wrote:
>>
>>> Reuti wrote:
>>>
>>>> Am 08.08.2007 um 17:59 schrieb Yifan Zhang:
>>>>
>>>>> We were using SGE5.3, and it is been doing its job for years. Now 
>>>>> we decided to switch to GE6.1. At begining it worked, and then for 
>>>>> no reason every host will be set to error automatically. The job 
>>>>> finished fine. It is a qsub job, not qrsh. please help if you know 
>>>>> what is the reason causing this. The typical error message is as 
>>>>> following:
>>>>>
>>>>> Job 13851 caused action: Queue "real.q at hal00.softsound.com" set to 
>>>>> ERROR
>>>>> User        = ajr
>>>>> Queue       = real.q at hal00.softsound.com
>>>>> Host        = hal00.softsound.com
>>>>> Start Time  = <unknown>
>>>>> End Time    = <unknown>
>>>>> failed before job:08/08/2007 16:45:40 [0:15106]: can't open file 
>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>> Shepherd trace:
>>>>> 08/08/2007 16:45:37 [16384:15105]: shepherd called with uid = 0, 
>>>>> euid = 16384
>>>>> 08/08/2007 16:45:37 [16384:15105]: setpgid(15105, 15105) returned 0
>>>>> 08/08/2007 16:45:37 [16384:15105]: no prolog script to start
>>>>> 08/08/2007 16:45:37 [16384:15105]: forked "job" with pid 15106
>>>>> 08/08/2007 16:45:37 [16384:15106]: processing qlogin job
>>>>> 08/08/2007 16:45:37 [16384:15106]: pid=15106 pgrp=15106 sid=15106 
>>>>> old pgrp=15105 getlogin()=<no login set>
>>>>> 08/08/2007 16:45:37 [16384:15105]: child: job - pid: 15106
>>>>> 08/08/2007 16:45:37 [16384:15106]: reading passwd information for 
>>>>> user 'root'
>>>>> 08/08/2007 16:45:37 [16384:15106]: setting limits
>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CPU setting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_FSIZE setting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_DATA setting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_STACK setting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CORE setting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_VMEM/RLIMIT_AS setting: 
>>>>> (soft 18446744073709551615 hard 18446744073709551615) resulting: 
>>>>> (soft 18446744073709551615 hard 18446744073709551615)
>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_RSS setting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>> 08/08/2007 16:45:37 [16384:15106]: setting environment
>>>>> 08/08/2007 16:45:37 [16384:15106]: Initializing error file
>>>>> 08/08/2007 16:45:37 [16384:15106]: switching to 
>>>>> intermediate/target user
>>>>> 08/08/2007 16:45:37 [16519:15106]: closing all filedescriptors
>>>>> 08/08/2007 16:45:37 [16519:15106]: further messages are in "error" 
>>>>> and "trace"
>>>>> 08/08/2007 16:45:37 [0:15106]: now running with uid=0, euid=0
>>>>> 08/08/2007 16:45:37 [0:15106]: start qlogin
>>>>> 08/08/2007 16:45:37 [0:15106]: calling 
>>>>> qlogin_starter(/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1, 
>>>>> /usr/local/softsound/GridEngine/utilbin/lx24-amd64/rshd -l);
>>>>> 08/08/2007 16:45:37 [0:15106]: uid = 0, euid = 0, gid = 0, egid = 0
>>>>> 08/08/2007 16:45:37 [0:15106]: using sfd 1
>>>>> 08/08/2007 16:45:37 [0:15106]: bound to port 52903
>>>>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - data = 
>>>>> 0:52903:/usr/local/softsound/GridEngine/utilbin/lx24-amd64:/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1:hal00.softsound.com 
>>>>>
>>>>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - address = 
>>>>> munge.softsound.com:33046
>>>>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - host = 
>>>>> munge.softsound.com, port = 33046
>>>>
>>>>
>>>>
>>>> Your qmaster is hal00 and you want to access munge.softsound.com? 
>>>> Something special with munge, is it showing up in qhost? Has this 
>>>> node more than one name and/or network card?
>>>
>>>
>>>
>>> hal00 should be an execution host.
>>> munge.softsound.com should be a submmit host.
>>>
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> 08/08/2007 16:45:40 [0:15106]: error connecting stream socket: No 
>>>>> route to host
>>>>> 08/08/2007 16:45:40 [0:15106]: communication with qrsh failed
>>>>> 08/08/2007 16:45:40 [0:15106]: forked "job" with pid 0
>>>>> 08/08/2007 16:45:40 [0:15106]: child: job - pid: 0
>>>>> 08/08/2007 16:45:40 [0:15106]: wait3 returned -1
>>>>> 08/08/2007 16:45:40 [0:15106]: can't open file 
>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - data = 1:can't open 
>>>>> file /tmp/13851.1.real.q/pid: No such file or directory
>>>>> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - address = 
>>>>> munge.softsound.com
>>>>> 08/08/2007 16:45:40 [0:15106]: illegal value for 
>>>>> qrsh_control_port: "munge.softsound.com". Should be host:port
>>>>
>>>>
>>>
>>> How is the ignore_fqdn set?
>>> The FQDN hostname may have confused SGE shepherd?
>>>
>>> Regards,
>>>
>>> - Chansup
>>>
>>>
>>>>> 08/08/2007 16:45:40 [16384:15105]: wait3 returned 15106 (status: 
>>>>> 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
>>>>> 08/08/2007 16:45:40 [16384:15105]: job exited with exit status 11
>>>>> 08/08/2007 16:45:40 [16384:15105]: reaped "job" with pid 15106
>>>>> 08/08/2007 16:45:40 [16384:15105]: job exited not due to signal
>>>>> 08/08/2007 16:45:40 [16384:15105]: job exited with status 11
>>>>> 08/08/2007 16:45:40 [0:15105]: can't open file 
>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - data = 1:can't open 
>>>>> file /tmp/13851.1.real.q/pid: No such file or directory
>>>>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - address = 
>>>>> munge.softsound.com:33046
>>>>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - host = 
>>>>> munge.softsound.com, port = 33046
>>>>> 08/08/2007 16:45:43 [0:15105]: error connecting stream socket: No 
>>>>> route to host
>>>>>
>>>>> Shepherd error:
>>>>> 08/08/2007 16:45:40 [0:15106]: can't open file 
>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>> 08/08/2007 16:45:40 [0:15105]: can't open file 
>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>>
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>> -- 
>>>>> /*
>>>>> * Yifan Zhang
>>>>> *
>>>>> * Softsound
>>>>> * +44 (0)1223 448 097
>>>>> */
>>>>>
>>>>> "Autonomy???s search technology is becoming a de facto standard for 
>>>>> companies??? ??? FT, 4th July 07
>>>>>
>>>>> Cream is a configuration of the Vim text editor that consists of a 
>>>>> set of scripts which can be run within Vim to make it behave more 
>>>>> like an editor now common to most personal computers which conform 
>>>>> to the Common User Access standards of interface and operability.
>>>>>
>>>>> The information contained in this message is for the intended 
>>>>> addressee only and may contain confidential and/or privileged  
>>>>> information.  If you are not the intended addressee, please delete 
>>>>> this message and notify the sender, and do not copy or distribute 
>>>>> this message or disclose its contents to anyone.  Any views or 
>>>>> opinions expressed in this message are those of the author and do 
>>>>> not necessarily represent those of Autonomy Systems Limited or of 
>>>>> any of its associated companies.  No reliance may be placed on 
>>>>> this message without written confirmation from an authorised 
>>>>> representative of the company.  Autonomy Systems Limited, 
>>>>> Registered Office:  Cambridge Business Park, Cowley Road, 
>>>>> Cambridge CB3 0WZ, Registered Number 03063054.
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>
>>
>
>


-- 
/*
 * Yifan Zhang
 *
 * Softsound
 * +44 (0)1223 448 097
 */

"Autonomyâ??s search technology is becoming a de facto standard for companiesâ?? â?? FT, 4th July 07

Cream is a configuration of the Vim text editor that consists of a set of scripts which can be run within Vim to make it behave more like an editor now common to most personal computers which conform to the Common User Access standards of interface and operability.

The information contained in this message is for the intended addressee only and may contain confidential and/or privileged  information.  If you are not the intended addressee, please delete this message and notify the sender, and do not copy or distribute this message or disclose its contents to anyone.  Any views or opinions expressed in this message are those of the author and do not necessarily represent those of Autonomy Systems Limited or of any of its associated companies.  No reliance may be placed on this message without written confirmation from an authorised representative of the company.  Autonomy Systems Limited, Registered Office:  Cambridge Business Park, Cowley Road, Cambridge CB3 0WZ, Registered Number 03063054.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list