|
From: | Nazgul |
Subject: | Re: GNU Parallel Bug Reports Feature request: if a SSH node goes down, retry on other nodes |
Date: | Tue, 15 Dec 2015 11:23:38 +1100 |
On Sun, Dec 13, 2015 at 2:48 PM, Nazgul <address@hidden> wrote:
> On 14 December 2015 at 00:20, Ole Tange <address@hidden> wrote:
>>
>> On Thu, Dec 10, 2015 at 11:17 PM, Nazgul <address@hidden> wrote:
>>
>> > I am using GNU Parallel with --sshlogin on unreliable nodes - that is,
>> > some
>> > of them become unreachable after an unpredictable amount of time.
:
>> > It would be nice to have a feature so that, instead, remaining threads
>> > are
>> > sent to the machines that are still available.
>>
>> Did you try --retries and --filterhosts?
:
> It seems --filter-hosts is a good candidate. However I have two doubts:
>
> Is this a check performed before the distributed executions or is this a
> policy active throughout the whole life-time of the Parallel process? This
> makes a difference if the node fails after the check.
> If a node fails while executing a command, is that command re-executed on a
> still active node?
From the man page:
--retries n
If a job fails, retry it on another computer on
which it has not failed. Do this n times. If
there are fewer than n computers in --sshlogin
GNU parallel will re-use all the computers.
This is useful if some jobs fail for no
apparent reason (such as network failure).
It is fairly expensive do filter hosts. So it is only done if
--sshloginfile is changed. You can force that by touching
--sshloginfile every time you want a filtering to be run.
/Ole
[Prev in Thread] | Current Thread | [Next in Thread] |