bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk -i inplace is an order of magnitude faster when also redirectin


From: Ed Morton
Subject: Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout
Date: Thu, 29 Feb 2024 08:48:05 -0600
User-agent: Mozilla Thunderbird

No problem. Trying again to post the strace output as it got mangled by something in transit last time:

The SE answer I linked, https://unix.stackexchange.com/a/771263/133219, shows strace being used on gawk with a 10-line input file and there being 10 writes (same as number of input lines) when used without redirection (look at the "calls" column below)"

   $ strace -e trace=write -c gawk -i inplace 1 somefile
   % time     seconds  usecs/call     calls    errors syscall
   ------ ----------- ----------- --------- --------- ----------------
   100.00    0.000098           9        10           write
   ------ ----------- ----------- --------- --------- ----------------
   100.00    0.000098           9        10           total

vs 1 write when used with redirection :

   $ strace -e trace=write -c gawk -i inplace 1 somefile > /dev/null
   % time     seconds  usecs/call     calls    errors syscall
   ------ ----------- ----------- --------- --------- ----------------
   100.00    0.000020          20         1           write
   ------ ----------- ----------- --------- --------- ----------------
   100.00    0.000020          20         1           total



On 2/29/2024 8:47 AM, david kerns wrote:
sorry for doubting your due diligence

On Thu, Feb 29, 2024 at 7:44 AM Ed Morton <mortoneccc@comcast.net> wrote:

    Yes, I tried the same with `sed` and there was no performance
    difference between:

    No redirection:

        $ time { sed -i -n 'p' file; }

        real    0m0.027s
        user    0m0.000s
        sys     0m0.000s

    Redirection:

        $ time { sed -i -n 'p' file >/dev/null; }

        real    0m0.023s
        user    0m0.000s
        sys     0m0.000s

    The SE answer I linked,
    https://unix.stackexchange.com/a/771263/133219, shows strace being
    used on gawk with a 10-line input file and there being 10 writes
    (same as number of input lines) when used without redirection
    (look at the "calls" column below)"
    |$ strace -e trace=write -c gawk -i inplace 1 somefile % time
    seconds usecs/call calls errors syscall ------ -----------
    ----------- --------- --------- ---------------- 100.00 0.000098
    9 10 write ------ ----------- ----------- --------- ---------
    ---------------- 100.00 0.000098 9 10 total |

    vs 1 write when used with redirection :

    |$ strace -e trace=write -c gawk -i inplace 1 somefile >
    /dev/null % time seconds usecs/call calls errors syscall ------
    ----------- ----------- --------- --------- ----------------
    100.00 0.000020 20 1 write ------ ----------- -----------
    --------- --------- ---------------- 100.00 0.000020 20 1 total |

    so buffering does seem likely to be the source of the time difference.

    Regards,

        Ed.

    On 2/29/2024 8:32 AM, david kerns wrote:
    glad you checked that...
    have you tried other commands? ... perhaps the closing of stdout by the
    shell before the fork/exec is causing it.

    On Thu, Feb 29, 2024 at 6:57 AM Ed Morton<mortoneccc@comcast.net>  
<mailto:mortoneccc@comcast.net>  wrote:

    David - that was 3rd-run timing to ensure caching wasn't the issue.

         Ed.

    On 2/29/2024 7:35 AM, david kerns wrote:

    swap the order (do the redirect one first) I suspect the input file was
    still cached for the 2nd run


    On Thu, Feb 29, 2024 at 5:52 AM Ed Morton<mortoneccc@comcast.net>  
<mailto:mortoneccc@comcast.net>  <mortoneccc@comcast.net>  
<mailto:mortoneccc@comcast.net>  wrote:


    Someone on StackExchange was asking about their gawk script being slow
    and someone else (https://unix.stackexchange.com/a/771263/133219)
    pointed out that using `-i inplace` is an order of magnitude slower if
    you don't also redirect stdout which seems unintuitive at best.

    For example given a 1 million line input file created by:

         $ seq 1000000 > file1m

    and using:

         $ awk --version
         GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP 6.3.0)

    If we just reproduce it as-is using `-i inplace` the timing is:

         $ time { awk -i inplace '1' file1m; }

         real    0m2.544s
         user    0m0.265s
         sys     0m1.843s

    whereas if we redirect stdout even though there is no stdout produced:

         $ time { awk -i inplace '1' file1m >/dev/null; }

         real    0m0.236s
         user    0m0.187s
         sys     0m0.000s

    As you can see that second execution with stdout redirected ran an order
    of magnitude faster. The person who investigated thinks it's due to the
    first execution being considered "interactive" since stdout isn't
    technically being redirected and so doing line buffering vs the second
    execution being "non-interactive" due to stdout being redirected and so
    using a larger buffer.

    If that is the case, could gawk be updated to consider "inplace" editing
    as non-interactive? If not, I think it'd be worth a statement in the
    manual about this difference in performance between the 2.

          Ed.









reply via email to

[Prev in Thread] Current Thread [Next in Thread]