Re: FPAT is not working as expected

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: FPAT is not working as expected

From:	Arthur Schwarz
Subject:	Re: FPAT is not working as expected
Date:	Mon, 14 Dec 2020 10:39:50 -0800
User-agent:	Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1

Thanks Jannack;

Well no. This isn't a homework assignment. Not by a long, long shot(time). It is a newbie issue though. In my most demented moment Idecided to learn a little bit more about gawk than is humanly advised. Ihave a website (https:/slipbits.com) with many pages with the sameformatted web pages, well, almost the same. So I built a web pagegenerator in gawk. For me, a hefty load since being retired there is noway to have an effective dialog with colleagues other than by email. Itworked. And then the LibreOffice csv output generation changed, whichbrought me to the current issues. I decided to look through the Gnu Gawkmanual to see what would be the best way to proceed. And in Section 4.7and Section 4.7.1 I cam across FPAT which seemed ideal until it wasn't.Rather than create a new, more perfect world, I decided to copy theprogram in the manual as exactly as I could. What you see is basicallywhat was written.


And now to your code:

1:    Removing  $i = substr($i, 2, len - 2); seems to have fixed the issue
        of not correctly identifying the http:... URL. I don't understand
        why this should happen, but all the URL's are correct.

2:    In all cases which there is a quoted string, an extra (empty) field
        is found. This is consistent with previous results. I think that
        this only applies to the last quoted string, not the first.

3:    Embedded quotes ("") and commas (,) are correctly handled.
        This is consistent with my last email.

4:    Split does not work correctly (thanks for the narray=...).
        It looks like a space, " ", is treated as a delimeter.

5: Substituting "patsplit" for "split" yields uniformly incorrectresults.

        No line with a quoted string is output correctly. "" and , are
            treated as delimiters. All lines without "" and , are output
            as a single field. Lines with "" or , are output as two fields.
        All lines without a quoted string are output correctly.

In summary (thanks again for you narray). Your changes fixed all
the line processing output issues except for the addition of a
null extra field and that damnable formatting issue on the last
output ("> <5:"). split() and patsplit() do not work as expected.

What I don't understand:

My environment: Win 7-64
                                cygcheck (cygwin) 3.1.7

Your output does not show an extra field when the last field in
the input is quoted. My output does show an extra field.

Your output of the last field in record splitting shows correct
output my version shows that damnable "> <5:".

Both your output of split() and my output of split() are the same
which indicates a complete lack of understanding on my part or
that split() does not work as advertised.

Thanks. I assure you that this is not a class project (and indecently)
I am not a student.

art

On 12/14/2020 9:18 AM, Jannick wrote:

On Mon, 14 Dec 2020 08:40:52 -0800, Arthur Schwarz wrote:

It does a lot better than the previous version but there are still issues.

1:    "line 1",,,"http://file.a/A%20Guide%20";
          <4: http>       http is wrong

2:    "line 2",,,"https://www.whitgt.pdf";
          same issue as 1:

3:    "line 3, and xyz",,,"http://www.c/main.pdf";
          same issue as 1: but note that the embedded ',' is treated correctly

4:    "line 4 "" and abc",,,http://file.a/A%20Guide%20
          embedded "" treated correctly and http: recognized correctly

5:    line 5,,,https://www.whitgt.pdf
          all recognized correctly

6:    line 6,,,http://file.a/A%20Guide%20
          all recognized correctly

errata:
1:    All lines with a quoted string recognize an extra field
2:    The last output of all lines is incorrectly formatted:
          ">  <5:" instead of "<5: >"  this may be a programming
          error but I can't seem to locate it.
5:    split($0, array) is uniformly incorrect.
          From the Gnu Awk manual FPAT is used as the regular expression
          and there are words to the effect that the resultant split will be
          the same as in normal input processing. This seems not to be
          the case.

These changes against your original version work for me - or am I missing 
something? My output far below.

diff --git a/code.awk b/code.awk
--- a/code.awk
+++ b/code.awk
@@ -2,7 +2,7 @@

BEGIN { # program constants

          FS           = "~"
-        FPAT         = "([^,]*)|(\"([^\"]|\"\")\")" # CSV field separator #       FPAT          = 
/([^,]*)|("([^"]|"")")/      # CSV field separator
+        FPAT         = "(\"([^\"]|\"\")+\"|[^,\"]*)" # CSV field separator #       FPAT          = 
/([^,]*)|("([^"]|"")")/      # CSV field separator
          print "FPAT = ", FPAT;
  } # BEGIN
  {
@@ -10,17 +10,15 @@ BEGIN {                                         # program 
constants
         print $0;
         printf("%3d:   \n", NF);
         for (i = 1; i <= NF; i++) {
-          if (substr($i, 1, 1) == "\"") {
-             len = length($1)
-             $i = substr($i, 2, len - 2);
-          }
+          gsub(/(^"|"$)/,"",$i) # feasible given the knowledge of tokens 
matching FPAT
+          gsub(/""/,"\"",$i) # same
            printf("      <%d: %s>\n", i, $i);
         }
         print " ";
         print  "------------------- split ----------------------\n"
-       split($0, array);
+       narray=split($0, array); # cosmetic change
         printf(" NF ndx            array\n");
-       for (ndx = 1; ndx <= length(array); ndx++) {
+       for (ndx = 1; ndx <= narray; ndx++) {
            printf("%3d %3d %27s\n", NF, ndx, array[ndx]);
         }

Hoping this is not kind of homework. For a newbie to gawk not bad at all. ;)


HTH.



OUTPUT:

FPAT =  ("([^"]|"")+"|[^,"]*)
------------------------------------------------

"PDQ",,,
   4:
       <1: PDQ>
       <2: >
       <3: >
       <4: >