On Mon, 14 Dec 2020 08:40:52 -0800, Arthur Schwarz wrote:
It does a lot better than the previous version but there are still issues.
1: "line 1",,,"http://file.a/A%20Guide%20"
<4: http> http is wrong
2: "line 2",,,"https://www.whitgt.pdf"
same issue as 1:
3: "line 3, and xyz",,,"http://www.c/main.pdf"
same issue as 1: but note that the embedded ',' is treated correctly
4: "line 4 "" and abc",,,http://file.a/A%20Guide%20
embedded "" treated correctly and http: recognized correctly
5: line 5,,,https://www.whitgt.pdf
all recognized correctly
6: line 6,,,http://file.a/A%20Guide%20
all recognized correctly
errata:
1: All lines with a quoted string recognize an extra field
2: The last output of all lines is incorrectly formatted:
"> <5:" instead of "<5: >" this may be a programming
error but I can't seem to locate it.
5: split($0, array) is uniformly incorrect.
From the Gnu Awk manual FPAT is used as the regular expression
and there are words to the effect that the resultant split will be
the same as in normal input processing. This seems not to be
the case.
These changes against your original version work for me - or am I missing
something? My output far below.
diff --git a/code.awk b/code.awk
--- a/code.awk
+++ b/code.awk
@@ -2,7 +2,7 @@
BEGIN { # program constants
FS = "~"
- FPAT = "([^,]*)|(\"([^\"]|\"\")\")" # CSV field separator # FPAT =
/([^,]*)|("([^"]|"")")/ # CSV field separator
+ FPAT = "(\"([^\"]|\"\")+\"|[^,\"]*)" # CSV field separator # FPAT =
/([^,]*)|("([^"]|"")")/ # CSV field separator
print "FPAT = ", FPAT;
} # BEGIN
{
@@ -10,17 +10,15 @@ BEGIN { # program
constants
print $0;
printf("%3d: \n", NF);
for (i = 1; i <= NF; i++) {
- if (substr($i, 1, 1) == "\"") {
- len = length($1)
- $i = substr($i, 2, len - 2);
- }
+ gsub(/(^"|"$)/,"",$i) # feasible given the knowledge of tokens
matching FPAT
+ gsub(/""/,"\"",$i) # same
printf(" <%d: %s>\n", i, $i);
}
print " ";
print "------------------- split ----------------------\n"
- split($0, array);
+ narray=split($0, array); # cosmetic change
printf(" NF ndx array\n");
- for (ndx = 1; ndx <= length(array); ndx++) {
+ for (ndx = 1; ndx <= narray; ndx++) {
printf("%3d %3d %27s\n", NF, ndx, array[ndx]);
}
Hoping this is not kind of homework. For a newbie to gawk not bad at all. ;)
HTH.
OUTPUT:
FPAT = ("([^"]|"")+"|[^,"]*)
------------------------------------------------
"PDQ",,,
4:
<1: PDQ>
<2: >
<3: >
<4: >
------------------- split ----------------------
NF ndx array
4 1 PDQ
------------------------------------------------
"line 1",,,"http://file.a/A%20Guide%20"
4:
<1: line 1>
<2: >
<3: >
<4: http://file.a/A%20Guide%20>
------------------- split ----------------------
NF ndx array
4 1 line 1 http://file.a/A%20Guide%20
------------------------------------------------
"line 2",,,"https://www.whitgt.pdf"
4:
<1: line 2>
<2: >
<3: >
<4: https://www.whitgt.pdf>
------------------- split ----------------------
NF ndx array
4 1 line 2 https://www.whitgt.pdf
------------------------------------------------
"line 3, and xyz",,,"http://www.c/main.pdf"
4:
<1: line 3, and xyz>
<2: >
<3: >
<4: http://www.c/main.pdf>
------------------- split ----------------------
NF ndx array
4 1 line 3, and xyz http://www.c/main.pdf
------------------------------------------------
"line 4 "" and abc",,,http://file.a/A%20Guide%20 line
5,,,https://www.whitgt.pdf line 6,,,http://file.a/A%20Guide%20
10:
<1: line 4 " and abc>
<2: >
<3: >
<4: http://file.a/A%20Guide%20 line 5>
<5: >
<6: >
<7: https://www.whitgt.pdf line 6>
<8: >
<9: >
<10: http://file.a/A%20Guide%20>
------------------- split ----------------------
NF ndx array
10 1 line 4 " and abc http://file.a/A%20Guide%20 line 5
https://www.whitgt.pdf line 6 http://file.a/A%20Guide%20