[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gawk-diffs] [SCM] gawk branch, master, updated. 3c09996d7efa635947c357e
From: |
Arnold Robbins |
Subject: |
[gawk-diffs] [SCM] gawk branch, master, updated. 3c09996d7efa635947c357efb3ccc5ed05b1ea31 |
Date: |
Fri, 24 Aug 2012 11:45:27 +0000 |
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "gawk".
The branch, master has been updated
via 3c09996d7efa635947c357efb3ccc5ed05b1ea31 (commit)
from a8ffb47faf32e7f065bfca5ffeee20cca85f6195 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
- Log -----------------------------------------------------------------
http://git.sv.gnu.org/cgit/gawk.git/commit/?id=3c09996d7efa635947c357efb3ccc5ed05b1ea31
commit 3c09996d7efa635947c357efb3ccc5ed05b1ea31
Author: Arnold D. Robbins <address@hidden>
Date: Fri Aug 24 14:45:12 2012 +0300
Rearrange chapters in gawk doc some.
diff --git a/doc/ChangeLog b/doc/ChangeLog
index 5372f7d..b112c72 100644
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@@ -2,6 +2,8 @@
* gawk.texi: Emphasize more that floating point behavior is
not a language issue. Add a pointer to POSIX bc.
+ Move arithmetic chapter to later in the book, before chapter
+ on dynamic extensions.
2012-08-17 Arnold D. Robbins <address@hidden>
diff --git a/doc/gawk.info b/doc/gawk.info
index 18e455c..4c10ab1 100644
--- a/doc/gawk.info
+++ b/doc/gawk.info
@@ -875,6 +875,9 @@ real problems.
*note Debugger::, describes the `awk' debugger.
+ *note Arbitrary Precision Arithmetic::, describes advanced
+arithmetic facilities provided by `gawk'.
+
*note Language History::, describes how the `awk' language has
evolved since its first release to present. It also describes how
`gawk' has acquired features over time.
@@ -13757,7733 +13760,7733 @@ writing, the latest version of GNU `gettext' is
version 0.18.1
usage messages, warnings, and fatal errors in the local language.
-File: gawk.info, Node: Arbitrary Precision Arithmetic, Next: Advanced
Features, Prev: Internationalization, Up: Top
+File: gawk.info, Node: Advanced Features, Next: Library Functions, Prev:
Arbitrary Precision Arithmetic, Up: Top
-11 Arithmetic and Arbitrary Precision Arithmetic with `gawk'
-************************************************************
+11 Advanced Features of `gawk'
+******************************
- There's a credibility gap: We don't know how much of the
- computer's answers to believe. Novice computer users solve this
- problem by implicitly trusting in the computer as an infallible
- authority; they tend to believe that all digits of a printed
- answer are significant. Disillusioned computer users have just the
- opposite approach; they are constantly afraid that their answers
- are almost meaningless.
- Donald Knuth(1)
+ Write documentation as if whoever reads it is a violent psychopath
+ who knows where you live.
+ Steve English, as quoted by Peter Langston
- This major node discusses issues that you may encounter when
-performing arithmetic. It begins by discussing some of the general
-atributes of computer arithmetic, along with how this can influence
-what you see when running `awk' programs. This discussion applies to
-all versions of `awk'.
+ This major node discusses advanced features in `gawk'. It's a bit
+of a "grab bag" of items that are otherwise unrelated to each other.
+First, a command-line option allows `gawk' to recognize nondecimal
+numbers in input data, not just in `awk' programs. Then, `gawk''s
+special features for sorting arrays are presented. Next, two-way I/O,
+discussed briefly in earlier parts of this Info file, is described in
+full detail, along with the basics of TCP/IP networking. Finally,
+`gawk' can "profile" an `awk' program, making it possible to tune it
+for performance.
- Then the discussion moves on to "arbitrary precsion arithmetic", a
-feature which is specific to `gawk'.
+ *note Dynamic Extensions::, discusses the ability to dynamically add
+new built-in functions to `gawk'. As this feature is still immature
+and likely to change, its description is relegated to an appendix.
* Menu:
-* General Arithmetic:: An introduction to computer arithmetic.
-* Floating-point Programming:: Effective Floating-point Programming.
-* Gawk and MPFR:: How `gawk' provides
- aribitrary-precision arithmetic.
-* Arbitrary Precision Floats:: Arbitrary Precision Floating-point Arithmetic
- with `gawk'.
-* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with
- `gawk'.
+* Nondecimal Data:: Allowing nondecimal input data.
+* Array Sorting:: Facilities for controlling array traversal and
+ sorting arrays.
+* Two-way I/O:: Two-way communications with another process.
+* TCP/IP Networking:: Using `gawk' for network programming.
+* Profiling:: Profiling your `awk' programs.
- ---------- Footnotes ----------
+
+File: gawk.info, Node: Nondecimal Data, Next: Array Sorting, Up: Advanced
Features
- (1) Donald E. Knuth. `The Art of Computer Programming'. Volume 2,
-`Seminumerical Algorithms', third edition, 1998, ISBN 0-201-89683-4, p.
-229.
+11.1 Allowing Nondecimal Input Data
+===================================
-
-File: gawk.info, Node: General Arithmetic, Next: Floating-point Programming,
Up: Arbitrary Precision Arithmetic
+If you run `gawk' with the `--non-decimal-data' option, you can have
+nondecimal constants in your input data:
-11.1 A General Description of Computer Arithmetic
-=================================================
+ $ echo 0123 123 0x123 |
+ > gawk --non-decimal-data '{ printf "%d, %d, %d\n",
+ > $1, $2, $3 }'
+ -| 83, 123, 291
-Within computers, there are two kinds of numeric values: "integers" and
-"floating-point". In school, integer values were referred to as
-"whole" numbers--that is, numbers without any fractional part, such as
-1, 42, or -17. The advantage to integer numbers is that they represent
-values exactly. The disadvantage is that their range is limited. On
-most systems, this range is -2,147,483,648 to 2,147,483,647. However,
-many systems now support a range from -9,223,372,036,854,775,808 to
-9,223,372,036,854,775,807.
+ For this feature to work, write your program so that `gawk' treats
+your data as numeric:
- Integer values come in two flavors: "signed" and "unsigned". Signed
-values may be negative or positive, with the range of values just
-described. Unsigned values are always positive. On most systems, the
-range is from 0 to 4,294,967,295. However, many systems now support a
-range from 0 to 18,446,744,073,709,551,615.
+ $ echo 0123 123 0x123 | gawk '{ print $1, $2, $3 }'
+ -| 0123 123 0x123
- Floating-point numbers represent what are called "real" numbers;
-i.e., those that do have a fractional part, such as 3.1415927. The
-advantage to floating-point numbers is that they can represent a much
-larger range of values. The disadvantage is that there are numbers
-that they cannot represent exactly. `awk' uses "double precision"
-floating-point numbers, which can hold more digits than "single
-precision" floating-point numbers.
+The `print' statement treats its expressions as strings. Although the
+fields can act as numbers when necessary, they are still strings, so
+`print' does not try to treat them numerically. You may need to add
+zero to a field to force it to be treated as a number. For example:
- There a several important issues to be aware of, described next.
+ $ echo 0123 123 0x123 | gawk --non-decimal-data '
+ > { print $1, $2, $3
+ > print $1 + 0, $2 + 0, $3 + 0 }'
+ -| 0123 123 0x123
+ -| 83 123 291
-* Menu:
+ Because it is common to have decimal data with leading zeros, and
+because using this facility could lead to surprising results, the
+default is to leave it disabled. If you want it, you must explicitly
+request it.
-* Floating Point Issues:: Stuff to know about floating-point numbers.
-* Integer Programming:: Effective integer programming.
+ CAUTION: _Use of this option is not recommended._ It can break old
+ programs very badly. Instead, use the `strtonum()' function to
+ convert your data (*note Nondecimal-numbers::). This makes your
+ programs easier to write and easier to read, and leads to less
+ surprising results.
-File: gawk.info, Node: Floating Point Issues, Next: Integer Programming,
Up: General Arithmetic
+File: gawk.info, Node: Array Sorting, Next: Two-way I/O, Prev: Nondecimal
Data, Up: Advanced Features
-11.1.1 Floating-Point Number Caveats
-------------------------------------
+11.2 Controlling Array Traversal and Array Sorting
+==================================================
-As mentioned earlier, floating-point numbers represent what are called
-"real" numbers, i.e., those that have a fractional part. `awk' uses
-double precision floating-point numbers to represent all numeric
-values. This minor node describes some of the issues involved in using
-floating-point numbers.
+`gawk' lets you control the order in which a `for (i in array)' loop
+traverses an array.
- There is a very nice paper on floating-point arithmetic
-(http://www.validlab.com/goldberg/paper.pdf) by David Goldberg, "What
-Every Computer Scientist Should Know About Floating-point Arithmetic,"
-`ACM Computing Surveys' *23*, 1 (1991-03), 5-48. This is worth reading
-if you are interested in the details, but it does require a background
-in computer science.
+ In addition, two built-in functions, `asort()' and `asorti()', let
+you sort arrays based on the array values and indices, respectively.
+These two functions also provide control over the sorting criteria used
+to order the elements during sorting.
* Menu:
-* String Conversion Precision:: The String Value Can Lie.
-* Unexpected Results:: Floating Point Numbers Are Not Abstract
- Numbers.
-* POSIX Floating Point Problems:: Standards Versus Existing Practice.
+* Controlling Array Traversal:: How to use PROCINFO["sorted_in"].
+* Array Sorting Functions:: How to use `asort()' and `asorti()'.
-File: gawk.info, Node: String Conversion Precision, Next: Unexpected
Results, Up: Floating Point Issues
+File: gawk.info, Node: Controlling Array Traversal, Next: Array Sorting
Functions, Up: Array Sorting
-11.1.1.1 The String Value Can Lie
-.................................
+11.2.1 Controlling Array Traversal
+----------------------------------
-Internally, `awk' keeps both the numeric value (double precision
-floating-point) and the string value for a variable. Separately, `awk'
-keeps track of what type the variable has (*note Typing and
-Comparison::), which plays a role in how variables are used in
-comparisons.
+By default, the order in which a `for (i in array)' loop scans an array
+is not defined; it is generally based upon the internal implementation
+of arrays inside `awk'.
- It is important to note that the string value for a number may not
-reflect the full value (all the digits) that the numeric value actually
-contains. The following program (`values.awk') illustrates this:
+ Often, though, it is desirable to be able to loop over the elements
+in a particular order that you, the programmer, choose. `gawk' lets
+you do this.
+
+ *note Controlling Scanning::, describes how you can assign special,
+pre-defined values to `PROCINFO["sorted_in"]' in order to control the
+order in which `gawk' will traverse an array during a `for' loop.
+ In addition, the value of `PROCINFO["sorted_in"]' can be a function
+name. This lets you traverse an array based on any custom criterion.
+The array elements are ordered according to the return value of this
+function. The comparison function should be defined with at least four
+arguments:
+
+ function comp_func(i1, v1, i2, v2)
{
- sum = $1 + $2
- # see it for what it is
- printf("sum = %.12g\n", sum)
- # use CONVFMT
- a = "<" sum ">"
- print "a =", a
- # use OFMT
- print "sum =", sum
+ COMPARE ELEMENTS 1 AND 2 IN SOME FASHION
+ RETURN < 0; 0; OR > 0
}
-This program shows the full value of the sum of `$1' and `$2' using
-`printf', and then prints the string values obtained from both
-automatic conversion (via `CONVFMT') and from printing (via `OFMT').
+ Here, I1 and I2 are the indices, and V1 and V2 are the corresponding
+values of the two elements being compared. Either V1 or V2, or both,
+can be arrays if the array being traversed contains subarrays as values.
+(*Note Arrays of Arrays::, for more information about subarrays.) The
+three possible return values are interpreted as follows:
- Here is what happens when the program is run:
+`comp_func(i1, v1, i2, v2) < 0'
+ Index I1 comes before index I2 during loop traversal.
- $ echo 3.654321 1.2345678 | awk -f values.awk
- -| sum = 4.8888888
- -| a = <4.88889>
- -| sum = 4.88889
+`comp_func(i1, v1, i2, v2) == 0'
+ Indices I1 and I2 come together but the relative order with
+ respect to each other is undefined.
- This makes it clear that the full numeric value is different from
-what the default string representations show.
+`comp_func(i1, v1, i2, v2) > 0'
+ Index I1 comes after index I2 during loop traversal.
- `CONVFMT''s default value is `"%.6g"', which yields a value with at
-least six significant digits. For some applications, you might want to
-change it to specify more precision. On most modern machines, most of
-the time, 17 digits is enough to capture a floating-point number's
-value exactly.(1)
+ Our first comparison function can be used to scan an array in
+numerical order of the indices:
- ---------- Footnotes ----------
+ function cmp_num_idx(i1, v1, i2, v2)
+ {
+ # numerical index comparison, ascending order
+ return (i1 - i2)
+ }
- (1) Pathological cases can require up to 752 digits (!), but we
-doubt that you need to worry about this.
+ Our second function traverses an array based on the string order of
+the element values rather than by indices:
-
-File: gawk.info, Node: Unexpected Results, Next: POSIX Floating Point
Problems, Prev: String Conversion Precision, Up: Floating Point Issues
+ function cmp_str_val(i1, v1, i2, v2)
+ {
+ # string value comparison, ascending order
+ v1 = v1 ""
+ v2 = v2 ""
+ if (v1 < v2)
+ return -1
+ return (v1 != v2)
+ }
-11.1.1.2 Floating Point Numbers Are Not Abstract Numbers
-........................................................
+ The third comparison function makes all numbers, and numeric strings
+without any leading or trailing spaces, come out first during loop
+traversal:
-Unlike numbers in the abstract sense (such as what you studied in high
-school or college arithmetic), numbers stored in computers are limited
-in certain ways. They cannot represent an infinite number of digits,
-nor can they always represent things exactly. In particular,
-floating-point numbers cannot always represent values exactly. Here is
-an example:
+ function cmp_num_str_val(i1, v1, i2, v2, n1, n2)
+ {
+ # numbers before string value comparison, ascending order
+ n1 = v1 + 0
+ n2 = v2 + 0
+ if (n1 == v1)
+ return (n2 == v2) ? (n1 - n2) : -1
+ else if (n2 == v2)
+ return 1
+ return (v1 < v2) ? -1 : (v1 != v2)
+ }
- $ awk '{ printf("%010d\n", $1 * 100) }'
- 515.79
- -| 0000051579
- 515.80
- -| 0000051579
- 515.81
- -| 0000051580
- 515.82
- -| 0000051582
- Ctrl-d
+ Here is a main program to demonstrate how `gawk' behaves using each
+of the previous functions:
-This shows that some values can be represented exactly, whereas others
-are only approximated. This is not a "bug" in `awk', but simply an
-artifact of how computers represent numbers.
+ BEGIN {
+ data["one"] = 10
+ data["two"] = 20
+ data[10] = "one"
+ data[100] = 100
+ data[20] = "two"
- NOTE: It cannot be emphasized enough that the behavior just
- described is fundamental to modern computers. You will see this
- kind of thing happen in _any_ programming language using hardware
- floating-point numbers. It is _not_ a bug in `gawk', nor is it
- something that can be "just fixed."
+ f[1] = "cmp_num_idx"
+ f[2] = "cmp_str_val"
+ f[3] = "cmp_num_str_val"
+ for (i = 1; i <= 3; i++) {
+ printf("Sort function: %s\n", f[i])
+ PROCINFO["sorted_in"] = f[i]
+ for (j in data)
+ printf("\tdata[%s] = %s\n", j, data[j])
+ print ""
+ }
+ }
- Another peculiarity of floating-point numbers on modern systems is
-that they often have more than one representation for the number zero!
-In particular, it is possible to represent "minus zero" as well as
-regular, or "positive" zero.
+ Here are the results when the program is run:
- This example shows that negative and positive zero are distinct
-values when stored internally, but that they are in fact equal to each
-other, as well as to "regular" zero:
+ $ gawk -f compdemo.awk
+ -| Sort function: cmp_num_idx Sort by numeric index
+ -| data[two] = 20
+ -| data[one] = 10 Both strings are numerically zero
+ -| data[10] = one
+ -| data[20] = two
+ -| data[100] = 100
+ -|
+ -| Sort function: cmp_str_val Sort by element values as strings
+ -| data[one] = 10
+ -| data[100] = 100 String 100 is less than string 20
+ -| data[two] = 20
+ -| data[10] = one
+ -| data[20] = two
+ -|
+ -| Sort function: cmp_num_str_val Sort all numeric values before all
strings
+ -| data[one] = 10
+ -| data[two] = 20
+ -| data[100] = 100
+ -| data[10] = one
+ -| data[20] = two
- $ gawk 'BEGIN { mz = -0 ; pz = 0
- > printf "-0 = %g, +0 = %g, (-0 == +0) -> %d\n", mz, pz, mz == pz
- > printf "mz == 0 -> %d, pz == 0 -> %d\n", mz == 0, pz == 0
- > }'
- -| -0 = -0, +0 = 0, (-0 == +0) -> 1
- -| mz == 0 -> 1, pz == 0 -> 1
+ Consider sorting the entries of a GNU/Linux system password file
+according to login name. The following program sorts records by a
+specific field position and can be used for this purpose:
- It helps to keep this in mind should you process numeric data that
-contains negative zero values; the fact that the zero is negative is
-noted and can affect comparisons.
+ # sort.awk --- simple program to sort by field position
+ # field position is specified by the global variable POS
-
-File: gawk.info, Node: POSIX Floating Point Problems, Prev: Unexpected
Results, Up: Floating Point Issues
+ function cmp_field(i1, v1, i2, v2)
+ {
+ # comparison by value, as string, and ascending order
+ return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS])
+ }
-11.1.1.3 Standards Versus Existing Practice
-...........................................
+ {
+ for (i = 1; i <= NF; i++)
+ a[NR][i] = $i
+ }
-Historically, `awk' has converted any non-numeric looking string to the
-numeric value zero, when required. Furthermore, the original
-definition of the language and the original POSIX standards specified
-that `awk' only understands decimal numbers (base 10), and not octal
-(base 8) or hexadecimal numbers (base 16).
+ END {
+ PROCINFO["sorted_in"] = "cmp_field"
+ if (POS < 1 || POS > NF)
+ POS = 1
+ for (i in a) {
+ for (j = 1; j <= NF; j++)
+ printf("%s%c", a[i][j], j < NF ? ":" : "")
+ print ""
+ }
+ }
- Changes in the language of the 2001 and 2004 POSIX standards can be
-interpreted to imply that `awk' should support additional features.
-These features are:
+ The first field in each entry of the password file is the user's
+login name, and the fields are separated by colons. Each record
+defines a subarray, with each field as an element in the subarray.
+Running the program produces the following output:
- * Interpretation of floating point data values specified in
- hexadecimal notation (`0xDEADBEEF'). (Note: data values, _not_
- source code constants.)
+ $ gawk -vPOS=1 -F: -f sort.awk /etc/passwd
+ -| adm:x:3:4:adm:/var/adm:/sbin/nologin
+ -| apache:x:48:48:Apache:/var/www:/sbin/nologin
+ -| avahi:x:70:70:Avahi daemon:/:/sbin/nologin
+ ...
- * Support for the special IEEE 754 floating point values "Not A
- Number" (NaN), positive Infinity ("inf") and negative Infinity
- ("-inf"). In particular, the format for these values is as
- specified by the ISO 1999 C standard, which ignores case and can
- allow machine-dependent additional characters after the `nan' and
- allow either `inf' or `infinity'.
+ The comparison should normally always return the same value when
+given a specific pair of array elements as its arguments. If
+inconsistent results are returned then the order is undefined. This
+behavior can be exploited to introduce random order into otherwise
+seemingly ordered data:
- The first problem is that both of these are clear changes to
-historical practice:
+ function cmp_randomize(i1, v1, i2, v2)
+ {
+ # random order
+ return (2 - 4 * rand())
+ }
- * The `gawk' maintainer feels that supporting hexadecimal floating
- point values, in particular, is ugly, and was never intended by the
- original designers to be part of the language.
+ As mentioned above, the order of the indices is arbitrary if two
+elements compare equal. This is usually not a problem, but letting the
+tied elements come out in arbitrary order can be an issue, especially
+when comparing item values. The partial ordering of the equal elements
+may change during the next loop traversal, if other elements are added
+or removed from the array. One way to resolve ties when comparing
+elements with otherwise equal values is to include the indices in the
+comparison rules. Note that doing this may make the loop traversal
+less efficient, so consider it only if necessary. The following
+comparison functions force a deterministic order, and are based on the
+fact that the indices of two elements are never equal:
- * Allowing completely alphabetic strings to have valid numeric
- values is also a very severe departure from historical practice.
+ function cmp_numeric(i1, v1, i2, v2)
+ {
+ # numerical value (and index) comparison, descending order
+ return (v1 != v2) ? (v2 - v1) : (i2 - i1)
+ }
- The second problem is that the `gawk' maintainer feels that this
-interpretation of the standard, which requires a certain amount of
-"language lawyering" to arrive at in the first place, was not even
-intended by the standard developers. In other words, "we see how you
-got where you are, but we don't think that that's where you want to be."
+ function cmp_string(i1, v1, i2, v2)
+ {
+ # string value (and index) comparison, descending order
+ v1 = v1 i1
+ v2 = v2 i2
+ return (v1 > v2) ? -1 : (v1 != v2)
+ }
- Recognizing the above issues, but attempting to provide compatibility
-with the earlier versions of the standard, the 2008 POSIX standard
-added explicit wording to allow, but not require, that `awk' support
-hexadecimal floating point values and special values for "Not A Number"
-and infinity.
+ A custom comparison function can often simplify ordered loop
+traversal, and the sky is really the limit when it comes to designing
+such a function.
- Although the `gawk' maintainer continues to feel that providing
-those features is inadvisable, nevertheless, on systems that support
-IEEE floating point, it seems reasonable to provide _some_ way to
-support NaN and Infinity values. The solution implemented in `gawk' is
-as follows:
+ When string comparisons are made during a sort, either for element
+values where one or both aren't numbers, or for element indices handled
+as strings, the value of `IGNORECASE' (*note Built-in Variables::)
+controls whether the comparisons treat corresponding uppercase and
+lowercase letters as equivalent or distinct.
- * With the `--posix' command-line option, `gawk' becomes "hands
- off." String values are passed directly to the system library's
- `strtod()' function, and if it successfully returns a numeric
- value, that is what's used.(1) By definition, the results are not
- portable across different systems. They are also a little
- surprising:
+ Another point to keep in mind is that in the case of subarrays the
+element values can themselves be arrays; a production comparison
+function should use the `isarray()' function (*note Type Functions::),
+to check for this, and choose a defined sorting order for subarrays.
- $ echo nanny | gawk --posix '{ print $1 + 0 }'
- -| nan
- $ echo 0xDeadBeef | gawk --posix '{ print $1 + 0 }'
- -| 3735928559
+ All sorting based on `PROCINFO["sorted_in"]' is disabled in POSIX
+mode, since the `PROCINFO' array is not special in that case.
- * Without `--posix', `gawk' interprets the four strings `+inf',
- `-inf', `+nan', and `-nan' specially, producing the corresponding
- special numeric values. The leading sign acts a signal to `gawk'
- (and the user) that the value is really numeric. Hexadecimal
- floating point is not supported (unless you also use
- `--non-decimal-data', which is _not_ recommended). For example:
+ As a side note, sorting the array indices before traversing the
+array has been reported to add 15% to 20% overhead to the execution
+time of `awk' programs. For this reason, sorted array traversal is not
+the default.
- $ echo nanny | gawk '{ print $1 + 0 }'
- -| 0
- $ echo +nan | gawk '{ print $1 + 0 }'
- -| nan
- $ echo 0xDeadBeef | gawk '{ print $1 + 0 }'
- -| 0
+
+File: gawk.info, Node: Array Sorting Functions, Prev: Controlling Array
Traversal, Up: Array Sorting
- `gawk' does ignore case in the four special values. Thus `+nan'
- and `+NaN' are the same.
+11.2.2 Sorting Array Values and Indices with `gawk'
+---------------------------------------------------
- ---------- Footnotes ----------
+In most `awk' implementations, sorting an array requires writing a
+`sort()' function. While this can be educational for exploring
+different sorting algorithms, usually that's not the point of the
+program. `gawk' provides the built-in `asort()' and `asorti()'
+functions (*note String Functions::) for sorting arrays. For example:
- (1) You asked for it, you got it.
+ POPULATE THE ARRAY data
+ n = asort(data)
+ for (i = 1; i <= n; i++)
+ DO SOMETHING WITH data[i]
-
-File: gawk.info, Node: Integer Programming, Prev: Floating Point Issues,
Up: General Arithmetic
+ After the call to `asort()', the array `data' is indexed from 1 to
+some number N, the total number of elements in `data'. (This count is
+`asort()''s return value.) `data[1]' <= `data[2]' <= `data[3]', and so
+on. The comparison is based on the type of the elements (*note Typing
+and Comparison::). All numeric values come before all string values,
+which in turn come before all subarrays.
-11.1.2 Mixing Integers And Floating-point
------------------------------------------
+ An important side effect of calling `asort()' is that _the array's
+original indices are irrevocably lost_. As this isn't always
+desirable, `asort()' accepts a second argument:
-As has been mentioned already, `gawk' ordinarily uses hardware double
-precision with 64-bit IEEE binary floating-point representation for
-numbers on most systems. A large integer like 9007199254740997 has a
-binary representation that, although finite, is more than 53 bits long;
-it must also be rounded to 53 bits. The biggest integer that can be
-stored in a C `double' is usually the same as the largest possible
-value of a `double'. If your system `double' is an IEEE 64-bit
-`double', this largest possible value is an integer and can be
-represented precisely. What more should one know about integers?
+ POPULATE THE ARRAY source
+ n = asort(source, dest)
+ for (i = 1; i <= n; i++)
+ DO SOMETHING WITH dest[i]
- If you want to know what is the largest integer, such that it and
-all smaller integers can be stored in 64-bit doubles without losing
-precision, then the answer is 2^53. The next representable number is
-the even number 2^53 + 2, meaning it is unlikely that you will be able
-to make `gawk' print 2^53 + 1 in integer format. The range of integers
-exactly representable by a 64-bit double is [-2^53, 2^53]. If you ever
-see an integer outside this range in `gawk' using 64-bit doubles, you
-have reason to be very suspicious about the accuracy of the output.
-Here is a simple program with erroneous output:
+ In this case, `gawk' copies the `source' array into the `dest' array
+and then sorts `dest', destroying its indices. However, the `source'
+array is not affected.
- $ gawk 'BEGIN { i = 2^53 - 1; for (j = 0; j < 4; j++) print i + j }'
- -| 9007199254740991
- -| 9007199254740992
- -| 9007199254740992
- -| 9007199254740994
+ `asort()' accepts a third string argument to control comparison of
+array elements. As with `PROCINFO["sorted_in"]', this argument may be
+one of the predefined names that `gawk' provides (*note Controlling
+Scanning::), or the name of a user-defined function (*note Controlling
+Array Traversal::).
- The lesson is to not assume that any large integer printed by `gawk'
-represents an exact result from your computation, especially if it wraps
-around on your screen.
+ NOTE: In all cases, the sorted element values consist of the
+ original array's element values. The ability to control
+ comparison merely affects the way in which they are sorted.
-
-File: gawk.info, Node: Floating-point Programming, Next: Gawk and MPFR,
Prev: General Arithmetic, Up: Arbitrary Precision Arithmetic
+ Often, what's needed is to sort on the values of the _indices_
+instead of the values of the elements. To do that, use the `asorti()'
+function. The interface is identical to that of `asort()', except that
+the index values are used for sorting, and become the values of the
+result array:
-11.2 Understanding Floating-point Programming
-=============================================
+ { source[$0] = some_func($0) }
-Numerical programming is an extensive area; if you need to develop
-sophisticated numerical algorithms then `gawk' may not be the ideal
-tool, and this documentation may not be sufficient. It might require
-digesting a book or two to really internalize how to compute with ideal
-accuracy and precision and the result often depends on the particular
-application.
+ END {
+ n = asorti(source, dest)
+ for (i = 1; i <= n; i++) {
+ Work with sorted indices directly:
+ DO SOMETHING WITH dest[i]
+ ...
+ Access original array via sorted indices:
+ DO SOMETHING WITH source[dest[i]]
+ }
+ }
- NOTE: A floating-point calculation's "accuracy" is how close it
- comes to the real value. This is as opposed to the "precision",
- which usually refers to the number of bits used to represent the
- number (see the Wikipedia article
- (http://en.wikipedia.org/wiki/Accuracy_and_precision) for more
- information).
+ Similar to `asort()', in all cases, the sorted element values
+consist of the original array's indices. The ability to control
+comparison merely affects the way in which they are sorted.
- There are two options for doing floating-point calculations:
-hardware floating-point (as used by standard `awk' and the default for
-`gawk'), and "arbitrary-precision" floating-point, which is software
-based. This major node aims to provide enough information to
-understand both, and then will focus on `gawk''s facilities for the
-latter.(1)
+ Sorting the array by replacing the indices provides maximal
+flexibility. To traverse the elements in decreasing order, use a loop
+that goes from N down to 1, either over the elements or over the
+indices.(1)
- Binary floating-point representations and arithmetic are inexact.
-Simple values like 0.1 cannot be precisely represented using binary
-floating-point numbers, and the limited precision of floating-point
-numbers means that slight changes in the order of operations or the
-precision of intermediate storage can change the result. To make
-matters worse, with arbitrary precision floating-point, you can set the
-precision before starting a computation, but then you cannot be sure of
-the number of significant decimal places in the final result.
+ Copying array indices and elements isn't expensive in terms of
+memory. Internally, `gawk' maintains "reference counts" to data. For
+example, when `asort()' copies the first array to the second one, there
+is only one copy of the original array elements' data, even though both
+arrays use the values.
- Sometimes, before you start to write any code, you should think more
-about what you really want and what's really happening. Consider the
-two numbers in the following example:
+ Because `IGNORECASE' affects string comparisons, the value of
+`IGNORECASE' also affects sorting for both `asort()' and `asorti()'.
+Note also that the locale's sorting order does _not_ come into play;
+comparisons are based on character values only.(2) Caveat Emptor.
- x = 0.875 # 1/2 + 1/4 + 1/8
- y = 0.425
+ ---------- Footnotes ----------
- Unlike the number in `y', the number stored in `x' is exactly
-representable in binary since it can be written as a finite sum of one
-or more fractions whose denominators are all powers of two. When
-`gawk' reads a floating-point number from program source, it
-automatically rounds that number to whatever precision your machine
-supports. If you try to print the numeric content of a variable using
-an output format string of `"%.17g"', it may not produce the same
-number as you assigned to it:
+ (1) You may also use one of the predefined sorting names that sorts
+in decreasing order.
- $ gawk 'BEGIN { x = 0.875; y = 0.425
- > printf("%0.17g, %0.17g\n", x, y) }'
- -| 0.875, 0.42499999999999999
+ (2) This is true because locale-based comparison occurs only when in
+POSIX compatibility mode, and since `asort()' and `asorti()' are `gawk'
+extensions, they are not available in that case.
- Often the error is so small you do not even notice it, and if you do,
-you can always specify how much precision you would like in your output.
-Usually this is a format string like `"%.15g"', which when used in the
-previous example, produces an output identical to the input.
+
+File: gawk.info, Node: Two-way I/O, Next: TCP/IP Networking, Prev: Array
Sorting, Up: Advanced Features
- Because the underlying representation can be little bit off from the
-exact value, comparing floating-point values to see if they are equal
-is generally not a good idea. Here is an example where it does not
-work like you expect:
+11.3 Two-Way Communications with Another Process
+================================================
- $ gawk 'BEGIN { print (0.1 + 12.2 == 12.3) }'
- -| 0
+ From: address@hidden (Mike Brennan)
+ Newsgroups: comp.lang.awk
+ Subject: Re: Learn the SECRET to Attract Women Easily
+ Date: 4 Aug 1997 17:34:46 GMT
+ Message-ID: <address@hidden>
- The loss of accuracy during a single computation with floating-point
-numbers usually isn't enough to worry about. However, if you compute a
-value which is the result of a sequence of floating point operations,
-the error can accumulate and greatly affect the computation itself.
-Here is an attempt to compute the value of the constant pi using one of
-its many series representations:
+ On 3 Aug 1997 13:17:43 GMT, Want More Dates???
+ <address@hidden> wrote:
+ >Learn the SECRET to Attract Women Easily
+ >
+ >The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women
- BEGIN {
- x = 1.0 / sqrt(3.0)
- n = 6
- for (i = 1; i < 30; i++) {
- n = n * 2.0
- x = (sqrt(x * x + 1) - 1) / x
- printf("%.15f\n", n * x)
- }
- }
+ The scent of awk programmers is a lot more attractive to women than
+ the scent of perl programmers.
+ --
+ Mike Brennan
- When run, the early errors propagating through later computations
-cause the loop to terminate prematurely after an attempt to divide by
-zero.
+ It is often useful to be able to send data to a separate program for
+processing and then read the result. This can always be done with
+temporary files:
- $ gawk -f pi.awk
- -| 3.215390309173475
- -| 3.159659942097510
- -| 3.146086215131467
- -| 3.142714599645573
- ...
- -| 3.224515243534819
- -| 2.791117213058638
- -| 0.000000000000000
- error--> gawk: pi.awk:6: fatal: division by zero attempted
+ # Write the data for processing
+ tempfile = ("mydata." PROCINFO["pid"])
+ while (NOT DONE WITH DATA)
+ print DATA | ("subprogram > " tempfile)
+ close("subprogram > " tempfile)
- Here is one more example where the inaccuracies in internal
-representations yield an unexpected result:
+ # Read the results, remove tempfile when done
+ while ((getline newdata < tempfile) > 0)
+ PROCESS newdata APPROPRIATELY
+ close(tempfile)
+ system("rm " tempfile)
- $ gawk 'BEGIN {
- > for (d = 1.1; d <= 1.5; d += 0.1)
- > i++
- > print i
- > }'
- -| 4
+This works, but not elegantly. Among other things, it requires that
+the program be run in a directory that cannot be shared among users;
+for example, `/tmp' will not do, as another user might happen to be
+using a temporary file with the same name.
- Can computation using aribitrary precision help with the previous
-examples? If you are impatient to know, see *note Exact Arithmetic::.
+ However, with `gawk', it is possible to open a _two-way_ pipe to
+another process. The second process is termed a "coprocess", since it
+runs in parallel with `gawk'. The two-way connection is created using
+the `|&' operator (borrowed from the Korn shell, `ksh'):(1)
- Instead of aribitrary precision floating-point arithmetic, often all
-you need is an adjustment of your logic or a different order for the
-operations in your calculation. The stability and the accuracy of the
-computation of the constant pi in the previous example can be enhanced
-by using the following simple algebraic transformation:
+ do {
+ print DATA |& "subprogram"
+ "subprogram" |& getline results
+ } while (DATA LEFT TO PROCESS)
+ close("subprogram")
- (sqrt(x * x + 1) - 1) / x = x / (sqrt(x * x + 1) + 1)
+ The first time an I/O operation is executed using the `|&' operator,
+`gawk' creates a two-way pipeline to a child process that runs the
+other program. Output created with `print' or `printf' is written to
+the program's standard input, and output from the program's standard
+output can be read by the `gawk' program using `getline'. As is the
+case with processes started by `|', the subprogram can be any program,
+or pipeline of programs, that can be started by the shell.
-After making this, change the program does converge to pi in under 30
-iterations:
+ There are some cautionary items to be aware of:
- $ gawk -f /tmp/pi2.awk
- -| 3.215390309173473
- -| 3.159659942097501
- -| 3.146086215131436
- -| 3.142714599645370
- -| 3.141873049979825
- ...
- -| 3.141592653589797
- -| 3.141592653589797
+ * As the code inside `gawk' currently stands, the coprocess's
+ standard error goes to the same place that the parent `gawk''s
+ standard error goes. It is not possible to read the child's
+ standard error separately.
- There is no need to be unduly suspicious about the results from
-floating-point arithmetic. The lesson to remember is that
-floating-point arithmetic is always more complex than the arithmetic
-using pencil and paper. In order to take advantage of the power of
-computer floating-point, you need to know its limitations and work
-within them. For most casual use of floating-point arithmetic, you will
-often get the expected result in the end if you simply round the
-display of your final results to the correct number of significant
-decimal digits. And, avoid presenting numerical data in a manner that
-implies better precision than is actually the case.
+ * I/O buffering may be a problem. `gawk' automatically flushes all
+ output down the pipe to the coprocess. However, if the coprocess
+ does not flush its output, `gawk' may hang when doing a `getline'
+ in order to read the coprocess's results. This could lead to a
+ situation known as "deadlock", where each process is waiting for
+ the other one to do something.
-* Menu:
+ It is possible to close just one end of the two-way pipe to a
+coprocess, by supplying a second argument to the `close()' function of
+either `"to"' or `"from"' (*note Close Files And Pipes::). These
+strings tell `gawk' to close the end of the pipe that sends data to the
+coprocess or the end that reads from it, respectively.
-* Floating-point Representation:: Binary floating-point representation.
-* Floating-point Context:: Floating-point context.
-* Rounding Mode:: Floating-point rounding mode.
+ This is particularly necessary in order to use the system `sort'
+utility as part of a coprocess; `sort' must read _all_ of its input
+data before it can produce any output. The `sort' program does not
+receive an end-of-file indication until `gawk' closes the write end of
+the pipe.
- ---------- Footnotes ----------
+ When you have finished writing data to the `sort' utility, you can
+close the `"to"' end of the pipe, and then start reading sorted data
+via `getline'. For example:
- (1) If you are interested in other tools that perform arbitrary
-precision arithmetic, you may want to investigate the POSIX `bc' tool.
-See the POSIX specification for it
-(http://pubs.opengroup.org/onlinepubs/009695399/utilities/bc.html), for
-more information.
+ BEGIN {
+ command = "LC_ALL=C sort"
+ n = split("abcdefghijklmnopqrstuvwxyz", a, "")
-
-File: gawk.info, Node: Floating-point Representation, Next: Floating-point
Context, Up: Floating-point Programming
+ for (i = n; i > 0; i--)
+ print a[i] |& command
+ close(command, "to")
-11.2.1 Binary Floating-point Representation
--------------------------------------------
+ while ((command |& getline line) > 0)
+ print "got", line
+ close(command)
+ }
-Although floating-point representations vary from machine to machine,
-the most commonly encountered representation is that defined by the
-IEEE 754 Standard. An IEEE-754 format value has three components:
+ This program writes the letters of the alphabet in reverse order, one
+per line, down the two-way pipe to `sort'. It then closes the write
+end of the pipe, so that `sort' receives an end-of-file indication.
+This causes `sort' to sort the data and write the sorted data back to
+the `gawk' program. Once all of the data has been read, `gawk'
+terminates the coprocess and exits.
- * A sign bit telling whether the number is positive or negative.
+ As a side note, the assignment `LC_ALL=C' in the `sort' command
+ensures traditional Unix (ASCII) sorting from `sort'.
- * An "exponent" giving its order of magnitude, E.
+ You may also use pseudo-ttys (ptys) for two-way communication
+instead of pipes, if your system supports them. This is done on a
+per-command basis, by setting a special element in the `PROCINFO' array
+(*note Auto-set::), like so:
- * A "significand", S, specifying the actual digits of the number.
+ command = "sort -nr" # command, save in convenience variable
+ PROCINFO[command, "pty"] = 1 # update PROCINFO
+ print ... |& command # start two-way pipe
+ ...
- The value of the number is then S * 2^E. The first bit of a
-non-zero binary significand is always one, so the significand in an
-IEEE-754 format only includes the fractional part, leaving the leading
-one implicit.
+Using ptys avoids the buffer deadlock issues described earlier, at some
+loss in performance. If your system does not have ptys, or if all the
+system's ptys are in use, `gawk' automatically falls back to using
+regular pipes.
- Three of the standard IEEE-754 types are 32-bit single precision,
-64-bit double precision and 128-bit quadruple precision. The standard
-also specifies extended precision formats to allow greater precisions
-and larger exponent ranges.
+ ---------- Footnotes ----------
- The significand is stored in "normalized" format, which means that
-the first bit is always a one.
+ (1) This is very different from the same operator in the C shell.
-File: gawk.info, Node: Floating-point Context, Next: Rounding Mode, Prev:
Floating-point Representation, Up: Floating-point Programming
+File: gawk.info, Node: TCP/IP Networking, Next: Profiling, Prev: Two-way
I/O, Up: Advanced Features
-11.2.2 Floating-point Context
------------------------------
+11.4 Using `gawk' for Network Programming
+=========================================
-A floating-point "context" defines the environment for arithmetic
-operations. It governs precision, sets rules for rounding, and limits
-the range for exponents. The context has the following primary
-components:
+ `EMISTERED':
+ A host is a host from coast to coast,
+ and no-one can talk to host that's close,
+ unless the host that isn't close
+ is busy hung or dead.
-"Precision"
- Precision of the floating-point format in bits.
+ In addition to being able to open a two-way pipeline to a coprocess
+on the same system (*note Two-way I/O::), it is possible to make a
+two-way connection to another process on another system across an IP
+network connection.
-"emax"
- Maximum exponent allowed for this format.
+ You can think of this as just a _very long_ two-way pipeline to a
+coprocess. The way `gawk' decides that you want to use TCP/IP
+networking is by recognizing special file names that begin with one of
+`/inet/', `/inet4/' or `/inet6'.
-"emin"
- Minimum exponent allowed for this format.
+ The full syntax of the special file name is
+`/NET-TYPE/PROTOCOL/LOCAL-PORT/REMOTE-HOST/REMOTE-PORT'. The
+components are:
-"Underflow behavior"
- The format may or may not support gradual underflow.
+NET-TYPE
+ Specifies the kind of Internet connection to make. Use `/inet4/'
+ to force IPv4, and `/inet6/' to force IPv6. Plain `/inet/' (which
+ used to be the only option) uses the system default, most likely
+ IPv4.
-"Rounding"
- The rounding mode of this context.
+PROTOCOL
+ The protocol to use over IP. This must be either `tcp', or `udp',
+ for a TCP or UDP IP connection, respectively. The use of TCP is
+ recommended for most applications.
- *note table-ieee-formats:: lists the precision and exponent field
-values for the basic IEEE-754 binary formats:
+LOCAL-PORT
+ The local TCP or UDP port number to use. Use a port number of `0'
+ when you want the system to pick a port. This is what you should do
+ when writing a TCP or UDP client. You may also use a well-known
+ service name, such as `smtp' or `http', in which case `gawk'
+ attempts to determine the predefined port number using the C
+ `getaddrinfo()' function.
-Name Total bits Precision emin emax
----------------------------------------------------------------------------
-Single 32 24 -126 +127
-Double 64 53 -1022 +1023
-Quadruple 128 113 -16382 +16383
+REMOTE-HOST
+ The IP address or fully-qualified domain name of the Internet host
+ to which you want to connect.
+
+REMOTE-PORT
+ The TCP or UDP port number to use on the given REMOTE-HOST.
+ Again, use `0' if you don't care, or else a well-known service
+ name.
-Table 11.1: Basic IEEE Format Context Values
+ NOTE: Failure in opening a two-way socket will result in a
+ non-fatal error being returned to the calling code. The value of
+ `ERRNO' indicates the error (*note Auto-set::).
- NOTE: The precision numbers include the implied leading one that
- gives them one extra bit of significand.
+ Consider the following very simple example:
- A floating-point context can also determine which signals are treated
-as exceptions, and can set rules for arithmetic with special values.
-Please consult the IEEE-754 standard or other resources for details.
+ BEGIN {
+ Service = "/inet/tcp/0/localhost/daytime"
+ Service |& getline
+ print $0
+ close(Service)
+ }
- `gawk' ordinarily uses the hardware double precision representation
-for numbers. On most systems, this is IEEE-754 floating-point format,
-corresponding to 64-bit binary with 53 bits of precision.
+ This program reads the current date and time from the local system's
+TCP `daytime' server. It then prints the results and closes the
+connection.
- NOTE: In case an underflow occurs, the standard allows, but does
- not require, the result from an arithmetic operation to be a
- number smaller than the smallest nonzero normalized number. Such
- numbers do not have as many significant digits as normal numbers,
- and are called "denormals" or "subnormals". The alternative,
- simply returning a zero, is called "flush to zero". The basic
- IEEE-754 binary formats support subnormal numbers.
+ Because this topic is extensive, the use of `gawk' for TCP/IP
+programming is documented separately. See *note (General
+Introduction)Top:: gawkinet, TCP/IP Internetworking with `gawk', for a
+much more complete introduction and discussion, as well as extensive
+examples.
-File: gawk.info, Node: Rounding Mode, Prev: Floating-point Context, Up:
Floating-point Programming
+File: gawk.info, Node: Profiling, Prev: TCP/IP Networking, Up: Advanced
Features
-11.2.3 Floating-point Rounding Mode
------------------------------------
+11.5 Profiling Your `awk' Programs
+==================================
-The "rounding mode" specifies the behavior for the results of numerical
-operations when discarding extra precision. Each rounding mode indicates
-how the least significant returned digit of a rounded result is to be
-calculated. *note table-rounding-modes:: lists the IEEE-754 defined
-rounding modes:
+You may produce execution traces of your `awk' programs. This is done
+by passing the option `--profile' to `gawk'. When `gawk' has finished
+running, it creates a profile of your program in a file named
+`awkprof.out'. Because it is profiling, it also executes up to 45%
+slower than `gawk' normally does.
-Rounding Mode IEEE Name
---------------------------------------------------------------------------
-Round to nearest, ties to even `roundTiesToEven'
-Round toward plus Infinity `roundTowardPositive'
-Round toward negative Infinity `roundTowardNegative'
-Round toward zero `roundTowardZero'
-Round to nearest, ties away `roundTiesToAway'
-from zero
+ As shown in the following example, the `--profile' option can be
+used to change the name of the file where `gawk' will write the profile:
-Table 11.2: IEEE 754 Rounding Modes
+ gawk --profile=myprog.prof -f myprog.awk data1 data2
- The default mode `roundTiesToEven' is the most preferred, but the
-least intuitive. This method does the obvious thing for most values, by
-rounding them up or down to the nearest digit. For example, rounding
-1.132 to two digits yields 1.13, and rounding 1.157 yields 1.16.
+In the above example, `gawk' places the profile in `myprog.prof'
+instead of in `awkprof.out'.
- However, when it comes to rounding a value that is exactly halfway
-between, things do not work the way you probably learned in school. In
-this case, the number is rounded to the nearest even digit. So
-rounding 0.125 to two digits rounds down to 0.12, but rounding 0.6875
-to three digits rounds up to 0.688. You probably have already
-encountered this rounding mode when using the `printf' routine to
-format floating-point numbers. For example:
+ Here is a sample session showing a simple `awk' program, its input
+data, and the results from running `gawk' with the `--profile' option.
+First, the `awk' program:
- BEGIN {
- x = -4.5
- for (i = 1; i < 10; i++) {
- x += 1.0
- printf("%4.1f => %2.0f\n", x, x)
- }
+ BEGIN { print "First BEGIN rule" }
+
+ END { print "First END rule" }
+
+ /foo/ {
+ print "matched /foo/, gosh"
+ for (i = 1; i <= 3; i++)
+ sing()
}
-produces the following output when run:(1)
+ {
+ if (/foo/)
+ print "if is true"
+ else
+ print "else is true"
+ }
- -3.5 => -4
- -2.5 => -2
- -1.5 => -2
- -0.5 => 0
- 0.5 => 0
- 1.5 => 2
- 2.5 => 2
- 3.5 => 4
- 4.5 => 4
+ BEGIN { print "Second BEGIN rule" }
- The theory behind the rounding mode `roundTiesToEven' is that it
-more or less evenly distributes upward and downward rounds of exact
-halves, which might cause the round-off error to cancel itself out.
-This is the default rounding mode used in IEEE-754 computing functions
-and operators.
+ END { print "Second END rule" }
- The other rounding modes are rarely used. Round toward positive
-infinity (`roundTowardPositive') and round toward negative infinity
-(`roundTowardNegative') are often used to implement interval arithmetic,
-where you adjust the rounding mode to calculate upper and lower bounds
-for the range of output. The `roundTowardZero' mode can be used for
-converting floating-point numbers to integers. The rounding mode
-`roundTiesToAway' rounds the result to the nearest number and selects
-the number with the larger magnitude if a tie occurs.
+ function sing( dummy)
+ {
+ print "I gotta be me!"
+ }
- Some numerical analysts will tell you that your choice of rounding
-style has tremendous impact on the final outcome, and advise you to
-wait until final output for any rounding. Instead, you can often avoid
-round-off error problems by setting the precision initially to some
-value sufficiently larger than the final desired precision, so that the
-accumulation of round-off error does not influence the outcome. If you
-suspect that results from your computation are sensitive to
-accumulation of round-off error, one way to be sure is to look for a
-significant difference in output when you change the rounding mode.
-
- ---------- Footnotes ----------
+ Following is the input data:
- (1) It is possible for the output to be completely different if the
-C library in your system does not use the IEEE-754 even-rounding rule
-to round halfway cases for `printf()'.
+ foo
+ bar
+ baz
+ foo
+ junk
-
-File: gawk.info, Node: Gawk and MPFR, Next: Arbitrary Precision Floats,
Prev: Floating-point Programming, Up: Arbitrary Precision Arithmetic
+ Here is the `awkprof.out' that results from running the `gawk'
+profiler on this program and data (this example also illustrates that
+`awk' programmers sometimes have to work late):
-11.3 `gawk' + MPFR = Powerful Arithmetic
-========================================
+ # gawk profile, created Sun Aug 13 00:00:15 2000
-The rest of this major node decsribes how to use the arbitrary precision
-(also known as "multiple precision" or "infinite precision") numeric
-capabilites in `gawk' to produce maximally accurate results when you
-need it.
+ # BEGIN block(s)
- But first you should check if your version of `gawk' supports
-arbitrary precision arithmetic. The easiest way to find out is to look
-at the output of the following command:
+ BEGIN {
+ 1 print "First BEGIN rule"
+ 1 print "Second BEGIN rule"
+ }
- $ gawk --version
- -| GNU Awk 4.1.0 (GNU MPFR 3.1.0, GNU MP 5.0.3)
- -| Copyright (C) 1989, 1991-2012 Free Software Foundation.
- ...
+ # Rule(s)
- `gawk' uses the GNU MPFR (http://www.mpfr.org) and GNU MP
-(http://gmplib.org) (GMP) libraries for arbitrary precision arithmetic
-on numbers. So if you do not see the names of these libraries in the
-output, then your version of `gawk' does not support arbitrary
-precision arithmetic.
+ 5 /foo/ { # 2
+ 2 print "matched /foo/, gosh"
+ 6 for (i = 1; i <= 3; i++) {
+ 6 sing()
+ }
+ }
- Additionally, there are a few elements available in the `PROCINFO'
-array to provide information about the MPFR and GMP libraries. *Note
-Auto-set::, for more information.
+ 5 {
+ 5 if (/foo/) { # 2
+ 2 print "if is true"
+ 3 } else {
+ 3 print "else is true"
+ }
+ }
-
-File: gawk.info, Node: Arbitrary Precision Floats, Next: Arbitrary Precision
Integers, Prev: Gawk and MPFR, Up: Arbitrary Precision Arithmetic
+ # END block(s)
-11.4 Arbitrary Precision Floating-point Arithmetic with `gawk'
-==============================================================
+ END {
+ 1 print "First END rule"
+ 1 print "Second END rule"
+ }
-`gawk' uses the GNU MPFR library for arbitrary precision floating-point
-arithmetic. The MPFR library provides precise control over precisions
-and rounding modes, and gives correctly rounded reproducible
-platform-independent results. With the command-line option `--bignum'
-or `-M', all floating-point arithmetic operators and numeric functions
-can yield results to any desired precision level supported by MPFR.
-Two built-in variables `PREC' (*note Setting Precision::) and
-`ROUNDMODE' (*note Setting Rounding Mode::) provide control over the
-working precision and the rounding mode. The precision and the
-rounding mode are set globally for every operation to follow.
+ # Functions, listed alphabetically
- The default working precision for arbitrary precision floating-point
-values is 53, and the default value for `ROUNDMODE' is `"N"', which
-selects the IEEE-754 `roundTiesToEven' (*note Rounding Mode::) rounding
-mode.(1) `gawk' uses the default exponent range in MPFR (EMAX = 2^30 -
-1, EMIN = -EMAX) for all floating-point contexts. There is no explicit
-mechanism to adjust the exponent range. MPFR does not implement
-subnormal numbers by default, and this behavior cannot be changed in
-`gawk'.
+ 6 function sing(dummy)
+ {
+ 6 print "I gotta be me!"
+ }
- NOTE: When emulating an IEEE-754 format (*note Setting
- Precision::), `gawk' internally adjusts the exponent range to the
- value defined for the format and also performs computations needed
- for gradual underflow (subnormal numbers).
+ This example illustrates many of the basic features of profiling
+output. They are as follows:
- NOTE: MPFR numbers are variable-size entities, consuming only as
- much space as needed to store the significant digits. Since the
- performance using MPFR numbers pales in comparison to doing
- arithmetic using the underlying machine types, you should consider
- using only as much precision as needed by your program.
+ * The program is printed in the order `BEGIN' rule, `BEGINFILE' rule,
+ pattern/action rules, `ENDFILE' rule, `END' rule and functions,
+ listed alphabetically. Multiple `BEGIN' and `END' rules are
+ merged together, as are multiple `BEGINFILE' and `ENDFILE' rules.
-* Menu:
+ * Pattern-action rules have two counts. The first count, to the
+ left of the rule, shows how many times the rule's pattern was
+ _tested_. The second count, to the right of the rule's opening
+ left brace in a comment, shows how many times the rule's action
+ was _executed_. The difference between the two indicates how many
+ times the rule's pattern evaluated to false.
-* Setting Precision:: Setting the working precision.
-* Setting Rounding Mode:: Setting the rounding mode.
-* Floating-point Constants:: Representing floating-point constants.
-* Changing Precision:: Changing the precision of a number.
-* Exact Arithmetic:: Exact arithmetic with floating-point numbers.
+ * Similarly, the count for an `if'-`else' statement shows how many
+ times the condition was tested. To the right of the opening left
+ brace for the `if''s body is a count showing how many times the
+ condition was true. The count for the `else' indicates how many
+ times the test failed.
- ---------- Footnotes ----------
+ * The count for a loop header (such as `for' or `while') shows how
+ many times the loop test was executed. (Because of this, you
+ can't just look at the count on the first statement in a rule to
+ determine how many times the rule was executed. If the first
+ statement is a loop, the count is misleading.)
- (1) The default precision is 53, since according to the MPFR
-documentation, the library should be able to exactly reproduce all
-computations with double-precision machine floating-point numbers
-(`double' type in C), except the default exponent range is much wider
-and subnormal numbers are not implemented.
+ * For user-defined functions, the count next to the `function'
+ keyword indicates how many times the function was called. The
+ counts next to the statements in the body show how many times
+ those statements were executed.
-
-File: gawk.info, Node: Setting Precision, Next: Setting Rounding Mode, Up:
Arbitrary Precision Floats
+ * The layout uses "K&R" style with TABs. Braces are used
+ everywhere, even when the body of an `if', `else', or loop is only
+ a single statement.
-11.4.1 Setting the Working Precision
-------------------------------------
+ * Parentheses are used only where needed, as indicated by the
+ structure of the program and the precedence rules. For example,
+ `(3 + 5) * 4' means add three plus five, then multiply the total
+ by four. However, `3 + 5 * 4' has no parentheses, and means `3 +
+ (5 * 4)'.
-`gawk' uses a global working precision; it does not keep track of the
-precision or accuracy of individual numbers. Performing an arithmetic
-operation or calling a built-in function rounds the result to the
-current working precision. The default working precision is 53 which
-can be modified using the built-in variable `PREC'. You can also set the
-value to one of the following pre-defined case-insensitive strings to
-emulate an IEEE-754 binary format:
+ * Parentheses are used around the arguments to `print' and `printf'
+ only when the `print' or `printf' statement is followed by a
+ redirection. Similarly, if the target of a redirection isn't a
+ scalar, it gets parenthesized.
-`PREC' IEEE-754 Binary Format
----------------------------------------------------
-`"half"' 16-bit half-precision.
-`"single"' Basic 32-bit single precision.
-`"double"' Basic 64-bit double precision.
-`"quad"' Basic 128-bit quadruple precision.
-`"oct"' 256-bit octuple precision.
+ * `gawk' supplies leading comments in front of the `BEGIN' and `END'
+ rules, the pattern/action rules, and the functions.
- The following example illustrates the effects of changing precision
-on arithmetic operations:
- $ gawk -M -vPREC=100 'BEGIN { x = 1.0e-400; print x + 0; \
- > PREC = "double"; print x + 0 }'
- -| 1e-400
- -| 0
+ The profiled version of your program may not look exactly like what
+you typed when you wrote it. This is because `gawk' creates the
+profiled version by "pretty printing" its internal representation of
+the program. The advantage to this is that `gawk' can produce a
+standard representation. The disadvantage is that all source-code
+comments are lost, as are the distinctions among multiple `BEGIN',
+`END', `BEGINFILE', and `ENDFILE' rules. Also, things such as:
- Binary and decimal precisions are related approximately according to
-the formula:
+ /foo/
- PREC = 3.322 * DPS
+come out as:
-Here, PREC denotes the binary precision (measured in bits) and DPS
-(short for decimal places) is the decimal digits. We can easily
-calculate how many decimal digits the 53-bit significand of an IEEE
-double is equivalent to: 53 / 3.332 which is equal to about 15.95. But
-what does 15.95 digits actually mean? It depends whether you are
-concerned about how many digits you can rely on, or how many digits you
-need.
+ /foo/ {
+ print $0
+ }
- It is important to know how many bits it takes to uniquely identify
-a double-precision value (the C type `double'). If you want to convert
-from `double' to decimal and back to `double' (e.g., saving a `double'
-representing an intermediate result to a file, and later reading it
-back to restart the computation), then a few more decimal digits are
-required. 17 digits is generally enough for a `double'.
+which is correct, but possibly surprising.
- It can also be important to know what decimal numbers can be uniquely
-represented with a `double'. If you want to convert from decimal to
-`double' and back again, 15 digits is the most that you can get. Stated
-differently, you should not present the numbers from your
-floating-point computations with more than 15 significant digits in
-them.
+ Besides creating profiles when a program has completed, `gawk' can
+produce a profile while it is running. This is useful if your `awk'
+program goes into an infinite loop and you want to see what has been
+executed. To use this feature, run `gawk' with the `--profile' option
+in the background:
- Conversely, it takes a precision of 332 bits to hold an approximation
-of the constant pi that is accurate to 100 decimal places. You should
-always add some extra bits in order to avoid the confusing round-off
-issues that occur because numbers are stored internally in binary.
+ $ gawk --profile -f myprog &
+ [1] 13992
-
-File: gawk.info, Node: Setting Rounding Mode, Next: Floating-point
Constants, Prev: Setting Precision, Up: Arbitrary Precision Floats
+The shell prints a job number and process ID number; in this case,
+13992. Use the `kill' command to send the `USR1' signal to `gawk':
-11.4.2 Setting the Rounding Mode
---------------------------------
+ $ kill -USR1 13992
-The `ROUNDMODE' variable provides program level control over the
-rounding mode. The correspondance between `ROUNDMODE' and the IEEE
-rounding modes is shown in *note table-gawk-rounding-modes::.
+As usual, the profiled version of the program is written to
+`awkprof.out', or to a different file if one specified with the
+`--profile' option.
-Rounding Mode IEEE Name `ROUNDMODE'
----------------------------------------------------------------------------
-Round to nearest, ties to even `roundTiesToEven' `"N"' or `"n"'
-Round toward plus Infinity `roundTowardPositive' `"U"' or `"u"'
-Round toward negative Infinity `roundTowardNegative' `"D"' or `"d"'
-Round toward zero `roundTowardZero' `"Z"' or `"z"'
-Round to nearest, ties away `roundTiesToAway' `"A"' or `"a"'
-from zero
+ Along with the regular profile, as shown earlier, the profile
+includes a trace of any active functions:
-Table 11.3: `gawk' Rounding Modes
+ # Function Call Stack:
- `ROUNDMODE' has the default value `"N"', which selects the IEEE-754
-rounding mode `roundTiesToEven'. Besides the values listed in *note
-Table 11.3: table-gawk-rounding-modes, `gawk' also accepts `"A"' to
-select the IEEE-754 mode `roundTiesToAway' if your version of the MPFR
-library supports it; otherwise setting `ROUNDMODE' to this value has no
-effect. *Note Rounding Mode::, for the meanings of the various rounding
-modes.
+ # 3. baz
+ # 2. bar
+ # 1. foo
+ # -- main --
- Here is an example of how to change the default rounding behavior of
-`printf''s output:
+ You may send `gawk' the `USR1' signal as many times as you like.
+Each time, the profile and function call trace are appended to the
+output profile file.
- $ gawk -M -vROUNDMODE="Z" 'BEGIN { printf("%.2f\n", 1.378) }'
- -| 1.37
+ If you use the `HUP' signal instead of the `USR1' signal, `gawk'
+produces the profile and the function call trace and then exits.
+
+ When `gawk' runs on MS-Windows systems, it uses the `INT' and `QUIT'
+signals for producing the profile and, in the case of the `INT' signal,
+`gawk' exits. This is because these systems don't support the `kill'
+command, so the only signals you can deliver to a program are those
+generated by the keyboard. The `INT' signal is generated by the
+`Ctrl-<C>' or `Ctrl-<BREAK>' key, while the `QUIT' signal is generated
+by the `Ctrl-<\>' key.
+
+ Finally, `gawk' also accepts another option `--pretty-print'. When
+called this way, `gawk' "pretty prints" the program into `awkprof.out',
+without any execution counts.
-File: gawk.info, Node: Floating-point Constants, Next: Changing Precision,
Prev: Setting Rounding Mode, Up: Arbitrary Precision Floats
+File: gawk.info, Node: Library Functions, Next: Sample Programs, Prev:
Advanced Features, Up: Top
-11.4.3 Representing Floating-point Constants
---------------------------------------------
+12 A Library of `awk' Functions
+*******************************
-Be wary of floating-point constants! When reading a floating-point
-constant from program source code, `gawk' uses the default precision,
-unless overridden by an assignment to the special variable `PREC' on
-the command line, to store it internally as a MPFR number. Changing
-the precision using `PREC' in the program text does not change the
-precision of a constant. If you need to represent a floating-point
-constant at a higher precision than the default and cannot use a
-command line assignment to `PREC', you should either specify the
-constant as a string, or as a rational number whenever possible. The
-following example illustrates the differences among various ways to
-print a floating-point constant:
+*note User-defined::, describes how to write your own `awk' functions.
+Writing functions is important, because it allows you to encapsulate
+algorithms and program tasks in a single place. It simplifies
+programming, making program development more manageable, and making
+programs more readable.
- $ gawk -M 'BEGIN { PREC = 113; printf("%0.25f\n", 0.1) }'
- -| 0.1000000000000000055511151
- $ gawk -M -vPREC = 113 'BEGIN { printf("%0.25f\n", 0.1) }'
- -| 0.1000000000000000000000000
- $ gawk -M 'BEGIN { PREC = 113; printf("%0.25f\n", "0.1") }'
- -| 0.1000000000000000000000000
- $ gawk -M 'BEGIN { PREC = 113; printf("%0.25f\n", 1/10) }'
- -| 0.1000000000000000000000000
+ One valuable way to learn a new programming language is to _read_
+programs in that language. To that end, this major node and *note
+Sample Programs::, provide a good-sized body of code for you to read,
+and hopefully, to learn from.
- In the first case, the number is stored with the default precision
-of 53.
+ This major node presents a library of useful `awk' functions. Many
+of the sample programs presented later in this Info file use these
+functions. The functions are presented here in a progression from
+simple to complex.
-
-File: gawk.info, Node: Changing Precision, Next: Exact Arithmetic, Prev:
Floating-point Constants, Up: Arbitrary Precision Floats
+ *note Extract Program::, presents a program that you can use to
+extract the source code for these example library functions and
+programs from the Texinfo source for this Info file. (This has already
+been done as part of the `gawk' distribution.)
-11.4.4 Changing the Precision of a Number
------------------------------------------
+ If you have written one or more useful, general-purpose `awk'
+functions and would like to contribute them to the `awk' user
+community, see *note How To Contribute::, for more information.
- The point is that in any variable-precision package, a decision is
- made on how to treat numbers given as data, or arising in
- intermediate results, which are represented in floating-point
- format to a precision lower than working precision. Do we promote
- them to full membership of the high-precision club, or do we treat
- them and all their associates as second-class citizens? Sometimes
- the first course is proper, sometimes the second, and it takes
- careful analysis to tell which.
+ The programs in this major node and in *note Sample Programs::,
+freely use features that are `gawk'-specific. Rewriting these programs
+for different implementations of `awk' is pretty straightforward.
- Dirk Laurie(1)
+ * Diagnostic error messages are sent to `/dev/stderr'. Use `| "cat
+ 1>&2"' instead of `> "/dev/stderr"' if your system does not have a
+ `/dev/stderr', or if you cannot use `gawk'.
- `gawk' does not implicitly modify the precision of any previously
-computed results when the working precision is changed with an
-assignment to `PREC'. The precision of a number is always the one that
-was used at the time of its creation, and there is no way for the user
-to explicitly change it afterwards. However, since the result of a
-floating-point arithmetic operation is always an arbitrary precision
-floating-point value--with a precision set by the value of `PREC'--one
-of the following workarounds effectively accomplishes the desired
-behavior:
+ * A number of programs use `nextfile' (*note Nextfile Statement::)
+ to skip any remaining input in the input file.
- x = x + 0.0
+ * Finally, some of the programs choose to ignore upper- and lowercase
+ distinctions in their input. They do so by assigning one to
+ `IGNORECASE'. You can achieve almost the same effect(1) by adding
+ the following rule to the beginning of the program:
-or:
+ # ignore case
+ { $0 = tolower($0) }
- x += 0.0
+ Also, verify that all regexp and string constants used in
+ comparisons use only lowercase letters.
+
+* Menu:
+
+* Library Names:: How to best name private global variables in
+ library functions.
+* General Functions:: Functions that are of general use.
+* Data File Management:: Functions for managing command-line data
+ files.
+* Getopt Function:: A function for processing command-line
+ arguments.
+* Passwd Functions:: Functions for getting user information.
+* Group Functions:: Functions for getting group information.
+* Walking Arrays:: A function to walk arrays of arrays.
---------- Footnotes ----------
- (1) Dirk Laurie. `Variable-precision Arithmetic Considered Perilous
--- A Detective Story'. Electronic Transactions on Numerical Analysis.
-Volume 28, pp. 168-173, 2008.
+ (1) The effects are not identical. Output of the transformed record
+will be in all lowercase, while `IGNORECASE' preserves the original
+contents of the input record.
-File: gawk.info, Node: Exact Arithmetic, Prev: Changing Precision, Up:
Arbitrary Precision Floats
+File: gawk.info, Node: Library Names, Next: General Functions, Up: Library
Functions
-11.4.5 Exact Arithmetic with Floating-point Numbers
----------------------------------------------------
+12.1 Naming Library Function Global Variables
+=============================================
- CAUTION: Never depend on the exactness of floating-point
- arithmetic, even for apparently simple expressions!
+Due to the way the `awk' language evolved, variables are either
+"global" (usable by the entire program) or "local" (usable just by a
+specific function). There is no intermediate state analogous to
+`static' variables in C.
- Can arbitrary precision arithmetic give exact results? There are no
-easy answers. The standard rules of algebra often do not apply when
-using floating-point arithmetic. Among other things, the distributive
-and associative laws do not hold completely, and order of operation may
-be important for your computation. Rounding error, cumulative precision
-loss and underflow are often troublesome.
+ Library functions often need to have global variables that they can
+use to preserve state information between calls to the function--for
+example, `getopt()''s variable `_opti' (*note Getopt Function::). Such
+variables are called "private", since the only functions that need to
+use them are the ones in the library.
- When `gawk' tests the expressions `0.1 + 12.2' and `12.3' for
-equality using the machine double precision arithmetic, it decides that
-they are not equal! (*Note Floating-point Programming::.) You can get
-the result you want by increasing the precision; 56 in this case will
-get the job done:
+ When writing a library function, you should try to choose names for
+your private variables that will not conflict with any variables used by
+either another library function or a user's main program. For example,
+a name like `i' or `j' is not a good choice, because user programs
+often use variable names like these for their own purposes.
- $ gawk -M -vPREC=56 'BEGIN { print (0.1 + 12.2 == 12.3) }'
- -| 1
+ The example programs shown in this major node all start the names of
+their private variables with an underscore (`_'). Users generally
+don't use leading underscores in their variable names, so this
+convention immediately decreases the chances that the variable name
+will be accidentally shared with the user's program.
- If adding more bits is good, perhaps adding even more bits of
-precision is better? Here is what happens if we use an even larger
-value of `PREC':
+ In addition, several of the library functions use a prefix that helps
+indicate what function or set of functions use the variables--for
+example, `_pw_byname' in the user database routines (*note Passwd
+Functions::). This convention is recommended, since it even further
+decreases the chance of inadvertent conflict among variable names.
+Note that this convention is used equally well for variable names and
+for private function names.(1)
- $ gawk -M -vPREC=201 'BEGIN { print (0.1 + 12.2 == 12.3) }'
- -| 0
+ As a final note on variable naming, if a function makes global
+variables available for use by a main program, it is a good convention
+to start that variable's name with a capital letter--for example,
+`getopt()''s `Opterr' and `Optind' variables (*note Getopt Function::).
+The leading capital letter indicates that it is global, while the fact
+that the variable name is not all capital letters indicates that the
+variable is not one of `awk''s built-in variables, such as `FS'.
- This is not a bug in `gawk' or in the MPFR library. It is easy to
-forget that the finite number of bits used to store the value is often
-just an approximation after proper rounding. The test for equality
-succeeds if and only if _all_ bits in the two operands are exactly the
-same. Since this is not necessarily true after floating-point
-computations with a particular precision and effective rounding rule, a
-straight test for equality may not work.
+ It is also important that _all_ variables in library functions that
+do not need to save state are, in fact, declared local.(2) If this is
+not done, the variable could accidentally be used in the user's
+program, leading to bugs that are very difficult to track down:
- So, don't assume that floating-point values can be compared for
-equality. You should also exercise caution when using other forms of
-comparisons. The standard way to compare between floating-point
-numbers is to determine how much error (or "tolerance") you will allow
-in a comparison and check to see if one value is within this error
-range of the other.
+ function lib_func(x, y, l1, l2)
+ {
+ ...
+ USE VARIABLE some_var # some_var should be local
+ ... # but is not by oversight
+ }
- In applications where 15 or fewer decimal places suffice, hardware
-double precision arithmetic can be adequate, and is usually much faster.
-But you do need to keep in mind that every floating-point operation can
-suffer a new rounding error with catastrophic consequences as
-illustrated by our attempt to compute the value of the constant pi
-(*note Floating-point Programming::). Extra precision can greatly
-enhance the stability and the accuracy of your computation in such
-cases.
+ A different convention, common in the Tcl community, is to use a
+single associative array to hold the values needed by the library
+function(s), or "package." This significantly decreases the number of
+actual global names in use. For example, the functions described in
+*note Passwd Functions::, might have used array elements
+`PW_data["inited"]', `PW_data["total"]', `PW_data["count"]', and
+`PW_data["awklib"]', instead of `_pw_inited', `_pw_awklib', `_pw_total',
+and `_pw_count'.
- Repeated addition is not necessarily equivalent to multiplication in
-floating-point arithmetic. In the example in *note Floating-point
-Programming:::
+ The conventions presented in this minor node are exactly that:
+conventions. You are not required to write your programs this way--we
+merely recommend that you do so.
- $ gawk 'BEGIN {
- > for (d = 1.1; d <= 1.5; d += 0.1)
- > i++
- > print i
- > }'
- -| 4
+ ---------- Footnotes ----------
-you may or may not succeed in getting the correct result by choosing an
-arbitrarily large value for `PREC'. Reformulation of the problem at
-hand is often the correct approach in such situations.
+ (1) While all the library routines could have been rewritten to use
+this convention, this was not done, in order to show how our own `awk'
+programming style has evolved and to provide some basis for this
+discussion.
+
+ (2) `gawk''s `--dump-variables' command-line option is useful for
+verifying this.
-File: gawk.info, Node: Arbitrary Precision Integers, Prev: Arbitrary
Precision Floats, Up: Arbitrary Precision Arithmetic
+File: gawk.info, Node: General Functions, Next: Data File Management, Prev:
Library Names, Up: Library Functions
-11.5 Arbitrary Precision Integer Arithmetic with `gawk'
-=======================================================
+12.2 General Programming
+========================
-If the option `--bignum' or `-M' is specified, `gawk' performs all
-integer arithmetic using GMP arbitrary precision integers. Any number
-that looks like an integer in a program source or data file is stored
-as an arbitrary precision integer. The size of the integer is limited
-only by your computer's memory. The current floating-point context has
-no effect on operations involving integers. For example, the following
-computes 5^4^3^2, the result of which is beyond the limits of ordinary
-`gawk' numbers:
-
- $ gawk -M 'BEGIN {
- > x = 5^4^3^2
- > print "# of digits =", length(x)
- > print substr(x, 1, 20), "...", substr(x, length(x) - 19, 20)
- > }'
- -| # of digits = 183231
- -| 62060698786608744707 ... 92256259918212890625
-
- If you were to compute the same value using arbitrary precision
-floating-point values instead, the precision needed for correct output
-(using the formula `prec = 3.322 * dps'), would be 3.322 x 183231, or
-608693. (Thus, the floating-point representation requires over 30
-times as many decimal digits!)
+This minor node presents a number of functions that are of general
+programming use.
- The result from an arithmetic operation with an integer and a
-floating-point value is a floating-point value with a precision equal
-to the working precision. The following program calculates the eighth
-term in Sylvester's sequence(1) using a recurrence:
+* Menu:
- $ gawk -M 'BEGIN {
- > s = 2.0
- > for (i = 1; i <= 7; i++)
- > s = s * (s - 1) + 1
- > print s
- > }'
- -| 113423713055421845118910464
+* Strtonum Function:: A replacement for the built-in
+ `strtonum()' function.
+* Assert Function:: A function for assertions in `awk'
+ programs.
+* Round Function:: A function for rounding if `sprintf()'
+ does not do it correctly.
+* Cliff Random Function:: The Cliff Random Number Generator.
+* Ordinal Functions:: Functions for using characters as numbers and
+ vice versa.
+* Join Function:: A function to join an array into a string.
+* Getlocaltime Function:: A function to get formatted times.
- The output differs from the acutal number,
-113423713055421844361000443, because the default precision of 53 is not
-enough to represent the floating-point results exactly. You can either
-increase the precision (100 is enough in this case), or replace the
-floating-point constant `2.0' with an integer, to perform all
-computations using integer arithmetic to get the correct output.
+
+File: gawk.info, Node: Strtonum Function, Next: Assert Function, Up:
General Functions
- It will sometimes be necessary for `gawk' to implicitly convert an
-arbitrary precision integer into an arbitrary precision floating-point
-value. This is primarily because the MPFR library does not always
-provide the relevant interface to process arbitrary precision integers
-or mixed-mode numbers as needed by an operation or function. In such a
-case, the precision is set to the minimum value necessary for exact
-conversion, and the working precision is not used for this purpose. If
-this is not what you need or want, you can employ a subterfuge like
-this:
+12.2.1 Converting Strings To Numbers
+------------------------------------
- gawk -M 'BEGIN { n = 13; print (n + 0.0) % 2.0 }'
+The `strtonum()' function (*note String Functions::) is a `gawk'
+extension. The following function provides an implementation for other
+versions of `awk':
- You can avoid this issue altogether by specifying the number as a
-floating-point value to begin with:
+ # mystrtonum --- convert string to number
- gawk -M 'BEGIN { n = 13.0; print n % 2.0 }'
+ function mystrtonum(str, ret, chars, n, i, k, c)
+ {
+ if (str ~ /^0[0-7]*$/) {
+ # octal
+ n = length(str)
+ ret = 0
+ for (i = 1; i <= n; i++) {
+ c = substr(str, i, 1)
+ if ((k = index("01234567", c)) > 0)
+ k-- # adjust for 1-basing in awk
- Note that for the particular example above, there is likely best to
-just use the following:
+ ret = ret * 8 + k
+ }
+ } else if (str ~ /^0[xX][[:xdigit:]]+/) {
+ # hexadecimal
+ str = substr(str, 3) # lop off leading 0x
+ n = length(str)
+ ret = 0
+ for (i = 1; i <= n; i++) {
+ c = substr(str, i, 1)
+ c = tolower(c)
+ if ((k = index("0123456789", c)) > 0)
+ k-- # adjust for 1-basing in awk
+ else if ((k = index("abcdef", c)) > 0)
+ k += 9
- gawk -M 'BEGIN { n = 13; print n % 2 }'
+ ret = ret * 16 + k
+ }
+ } else if (str ~ \
+
/^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|([.][0-9]+([Ee][-+]?[0-9]+)?))$/) {
+ # decimal number, possibly floating point
+ ret = str + 0
+ } else
+ ret = "NOT-A-NUMBER"
- ---------- Footnotes ----------
+ return ret
+ }
- (1) Weisstein, Eric W. `Sylvester's Sequence'. From MathWorld--A
-Wolfram Web Resource.
-`http://mathworld.wolfram.com/SylvestersSequence.html'
+ # BEGIN { # gawk test harness
+ # a[1] = "25"
+ # a[2] = ".31"
+ # a[3] = "0123"
+ # a[4] = "0xdeadBEEF"
+ # a[5] = "123.45"
+ # a[6] = "1.e3"
+ # a[7] = "1.32"
+ # a[7] = "1.32E2"
+ #
+ # for (i = 1; i in a; i++)
+ # print a[i], strtonum(a[i]), mystrtonum(a[i])
+ # }
-
-File: gawk.info, Node: Advanced Features, Next: Library Functions, Prev:
Arbitrary Precision Arithmetic, Up: Top
+ The function first looks for C-style octal numbers (base 8). If the
+input string matches a regular expression describing octal numbers,
+then `mystrtonum()' loops through each character in the string. It
+sets `k' to the index in `"01234567"' of the current octal digit.
+Since the return value is one-based, the `k--' adjusts `k' so it can be
+used in computing the return value.
-12 Advanced Features of `gawk'
-******************************
+ Similar logic applies to the code that checks for and converts a
+hexadecimal value, which starts with `0x' or `0X'. The use of
+`tolower()' simplifies the computation for finding the correct numeric
+value for each hexadecimal digit.
- Write documentation as if whoever reads it is a violent psychopath
- who knows where you live.
- Steve English, as quoted by Peter Langston
+ Finally, if the string matches the (rather complicated) regexp for a
+regular decimal integer or floating-point number, the computation `ret
+= str + 0' lets `awk' convert the value to a number.
- This major node discusses advanced features in `gawk'. It's a bit
-of a "grab bag" of items that are otherwise unrelated to each other.
-First, a command-line option allows `gawk' to recognize nondecimal
-numbers in input data, not just in `awk' programs. Then, `gawk''s
-special features for sorting arrays are presented. Next, two-way I/O,
-discussed briefly in earlier parts of this Info file, is described in
-full detail, along with the basics of TCP/IP networking. Finally,
-`gawk' can "profile" an `awk' program, making it possible to tune it
-for performance.
+ A commented-out test program is included, so that the function can
+be tested with `gawk' and the results compared to the built-in
+`strtonum()' function.
- *note Dynamic Extensions::, discusses the ability to dynamically add
-new built-in functions to `gawk'. As this feature is still immature
-and likely to change, its description is relegated to an appendix.
+
+File: gawk.info, Node: Assert Function, Next: Round Function, Prev:
Strtonum Function, Up: General Functions
-* Menu:
+12.2.2 Assertions
+-----------------
-* Nondecimal Data:: Allowing nondecimal input data.
-* Array Sorting:: Facilities for controlling array traversal and
- sorting arrays.
-* Two-way I/O:: Two-way communications with another process.
-* TCP/IP Networking:: Using `gawk' for network programming.
-* Profiling:: Profiling your `awk' programs.
+When writing large programs, it is often useful to know that a
+condition or set of conditions is true. Before proceeding with a
+particular computation, you make a statement about what you believe to
+be the case. Such a statement is known as an "assertion". The C
+language provides an `<assert.h>' header file and corresponding
+`assert()' macro that the programmer can use to make assertions. If an
+assertion fails, the `assert()' macro arranges to print a diagnostic
+message describing the condition that should have been true but was
+not, and then it kills the program. In C, using `assert()' looks this:
-
-File: gawk.info, Node: Nondecimal Data, Next: Array Sorting, Up: Advanced
Features
+ #include <assert.h>
-12.1 Allowing Nondecimal Input Data
-===================================
+ int myfunc(int a, double b)
+ {
+ assert(a <= 5 && b >= 17.1);
+ ...
+ }
-If you run `gawk' with the `--non-decimal-data' option, you can have
-nondecimal constants in your input data:
+ If the assertion fails, the program prints a message similar to this:
- $ echo 0123 123 0x123 |
- > gawk --non-decimal-data '{ printf "%d, %d, %d\n",
- > $1, $2, $3 }'
- -| 83, 123, 291
+ prog.c:5: assertion failed: a <= 5 && b >= 17.1
- For this feature to work, write your program so that `gawk' treats
-your data as numeric:
+ The C language makes it possible to turn the condition into a string
+for use in printing the diagnostic message. This is not possible in
+`awk', so this `assert()' function also requires a string version of
+the condition that is being tested. Following is the function:
- $ echo 0123 123 0x123 | gawk '{ print $1, $2, $3 }'
- -| 0123 123 0x123
+ # assert --- assert that a condition is true. Otherwise exit.
-The `print' statement treats its expressions as strings. Although the
-fields can act as numbers when necessary, they are still strings, so
-`print' does not try to treat them numerically. You may need to add
-zero to a field to force it to be treated as a number. For example:
+ function assert(condition, string)
+ {
+ if (! condition) {
+ printf("%s:%d: assertion failed: %s\n",
+ FILENAME, FNR, string) > "/dev/stderr"
+ _assert_exit = 1
+ exit 1
+ }
+ }
- $ echo 0123 123 0x123 | gawk --non-decimal-data '
- > { print $1, $2, $3
- > print $1 + 0, $2 + 0, $3 + 0 }'
- -| 0123 123 0x123
- -| 83 123 291
+ END {
+ if (_assert_exit)
+ exit 1
+ }
- Because it is common to have decimal data with leading zeros, and
-because using this facility could lead to surprising results, the
-default is to leave it disabled. If you want it, you must explicitly
-request it.
+ The `assert()' function tests the `condition' parameter. If it is
+false, it prints a message to standard error, using the `string'
+parameter to describe the failed condition. It then sets the variable
+`_assert_exit' to one and executes the `exit' statement. The `exit'
+statement jumps to the `END' rule. If the `END' rules finds
+`_assert_exit' to be true, it exits immediately.
- CAUTION: _Use of this option is not recommended._ It can break old
- programs very badly. Instead, use the `strtonum()' function to
- convert your data (*note Nondecimal-numbers::). This makes your
- programs easier to write and easier to read, and leads to less
- surprising results.
+ The purpose of the test in the `END' rule is to keep any other `END'
+rules from running. When an assertion fails, the program should exit
+immediately. If no assertions fail, then `_assert_exit' is still false
+when the `END' rule is run normally, and the rest of the program's
+`END' rules execute. For all of this to work correctly, `assert.awk'
+must be the first source file read by `awk'. The function can be used
+in a program in the following way:
-
-File: gawk.info, Node: Array Sorting, Next: Two-way I/O, Prev: Nondecimal
Data, Up: Advanced Features
+ function myfunc(a, b)
+ {
+ assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1")
+ ...
+ }
-12.2 Controlling Array Traversal and Array Sorting
-==================================================
+If the assertion fails, you see a message similar to the following:
-`gawk' lets you control the order in which a `for (i in array)' loop
-traverses an array.
+ mydata:1357: assertion failed: a <= 5 && b >= 17.1
- In addition, two built-in functions, `asort()' and `asorti()', let
-you sort arrays based on the array values and indices, respectively.
-These two functions also provide control over the sorting criteria used
-to order the elements during sorting.
-
-* Menu:
+ There is a small problem with this version of `assert()'. An `END'
+rule is automatically added to the program calling `assert()'.
+Normally, if a program consists of just a `BEGIN' rule, the input files
+and/or standard input are not read. However, now that the program has
+an `END' rule, `awk' attempts to read the input data files or standard
+input (*note Using BEGIN/END::), most likely causing the program to
+hang as it waits for input.
-* Controlling Array Traversal:: How to use PROCINFO["sorted_in"].
-* Array Sorting Functions:: How to use `asort()' and `asorti()'.
+ There is a simple workaround to this: make sure that such a `BEGIN'
+rule always ends with an `exit' statement.
-File: gawk.info, Node: Controlling Array Traversal, Next: Array Sorting
Functions, Up: Array Sorting
+File: gawk.info, Node: Round Function, Next: Cliff Random Function, Prev:
Assert Function, Up: General Functions
-12.2.1 Controlling Array Traversal
-----------------------------------
+12.2.3 Rounding Numbers
+-----------------------
-By default, the order in which a `for (i in array)' loop scans an array
-is not defined; it is generally based upon the internal implementation
-of arrays inside `awk'.
+The way `printf' and `sprintf()' (*note Printf::) perform rounding
+often depends upon the system's C `sprintf()' subroutine. On many
+machines, `sprintf()' rounding is "unbiased," which means it doesn't
+always round a trailing `.5' up, contrary to naive expectations. In
+unbiased rounding, `.5' rounds to even, rather than always up, so 1.5
+rounds to 2 but 4.5 rounds to 4. This means that if you are using a
+format that does rounding (e.g., `"%.0f"'), you should check what your
+system does. The following function does traditional rounding; it
+might be useful if your `awk''s `printf' does unbiased rounding:
- Often, though, it is desirable to be able to loop over the elements
-in a particular order that you, the programmer, choose. `gawk' lets
-you do this.
+ # round.awk --- do normal rounding
- *note Controlling Scanning::, describes how you can assign special,
-pre-defined values to `PROCINFO["sorted_in"]' in order to control the
-order in which `gawk' will traverse an array during a `for' loop.
+ function round(x, ival, aval, fraction)
+ {
+ ival = int(x) # integer part, int() truncates
- In addition, the value of `PROCINFO["sorted_in"]' can be a function
-name. This lets you traverse an array based on any custom criterion.
-The array elements are ordered according to the return value of this
-function. The comparison function should be defined with at least four
-arguments:
+ # see if fractional part
+ if (ival == x) # no fraction
+ return ival # ensure no decimals
- function comp_func(i1, v1, i2, v2)
- {
- COMPARE ELEMENTS 1 AND 2 IN SOME FASHION
- RETURN < 0; 0; OR > 0
+ if (x < 0) {
+ aval = -x # absolute value
+ ival = int(aval)
+ fraction = aval - ival
+ if (fraction >= .5)
+ return int(x) - 1 # -2.5 --> -3
+ else
+ return int(x) # -2.3 --> -2
+ } else {
+ fraction = x - ival
+ if (fraction >= .5)
+ return ival + 1
+ else
+ return ival
+ }
}
- Here, I1 and I2 are the indices, and V1 and V2 are the corresponding
-values of the two elements being compared. Either V1 or V2, or both,
-can be arrays if the array being traversed contains subarrays as values.
-(*Note Arrays of Arrays::, for more information about subarrays.) The
-three possible return values are interpreted as follows:
+ # test harness
+ { print $0, round($0) }
-`comp_func(i1, v1, i2, v2) < 0'
- Index I1 comes before index I2 during loop traversal.
+
+File: gawk.info, Node: Cliff Random Function, Next: Ordinal Functions,
Prev: Round Function, Up: General Functions
-`comp_func(i1, v1, i2, v2) == 0'
- Indices I1 and I2 come together but the relative order with
- respect to each other is undefined.
+12.2.4 The Cliff Random Number Generator
+----------------------------------------
-`comp_func(i1, v1, i2, v2) > 0'
- Index I1 comes after index I2 during loop traversal.
+The Cliff random number generator
+(http://mathworld.wolfram.com/CliffRandomNumberGenerator.html) is a
+very simple random number generator that "passes the noise sphere test
+for randomness by showing no structure." It is easily programmed, in
+less than 10 lines of `awk' code:
- Our first comparison function can be used to scan an array in
-numerical order of the indices:
+ # cliff_rand.awk --- generate Cliff random numbers
- function cmp_num_idx(i1, v1, i2, v2)
+ BEGIN { _cliff_seed = 0.1 }
+
+ function cliff_rand()
{
- # numerical index comparison, ascending order
- return (i1 - i2)
+ _cliff_seed = (100 * log(_cliff_seed)) % 1
+ if (_cliff_seed < 0)
+ _cliff_seed = - _cliff_seed
+ return _cliff_seed
}
- Our second function traverses an array based on the string order of
-the element values rather than by indices:
+ This algorithm requires an initial "seed" of 0.1. Each new value
+uses the current seed as input for the calculation. If the built-in
+`rand()' function (*note Numeric Functions::) isn't random enough, you
+might try using this function instead.
- function cmp_str_val(i1, v1, i2, v2)
- {
- # string value comparison, ascending order
- v1 = v1 ""
- v2 = v2 ""
- if (v1 < v2)
- return -1
- return (v1 != v2)
- }
+
+File: gawk.info, Node: Ordinal Functions, Next: Join Function, Prev: Cliff
Random Function, Up: General Functions
- The third comparison function makes all numbers, and numeric strings
-without any leading or trailing spaces, come out first during loop
-traversal:
+12.2.5 Translating Between Characters and Numbers
+-------------------------------------------------
- function cmp_num_str_val(i1, v1, i2, v2, n1, n2)
- {
- # numbers before string value comparison, ascending order
- n1 = v1 + 0
- n2 = v2 + 0
- if (n1 == v1)
- return (n2 == v2) ? (n1 - n2) : -1
- else if (n2 == v2)
- return 1
- return (v1 < v2) ? -1 : (v1 != v2)
- }
+One commercial implementation of `awk' supplies a built-in function,
+`ord()', which takes a character and returns the numeric value for that
+character in the machine's character set. If the string passed to
+`ord()' has more than one character, only the first one is used.
- Here is a main program to demonstrate how `gawk' behaves using each
-of the previous functions:
+ The inverse of this function is `chr()' (from the function of the
+same name in Pascal), which takes a number and returns the
+corresponding character. Both functions are written very nicely in
+`awk'; there is no real reason to build them into the `awk' interpreter:
- BEGIN {
- data["one"] = 10
- data["two"] = 20
- data[10] = "one"
- data[100] = 100
- data[20] = "two"
+ # ord.awk --- do ord and chr
- f[1] = "cmp_num_idx"
- f[2] = "cmp_str_val"
- f[3] = "cmp_num_str_val"
- for (i = 1; i <= 3; i++) {
- printf("Sort function: %s\n", f[i])
- PROCINFO["sorted_in"] = f[i]
- for (j in data)
- printf("\tdata[%s] = %s\n", j, data[j])
- print ""
- }
- }
+ # Global identifiers:
+ # _ord_: numerical values indexed by characters
+ # _ord_init: function to initialize _ord_
- Here are the results when the program is run:
+ BEGIN { _ord_init() }
- $ gawk -f compdemo.awk
- -| Sort function: cmp_num_idx Sort by numeric index
- -| data[two] = 20
- -| data[one] = 10 Both strings are numerically zero
- -| data[10] = one
- -| data[20] = two
- -| data[100] = 100
- -|
- -| Sort function: cmp_str_val Sort by element values as strings
- -| data[one] = 10
- -| data[100] = 100 String 100 is less than string 20
- -| data[two] = 20
- -| data[10] = one
- -| data[20] = two
- -|
- -| Sort function: cmp_num_str_val Sort all numeric values before all
strings
- -| data[one] = 10
- -| data[two] = 20
- -| data[100] = 100
- -| data[10] = one
- -| data[20] = two
+ function _ord_init( low, high, i, t)
+ {
+ low = sprintf("%c", 7) # BEL is ascii 7
+ if (low == "\a") { # regular ascii
+ low = 0
+ high = 127
+ } else if (sprintf("%c", 128 + 7) == "\a") {
+ # ascii, mark parity
+ low = 128
+ high = 255
+ } else { # ebcdic(!)
+ low = 0
+ high = 255
+ }
- Consider sorting the entries of a GNU/Linux system password file
-according to login name. The following program sorts records by a
-specific field position and can be used for this purpose:
+ for (i = low; i <= high; i++) {
+ t = sprintf("%c", i)
+ _ord_[t] = i
+ }
+ }
- # sort.awk --- simple program to sort by field position
- # field position is specified by the global variable POS
+ Some explanation of the numbers used by `chr' is worthwhile. The
+most prominent character set in use today is ASCII.(1) Although an
+8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
+defines characters that use the values from 0 to 127.(2) In the now
+distant past, at least one minicomputer manufacturer used ASCII, but
+with mark parity, meaning that the leftmost bit in the byte is always
+1. This means that on those systems, characters have numeric values
+from 128 to 255. Finally, large mainframe systems use the EBCDIC
+character set, which uses all 256 values. While there are other
+character sets in use on some older systems, they are not really worth
+worrying about:
- function cmp_field(i1, v1, i2, v2)
+ function ord(str, c)
{
- # comparison by value, as string, and ascending order
- return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS])
+ # only first character is of interest
+ c = substr(str, 1, 1)
+ return _ord_[c]
}
+ function chr(c)
{
- for (i = 1; i <= NF; i++)
- a[NR][i] = $i
+ # force c to be numeric by adding 0
+ return sprintf("%c", c + 0)
}
- END {
- PROCINFO["sorted_in"] = "cmp_field"
- if (POS < 1 || POS > NF)
- POS = 1
- for (i in a) {
- for (j = 1; j <= NF; j++)
- printf("%s%c", a[i][j], j < NF ? ":" : "")
- print ""
- }
- }
+ #### test code ####
+ # BEGIN \
+ # {
+ # for (;;) {
+ # printf("enter a character: ")
+ # if (getline var <= 0)
+ # break
+ # printf("ord(%s) = %d\n", var, ord(var))
+ # }
+ # }
- The first field in each entry of the password file is the user's
-login name, and the fields are separated by colons. Each record
-defines a subarray, with each field as an element in the subarray.
-Running the program produces the following output:
-
- $ gawk -vPOS=1 -F: -f sort.awk /etc/passwd
- -| adm:x:3:4:adm:/var/adm:/sbin/nologin
- -| apache:x:48:48:Apache:/var/www:/sbin/nologin
- -| avahi:x:70:70:Avahi daemon:/:/sbin/nologin
- ...
+ An obvious improvement to these functions is to move the code for the
+`_ord_init' function into the body of the `BEGIN' rule. It was written
+this way initially for ease of development. There is a "test program"
+in a `BEGIN' rule, to test the function. It is commented out for
+production use.
- The comparison should normally always return the same value when
-given a specific pair of array elements as its arguments. If
-inconsistent results are returned then the order is undefined. This
-behavior can be exploited to introduce random order into otherwise
-seemingly ordered data:
+ ---------- Footnotes ----------
- function cmp_randomize(i1, v1, i2, v2)
- {
- # random order
- return (2 - 4 * rand())
- }
+ (1) This is changing; many systems use Unicode, a very large
+character set that includes ASCII as a subset. On systems with full
+Unicode support, a character can occupy up to 32 bits, making simple
+tests such as used here prohibitively expensive.
- As mentioned above, the order of the indices is arbitrary if two
-elements compare equal. This is usually not a problem, but letting the
-tied elements come out in arbitrary order can be an issue, especially
-when comparing item values. The partial ordering of the equal elements
-may change during the next loop traversal, if other elements are added
-or removed from the array. One way to resolve ties when comparing
-elements with otherwise equal values is to include the indices in the
-comparison rules. Note that doing this may make the loop traversal
-less efficient, so consider it only if necessary. The following
-comparison functions force a deterministic order, and are based on the
-fact that the indices of two elements are never equal:
+ (2) ASCII has been extended in many countries to use the values from
+128 to 255 for country-specific characters. If your system uses these
+extensions, you can simplify `_ord_init' to loop from 0 to 255.
- function cmp_numeric(i1, v1, i2, v2)
- {
- # numerical value (and index) comparison, descending order
- return (v1 != v2) ? (v2 - v1) : (i2 - i1)
- }
+
+File: gawk.info, Node: Join Function, Next: Getlocaltime Function, Prev:
Ordinal Functions, Up: General Functions
- function cmp_string(i1, v1, i2, v2)
- {
- # string value (and index) comparison, descending order
- v1 = v1 i1
- v2 = v2 i2
- return (v1 > v2) ? -1 : (v1 != v2)
- }
+12.2.6 Merging an Array into a String
+-------------------------------------
- A custom comparison function can often simplify ordered loop
-traversal, and the sky is really the limit when it comes to designing
-such a function.
+When doing string processing, it is often useful to be able to join all
+the strings in an array into one long string. The following function,
+`join()', accomplishes this task. It is used later in several of the
+application programs (*note Sample Programs::).
- When string comparisons are made during a sort, either for element
-values where one or both aren't numbers, or for element indices handled
-as strings, the value of `IGNORECASE' (*note Built-in Variables::)
-controls whether the comparisons treat corresponding uppercase and
-lowercase letters as equivalent or distinct.
+ Good function design is important; this function needs to be general
+but it should also have a reasonable default behavior. It is called
+with an array as well as the beginning and ending indices of the
+elements in the array to be merged. This assumes that the array
+indices are numeric--a reasonable assumption since the array was likely
+created with `split()' (*note String Functions::):
- Another point to keep in mind is that in the case of subarrays the
-element values can themselves be arrays; a production comparison
-function should use the `isarray()' function (*note Type Functions::),
-to check for this, and choose a defined sorting order for subarrays.
+ # join.awk --- join an array into a string
- All sorting based on `PROCINFO["sorted_in"]' is disabled in POSIX
-mode, since the `PROCINFO' array is not special in that case.
+ function join(array, start, end, sep, result, i)
+ {
+ if (sep == "")
+ sep = " "
+ else if (sep == SUBSEP) # magic value
+ sep = ""
+ result = array[start]
+ for (i = start + 1; i <= end; i++)
+ result = result sep array[i]
+ return result
+ }
- As a side note, sorting the array indices before traversing the
-array has been reported to add 15% to 20% overhead to the execution
-time of `awk' programs. For this reason, sorted array traversal is not
-the default.
+ An optional additional argument is the separator to use when joining
+the strings back together. If the caller supplies a nonempty value,
+`join()' uses it; if it is not supplied, it has a null value. In this
+case, `join()' uses a single space as a default separator for the
+strings. If the value is equal to `SUBSEP', then `join()' joins the
+strings with no separator between them. `SUBSEP' serves as a "magic"
+value to indicate that there should be no separation between the
+component strings.(1)
-
-File: gawk.info, Node: Array Sorting Functions, Prev: Controlling Array
Traversal, Up: Array Sorting
+ ---------- Footnotes ----------
-12.2.2 Sorting Array Values and Indices with `gawk'
----------------------------------------------------
+ (1) It would be nice if `awk' had an assignment operator for
+concatenation. The lack of an explicit operator for concatenation
+makes string operations more difficult than they really need to be.
-In most `awk' implementations, sorting an array requires writing a
-`sort()' function. While this can be educational for exploring
-different sorting algorithms, usually that's not the point of the
-program. `gawk' provides the built-in `asort()' and `asorti()'
-functions (*note String Functions::) for sorting arrays. For example:
+
+File: gawk.info, Node: Getlocaltime Function, Prev: Join Function, Up:
General Functions
- POPULATE THE ARRAY data
- n = asort(data)
- for (i = 1; i <= n; i++)
- DO SOMETHING WITH data[i]
+12.2.7 Managing the Time of Day
+-------------------------------
- After the call to `asort()', the array `data' is indexed from 1 to
-some number N, the total number of elements in `data'. (This count is
-`asort()''s return value.) `data[1]' <= `data[2]' <= `data[3]', and so
-on. The comparison is based on the type of the elements (*note Typing
-and Comparison::). All numeric values come before all string values,
-which in turn come before all subarrays.
+The `systime()' and `strftime()' functions described in *note Time
+Functions::, provide the minimum functionality necessary for dealing
+with the time of day in human readable form. While `strftime()' is
+extensive, the control formats are not necessarily easy to remember or
+intuitively obvious when reading a program.
- An important side effect of calling `asort()' is that _the array's
-original indices are irrevocably lost_. As this isn't always
-desirable, `asort()' accepts a second argument:
+ The following function, `getlocaltime()', populates a user-supplied
+array with preformatted time information. It returns a string with the
+current time formatted in the same way as the `date' utility:
- POPULATE THE ARRAY source
- n = asort(source, dest)
- for (i = 1; i <= n; i++)
- DO SOMETHING WITH dest[i]
+ # getlocaltime.awk --- get the time of day in a usable format
- In this case, `gawk' copies the `source' array into the `dest' array
-and then sorts `dest', destroying its indices. However, the `source'
-array is not affected.
+ # Returns a string in the format of output of date(1)
+ # Populates the array argument time with individual values:
+ # time["second"] -- seconds (0 - 59)
+ # time["minute"] -- minutes (0 - 59)
+ # time["hour"] -- hours (0 - 23)
+ # time["althour"] -- hours (0 - 12)
+ # time["monthday"] -- day of month (1 - 31)
+ # time["month"] -- month of year (1 - 12)
+ # time["monthname"] -- name of the month
+ # time["shortmonth"] -- short name of the month
+ # time["year"] -- year modulo 100 (0 - 99)
+ # time["fullyear"] -- full year
+ # time["weekday"] -- day of week (Sunday = 0)
+ # time["altweekday"] -- day of week (Monday = 0)
+ # time["dayname"] -- name of weekday
+ # time["shortdayname"] -- short name of weekday
+ # time["yearday"] -- day of year (0 - 365)
+ # time["timezone"] -- abbreviation of timezone name
+ # time["ampm"] -- AM or PM designation
+ # time["weeknum"] -- week number, Sunday first day
+ # time["altweeknum"] -- week number, Monday first day
- `asort()' accepts a third string argument to control comparison of
-array elements. As with `PROCINFO["sorted_in"]', this argument may be
-one of the predefined names that `gawk' provides (*note Controlling
-Scanning::), or the name of a user-defined function (*note Controlling
-Array Traversal::).
+ function getlocaltime(time, ret, now, i)
+ {
+ # get time once, avoids unnecessary system calls
+ now = systime()
- NOTE: In all cases, the sorted element values consist of the
- original array's element values. The ability to control
- comparison merely affects the way in which they are sorted.
+ # return date(1)-style output
+ ret = strftime("%a %b %e %H:%M:%S %Z %Y", now)
- Often, what's needed is to sort on the values of the _indices_
-instead of the values of the elements. To do that, use the `asorti()'
-function. The interface is identical to that of `asort()', except that
-the index values are used for sorting, and become the values of the
-result array:
+ # clear out target array
+ delete time
- { source[$0] = some_func($0) }
+ # fill in values, force numeric values to be
+ # numeric by adding 0
+ time["second"] = strftime("%S", now) + 0
+ time["minute"] = strftime("%M", now) + 0
+ time["hour"] = strftime("%H", now) + 0
+ time["althour"] = strftime("%I", now) + 0
+ time["monthday"] = strftime("%d", now) + 0
+ time["month"] = strftime("%m", now) + 0
+ time["monthname"] = strftime("%B", now)
+ time["shortmonth"] = strftime("%b", now)
+ time["year"] = strftime("%y", now) + 0
+ time["fullyear"] = strftime("%Y", now) + 0
+ time["weekday"] = strftime("%w", now) + 0
+ time["altweekday"] = strftime("%u", now) + 0
+ time["dayname"] = strftime("%A", now)
+ time["shortdayname"] = strftime("%a", now)
+ time["yearday"] = strftime("%j", now) + 0
+ time["timezone"] = strftime("%Z", now)
+ time["ampm"] = strftime("%p", now)
+ time["weeknum"] = strftime("%U", now) + 0
+ time["altweeknum"] = strftime("%W", now) + 0
- END {
- n = asorti(source, dest)
- for (i = 1; i <= n; i++) {
- Work with sorted indices directly:
- DO SOMETHING WITH dest[i]
- ...
- Access original array via sorted indices:
- DO SOMETHING WITH source[dest[i]]
- }
+ return ret
}
- Similar to `asort()', in all cases, the sorted element values
-consist of the original array's indices. The ability to control
-comparison merely affects the way in which they are sorted.
-
- Sorting the array by replacing the indices provides maximal
-flexibility. To traverse the elements in decreasing order, use a loop
-that goes from N down to 1, either over the elements or over the
-indices.(1)
+ The string indices are easier to use and read than the various
+formats required by `strftime()'. The `alarm' program presented in
+*note Alarm Program::, uses this function. A more general design for
+the `getlocaltime()' function would have allowed the user to supply an
+optional timestamp value to use instead of the current time.
- Copying array indices and elements isn't expensive in terms of
-memory. Internally, `gawk' maintains "reference counts" to data. For
-example, when `asort()' copies the first array to the second one, there
-is only one copy of the original array elements' data, even though both
-arrays use the values.
+
+File: gawk.info, Node: Data File Management, Next: Getopt Function, Prev:
General Functions, Up: Library Functions
- Because `IGNORECASE' affects string comparisons, the value of
-`IGNORECASE' also affects sorting for both `asort()' and `asorti()'.
-Note also that the locale's sorting order does _not_ come into play;
-comparisons are based on character values only.(2) Caveat Emptor.
+12.3 Data File Management
+=========================
- ---------- Footnotes ----------
+This minor node presents functions that are useful for managing
+command-line data files.
- (1) You may also use one of the predefined sorting names that sorts
-in decreasing order.
+* Menu:
- (2) This is true because locale-based comparison occurs only when in
-POSIX compatibility mode, and since `asort()' and `asorti()' are `gawk'
-extensions, they are not available in that case.
+* Filetrans Function:: A function for handling data file transitions.
+* Rewind Function:: A function for rereading the current file.
+* File Checking:: Checking that data files are readable.
+* Empty Files:: Checking for zero-length files.
+* Ignoring Assigns:: Treating assignments as file names.
-File: gawk.info, Node: Two-way I/O, Next: TCP/IP Networking, Prev: Array
Sorting, Up: Advanced Features
-
-12.3 Two-Way Communications with Another Process
-================================================
+File: gawk.info, Node: Filetrans Function, Next: Rewind Function, Up: Data
File Management
- From: address@hidden (Mike Brennan)
- Newsgroups: comp.lang.awk
- Subject: Re: Learn the SECRET to Attract Women Easily
- Date: 4 Aug 1997 17:34:46 GMT
- Message-ID: <address@hidden>
+12.3.1 Noting Data File Boundaries
+----------------------------------
- On 3 Aug 1997 13:17:43 GMT, Want More Dates???
- <address@hidden> wrote:
- >Learn the SECRET to Attract Women Easily
- >
- >The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women
+The `BEGIN' and `END' rules are each executed exactly once at the
+beginning and end of your `awk' program, respectively (*note
+BEGIN/END::). We (the `gawk' authors) once had a user who mistakenly
+thought that the `BEGIN' rule is executed at the beginning of each data
+file and the `END' rule is executed at the end of each data file.
- The scent of awk programmers is a lot more attractive to women than
- the scent of perl programmers.
- --
- Mike Brennan
+ When informed that this was not the case, the user requested that we
+add new special patterns to `gawk', named `BEGIN_FILE' and `END_FILE',
+that would have the desired behavior. He even supplied us the code to
+do so.
- It is often useful to be able to send data to a separate program for
-processing and then read the result. This can always be done with
-temporary files:
+ Adding these special patterns to `gawk' wasn't necessary; the job
+can be done cleanly in `awk' itself, as illustrated by the following
+library program. It arranges to call two user-supplied functions,
+`beginfile()' and `endfile()', at the beginning and end of each data
+file. Besides solving the problem in only nine(!) lines of code, it
+does so _portably_; this works with any implementation of `awk':
- # Write the data for processing
- tempfile = ("mydata." PROCINFO["pid"])
- while (NOT DONE WITH DATA)
- print DATA | ("subprogram > " tempfile)
- close("subprogram > " tempfile)
+ # transfile.awk
+ #
+ # Give the user a hook for filename transitions
+ #
+ # The user must supply functions beginfile() and endfile()
+ # that each take the name of the file being started or
+ # finished, respectively.
- # Read the results, remove tempfile when done
- while ((getline newdata < tempfile) > 0)
- PROCESS newdata APPROPRIATELY
- close(tempfile)
- system("rm " tempfile)
+ FILENAME != _oldfilename \
+ {
+ if (_oldfilename != "")
+ endfile(_oldfilename)
+ _oldfilename = FILENAME
+ beginfile(FILENAME)
+ }
-This works, but not elegantly. Among other things, it requires that
-the program be run in a directory that cannot be shared among users;
-for example, `/tmp' will not do, as another user might happen to be
-using a temporary file with the same name.
+ END { endfile(FILENAME) }
- However, with `gawk', it is possible to open a _two-way_ pipe to
-another process. The second process is termed a "coprocess", since it
-runs in parallel with `gawk'. The two-way connection is created using
-the `|&' operator (borrowed from the Korn shell, `ksh'):(1)
+ This file must be loaded before the user's "main" program, so that
+the rule it supplies is executed first.
- do {
- print DATA |& "subprogram"
- "subprogram" |& getline results
- } while (DATA LEFT TO PROCESS)
- close("subprogram")
+ This rule relies on `awk''s `FILENAME' variable that automatically
+changes for each new data file. The current file name is saved in a
+private variable, `_oldfilename'. If `FILENAME' does not equal
+`_oldfilename', then a new data file is being processed and it is
+necessary to call `endfile()' for the old file. Because `endfile()'
+should only be called if a file has been processed, the program first
+checks to make sure that `_oldfilename' is not the null string. The
+program then assigns the current file name to `_oldfilename' and calls
+`beginfile()' for the file. Because, like all `awk' variables,
+`_oldfilename' is initialized to the null string, this rule executes
+correctly even for the first data file.
- The first time an I/O operation is executed using the `|&' operator,
-`gawk' creates a two-way pipeline to a child process that runs the
-other program. Output created with `print' or `printf' is written to
-the program's standard input, and output from the program's standard
-output can be read by the `gawk' program using `getline'. As is the
-case with processes started by `|', the subprogram can be any program,
-or pipeline of programs, that can be started by the shell.
+ The program also supplies an `END' rule to do the final processing
+for the last file. Because this `END' rule comes before any `END' rules
+supplied in the "main" program, `endfile()' is called first. Once
+again the value of multiple `BEGIN' and `END' rules should be clear.
- There are some cautionary items to be aware of:
+ If the same data file occurs twice in a row on the command line, then
+`endfile()' and `beginfile()' are not executed at the end of the first
+pass and at the beginning of the second pass. The following version
+solves the problem:
- * As the code inside `gawk' currently stands, the coprocess's
- standard error goes to the same place that the parent `gawk''s
- standard error goes. It is not possible to read the child's
- standard error separately.
+ # ftrans.awk --- handle data file transitions
+ #
+ # user supplies beginfile() and endfile() functions
- * I/O buffering may be a problem. `gawk' automatically flushes all
- output down the pipe to the coprocess. However, if the coprocess
- does not flush its output, `gawk' may hang when doing a `getline'
- in order to read the coprocess's results. This could lead to a
- situation known as "deadlock", where each process is waiting for
- the other one to do something.
+ FNR == 1 {
+ if (_filename_ != "")
+ endfile(_filename_)
+ _filename_ = FILENAME
+ beginfile(FILENAME)
+ }
- It is possible to close just one end of the two-way pipe to a
-coprocess, by supplying a second argument to the `close()' function of
-either `"to"' or `"from"' (*note Close Files And Pipes::). These
-strings tell `gawk' to close the end of the pipe that sends data to the
-coprocess or the end that reads from it, respectively.
+ END { endfile(_filename_) }
- This is particularly necessary in order to use the system `sort'
-utility as part of a coprocess; `sort' must read _all_ of its input
-data before it can produce any output. The `sort' program does not
-receive an end-of-file indication until `gawk' closes the write end of
-the pipe.
+ *note Wc Program::, shows how this library function can be used and
+how it simplifies writing the main program.
- When you have finished writing data to the `sort' utility, you can
-close the `"to"' end of the pipe, and then start reading sorted data
-via `getline'. For example:
+Advanced Notes: So Why Does `gawk' have `BEGINFILE' and `ENDFILE'?
+------------------------------------------------------------------
- BEGIN {
- command = "LC_ALL=C sort"
- n = split("abcdefghijklmnopqrstuvwxyz", a, "")
+You are probably wondering, if `beginfile()' and `endfile()' functions
+can do the job, why does `gawk' have `BEGINFILE' and `ENDFILE' patterns
+(*note BEGINFILE/ENDFILE::)?
- for (i = n; i > 0; i--)
- print a[i] |& command
- close(command, "to")
+ Good question. Normally, if `awk' cannot open a file, this causes
+an immediate fatal error. In this case, there is no way for a
+user-defined function to deal with the problem, since the mechanism for
+calling it relies on the file being open and at the first record. Thus,
+the main reason for `BEGINFILE' is to give you a "hook" to catch files
+that cannot be processed. `ENDFILE' exists for symmetry, and because
+it provides an easy way to do per-file cleanup processing.
- while ((command |& getline line) > 0)
- print "got", line
- close(command)
- }
+
+File: gawk.info, Node: Rewind Function, Next: File Checking, Prev:
Filetrans Function, Up: Data File Management
- This program writes the letters of the alphabet in reverse order, one
-per line, down the two-way pipe to `sort'. It then closes the write
-end of the pipe, so that `sort' receives an end-of-file indication.
-This causes `sort' to sort the data and write the sorted data back to
-the `gawk' program. Once all of the data has been read, `gawk'
-terminates the coprocess and exits.
+12.3.2 Rereading the Current File
+---------------------------------
- As a side note, the assignment `LC_ALL=C' in the `sort' command
-ensures traditional Unix (ASCII) sorting from `sort'.
+Another request for a new built-in function was for a `rewind()'
+function that would make it possible to reread the current file. The
+requesting user didn't want to have to use `getline' (*note Getline::)
+inside a loop.
- You may also use pseudo-ttys (ptys) for two-way communication
-instead of pipes, if your system supports them. This is done on a
-per-command basis, by setting a special element in the `PROCINFO' array
-(*note Auto-set::), like so:
+ However, as long as you are not in the `END' rule, it is quite easy
+to arrange to immediately close the current input file and then start
+over with it from the top. For lack of a better name, we'll call it
+`rewind()':
- command = "sort -nr" # command, save in convenience variable
- PROCINFO[command, "pty"] = 1 # update PROCINFO
- print ... |& command # start two-way pipe
- ...
+ # rewind.awk --- rewind the current file and start over
-Using ptys avoids the buffer deadlock issues described earlier, at some
-loss in performance. If your system does not have ptys, or if all the
-system's ptys are in use, `gawk' automatically falls back to using
-regular pipes.
+ function rewind( i)
+ {
+ # shift remaining arguments up
+ for (i = ARGC; i > ARGIND; i--)
+ ARGV[i] = ARGV[i-1]
- ---------- Footnotes ----------
+ # make sure gawk knows to keep going
+ ARGC++
- (1) This is very different from the same operator in the C shell.
+ # make current file next to get done
+ ARGV[ARGIND+1] = FILENAME
-
-File: gawk.info, Node: TCP/IP Networking, Next: Profiling, Prev: Two-way
I/O, Up: Advanced Features
+ # do it
+ nextfile
+ }
-12.4 Using `gawk' for Network Programming
-=========================================
+ This code relies on the `ARGIND' variable (*note Auto-set::), which
+is specific to `gawk'. If you are not using `gawk', you can use ideas
+presented in *note Filetrans Function::, to either update `ARGIND' on
+your own or modify this code as appropriate.
- `EMISTERED':
- A host is a host from coast to coast,
- and no-one can talk to host that's close,
- unless the host that isn't close
- is busy hung or dead.
+ The `rewind()' function also relies on the `nextfile' keyword (*note
+Nextfile Statement::).
- In addition to being able to open a two-way pipeline to a coprocess
-on the same system (*note Two-way I/O::), it is possible to make a
-two-way connection to another process on another system across an IP
-network connection.
+
+File: gawk.info, Node: File Checking, Next: Empty Files, Prev: Rewind
Function, Up: Data File Management
- You can think of this as just a _very long_ two-way pipeline to a
-coprocess. The way `gawk' decides that you want to use TCP/IP
-networking is by recognizing special file names that begin with one of
-`/inet/', `/inet4/' or `/inet6'.
+12.3.3 Checking for Readable Data Files
+---------------------------------------
- The full syntax of the special file name is
-`/NET-TYPE/PROTOCOL/LOCAL-PORT/REMOTE-HOST/REMOTE-PORT'. The
-components are:
+Normally, if you give `awk' a data file that isn't readable, it stops
+with a fatal error. There are times when you might want to just ignore
+such files and keep going. You can do this by prepending the following
+program to your `awk' program:
-NET-TYPE
- Specifies the kind of Internet connection to make. Use `/inet4/'
- to force IPv4, and `/inet6/' to force IPv6. Plain `/inet/' (which
- used to be the only option) uses the system default, most likely
- IPv4.
+ # readable.awk --- library file to skip over unreadable files
-PROTOCOL
- The protocol to use over IP. This must be either `tcp', or `udp',
- for a TCP or UDP IP connection, respectively. The use of TCP is
- recommended for most applications.
+ BEGIN {
+ for (i = 1; i < ARGC; i++) {
+ if (ARGV[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/ \
+ || ARGV[i] == "-" || ARGV[i] == "/dev/stdin")
+ continue # assignment or standard input
+ else if ((getline junk < ARGV[i]) < 0) # unreadable
+ delete ARGV[i]
+ else
+ close(ARGV[i])
+ }
+ }
-LOCAL-PORT
- The local TCP or UDP port number to use. Use a port number of `0'
- when you want the system to pick a port. This is what you should do
- when writing a TCP or UDP client. You may also use a well-known
- service name, such as `smtp' or `http', in which case `gawk'
- attempts to determine the predefined port number using the C
- `getaddrinfo()' function.
+ This works, because the `getline' won't be fatal. Removing the
+element from `ARGV' with `delete' skips the file (since it's no longer
+in the list). See also *note ARGC and ARGV::.
-REMOTE-HOST
- The IP address or fully-qualified domain name of the Internet host
- to which you want to connect.
+
+File: gawk.info, Node: Empty Files, Next: Ignoring Assigns, Prev: File
Checking, Up: Data File Management
-REMOTE-PORT
- The TCP or UDP port number to use on the given REMOTE-HOST.
- Again, use `0' if you don't care, or else a well-known service
- name.
+12.3.4 Checking For Zero-length Files
+-------------------------------------
- NOTE: Failure in opening a two-way socket will result in a
- non-fatal error being returned to the calling code. The value of
- `ERRNO' indicates the error (*note Auto-set::).
+All known `awk' implementations silently skip over zero-length files.
+This is a by-product of `awk''s implicit
+read-a-record-and-match-against-the-rules loop: when `awk' tries to
+read a record from an empty file, it immediately receives an end of
+file indication, closes the file, and proceeds on to the next
+command-line data file, _without_ executing any user-level `awk'
+program code.
- Consider the following very simple example:
+ Using `gawk''s `ARGIND' variable (*note Built-in Variables::), it is
+possible to detect when an empty data file has been skipped. Similar
+to the library file presented in *note Filetrans Function::, the
+following library file calls a function named `zerofile()' that the
+user must provide. The arguments passed are the file name and the
+position in `ARGV' where it was found:
- BEGIN {
- Service = "/inet/tcp/0/localhost/daytime"
- Service |& getline
- print $0
- close(Service)
+ # zerofile.awk --- library file to process empty input files
+
+ BEGIN { Argind = 0 }
+
+ ARGIND > Argind + 1 {
+ for (Argind++; Argind < ARGIND; Argind++)
+ zerofile(ARGV[Argind], Argind)
}
- This program reads the current date and time from the local system's
-TCP `daytime' server. It then prints the results and closes the
-connection.
+ ARGIND != Argind { Argind = ARGIND }
- Because this topic is extensive, the use of `gawk' for TCP/IP
-programming is documented separately. See *note (General
-Introduction)Top:: gawkinet, TCP/IP Internetworking with `gawk', for a
-much more complete introduction and discussion, as well as extensive
-examples.
+ END {
+ if (ARGIND > Argind)
+ for (Argind++; Argind <= ARGIND; Argind++)
+ zerofile(ARGV[Argind], Argind)
+ }
-
-File: gawk.info, Node: Profiling, Prev: TCP/IP Networking, Up: Advanced
Features
+ The user-level variable `Argind' allows the `awk' program to track
+its progress through `ARGV'. Whenever the program detects that
+`ARGIND' is greater than `Argind + 1', it means that one or more empty
+files were skipped. The action then calls `zerofile()' for each such
+file, incrementing `Argind' along the way.
-12.5 Profiling Your `awk' Programs
-==================================
+ The `Argind != ARGIND' rule simply keeps `Argind' up to date in the
+normal case.
-You may produce execution traces of your `awk' programs. This is done
-by passing the option `--profile' to `gawk'. When `gawk' has finished
-running, it creates a profile of your program in a file named
-`awkprof.out'. Because it is profiling, it also executes up to 45%
-slower than `gawk' normally does.
+ Finally, the `END' rule catches the case of any empty files at the
+end of the command-line arguments. Note that the test in the condition
+of the `for' loop uses the `<=' operator, not `<'.
- As shown in the following example, the `--profile' option can be
-used to change the name of the file where `gawk' will write the profile:
+ As an exercise, you might consider whether this same problem can be
+solved without relying on `gawk''s `ARGIND' variable.
- gawk --profile=myprog.prof -f myprog.awk data1 data2
+ As a second exercise, revise this code to handle the case where an
+intervening value in `ARGV' is a variable assignment.
-In the above example, `gawk' places the profile in `myprog.prof'
-instead of in `awkprof.out'.
+
+File: gawk.info, Node: Ignoring Assigns, Prev: Empty Files, Up: Data File
Management
- Here is a sample session showing a simple `awk' program, its input
-data, and the results from running `gawk' with the `--profile' option.
-First, the `awk' program:
+12.3.5 Treating Assignments as File Names
+-----------------------------------------
- BEGIN { print "First BEGIN rule" }
+Occasionally, you might not want `awk' to process command-line variable
+assignments (*note Assignment Options::). In particular, if you have a
+file name that contain an `=' character, `awk' treats the file name as
+an assignment, and does not process it.
- END { print "First END rule" }
+ Some users have suggested an additional command-line option for
+`gawk' to disable command-line assignments. However, some simple
+programming with a library file does the trick:
- /foo/ {
- print "matched /foo/, gosh"
- for (i = 1; i <= 3; i++)
- sing()
- }
+ # noassign.awk --- library file to avoid the need for a
+ # special option that disables command-line assignments
+ function disable_assigns(argc, argv, i)
{
- if (/foo/)
- print "if is true"
- else
- print "else is true"
+ for (i = 1; i < argc; i++)
+ if (argv[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/)
+ argv[i] = ("./" argv[i])
}
- BEGIN { print "Second BEGIN rule" }
-
- END { print "Second END rule" }
-
- function sing( dummy)
- {
- print "I gotta be me!"
+ BEGIN {
+ if (No_command_assign)
+ disable_assigns(ARGC, ARGV)
}
- Following is the input data:
+ You then run your program this way:
- foo
- bar
- baz
- foo
- junk
+ awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk *
- Here is the `awkprof.out' that results from running the `gawk'
-profiler on this program and data (this example also illustrates that
-`awk' programmers sometimes have to work late):
+ The function works by looping through the arguments. It prepends
+`./' to any argument that matches the form of a variable assignment,
+turning that argument into a file name.
- # gawk profile, created Sun Aug 13 00:00:15 2000
+ The use of `No_command_assign' allows you to disable command-line
+assignments at invocation time, by giving the variable a true value.
+When not set, it is initially zero (i.e., false), so the command-line
+arguments are left alone.
- # BEGIN block(s)
+
+File: gawk.info, Node: Getopt Function, Next: Passwd Functions, Prev: Data
File Management, Up: Library Functions
- BEGIN {
- 1 print "First BEGIN rule"
- 1 print "Second BEGIN rule"
- }
+12.4 Processing Command-Line Options
+====================================
- # Rule(s)
+Most utilities on POSIX compatible systems take options on the command
+line that can be used to change the way a program behaves. `awk' is an
+example of such a program (*note Options::). Often, options take
+"arguments"; i.e., data that the program needs to correctly obey the
+command-line option. For example, `awk''s `-F' option requires a
+string to use as the field separator. The first occurrence on the
+command line of either `--' or a string that does not begin with `-'
+ends the options.
- 5 /foo/ { # 2
- 2 print "matched /foo/, gosh"
- 6 for (i = 1; i <= 3; i++) {
- 6 sing()
- }
- }
+ Modern Unix systems provide a C function named `getopt()' for
+processing command-line arguments. The programmer provides a string
+describing the one-letter options. If an option requires an argument,
+it is followed in the string with a colon. `getopt()' is also passed
+the count and values of the command-line arguments and is called in a
+loop. `getopt()' processes the command-line arguments for option
+letters. Each time around the loop, it returns a single character
+representing the next option letter that it finds, or `?' if it finds
+an invalid option. When it returns -1, there are no options left on
+the command line.
- 5 {
- 5 if (/foo/) { # 2
- 2 print "if is true"
- 3 } else {
- 3 print "else is true"
- }
- }
+ When using `getopt()', options that do not take arguments can be
+grouped together. Furthermore, options that take arguments require
+that the argument be present. The argument can immediately follow the
+option letter, or it can be a separate command-line argument.
- # END block(s)
+ Given a hypothetical program that takes three command-line options,
+`-a', `-b', and `-c', where `-b' requires an argument, all of the
+following are valid ways of invoking the program:
- END {
- 1 print "First END rule"
- 1 print "Second END rule"
- }
+ prog -a -b foo -c data1 data2 data3
+ prog -ac -bfoo -- data1 data2 data3
+ prog -acbfoo data1 data2 data3
- # Functions, listed alphabetically
+ Notice that when the argument is grouped with its option, the rest of
+the argument is considered to be the option's argument. In this
+example, `-acbfoo' indicates that all of the `-a', `-b', and `-c'
+options were supplied, and that `foo' is the argument to the `-b'
+option.
- 6 function sing(dummy)
- {
- 6 print "I gotta be me!"
- }
+ `getopt()' provides four external variables that the programmer can
+use:
- This example illustrates many of the basic features of profiling
-output. They are as follows:
+`optind'
+ The index in the argument value array (`argv') where the first
+ nonoption command-line argument can be found.
- * The program is printed in the order `BEGIN' rule, `BEGINFILE' rule,
- pattern/action rules, `ENDFILE' rule, `END' rule and functions,
- listed alphabetically. Multiple `BEGIN' and `END' rules are
- merged together, as are multiple `BEGINFILE' and `ENDFILE' rules.
+`optarg'
+ The string value of the argument to an option.
- * Pattern-action rules have two counts. The first count, to the
- left of the rule, shows how many times the rule's pattern was
- _tested_. The second count, to the right of the rule's opening
- left brace in a comment, shows how many times the rule's action
- was _executed_. The difference between the two indicates how many
- times the rule's pattern evaluated to false.
+`opterr'
+ Usually `getopt()' prints an error message when it finds an invalid
+ option. Setting `opterr' to zero disables this feature. (An
+ application might want to print its own error message.)
- * Similarly, the count for an `if'-`else' statement shows how many
- times the condition was tested. To the right of the opening left
- brace for the `if''s body is a count showing how many times the
- condition was true. The count for the `else' indicates how many
- times the test failed.
+`optopt'
+ The letter representing the command-line option.
- * The count for a loop header (such as `for' or `while') shows how
- many times the loop test was executed. (Because of this, you
- can't just look at the count on the first statement in a rule to
- determine how many times the rule was executed. If the first
- statement is a loop, the count is misleading.)
+ The following C fragment shows how `getopt()' might process
+command-line arguments for `awk':
- * For user-defined functions, the count next to the `function'
- keyword indicates how many times the function was called. The
- counts next to the statements in the body show how many times
- those statements were executed.
+ int
+ main(int argc, char *argv[])
+ {
+ ...
+ /* print our own message */
+ opterr = 0;
+ while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) {
+ switch (c) {
+ case 'f': /* file */
+ ...
+ break;
+ case 'F': /* field separator */
+ ...
+ break;
+ case 'v': /* variable assignment */
+ ...
+ break;
+ case 'W': /* extension */
+ ...
+ break;
+ case '?':
+ default:
+ usage();
+ break;
+ }
+ }
+ ...
+ }
- * The layout uses "K&R" style with TABs. Braces are used
- everywhere, even when the body of an `if', `else', or loop is only
- a single statement.
+ As a side point, `gawk' actually uses the GNU `getopt_long()'
+function to process both normal and GNU-style long options (*note
+Options::).
- * Parentheses are used only where needed, as indicated by the
- structure of the program and the precedence rules. For example,
- `(3 + 5) * 4' means add three plus five, then multiply the total
- by four. However, `3 + 5 * 4' has no parentheses, and means `3 +
- (5 * 4)'.
+ The abstraction provided by `getopt()' is very useful and is quite
+handy in `awk' programs as well. Following is an `awk' version of
+`getopt()'. This function highlights one of the greatest weaknesses in
+`awk', which is that it is very poor at manipulating single characters.
+Repeated calls to `substr()' are necessary for accessing individual
+characters (*note String Functions::).(1)
- * Parentheses are used around the arguments to `print' and `printf'
- only when the `print' or `printf' statement is followed by a
- redirection. Similarly, if the target of a redirection isn't a
- scalar, it gets parenthesized.
+ The discussion that follows walks through the code a bit at a time:
- * `gawk' supplies leading comments in front of the `BEGIN' and `END'
- rules, the pattern/action rules, and the functions.
+ # getopt.awk --- Do C library getopt(3) function in awk
+ # External variables:
+ # Optind -- index in ARGV of first nonoption argument
+ # Optarg -- string value of argument to current option
+ # Opterr -- if nonzero, print our own diagnostic
+ # Optopt -- current option letter
- The profiled version of your program may not look exactly like what
-you typed when you wrote it. This is because `gawk' creates the
-profiled version by "pretty printing" its internal representation of
-the program. The advantage to this is that `gawk' can produce a
-standard representation. The disadvantage is that all source-code
-comments are lost, as are the distinctions among multiple `BEGIN',
-`END', `BEGINFILE', and `ENDFILE' rules. Also, things such as:
+ # Returns:
+ # -1 at end of options
+ # "?" for unrecognized option
+ # <c> a character representing the current option
- /foo/
+ # Private Data:
+ # _opti -- index in multi-flag option, e.g., -abc
-come out as:
+ The function starts out with comments presenting a list of the
+global variables it uses, what the return values are, what they mean,
+and any global variables that are "private" to this library function.
+Such documentation is essential for any program, and particularly for
+library functions.
- /foo/ {
- print $0
- }
+ The `getopt()' function first checks that it was indeed called with
+a string of options (the `options' parameter). If `options' has a zero
+length, `getopt()' immediately returns -1:
-which is correct, but possibly surprising.
+ function getopt(argc, argv, options, thisopt, i)
+ {
+ if (length(options) == 0) # no options given
+ return -1
- Besides creating profiles when a program has completed, `gawk' can
-produce a profile while it is running. This is useful if your `awk'
-program goes into an infinite loop and you want to see what has been
-executed. To use this feature, run `gawk' with the `--profile' option
-in the background:
+ if (argv[Optind] == "--") { # all done
+ Optind++
+ _opti = 0
+ return -1
+ } else if (argv[Optind] !~ /^-[^:[:space:]]/) {
+ _opti = 0
+ return -1
+ }
- $ gawk --profile -f myprog &
- [1] 13992
+ The next thing to check for is the end of the options. A `--' ends
+the command-line options, as does any command-line argument that does
+not begin with a `-'. `Optind' is used to step through the array of
+command-line arguments; it retains its value across calls to
+`getopt()', because it is a global variable.
-The shell prints a job number and process ID number; in this case,
-13992. Use the `kill' command to send the `USR1' signal to `gawk':
+ The regular expression that is used, `/^-[^:[:space:]/', checks for
+a `-' followed by anything that is not whitespace and not a colon. If
+the current command-line argument does not match this pattern, it is
+not an option, and it ends option processing. Continuing on:
- $ kill -USR1 13992
+ if (_opti == 0)
+ _opti = 2
+ thisopt = substr(argv[Optind], _opti, 1)
+ Optopt = thisopt
+ i = index(options, thisopt)
+ if (i == 0) {
+ if (Opterr)
+ printf("%c -- invalid option\n",
+ thisopt) > "/dev/stderr"
+ if (_opti >= length(argv[Optind])) {
+ Optind++
+ _opti = 0
+ } else
+ _opti++
+ return "?"
+ }
-As usual, the profiled version of the program is written to
-`awkprof.out', or to a different file if one specified with the
-`--profile' option.
+ The `_opti' variable tracks the position in the current command-line
+argument (`argv[Optind]'). If multiple options are grouped together
+with one `-' (e.g., `-abx'), it is necessary to return them to the user
+one at a time.
- Along with the regular profile, as shown earlier, the profile
-includes a trace of any active functions:
+ If `_opti' is equal to zero, it is set to two, which is the index in
+the string of the next character to look at (we skip the `-', which is
+at position one). The variable `thisopt' holds the character, obtained
+with `substr()'. It is saved in `Optopt' for the main program to use.
- # Function Call Stack:
+ If `thisopt' is not in the `options' string, then it is an invalid
+option. If `Opterr' is nonzero, `getopt()' prints an error message on
+the standard error that is similar to the message from the C version of
+`getopt()'.
- # 3. baz
- # 2. bar
- # 1. foo
- # -- main --
+ Because the option is invalid, it is necessary to skip it and move
+on to the next option character. If `_opti' is greater than or equal
+to the length of the current command-line argument, it is necessary to
+move on to the next argument, so `Optind' is incremented and `_opti' is
+reset to zero. Otherwise, `Optind' is left alone and `_opti' is merely
+incremented.
- You may send `gawk' the `USR1' signal as many times as you like.
-Each time, the profile and function call trace are appended to the
-output profile file.
+ In any case, because the option is invalid, `getopt()' returns `"?"'.
+The main program can examine `Optopt' if it needs to know what the
+invalid option letter actually is. Continuing on:
- If you use the `HUP' signal instead of the `USR1' signal, `gawk'
-produces the profile and the function call trace and then exits.
+ if (substr(options, i + 1, 1) == ":") {
+ # get option argument
+ if (length(substr(argv[Optind], _opti + 1)) > 0)
+ Optarg = substr(argv[Optind], _opti + 1)
+ else
+ Optarg = argv[++Optind]
+ _opti = 0
+ } else
+ Optarg = ""
- When `gawk' runs on MS-Windows systems, it uses the `INT' and `QUIT'
-signals for producing the profile and, in the case of the `INT' signal,
-`gawk' exits. This is because these systems don't support the `kill'
-command, so the only signals you can deliver to a program are those
-generated by the keyboard. The `INT' signal is generated by the
-`Ctrl-<C>' or `Ctrl-<BREAK>' key, while the `QUIT' signal is generated
-by the `Ctrl-<\>' key.
+ If the option requires an argument, the option letter is followed by
+a colon in the `options' string. If there are remaining characters in
+the current command-line argument (`argv[Optind]'), then the rest of
+that string is assigned to `Optarg'. Otherwise, the next command-line
+argument is used (`-xFOO' versus `-x FOO'). In either case, `_opti' is
+reset to zero, because there are no more characters left to examine in
+the current command-line argument. Continuing:
- Finally, `gawk' also accepts another option `--pretty-print'. When
-called this way, `gawk' "pretty prints" the program into `awkprof.out',
-without any execution counts.
+ if (_opti == 0 || _opti >= length(argv[Optind])) {
+ Optind++
+ _opti = 0
+ } else
+ _opti++
+ return thisopt
+ }
-
-File: gawk.info, Node: Library Functions, Next: Sample Programs, Prev:
Advanced Features, Up: Top
+ Finally, if `_opti' is either zero or greater than the length of the
+current command-line argument, it means this element in `argv' is
+through being processed, so `Optind' is incremented to point to the
+next element in `argv'. If neither condition is true, then only
+`_opti' is incremented, so that the next option letter can be processed
+on the next call to `getopt()'.
-13 A Library of `awk' Functions
-*******************************
+ The `BEGIN' rule initializes both `Opterr' and `Optind' to one.
+`Opterr' is set to one, since the default behavior is for `getopt()' to
+print a diagnostic message upon seeing an invalid option. `Optind' is
+set to one, since there's no reason to look at the program name, which
+is in `ARGV[0]':
-*note User-defined::, describes how to write your own `awk' functions.
-Writing functions is important, because it allows you to encapsulate
-algorithms and program tasks in a single place. It simplifies
-programming, making program development more manageable, and making
-programs more readable.
+ BEGIN {
+ Opterr = 1 # default is to diagnose
+ Optind = 1 # skip ARGV[0]
- One valuable way to learn a new programming language is to _read_
-programs in that language. To that end, this major node and *note
-Sample Programs::, provide a good-sized body of code for you to read,
-and hopefully, to learn from.
+ # test program
+ if (_getopt_test) {
+ while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
+ printf("c = <%c>, optarg = <%s>\n",
+ _go_c, Optarg)
+ printf("non-option arguments:\n")
+ for (; Optind < ARGC; Optind++)
+ printf("\tARGV[%d] = <%s>\n",
+ Optind, ARGV[Optind])
+ }
+ }
- This major node presents a library of useful `awk' functions. Many
-of the sample programs presented later in this Info file use these
-functions. The functions are presented here in a progression from
-simple to complex.
+ The rest of the `BEGIN' rule is a simple test program. Here is the
+result of two sample runs of the test program:
- *note Extract Program::, presents a program that you can use to
-extract the source code for these example library functions and
-programs from the Texinfo source for this Info file. (This has already
-been done as part of the `gawk' distribution.)
-
- If you have written one or more useful, general-purpose `awk'
-functions and would like to contribute them to the `awk' user
-community, see *note How To Contribute::, for more information.
-
- The programs in this major node and in *note Sample Programs::,
-freely use features that are `gawk'-specific. Rewriting these programs
-for different implementations of `awk' is pretty straightforward.
-
- * Diagnostic error messages are sent to `/dev/stderr'. Use `| "cat
- 1>&2"' instead of `> "/dev/stderr"' if your system does not have a
- `/dev/stderr', or if you cannot use `gawk'.
-
- * A number of programs use `nextfile' (*note Nextfile Statement::)
- to skip any remaining input in the input file.
-
- * Finally, some of the programs choose to ignore upper- and lowercase
- distinctions in their input. They do so by assigning one to
- `IGNORECASE'. You can achieve almost the same effect(1) by adding
- the following rule to the beginning of the program:
+ $ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
+ -| c = <a>, optarg = <>
+ -| c = <c>, optarg = <>
+ -| c = <b>, optarg = <ARG>
+ -| non-option arguments:
+ -| ARGV[3] = <bax>
+ -| ARGV[4] = <-x>
- # ignore case
- { $0 = tolower($0) }
+ $ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
+ -| c = <a>, optarg = <>
+ error--> x -- invalid option
+ -| c = <?>, optarg = <>
+ -| non-option arguments:
+ -| ARGV[4] = <xyz>
+ -| ARGV[5] = <abc>
- Also, verify that all regexp and string constants used in
- comparisons use only lowercase letters.
+ In both runs, the first `--' terminates the arguments to `awk', so
+that it does not try to interpret the `-a', etc., as its own options.
-* Menu:
+ NOTE: After `getopt()' is through, it is the responsibility of the
+ user level code to clear out all the elements of `ARGV' from 1 to
+ `Optind', so that `awk' does not try to process the command-line
+ options as file names.
-* Library Names:: How to best name private global variables in
- library functions.
-* General Functions:: Functions that are of general use.
-* Data File Management:: Functions for managing command-line data
- files.
-* Getopt Function:: A function for processing command-line
- arguments.
-* Passwd Functions:: Functions for getting user information.
-* Group Functions:: Functions for getting group information.
-* Walking Arrays:: A function to walk arrays of arrays.
+ Several of the sample programs presented in *note Sample Programs::,
+use `getopt()' to process their arguments.
---------- Footnotes ----------
- (1) The effects are not identical. Output of the transformed record
-will be in all lowercase, while `IGNORECASE' preserves the original
-contents of the input record.
+ (1) This function was written before `gawk' acquired the ability to
+split strings into single characters using `""' as the separator. We
+have left it alone, since using `substr()' is more portable.
-File: gawk.info, Node: Library Names, Next: General Functions, Up: Library
Functions
-
-13.1 Naming Library Function Global Variables
-=============================================
+File: gawk.info, Node: Passwd Functions, Next: Group Functions, Prev:
Getopt Function, Up: Library Functions
-Due to the way the `awk' language evolved, variables are either
-"global" (usable by the entire program) or "local" (usable just by a
-specific function). There is no intermediate state analogous to
-`static' variables in C.
+12.5 Reading the User Database
+==============================
- Library functions often need to have global variables that they can
-use to preserve state information between calls to the function--for
-example, `getopt()''s variable `_opti' (*note Getopt Function::). Such
-variables are called "private", since the only functions that need to
-use them are the ones in the library.
+The `PROCINFO' array (*note Built-in Variables::) provides access to
+the current user's real and effective user and group ID numbers, and if
+available, the user's supplementary group set. However, because these
+are numbers, they do not provide very useful information to the average
+user. There needs to be some way to find the user information
+associated with the user and group ID numbers. This minor node
+presents a suite of functions for retrieving information from the user
+database. *Note Group Functions::, for a similar suite that retrieves
+information from the group database.
- When writing a library function, you should try to choose names for
-your private variables that will not conflict with any variables used by
-either another library function or a user's main program. For example,
-a name like `i' or `j' is not a good choice, because user programs
-often use variable names like these for their own purposes.
+ The POSIX standard does not define the file where user information is
+kept. Instead, it provides the `<pwd.h>' header file and several C
+language subroutines for obtaining user information. The primary
+function is `getpwent()', for "get password entry." The "password"
+comes from the original user database file, `/etc/passwd', which stores
+user information, along with the encrypted passwords (hence the name).
- The example programs shown in this major node all start the names of
-their private variables with an underscore (`_'). Users generally
-don't use leading underscores in their variable names, so this
-convention immediately decreases the chances that the variable name
-will be accidentally shared with the user's program.
+ While an `awk' program could simply read `/etc/passwd' directly,
+this file may not contain complete information about the system's set
+of users.(1) To be sure you are able to produce a readable and complete
+version of the user database, it is necessary to write a small C
+program that calls `getpwent()'. `getpwent()' is defined as returning
+a pointer to a `struct passwd'. Each time it is called, it returns the
+next entry in the database. When there are no more entries, it returns
+`NULL', the null pointer. When this happens, the C program should call
+`endpwent()' to close the database. Following is `pwcat', a C program
+that "cats" the password database:
- In addition, several of the library functions use a prefix that helps
-indicate what function or set of functions use the variables--for
-example, `_pw_byname' in the user database routines (*note Passwd
-Functions::). This convention is recommended, since it even further
-decreases the chance of inadvertent conflict among variable names.
-Note that this convention is used equally well for variable names and
-for private function names.(1)
+ /*
+ * pwcat.c
+ *
+ * Generate a printable version of the password database
+ */
+ #include <stdio.h>
+ #include <pwd.h>
- As a final note on variable naming, if a function makes global
-variables available for use by a main program, it is a good convention
-to start that variable's name with a capital letter--for example,
-`getopt()''s `Opterr' and `Optind' variables (*note Getopt Function::).
-The leading capital letter indicates that it is global, while the fact
-that the variable name is not all capital letters indicates that the
-variable is not one of `awk''s built-in variables, such as `FS'.
+ int
+ main(int argc, char **argv)
+ {
+ struct passwd *p;
- It is also important that _all_ variables in library functions that
-do not need to save state are, in fact, declared local.(2) If this is
-not done, the variable could accidentally be used in the user's
-program, leading to bugs that are very difficult to track down:
+ while ((p = getpwent()) != NULL)
+ printf("%s:%s:%ld:%ld:%s:%s:%s\n",
+ p->pw_name, p->pw_passwd, (long) p->pw_uid,
+ (long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
- function lib_func(x, y, l1, l2)
- {
- ...
- USE VARIABLE some_var # some_var should be local
- ... # but is not by oversight
+ endpwent();
+ return 0;
}
- A different convention, common in the Tcl community, is to use a
-single associative array to hold the values needed by the library
-function(s), or "package." This significantly decreases the number of
-actual global names in use. For example, the functions described in
-*note Passwd Functions::, might have used array elements
-`PW_data["inited"]', `PW_data["total"]', `PW_data["count"]', and
-`PW_data["awklib"]', instead of `_pw_inited', `_pw_awklib', `_pw_total',
-and `_pw_count'.
-
- The conventions presented in this minor node are exactly that:
-conventions. You are not required to write your programs this way--we
-merely recommend that you do so.
+ If you don't understand C, don't worry about it. The output from
+`pwcat' is the user database, in the traditional `/etc/passwd' format
+of colon-separated fields. The fields are:
- ---------- Footnotes ----------
+Login name
+ The user's login name.
- (1) While all the library routines could have been rewritten to use
-this convention, this was not done, in order to show how our own `awk'
-programming style has evolved and to provide some basis for this
-discussion.
+Encrypted password
+ The user's encrypted password. This may not be available on some
+ systems.
- (2) `gawk''s `--dump-variables' command-line option is useful for
-verifying this.
+User-ID
+ The user's numeric user ID number. (On some systems it's a C
+ `long', and not an `int'. Thus we cast it to `long' for all
+ cases.)
-
-File: gawk.info, Node: General Functions, Next: Data File Management, Prev:
Library Names, Up: Library Functions
+Group-ID
+ The user's numeric group ID number. (Similar comments about
+ `long' vs. `int' apply here.)
-13.2 General Programming
-========================
+Full name
+ The user's full name, and perhaps other information associated
+ with the user.
-This minor node presents a number of functions that are of general
-programming use.
+Home directory
+ The user's login (or "home") directory (familiar to shell
+ programmers as `$HOME').
-* Menu:
+Login shell
+ The program that is run when the user logs in. This is usually a
+ shell, such as Bash.
-* Strtonum Function:: A replacement for the built-in
- `strtonum()' function.
-* Assert Function:: A function for assertions in `awk'
- programs.
-* Round Function:: A function for rounding if `sprintf()'
- does not do it correctly.
-* Cliff Random Function:: The Cliff Random Number Generator.
-* Ordinal Functions:: Functions for using characters as numbers and
- vice versa.
-* Join Function:: A function to join an array into a string.
-* Getlocaltime Function:: A function to get formatted times.
+ A few lines representative of `pwcat''s output are as follows:
-
-File: gawk.info, Node: Strtonum Function, Next: Assert Function, Up:
General Functions
+ $ pwcat
+ -| root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
+ -| nobody:*:65534:65534::/:
+ -| daemon:*:1:1::/:
+ -| sys:*:2:2::/:/bin/csh
+ -| bin:*:3:3::/bin:
+ -| arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
+ -| miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
+ -| andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh
+ ...
-13.2.1 Converting Strings To Numbers
-------------------------------------
+ With that introduction, following is a group of functions for
+getting user information. There are several functions here,
+corresponding to the C functions of the same names:
-The `strtonum()' function (*note String Functions::) is a `gawk'
-extension. The following function provides an implementation for other
-versions of `awk':
+ # passwd.awk --- access password file information
- # mystrtonum --- convert string to number
+ BEGIN {
+ # tailor this to suit your system
+ _pw_awklib = "/usr/local/libexec/awk/"
+ }
- function mystrtonum(str, ret, chars, n, i, k, c)
+ function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat)
{
- if (str ~ /^0[0-7]*$/) {
- # octal
- n = length(str)
- ret = 0
- for (i = 1; i <= n; i++) {
- c = substr(str, i, 1)
- if ((k = index("01234567", c)) > 0)
- k-- # adjust for 1-basing in awk
+ if (_pw_inited)
+ return
- ret = ret * 8 + k
- }
- } else if (str ~ /^0[xX][[:xdigit:]]+/) {
- # hexadecimal
- str = substr(str, 3) # lop off leading 0x
- n = length(str)
- ret = 0
- for (i = 1; i <= n; i++) {
- c = substr(str, i, 1)
- c = tolower(c)
- if ((k = index("0123456789", c)) > 0)
- k-- # adjust for 1-basing in awk
- else if ((k = index("abcdef", c)) > 0)
- k += 9
-
- ret = ret * 16 + k
- }
- } else if (str ~ \
-
/^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|([.][0-9]+([Ee][-+]?[0-9]+)?))$/) {
- # decimal number, possibly floating point
- ret = str + 0
- } else
- ret = "NOT-A-NUMBER"
+ oldfs = FS
+ oldrs = RS
+ olddol0 = $0
+ using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
+ using_fpat = (PROCINFO["FS"] == "FPAT")
+ FS = ":"
+ RS = "\n"
- return ret
+ pwcat = _pw_awklib "pwcat"
+ while ((pwcat | getline) > 0) {
+ _pw_byname[$1] = $0
+ _pw_byuid[$3] = $0
+ _pw_bycount[++_pw_total] = $0
+ }
+ close(pwcat)
+ _pw_count = 0
+ _pw_inited = 1
+ FS = oldfs
+ if (using_fw)
+ FIELDWIDTHS = FIELDWIDTHS
+ else if (using_fpat)
+ FPAT = FPAT
+ RS = oldrs
+ $0 = olddol0
}
- # BEGIN { # gawk test harness
- # a[1] = "25"
- # a[2] = ".31"
- # a[3] = "0123"
- # a[4] = "0xdeadBEEF"
- # a[5] = "123.45"
- # a[6] = "1.e3"
- # a[7] = "1.32"
- # a[7] = "1.32E2"
- #
- # for (i = 1; i in a; i++)
- # print a[i], strtonum(a[i]), mystrtonum(a[i])
- # }
-
- The function first looks for C-style octal numbers (base 8). If the
-input string matches a regular expression describing octal numbers,
-then `mystrtonum()' loops through each character in the string. It
-sets `k' to the index in `"01234567"' of the current octal digit.
-Since the return value is one-based, the `k--' adjusts `k' so it can be
-used in computing the return value.
-
- Similar logic applies to the code that checks for and converts a
-hexadecimal value, which starts with `0x' or `0X'. The use of
-`tolower()' simplifies the computation for finding the correct numeric
-value for each hexadecimal digit.
+ The `BEGIN' rule sets a private variable to the directory where
+`pwcat' is stored. Because it is used to help out an `awk' library
+routine, we have chosen to put it in `/usr/local/libexec/awk'; however,
+you might want it to be in a different directory on your system.
- Finally, if the string matches the (rather complicated) regexp for a
-regular decimal integer or floating-point number, the computation `ret
-= str + 0' lets `awk' convert the value to a number.
+ The function `_pw_init()' keeps three copies of the user information
+in three associative arrays. The arrays are indexed by username
+(`_pw_byname'), by user ID number (`_pw_byuid'), and by order of
+occurrence (`_pw_bycount'). The variable `_pw_inited' is used for
+efficiency, since `_pw_init()' needs to be called only once.
- A commented-out test program is included, so that the function can
-be tested with `gawk' and the results compared to the built-in
-`strtonum()' function.
+ Because this function uses `getline' to read information from
+`pwcat', it first saves the values of `FS', `RS', and `$0'. It notes
+in the variable `using_fw' whether field splitting with `FIELDWIDTHS'
+is in effect or not. Doing so is necessary, since these functions
+could be called from anywhere within a user's program, and the user may
+have his or her own way of splitting records and fields.
-
-File: gawk.info, Node: Assert Function, Next: Round Function, Prev:
Strtonum Function, Up: General Functions
+ The `using_fw' variable checks `PROCINFO["FS"]', which is
+`"FIELDWIDTHS"' if field splitting is being done with `FIELDWIDTHS'.
+This makes it possible to restore the correct field-splitting mechanism
+later. The test can only be true for `gawk'. It is false if using
+`FS' or `FPAT', or on some other `awk' implementation.
-13.2.2 Assertions
------------------
+ The code that checks for using `FPAT', using `using_fpat' and
+`PROCINFO["FS"]' is similar.
-When writing large programs, it is often useful to know that a
-condition or set of conditions is true. Before proceeding with a
-particular computation, you make a statement about what you believe to
-be the case. Such a statement is known as an "assertion". The C
-language provides an `<assert.h>' header file and corresponding
-`assert()' macro that the programmer can use to make assertions. If an
-assertion fails, the `assert()' macro arranges to print a diagnostic
-message describing the condition that should have been true but was
-not, and then it kills the program. In C, using `assert()' looks this:
+ The main part of the function uses a loop to read database lines,
+split the line into fields, and then store the line into each array as
+necessary. When the loop is done, `_pw_init()' cleans up by closing
+the pipeline, setting `_pw_inited' to one, and restoring `FS' (and
+`FIELDWIDTHS' or `FPAT' if necessary), `RS', and `$0'. The use of
+`_pw_count' is explained shortly.
- #include <assert.h>
+ The `getpwnam()' function takes a username as a string argument. If
+that user is in the database, it returns the appropriate line.
+Otherwise, it relies on the array reference to a nonexistent element to
+create the element with the null string as its value:
- int myfunc(int a, double b)
+ function getpwnam(name)
{
- assert(a <= 5 && b >= 17.1);
- ...
+ _pw_init()
+ return _pw_byname[name]
}
- If the assertion fails, the program prints a message similar to this:
-
- prog.c:5: assertion failed: a <= 5 && b >= 17.1
-
- The C language makes it possible to turn the condition into a string
-for use in printing the diagnostic message. This is not possible in
-`awk', so this `assert()' function also requires a string version of
-the condition that is being tested. Following is the function:
-
- # assert --- assert that a condition is true. Otherwise exit.
+ Similarly, the `getpwuid' function takes a user ID number argument.
+If that user number is in the database, it returns the appropriate
+line. Otherwise, it returns the null string:
- function assert(condition, string)
+ function getpwuid(uid)
{
- if (! condition) {
- printf("%s:%d: assertion failed: %s\n",
- FILENAME, FNR, string) > "/dev/stderr"
- _assert_exit = 1
- exit 1
- }
+ _pw_init()
+ return _pw_byuid[uid]
}
- END {
- if (_assert_exit)
- exit 1
- }
+ The `getpwent()' function simply steps through the database, one
+entry at a time. It uses `_pw_count' to track its current position in
+the `_pw_bycount' array:
- The `assert()' function tests the `condition' parameter. If it is
-false, it prints a message to standard error, using the `string'
-parameter to describe the failed condition. It then sets the variable
-`_assert_exit' to one and executes the `exit' statement. The `exit'
-statement jumps to the `END' rule. If the `END' rules finds
-`_assert_exit' to be true, it exits immediately.
+ function getpwent()
+ {
+ _pw_init()
+ if (_pw_count < _pw_total)
+ return _pw_bycount[++_pw_count]
+ return ""
+ }
- The purpose of the test in the `END' rule is to keep any other `END'
-rules from running. When an assertion fails, the program should exit
-immediately. If no assertions fail, then `_assert_exit' is still false
-when the `END' rule is run normally, and the rest of the program's
-`END' rules execute. For all of this to work correctly, `assert.awk'
-must be the first source file read by `awk'. The function can be used
-in a program in the following way:
+ The `endpwent()' function resets `_pw_count' to zero, so that
+subsequent calls to `getpwent()' start over again:
- function myfunc(a, b)
+ function endpwent()
{
- assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1")
- ...
+ _pw_count = 0
}
-If the assertion fails, you see a message similar to the following:
+ A conscious design decision in this suite is that each subroutine
+calls `_pw_init()' to initialize the database arrays. The overhead of
+running a separate process to generate the user database, and the I/O
+to scan it, are only incurred if the user's main program actually calls
+one of these functions. If this library file is loaded along with a
+user's program, but none of the routines are ever called, then there is
+no extra runtime overhead. (The alternative is move the body of
+`_pw_init()' into a `BEGIN' rule, which always runs `pwcat'. This
+simplifies the code but runs an extra process that may never be needed.)
- mydata:1357: assertion failed: a <= 5 && b >= 17.1
+ In turn, calling `_pw_init()' is not too expensive, because the
+`_pw_inited' variable keeps the program from reading the data more than
+once. If you are worried about squeezing every last cycle out of your
+`awk' program, the check of `_pw_inited' could be moved out of
+`_pw_init()' and duplicated in all the other functions. In practice,
+this is not necessary, since most `awk' programs are I/O-bound, and
+such a change would clutter up the code.
- There is a small problem with this version of `assert()'. An `END'
-rule is automatically added to the program calling `assert()'.
-Normally, if a program consists of just a `BEGIN' rule, the input files
-and/or standard input are not read. However, now that the program has
-an `END' rule, `awk' attempts to read the input data files or standard
-input (*note Using BEGIN/END::), most likely causing the program to
-hang as it waits for input.
+ The `id' program in *note Id Program::, uses these functions.
- There is a simple workaround to this: make sure that such a `BEGIN'
-rule always ends with an `exit' statement.
+ ---------- Footnotes ----------
+
+ (1) It is often the case that password information is stored in a
+network database.
-File: gawk.info, Node: Round Function, Next: Cliff Random Function, Prev:
Assert Function, Up: General Functions
+File: gawk.info, Node: Group Functions, Next: Walking Arrays, Prev: Passwd
Functions, Up: Library Functions
-13.2.3 Rounding Numbers
------------------------
+12.6 Reading the Group Database
+===============================
-The way `printf' and `sprintf()' (*note Printf::) perform rounding
-often depends upon the system's C `sprintf()' subroutine. On many
-machines, `sprintf()' rounding is "unbiased," which means it doesn't
-always round a trailing `.5' up, contrary to naive expectations. In
-unbiased rounding, `.5' rounds to even, rather than always up, so 1.5
-rounds to 2 but 4.5 rounds to 4. This means that if you are using a
-format that does rounding (e.g., `"%.0f"'), you should check what your
-system does. The following function does traditional rounding; it
-might be useful if your `awk''s `printf' does unbiased rounding:
+Much of the discussion presented in *note Passwd Functions::, applies
+to the group database as well. Although there has traditionally been a
+well-known file (`/etc/group') in a well-known format, the POSIX
+standard only provides a set of C library routines (`<grp.h>' and
+`getgrent()') for accessing the information. Even though this file may
+exist, it may not have complete information. Therefore, as with the
+user database, it is necessary to have a small C program that generates
+the group database as its output. `grcat', a C program that "cats" the
+group database, is as follows:
- # round.awk --- do normal rounding
+ /*
+ * grcat.c
+ *
+ * Generate a printable version of the group database
+ */
+ #include <stdio.h>
+ #include <grp.h>
- function round(x, ival, aval, fraction)
+ int
+ main(int argc, char **argv)
{
- ival = int(x) # integer part, int() truncates
-
- # see if fractional part
- if (ival == x) # no fraction
- return ival # ensure no decimals
+ struct group *g;
+ int i;
- if (x < 0) {
- aval = -x # absolute value
- ival = int(aval)
- fraction = aval - ival
- if (fraction >= .5)
- return int(x) - 1 # -2.5 --> -3
- else
- return int(x) # -2.3 --> -2
- } else {
- fraction = x - ival
- if (fraction >= .5)
- return ival + 1
- else
- return ival
- }
+ while ((g = getgrent()) != NULL) {
+ printf("%s:%s:%ld:", g->gr_name, g->gr_passwd,
+ (long) g->gr_gid);
+ for (i = 0; g->gr_mem[i] != NULL; i++) {
+ printf("%s", g->gr_mem[i]);
+ if (g->gr_mem[i+1] != NULL)
+ putchar(',');
+ }
+ putchar('\n');
+ }
+ endgrent();
+ return 0;
}
- # test harness
- { print $0, round($0) }
+ Each line in the group database represents one group. The fields are
+separated with colons and represent the following information:
-
-File: gawk.info, Node: Cliff Random Function, Next: Ordinal Functions,
Prev: Round Function, Up: General Functions
+Group Name
+ The group's name.
-13.2.4 The Cliff Random Number Generator
-----------------------------------------
+Group Password
+ The group's encrypted password. In practice, this field is never
+ used; it is usually empty or set to `*'.
-The Cliff random number generator
-(http://mathworld.wolfram.com/CliffRandomNumberGenerator.html) is a
-very simple random number generator that "passes the noise sphere test
-for randomness by showing no structure." It is easily programmed, in
-less than 10 lines of `awk' code:
+Group ID Number
+ The group's numeric group ID number; this number must be unique
+ within the file. (On some systems it's a C `long', and not an
+ `int'. Thus we cast it to `long' for all cases.)
- # cliff_rand.awk --- generate Cliff random numbers
+Group Member List
+ A comma-separated list of user names. These users are members of
+ the group. Modern Unix systems allow users to be members of
+ several groups simultaneously. If your system does, then there
+ are elements `"group1"' through `"groupN"' in `PROCINFO' for those
+ group ID numbers. (Note that `PROCINFO' is a `gawk' extension;
+ *note Built-in Variables::.)
- BEGIN { _cliff_seed = 0.1 }
+ Here is what running `grcat' might produce:
- function cliff_rand()
+ $ grcat
+ -| wheel:*:0:arnold
+ -| nogroup:*:65534:
+ -| daemon:*:1:
+ -| kmem:*:2:
+ -| staff:*:10:arnold,miriam,andy
+ -| other:*:20:
+ ...
+
+ Here are the functions for obtaining information from the group
+database. There are several, modeled after the C library functions of
+the same names:
+
+ # group.awk --- functions for dealing with the group file
+
+ BEGIN \
{
- _cliff_seed = (100 * log(_cliff_seed)) % 1
- if (_cliff_seed < 0)
- _cliff_seed = - _cliff_seed
- return _cliff_seed
+ # Change to suit your system
+ _gr_awklib = "/usr/local/libexec/awk/"
}
- This algorithm requires an initial "seed" of 0.1. Each new value
-uses the current seed as input for the calculation. If the built-in
-`rand()' function (*note Numeric Functions::) isn't random enough, you
-might try using this function instead.
+ function _gr_init( oldfs, oldrs, olddol0, grcat,
+ using_fw, using_fpat, n, a, i)
+ {
+ if (_gr_inited)
+ return
-
-File: gawk.info, Node: Ordinal Functions, Next: Join Function, Prev: Cliff
Random Function, Up: General Functions
+ oldfs = FS
+ oldrs = RS
+ olddol0 = $0
+ using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
+ using_fpat = (PROCINFO["FS"] == "FPAT")
+ FS = ":"
+ RS = "\n"
-13.2.5 Translating Between Characters and Numbers
--------------------------------------------------
+ grcat = _gr_awklib "grcat"
+ while ((grcat | getline) > 0) {
+ if ($1 in _gr_byname)
+ _gr_byname[$1] = _gr_byname[$1] "," $4
+ else
+ _gr_byname[$1] = $0
+ if ($3 in _gr_bygid)
+ _gr_bygid[$3] = _gr_bygid[$3] "," $4
+ else
+ _gr_bygid[$3] = $0
-One commercial implementation of `awk' supplies a built-in function,
-`ord()', which takes a character and returns the numeric value for that
-character in the machine's character set. If the string passed to
-`ord()' has more than one character, only the first one is used.
+ n = split($4, a, "[ \t]*,[ \t]*")
+ for (i = 1; i <= n; i++)
+ if (a[i] in _gr_groupsbyuser)
+ _gr_groupsbyuser[a[i]] = \
+ _gr_groupsbyuser[a[i]] " " $1
+ else
+ _gr_groupsbyuser[a[i]] = $1
- The inverse of this function is `chr()' (from the function of the
-same name in Pascal), which takes a number and returns the
-corresponding character. Both functions are written very nicely in
-`awk'; there is no real reason to build them into the `awk' interpreter:
+ _gr_bycount[++_gr_count] = $0
+ }
+ close(grcat)
+ _gr_count = 0
+ _gr_inited++
+ FS = oldfs
+ if (using_fw)
+ FIELDWIDTHS = FIELDWIDTHS
+ else if (using_fpat)
+ FPAT = FPAT
+ RS = oldrs
+ $0 = olddol0
+ }
- # ord.awk --- do ord and chr
+ The `BEGIN' rule sets a private variable to the directory where
+`grcat' is stored. Because it is used to help out an `awk' library
+routine, we have chosen to put it in `/usr/local/libexec/awk'. You
+might want it to be in a different directory on your system.
- # Global identifiers:
- # _ord_: numerical values indexed by characters
- # _ord_init: function to initialize _ord_
+ These routines follow the same general outline as the user database
+routines (*note Passwd Functions::). The `_gr_inited' variable is used
+to ensure that the database is scanned no more than once. The
+`_gr_init()' function first saves `FS', `RS', and `$0', and then sets
+`FS' and `RS' to the correct values for scanning the group information.
+It also takes care to note whether `FIELDWIDTHS' or `FPAT' is being
+used, and to restore the appropriate field splitting mechanism.
- BEGIN { _ord_init() }
+ The group information is stored is several associative arrays. The
+arrays are indexed by group name (`_gr_byname'), by group ID number
+(`_gr_bygid'), and by position in the database (`_gr_bycount'). There
+is an additional array indexed by user name (`_gr_groupsbyuser'), which
+is a space-separated list of groups to which each user belongs.
- function _ord_init( low, high, i, t)
- {
- low = sprintf("%c", 7) # BEL is ascii 7
- if (low == "\a") { # regular ascii
- low = 0
- high = 127
- } else if (sprintf("%c", 128 + 7) == "\a") {
- # ascii, mark parity
- low = 128
- high = 255
- } else { # ebcdic(!)
- low = 0
- high = 255
- }
+ Unlike the user database, it is possible to have multiple records in
+the database for the same group. This is common when a group has a
+large number of members. A pair of such entries might look like the
+following:
- for (i = low; i <= high; i++) {
- t = sprintf("%c", i)
- _ord_[t] = i
- }
- }
+ tvpeople:*:101:johnny,jay,arsenio
+ tvpeople:*:101:david,conan,tom,joan
- Some explanation of the numbers used by `chr' is worthwhile. The
-most prominent character set in use today is ASCII.(1) Although an
-8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
-defines characters that use the values from 0 to 127.(2) In the now
-distant past, at least one minicomputer manufacturer used ASCII, but
-with mark parity, meaning that the leftmost bit in the byte is always
-1. This means that on those systems, characters have numeric values
-from 128 to 255. Finally, large mainframe systems use the EBCDIC
-character set, which uses all 256 values. While there are other
-character sets in use on some older systems, they are not really worth
-worrying about:
+ For this reason, `_gr_init()' looks to see if a group name or group
+ID number is already seen. If it is, then the user names are simply
+concatenated onto the previous list of users. (There is actually a
+subtle problem with the code just presented. Suppose that the first
+time there were no names. This code adds the names with a leading
+comma. It also doesn't check that there is a `$4'.)
- function ord(str, c)
+ Finally, `_gr_init()' closes the pipeline to `grcat', restores `FS'
+(and `FIELDWIDTHS' or `FPAT' if necessary), `RS', and `$0', initializes
+`_gr_count' to zero (it is used later), and makes `_gr_inited' nonzero.
+
+ The `getgrnam()' function takes a group name as its argument, and if
+that group exists, it is returned. Otherwise, it relies on the array
+reference to a nonexistent element to create the element with the null
+string as its value:
+
+ function getgrnam(group)
{
- # only first character is of interest
- c = substr(str, 1, 1)
- return _ord_[c]
+ _gr_init()
+ return _gr_byname[group]
}
- function chr(c)
+ The `getgrgid()' function is similar; it takes a numeric group ID and
+looks up the information associated with that group ID:
+
+ function getgrgid(gid)
{
- # force c to be numeric by adding 0
- return sprintf("%c", c + 0)
+ _gr_init()
+ return _gr_bygid[gid]
}
- #### test code ####
- # BEGIN \
- # {
- # for (;;) {
- # printf("enter a character: ")
- # if (getline var <= 0)
- # break
- # printf("ord(%s) = %d\n", var, ord(var))
- # }
- # }
-
- An obvious improvement to these functions is to move the code for the
-`_ord_init' function into the body of the `BEGIN' rule. It was written
-this way initially for ease of development. There is a "test program"
-in a `BEGIN' rule, to test the function. It is commented out for
-production use.
+ The `getgruser()' function does not have a C counterpart. It takes a
+user name and returns the list of groups that have the user as a member:
- ---------- Footnotes ----------
+ function getgruser(user)
+ {
+ _gr_init()
+ return _gr_groupsbyuser[user]
+ }
- (1) This is changing; many systems use Unicode, a very large
-character set that includes ASCII as a subset. On systems with full
-Unicode support, a character can occupy up to 32 bits, making simple
-tests such as used here prohibitively expensive.
+ The `getgrent()' function steps through the database one entry at a
+time. It uses `_gr_count' to track its position in the list:
- (2) ASCII has been extended in many countries to use the values from
-128 to 255 for country-specific characters. If your system uses these
-extensions, you can simplify `_ord_init' to loop from 0 to 255.
-
-
-File: gawk.info, Node: Join Function, Next: Getlocaltime Function, Prev:
Ordinal Functions, Up: General Functions
-
-13.2.6 Merging an Array into a String
--------------------------------------
-
-When doing string processing, it is often useful to be able to join all
-the strings in an array into one long string. The following function,
-`join()', accomplishes this task. It is used later in several of the
-application programs (*note Sample Programs::).
-
- Good function design is important; this function needs to be general
-but it should also have a reasonable default behavior. It is called
-with an array as well as the beginning and ending indices of the
-elements in the array to be merged. This assumes that the array
-indices are numeric--a reasonable assumption since the array was likely
-created with `split()' (*note String Functions::):
+ function getgrent()
+ {
+ _gr_init()
+ if (++_gr_count in _gr_bycount)
+ return _gr_bycount[_gr_count]
+ return ""
+ }
- # join.awk --- join an array into a string
+ The `endgrent()' function resets `_gr_count' to zero so that
+`getgrent()' can start over again:
- function join(array, start, end, sep, result, i)
+ function endgrent()
{
- if (sep == "")
- sep = " "
- else if (sep == SUBSEP) # magic value
- sep = ""
- result = array[start]
- for (i = start + 1; i <= end; i++)
- result = result sep array[i]
- return result
+ _gr_count = 0
}
- An optional additional argument is the separator to use when joining
-the strings back together. If the caller supplies a nonempty value,
-`join()' uses it; if it is not supplied, it has a null value. In this
-case, `join()' uses a single space as a default separator for the
-strings. If the value is equal to `SUBSEP', then `join()' joins the
-strings with no separator between them. `SUBSEP' serves as a "magic"
-value to indicate that there should be no separation between the
-component strings.(1)
+ As with the user database routines, each function calls `_gr_init()'
+to initialize the arrays. Doing so only incurs the extra overhead of
+running `grcat' if these functions are used (as opposed to moving the
+body of `_gr_init()' into a `BEGIN' rule).
- ---------- Footnotes ----------
+ Most of the work is in scanning the database and building the various
+associative arrays. The functions that the user calls are themselves
+very simple, relying on `awk''s associative arrays to do work.
- (1) It would be nice if `awk' had an assignment operator for
-concatenation. The lack of an explicit operator for concatenation
-makes string operations more difficult than they really need to be.
+ The `id' program in *note Id Program::, uses these functions.
-File: gawk.info, Node: Getlocaltime Function, Prev: Join Function, Up:
General Functions
-
-13.2.7 Managing the Time of Day
--------------------------------
-
-The `systime()' and `strftime()' functions described in *note Time
-Functions::, provide the minimum functionality necessary for dealing
-with the time of day in human readable form. While `strftime()' is
-extensive, the control formats are not necessarily easy to remember or
-intuitively obvious when reading a program.
-
- The following function, `getlocaltime()', populates a user-supplied
-array with preformatted time information. It returns a string with the
-current time formatted in the same way as the `date' utility:
+File: gawk.info, Node: Walking Arrays, Prev: Group Functions, Up: Library
Functions
- # getlocaltime.awk --- get the time of day in a usable format
+12.7 Traversing Arrays of Arrays
+================================
- # Returns a string in the format of output of date(1)
- # Populates the array argument time with individual values:
- # time["second"] -- seconds (0 - 59)
- # time["minute"] -- minutes (0 - 59)
- # time["hour"] -- hours (0 - 23)
- # time["althour"] -- hours (0 - 12)
- # time["monthday"] -- day of month (1 - 31)
- # time["month"] -- month of year (1 - 12)
- # time["monthname"] -- name of the month
- # time["shortmonth"] -- short name of the month
- # time["year"] -- year modulo 100 (0 - 99)
- # time["fullyear"] -- full year
- # time["weekday"] -- day of week (Sunday = 0)
- # time["altweekday"] -- day of week (Monday = 0)
- # time["dayname"] -- name of weekday
- # time["shortdayname"] -- short name of weekday
- # time["yearday"] -- day of year (0 - 365)
- # time["timezone"] -- abbreviation of timezone name
- # time["ampm"] -- AM or PM designation
- # time["weeknum"] -- week number, Sunday first day
- # time["altweeknum"] -- week number, Monday first day
+*note Arrays of Arrays::, described how `gawk' provides arrays of
+arrays. In particular, any element of an array may be either a scalar,
+or another array. The `isarray()' function (*note Type Functions::)
+lets you distinguish an array from a scalar. The following function,
+`walk_array()', recursively traverses an array, printing each element's
+indices and value. You call it with the array and a string
+representing the name of the array:
- function getlocaltime(time, ret, now, i)
+ function walk_array(arr, name, i)
{
- # get time once, avoids unnecessary system calls
- now = systime()
-
- # return date(1)-style output
- ret = strftime("%a %b %e %H:%M:%S %Z %Y", now)
+ for (i in arr) {
+ if (isarray(arr[i]))
+ walk_array(arr[i], (name "[" i "]"))
+ else
+ printf("%s[%s] = %s\n", name, i, arr[i])
+ }
+ }
- # clear out target array
- delete time
+It works by looping over each element of the array. If any given
+element is itself an array, the function calls itself recursively,
+passing the subarray and a new string representing the current index.
+Otherwise, the function simply prints the element's name, index, and
+value. Here is a main program to demonstrate:
- # fill in values, force numeric values to be
- # numeric by adding 0
- time["second"] = strftime("%S", now) + 0
- time["minute"] = strftime("%M", now) + 0
- time["hour"] = strftime("%H", now) + 0
- time["althour"] = strftime("%I", now) + 0
- time["monthday"] = strftime("%d", now) + 0
- time["month"] = strftime("%m", now) + 0
- time["monthname"] = strftime("%B", now)
- time["shortmonth"] = strftime("%b", now)
- time["year"] = strftime("%y", now) + 0
- time["fullyear"] = strftime("%Y", now) + 0
- time["weekday"] = strftime("%w", now) + 0
- time["altweekday"] = strftime("%u", now) + 0
- time["dayname"] = strftime("%A", now)
- time["shortdayname"] = strftime("%a", now)
- time["yearday"] = strftime("%j", now) + 0
- time["timezone"] = strftime("%Z", now)
- time["ampm"] = strftime("%p", now)
- time["weeknum"] = strftime("%U", now) + 0
- time["altweeknum"] = strftime("%W", now) + 0
+ BEGIN {
+ a[1] = 1
+ a[2][1] = 21
+ a[2][2] = 22
+ a[3] = 3
+ a[4][1][1] = 411
+ a[4][2] = 42
- return ret
+ walk_array(a, "a")
}
- The string indices are easier to use and read than the various
-formats required by `strftime()'. The `alarm' program presented in
-*note Alarm Program::, uses this function. A more general design for
-the `getlocaltime()' function would have allowed the user to supply an
-optional timestamp value to use instead of the current time.
+ When run, the program produces the following output:
+
+ $ gawk -f walk_array.awk
+ -| a[4][1][1] = 411
+ -| a[4][2] = 42
+ -| a[1] = 1
+ -| a[2][1] = 21
+ -| a[2][2] = 22
+ -| a[3] = 3
-File: gawk.info, Node: Data File Management, Next: Getopt Function, Prev:
General Functions, Up: Library Functions
+File: gawk.info, Node: Sample Programs, Next: Debugger, Prev: Library
Functions, Up: Top
-13.3 Data File Management
-=========================
+13 Practical `awk' Programs
+***************************
-This minor node presents functions that are useful for managing
-command-line data files.
+*note Library Functions::, presents the idea that reading programs in a
+language contributes to learning that language. This major node
+continues that theme, presenting a potpourri of `awk' programs for your
+reading enjoyment.
+
+ Many of these programs use library functions presented in *note
+Library Functions::.
* Menu:
-* Filetrans Function:: A function for handling data file transitions.
-* Rewind Function:: A function for rereading the current file.
-* File Checking:: Checking that data files are readable.
-* Empty Files:: Checking for zero-length files.
-* Ignoring Assigns:: Treating assignments as file names.
+* Running Examples:: How to run these examples.
+* Clones:: Clones of common utilities.
+* Miscellaneous Programs:: Some interesting `awk' programs.
-File: gawk.info, Node: Filetrans Function, Next: Rewind Function, Up: Data
File Management
+File: gawk.info, Node: Running Examples, Next: Clones, Up: Sample Programs
-13.3.1 Noting Data File Boundaries
-----------------------------------
+13.1 Running the Example Programs
+=================================
-The `BEGIN' and `END' rules are each executed exactly once at the
-beginning and end of your `awk' program, respectively (*note
-BEGIN/END::). We (the `gawk' authors) once had a user who mistakenly
-thought that the `BEGIN' rule is executed at the beginning of each data
-file and the `END' rule is executed at the end of each data file.
+To run a given program, you would typically do something like this:
- When informed that this was not the case, the user requested that we
-add new special patterns to `gawk', named `BEGIN_FILE' and `END_FILE',
-that would have the desired behavior. He even supplied us the code to
-do so.
+ awk -f PROGRAM -- OPTIONS FILES
- Adding these special patterns to `gawk' wasn't necessary; the job
-can be done cleanly in `awk' itself, as illustrated by the following
-library program. It arranges to call two user-supplied functions,
-`beginfile()' and `endfile()', at the beginning and end of each data
-file. Besides solving the problem in only nine(!) lines of code, it
-does so _portably_; this works with any implementation of `awk':
+Here, PROGRAM is the name of the `awk' program (such as `cut.awk'),
+OPTIONS are any command-line options for the program that start with a
+`-', and FILES are the actual data files.
- # transfile.awk
- #
- # Give the user a hook for filename transitions
- #
- # The user must supply functions beginfile() and endfile()
- # that each take the name of the file being started or
- # finished, respectively.
+ If your system supports the `#!' executable interpreter mechanism
+(*note Executable Scripts::), you can instead run your program directly:
- FILENAME != _oldfilename \
- {
- if (_oldfilename != "")
- endfile(_oldfilename)
- _oldfilename = FILENAME
- beginfile(FILENAME)
- }
+ cut.awk -c1-8 myfiles > results
- END { endfile(FILENAME) }
+ If your `awk' is not `gawk', you may instead need to use this:
- This file must be loaded before the user's "main" program, so that
-the rule it supplies is executed first.
+ cut.awk -- -c1-8 myfiles > results
- This rule relies on `awk''s `FILENAME' variable that automatically
-changes for each new data file. The current file name is saved in a
-private variable, `_oldfilename'. If `FILENAME' does not equal
-`_oldfilename', then a new data file is being processed and it is
-necessary to call `endfile()' for the old file. Because `endfile()'
-should only be called if a file has been processed, the program first
-checks to make sure that `_oldfilename' is not the null string. The
-program then assigns the current file name to `_oldfilename' and calls
-`beginfile()' for the file. Because, like all `awk' variables,
-`_oldfilename' is initialized to the null string, this rule executes
-correctly even for the first data file.
+
+File: gawk.info, Node: Clones, Next: Miscellaneous Programs, Prev: Running
Examples, Up: Sample Programs
- The program also supplies an `END' rule to do the final processing
-for the last file. Because this `END' rule comes before any `END' rules
-supplied in the "main" program, `endfile()' is called first. Once
-again the value of multiple `BEGIN' and `END' rules should be clear.
+13.2 Reinventing Wheels for Fun and Profit
+==========================================
- If the same data file occurs twice in a row on the command line, then
-`endfile()' and `beginfile()' are not executed at the end of the first
-pass and at the beginning of the second pass. The following version
-solves the problem:
-
- # ftrans.awk --- handle data file transitions
- #
- # user supplies beginfile() and endfile() functions
-
- FNR == 1 {
- if (_filename_ != "")
- endfile(_filename_)
- _filename_ = FILENAME
- beginfile(FILENAME)
- }
-
- END { endfile(_filename_) }
+This minor node presents a number of POSIX utilities implemented in
+`awk'. Reinventing these programs in `awk' is often enjoyable, because
+the algorithms can be very clearly expressed, and the code is usually
+very concise and simple. This is true because `awk' does so much for
+you.
- *note Wc Program::, shows how this library function can be used and
-how it simplifies writing the main program.
+ It should be noted that these programs are not necessarily intended
+to replace the installed versions on your system. Nor may all of these
+programs be fully compliant with the most recent POSIX standard. This
+is not a problem; their purpose is to illustrate `awk' language
+programming for "real world" tasks.
-Advanced Notes: So Why Does `gawk' have `BEGINFILE' and `ENDFILE'?
-------------------------------------------------------------------
+ The programs are presented in alphabetical order.
-You are probably wondering, if `beginfile()' and `endfile()' functions
-can do the job, why does `gawk' have `BEGINFILE' and `ENDFILE' patterns
-(*note BEGINFILE/ENDFILE::)?
+* Menu:
- Good question. Normally, if `awk' cannot open a file, this causes
-an immediate fatal error. In this case, there is no way for a
-user-defined function to deal with the problem, since the mechanism for
-calling it relies on the file being open and at the first record. Thus,
-the main reason for `BEGINFILE' is to give you a "hook" to catch files
-that cannot be processed. `ENDFILE' exists for symmetry, and because
-it provides an easy way to do per-file cleanup processing.
+* Cut Program:: The `cut' utility.
+* Egrep Program:: The `egrep' utility.
+* Id Program:: The `id' utility.
+* Split Program:: The `split' utility.
+* Tee Program:: The `tee' utility.
+* Uniq Program:: The `uniq' utility.
+* Wc Program:: The `wc' utility.
-File: gawk.info, Node: Rewind Function, Next: File Checking, Prev:
Filetrans Function, Up: Data File Management
-
-13.3.2 Rereading the Current File
----------------------------------
+File: gawk.info, Node: Cut Program, Next: Egrep Program, Up: Clones
-Another request for a new built-in function was for a `rewind()'
-function that would make it possible to reread the current file. The
-requesting user didn't want to have to use `getline' (*note Getline::)
-inside a loop.
+13.2.1 Cutting out Fields and Columns
+-------------------------------------
- However, as long as you are not in the `END' rule, it is quite easy
-to arrange to immediately close the current input file and then start
-over with it from the top. For lack of a better name, we'll call it
-`rewind()':
+The `cut' utility selects, or "cuts," characters or fields from its
+standard input and sends them to its standard output. Fields are
+separated by TABs by default, but you may supply a command-line option
+to change the field "delimiter" (i.e., the field-separator character).
+`cut''s definition of fields is less general than `awk''s.
- # rewind.awk --- rewind the current file and start over
+ A common use of `cut' might be to pull out just the login name of
+logged-on users from the output of `who'. For example, the following
+pipeline generates a sorted, unique list of the logged-on users:
- function rewind( i)
- {
- # shift remaining arguments up
- for (i = ARGC; i > ARGIND; i--)
- ARGV[i] = ARGV[i-1]
+ who | cut -c1-8 | sort | uniq
- # make sure gawk knows to keep going
- ARGC++
+ The options for `cut' are:
- # make current file next to get done
- ARGV[ARGIND+1] = FILENAME
+`-c LIST'
+ Use LIST as the list of characters to cut out. Items within the
+ list may be separated by commas, and ranges of characters can be
+ separated with dashes. The list `1-8,15,22-35' specifies
+ characters 1 through 8, 15, and 22 through 35.
- # do it
- nextfile
- }
+`-f LIST'
+ Use LIST as the list of fields to cut out.
- This code relies on the `ARGIND' variable (*note Auto-set::), which
-is specific to `gawk'. If you are not using `gawk', you can use ideas
-presented in *note Filetrans Function::, to either update `ARGIND' on
-your own or modify this code as appropriate.
+`-d DELIM'
+ Use DELIM as the field-separator character instead of the TAB
+ character.
- The `rewind()' function also relies on the `nextfile' keyword (*note
-Nextfile Statement::).
+`-s'
+ Suppress printing of lines that do not contain the field delimiter.
-
-File: gawk.info, Node: File Checking, Next: Empty Files, Prev: Rewind
Function, Up: Data File Management
+ The `awk' implementation of `cut' uses the `getopt()' library
+function (*note Getopt Function::) and the `join()' library function
+(*note Join Function::).
-13.3.3 Checking for Readable Data Files
----------------------------------------
+ The program begins with a comment describing the options, the library
+functions needed, and a `usage()' function that prints out a usage
+message and exits. `usage()' is called if invalid arguments are
+supplied:
-Normally, if you give `awk' a data file that isn't readable, it stops
-with a fatal error. There are times when you might want to just ignore
-such files and keep going. You can do this by prepending the following
-program to your `awk' program:
+ # cut.awk --- implement cut in awk
- # readable.awk --- library file to skip over unreadable files
+ # Options:
+ # -f list Cut fields
+ # -d c Field delimiter character
+ # -c list Cut characters
+ #
+ # -s Suppress lines without the delimiter
+ #
+ # Requires getopt() and join() library functions
- BEGIN {
- for (i = 1; i < ARGC; i++) {
- if (ARGV[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/ \
- || ARGV[i] == "-" || ARGV[i] == "/dev/stdin")
- continue # assignment or standard input
- else if ((getline junk < ARGV[i]) < 0) # unreadable
- delete ARGV[i]
- else
- close(ARGV[i])
- }
+ function usage( e1, e2)
+ {
+ e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
+ e2 = "usage: cut [-c list] [files...]"
+ print e1 > "/dev/stderr"
+ print e2 > "/dev/stderr"
+ exit 1
}
- This works, because the `getline' won't be fatal. Removing the
-element from `ARGV' with `delete' skips the file (since it's no longer
-in the list). See also *note ARGC and ARGV::.
+The variables `e1' and `e2' are used so that the function fits nicely
+on the screen.
-
-File: gawk.info, Node: Empty Files, Next: Ignoring Assigns, Prev: File
Checking, Up: Data File Management
+ Next comes a `BEGIN' rule that parses the command-line options. It
+sets `FS' to a single TAB character, because that is `cut''s default
+field separator. The rule then sets the output field separator to be the
+same as the input field separator. A loop using `getopt()' steps
+through the command-line options. Exactly one of the variables
+`by_fields' or `by_chars' is set to true, to indicate that processing
+should be done by fields or by characters, respectively. When cutting
+by characters, the output field separator is set to the null string:
-13.3.4 Checking For Zero-length Files
--------------------------------------
+ BEGIN \
+ {
+ FS = "\t" # default
+ OFS = FS
+ while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) {
+ if (c == "f") {
+ by_fields = 1
+ fieldlist = Optarg
+ } else if (c == "c") {
+ by_chars = 1
+ fieldlist = Optarg
+ OFS = ""
+ } else if (c == "d") {
+ if (length(Optarg) > 1) {
+ printf("Using first character of %s" \
+ " for delimiter\n", Optarg) > "/dev/stderr"
+ Optarg = substr(Optarg, 1, 1)
+ }
+ FS = Optarg
+ OFS = FS
+ if (FS == " ") # defeat awk semantics
+ FS = "[ ]"
+ } else if (c == "s")
+ suppress++
+ else
+ usage()
+ }
-All known `awk' implementations silently skip over zero-length files.
-This is a by-product of `awk''s implicit
-read-a-record-and-match-against-the-rules loop: when `awk' tries to
-read a record from an empty file, it immediately receives an end of
-file indication, closes the file, and proceeds on to the next
-command-line data file, _without_ executing any user-level `awk'
-program code.
+ # Clear out options
+ for (i = 1; i < Optind; i++)
+ ARGV[i] = ""
- Using `gawk''s `ARGIND' variable (*note Built-in Variables::), it is
-possible to detect when an empty data file has been skipped. Similar
-to the library file presented in *note Filetrans Function::, the
-following library file calls a function named `zerofile()' that the
-user must provide. The arguments passed are the file name and the
-position in `ARGV' where it was found:
+ The code must take special care when the field delimiter is a space.
+Using a single space (`" "') for the value of `FS' is incorrect--`awk'
+would separate fields with runs of spaces, TABs, and/or newlines, and
+we want them to be separated with individual spaces. Also remember
+that after `getopt()' is through (as described in *note Getopt
+Function::), we have to clear out all the elements of `ARGV' from 1 to
+`Optind', so that `awk' does not try to process the command-line options
+as file names.
- # zerofile.awk --- library file to process empty input files
+ After dealing with the command-line options, the program verifies
+that the options make sense. Only one or the other of `-c' and `-f'
+should be used, and both require a field list. Then the program calls
+either `set_fieldlist()' or `set_charlist()' to pull apart the list of
+fields or characters:
- BEGIN { Argind = 0 }
+ if (by_fields && by_chars)
+ usage()
- ARGIND > Argind + 1 {
- for (Argind++; Argind < ARGIND; Argind++)
- zerofile(ARGV[Argind], Argind)
- }
+ if (by_fields == 0 && by_chars == 0)
+ by_fields = 1 # default
- ARGIND != Argind { Argind = ARGIND }
+ if (fieldlist == "") {
+ print "cut: needs list for -c or -f" > "/dev/stderr"
+ exit 1
+ }
- END {
- if (ARGIND > Argind)
- for (Argind++; Argind <= ARGIND; Argind++)
- zerofile(ARGV[Argind], Argind)
+ if (by_fields)
+ set_fieldlist()
+ else
+ set_charlist()
}
- The user-level variable `Argind' allows the `awk' program to track
-its progress through `ARGV'. Whenever the program detects that
-`ARGIND' is greater than `Argind + 1', it means that one or more empty
-files were skipped. The action then calls `zerofile()' for each such
-file, incrementing `Argind' along the way.
-
- The `Argind != ARGIND' rule simply keeps `Argind' up to date in the
-normal case.
+ `set_fieldlist()' splits the field list apart at the commas into an
+array. Then, for each element of the array, it looks to see if the
+element is actually a range, and if so, splits it apart. The function
+checks the range to make sure that the first number is smaller than the
+second. Each number in the list is added to the `flist' array, which
+simply lists the fields that will be printed. Normal field splitting
+is used. The program lets `awk' handle the job of doing the field
+splitting:
- Finally, the `END' rule catches the case of any empty files at the
-end of the command-line arguments. Note that the test in the condition
-of the `for' loop uses the `<=' operator, not `<'.
-
- As an exercise, you might consider whether this same problem can be
-solved without relying on `gawk''s `ARGIND' variable.
-
- As a second exercise, revise this code to handle the case where an
-intervening value in `ARGV' is a variable assignment.
+ function set_fieldlist( n, m, i, j, k, f, g)
+ {
+ n = split(fieldlist, f, ",")
+ j = 1 # index in flist
+ for (i = 1; i <= n; i++) {
+ if (index(f[i], "-") != 0) { # a range
+ m = split(f[i], g, "-")
+ if (m != 2 || g[1] >= g[2]) {
+ printf("bad field list: %s\n",
+ f[i]) > "/dev/stderr"
+ exit 1
+ }
+ for (k = g[1]; k <= g[2]; k++)
+ flist[j++] = k
+ } else
+ flist[j++] = f[i]
+ }
+ nfields = j - 1
+ }
-
-File: gawk.info, Node: Ignoring Assigns, Prev: Empty Files, Up: Data File
Management
+ The `set_charlist()' function is more complicated than
+`set_fieldlist()'. The idea here is to use `gawk''s `FIELDWIDTHS'
+variable (*note Constant Size::), which describes constant-width input.
+When using a character list, that is exactly what we have.
-13.3.5 Treating Assignments as File Names
------------------------------------------
+ Setting up `FIELDWIDTHS' is more complicated than simply listing the
+fields that need to be printed. We have to keep track of the fields to
+print and also the intervening characters that have to be skipped. For
+example, suppose you wanted characters 1 through 8, 15, and 22 through
+35. You would use `-c 1-8,15,22-35'. The necessary value for
+`FIELDWIDTHS' is `"8 6 1 6 14"'. This yields five fields, and the
+fields to print are `$1', `$3', and `$5'. The intermediate fields are
+"filler", which is stuff in between the desired data. `flist' lists
+the fields to print, and `t' tracks the complete field list, including
+filler fields:
-Occasionally, you might not want `awk' to process command-line variable
-assignments (*note Assignment Options::). In particular, if you have a
-file name that contain an `=' character, `awk' treats the file name as
-an assignment, and does not process it.
+ function set_charlist( field, i, j, f, g, t,
+ filler, last, len)
+ {
+ field = 1 # count total fields
+ n = split(fieldlist, f, ",")
+ j = 1 # index in flist
+ for (i = 1; i <= n; i++) {
+ if (index(f[i], "-") != 0) { # range
+ m = split(f[i], g, "-")
+ if (m != 2 || g[1] >= g[2]) {
+ printf("bad character list: %s\n",
+ f[i]) > "/dev/stderr"
+ exit 1
+ }
+ len = g[2] - g[1] + 1
+ if (g[1] > 1) # compute length of filler
+ filler = g[1] - last - 1
+ else
+ filler = 0
+ if (filler)
+ t[field++] = filler
+ t[field++] = len # length of field
+ last = g[2]
+ flist[j++] = field - 1
+ } else {
+ if (f[i] > 1)
+ filler = f[i] - last - 1
+ else
+ filler = 0
+ if (filler)
+ t[field++] = filler
+ t[field++] = 1
+ last = f[i]
+ flist[j++] = field - 1
+ }
+ }
+ FIELDWIDTHS = join(t, 1, field - 1)
+ nfields = j - 1
+ }
- Some users have suggested an additional command-line option for
-`gawk' to disable command-line assignments. However, some simple
-programming with a library file does the trick:
+ Next is the rule that actually processes the data. If the `-s'
+option is given, then `suppress' is true. The first `if' statement
+makes sure that the input record does have the field separator. If
+`cut' is processing fields, `suppress' is true, and the field separator
+character is not in the record, then the record is skipped.
- # noassign.awk --- library file to avoid the need for a
- # special option that disables command-line assignments
+ If the record is valid, then `gawk' has split the data into fields,
+either using the character in `FS' or using fixed-length fields and
+`FIELDWIDTHS'. The loop goes through the list of fields that should be
+printed. The corresponding field is printed if it contains data. If
+the next field also has data, then the separator character is written
+out between the fields:
- function disable_assigns(argc, argv, i)
{
- for (i = 1; i < argc; i++)
- if (argv[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/)
- argv[i] = ("./" argv[i])
- }
+ if (by_fields && suppress && index($0, FS) != 0)
+ next
- BEGIN {
- if (No_command_assign)
- disable_assigns(ARGC, ARGV)
+ for (i = 1; i <= nfields; i++) {
+ if ($flist[i] != "") {
+ printf "%s", $flist[i]
+ if (i < nfields && $flist[i+1] != "")
+ printf "%s", OFS
+ }
+ }
+ print ""
}
- You then run your program this way:
-
- awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk *
+ This version of `cut' relies on `gawk''s `FIELDWIDTHS' variable to
+do the character-based cutting. While it is possible in other `awk'
+implementations to use `substr()' (*note String Functions::), it is
+also extremely painful. The `FIELDWIDTHS' variable supplies an elegant
+solution to the problem of picking the input line apart by characters.
- The function works by looping through the arguments. It prepends
-`./' to any argument that matches the form of a variable assignment,
-turning that argument into a file name.
+
+File: gawk.info, Node: Egrep Program, Next: Id Program, Prev: Cut Program,
Up: Clones
- The use of `No_command_assign' allows you to disable command-line
-assignments at invocation time, by giving the variable a true value.
-When not set, it is initially zero (i.e., false), so the command-line
-arguments are left alone.
+13.2.2 Searching for Regular Expressions in Files
+-------------------------------------------------
-
-File: gawk.info, Node: Getopt Function, Next: Passwd Functions, Prev: Data
File Management, Up: Library Functions
+The `egrep' utility searches files for patterns. It uses regular
+expressions that are almost identical to those available in `awk'
+(*note Regexp::). You invoke it as follows:
-13.4 Processing Command-Line Options
-====================================
+ egrep [ OPTIONS ] 'PATTERN' FILES ...
-Most utilities on POSIX compatible systems take options on the command
-line that can be used to change the way a program behaves. `awk' is an
-example of such a program (*note Options::). Often, options take
-"arguments"; i.e., data that the program needs to correctly obey the
-command-line option. For example, `awk''s `-F' option requires a
-string to use as the field separator. The first occurrence on the
-command line of either `--' or a string that does not begin with `-'
-ends the options.
+ The PATTERN is a regular expression. In typical usage, the regular
+expression is quoted to prevent the shell from expanding any of the
+special characters as file name wildcards. Normally, `egrep' prints
+the lines that matched. If multiple file names are provided on the
+command line, each output line is preceded by the name of the file and
+a colon.
- Modern Unix systems provide a C function named `getopt()' for
-processing command-line arguments. The programmer provides a string
-describing the one-letter options. If an option requires an argument,
-it is followed in the string with a colon. `getopt()' is also passed
-the count and values of the command-line arguments and is called in a
-loop. `getopt()' processes the command-line arguments for option
-letters. Each time around the loop, it returns a single character
-representing the next option letter that it finds, or `?' if it finds
-an invalid option. When it returns -1, there are no options left on
-the command line.
+ The options to `egrep' are as follows:
- When using `getopt()', options that do not take arguments can be
-grouped together. Furthermore, options that take arguments require
-that the argument be present. The argument can immediately follow the
-option letter, or it can be a separate command-line argument.
+`-c'
+ Print out a count of the lines that matched the pattern, instead
+ of the lines themselves.
- Given a hypothetical program that takes three command-line options,
-`-a', `-b', and `-c', where `-b' requires an argument, all of the
-following are valid ways of invoking the program:
+`-s'
+ Be silent. No output is produced and the exit value indicates
+ whether the pattern was matched.
- prog -a -b foo -c data1 data2 data3
- prog -ac -bfoo -- data1 data2 data3
- prog -acbfoo data1 data2 data3
+`-v'
+ Invert the sense of the test. `egrep' prints the lines that do
+ _not_ match the pattern and exits successfully if the pattern is
+ not matched.
- Notice that when the argument is grouped with its option, the rest of
-the argument is considered to be the option's argument. In this
-example, `-acbfoo' indicates that all of the `-a', `-b', and `-c'
-options were supplied, and that `foo' is the argument to the `-b'
-option.
+`-i'
+ Ignore case distinctions in both the pattern and the input data.
- `getopt()' provides four external variables that the programmer can
-use:
+`-l'
+ Only print (list) the names of the files that matched, not the
+ lines that matched.
-`optind'
- The index in the argument value array (`argv') where the first
- nonoption command-line argument can be found.
+`-e PATTERN'
+ Use PATTERN as the regexp to match. The purpose of the `-e'
+ option is to allow patterns that start with a `-'.
-`optarg'
- The string value of the argument to an option.
+ This version uses the `getopt()' library function (*note Getopt
+Function::) and the file transition library program (*note Filetrans
+Function::).
-`opterr'
- Usually `getopt()' prints an error message when it finds an invalid
- option. Setting `opterr' to zero disables this feature. (An
- application might want to print its own error message.)
+ The program begins with a descriptive comment and then a `BEGIN' rule
+that processes the command-line arguments with `getopt()'. The `-i'
+(ignore case) option is particularly easy with `gawk'; we just use the
+`IGNORECASE' built-in variable (*note Built-in Variables::):
-`optopt'
- The letter representing the command-line option.
+ # egrep.awk --- simulate egrep in awk
+ #
+ # Options:
+ # -c count of lines
+ # -s silent - use exit value
+ # -v invert test, success if no match
+ # -i ignore case
+ # -l print filenames only
+ # -e argument is pattern
+ #
+ # Requires getopt and file transition library functions
- The following C fragment shows how `getopt()' might process
-command-line arguments for `awk':
+ BEGIN {
+ while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) {
+ if (c == "c")
+ count_only++
+ else if (c == "s")
+ no_print++
+ else if (c == "v")
+ invert++
+ else if (c == "i")
+ IGNORECASE = 1
+ else if (c == "l")
+ filenames_only++
+ else if (c == "e")
+ pattern = Optarg
+ else
+ usage()
+ }
- int
- main(int argc, char *argv[])
- {
- ...
- /* print our own message */
- opterr = 0;
- while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) {
- switch (c) {
- case 'f': /* file */
- ...
- break;
- case 'F': /* field separator */
- ...
- break;
- case 'v': /* variable assignment */
- ...
- break;
- case 'W': /* extension */
- ...
- break;
- case '?':
- default:
- usage();
- break;
- }
- }
- ...
- }
+ Next comes the code that handles the `egrep'-specific behavior. If no
+pattern is supplied with `-e', the first nonoption on the command line
+is used. The `awk' command-line arguments up to `ARGV[Optind]' are
+cleared, so that `awk' won't try to process them as files. If no files
+are specified, the standard input is used, and if multiple files are
+specified, we make sure to note this so that the file names can precede
+the matched lines in the output:
- As a side point, `gawk' actually uses the GNU `getopt_long()'
-function to process both normal and GNU-style long options (*note
-Options::).
+ if (pattern == "")
+ pattern = ARGV[Optind++]
- The abstraction provided by `getopt()' is very useful and is quite
-handy in `awk' programs as well. Following is an `awk' version of
-`getopt()'. This function highlights one of the greatest weaknesses in
-`awk', which is that it is very poor at manipulating single characters.
-Repeated calls to `substr()' are necessary for accessing individual
-characters (*note String Functions::).(1)
+ for (i = 1; i < Optind; i++)
+ ARGV[i] = ""
+ if (Optind >= ARGC) {
+ ARGV[1] = "-"
+ ARGC = 2
+ } else if (ARGC - Optind > 1)
+ do_filenames++
- The discussion that follows walks through the code a bit at a time:
+ # if (IGNORECASE)
+ # pattern = tolower(pattern)
+ }
- # getopt.awk --- Do C library getopt(3) function in awk
+ The last two lines are commented out, since they are not needed in
+`gawk'. They should be uncommented if you have to use another version
+of `awk'.
- # External variables:
- # Optind -- index in ARGV of first nonoption argument
- # Optarg -- string value of argument to current option
- # Opterr -- if nonzero, print our own diagnostic
- # Optopt -- current option letter
+ The next set of lines should be uncommented if you are not using
+`gawk'. This rule translates all the characters in the input line into
+lowercase if the `-i' option is specified.(1) The rule is commented out
+since it is not necessary with `gawk':
- # Returns:
- # -1 at end of options
- # "?" for unrecognized option
- # <c> a character representing the current option
+ #{
+ # if (IGNORECASE)
+ # $0 = tolower($0)
+ #}
- # Private Data:
- # _opti -- index in multi-flag option, e.g., -abc
+ The `beginfile()' function is called by the rule in `ftrans.awk'
+when each new file is processed. In this case, it is very simple; all
+it does is initialize a variable `fcount' to zero. `fcount' tracks how
+many lines in the current file matched the pattern. Naming the
+parameter `junk' shows we know that `beginfile()' is called with a
+parameter, but that we're not interested in its value:
- The function starts out with comments presenting a list of the
-global variables it uses, what the return values are, what they mean,
-and any global variables that are "private" to this library function.
-Such documentation is essential for any program, and particularly for
-library functions.
+ function beginfile(junk)
+ {
+ fcount = 0
+ }
- The `getopt()' function first checks that it was indeed called with
-a string of options (the `options' parameter). If `options' has a zero
-length, `getopt()' immediately returns -1:
+ The `endfile()' function is called after each file has been
+processed. It affects the output only when the user wants a count of
+the number of lines that matched. `no_print' is true only if the exit
+status is desired. `count_only' is true if line counts are desired.
+`egrep' therefore only prints line counts if printing and counting are
+enabled. The output format must be adjusted depending upon the number
+of files to process. Finally, `fcount' is added to `total', so that we
+know the total number of lines that matched the pattern:
- function getopt(argc, argv, options, thisopt, i)
+ function endfile(file)
{
- if (length(options) == 0) # no options given
- return -1
-
- if (argv[Optind] == "--") { # all done
- Optind++
- _opti = 0
- return -1
- } else if (argv[Optind] !~ /^-[^:[:space:]]/) {
- _opti = 0
- return -1
+ if (! no_print && count_only) {
+ if (do_filenames)
+ print file ":" fcount
+ else
+ print fcount
}
- The next thing to check for is the end of the options. A `--' ends
-the command-line options, as does any command-line argument that does
-not begin with a `-'. `Optind' is used to step through the array of
-command-line arguments; it retains its value across calls to
-`getopt()', because it is a global variable.
+ total += fcount
+ }
- The regular expression that is used, `/^-[^:[:space:]/', checks for
-a `-' followed by anything that is not whitespace and not a colon. If
-the current command-line argument does not match this pattern, it is
-not an option, and it ends option processing. Continuing on:
+ The following rule does most of the work of matching lines. The
+variable `matches' is true if the line matched the pattern. If the user
+wants lines that did not match, the sense of `matches' is inverted
+using the `!' operator. `fcount' is incremented with the value of
+`matches', which is either one or zero, depending upon a successful or
+unsuccessful match. If the line does not match, the `next' statement
+just moves on to the next record.
- if (_opti == 0)
- _opti = 2
- thisopt = substr(argv[Optind], _opti, 1)
- Optopt = thisopt
- i = index(options, thisopt)
- if (i == 0) {
- if (Opterr)
- printf("%c -- invalid option\n",
- thisopt) > "/dev/stderr"
- if (_opti >= length(argv[Optind])) {
- Optind++
- _opti = 0
- } else
- _opti++
- return "?"
- }
+ A number of additional tests are made, but they are only done if we
+are not counting lines. First, if the user only wants exit status
+(`no_print' is true), then it is enough to know that _one_ line in this
+file matched, and we can skip on to the next file with `nextfile'.
+Similarly, if we are only printing file names, we can print the file
+name, and then skip to the next file with `nextfile'. Finally, each
+line is printed, with a leading file name and colon if necessary:
- The `_opti' variable tracks the position in the current command-line
-argument (`argv[Optind]'). If multiple options are grouped together
-with one `-' (e.g., `-abx'), it is necessary to return them to the user
-one at a time.
+ {
+ matches = ($0 ~ pattern)
+ if (invert)
+ matches = ! matches
- If `_opti' is equal to zero, it is set to two, which is the index in
-the string of the next character to look at (we skip the `-', which is
-at position one). The variable `thisopt' holds the character, obtained
-with `substr()'. It is saved in `Optopt' for the main program to use.
+ fcount += matches # 1 or 0
- If `thisopt' is not in the `options' string, then it is an invalid
-option. If `Opterr' is nonzero, `getopt()' prints an error message on
-the standard error that is similar to the message from the C version of
-`getopt()'.
+ if (! matches)
+ next
- Because the option is invalid, it is necessary to skip it and move
-on to the next option character. If `_opti' is greater than or equal
-to the length of the current command-line argument, it is necessary to
-move on to the next argument, so `Optind' is incremented and `_opti' is
-reset to zero. Otherwise, `Optind' is left alone and `_opti' is merely
-incremented.
+ if (! count_only) {
+ if (no_print)
+ nextfile
- In any case, because the option is invalid, `getopt()' returns `"?"'.
-The main program can examine `Optopt' if it needs to know what the
-invalid option letter actually is. Continuing on:
+ if (filenames_only) {
+ print FILENAME
+ nextfile
+ }
- if (substr(options, i + 1, 1) == ":") {
- # get option argument
- if (length(substr(argv[Optind], _opti + 1)) > 0)
- Optarg = substr(argv[Optind], _opti + 1)
+ if (do_filenames)
+ print FILENAME ":" $0
else
- Optarg = argv[++Optind]
- _opti = 0
- } else
- Optarg = ""
+ print
+ }
+ }
- If the option requires an argument, the option letter is followed by
-a colon in the `options' string. If there are remaining characters in
-the current command-line argument (`argv[Optind]'), then the rest of
-that string is assigned to `Optarg'. Otherwise, the next command-line
-argument is used (`-xFOO' versus `-x FOO'). In either case, `_opti' is
-reset to zero, because there are no more characters left to examine in
-the current command-line argument. Continuing:
+ The `END' rule takes care of producing the correct exit status. If
+there are no matches, the exit status is one; otherwise it is zero:
- if (_opti == 0 || _opti >= length(argv[Optind])) {
- Optind++
- _opti = 0
- } else
- _opti++
- return thisopt
+ END \
+ {
+ if (total == 0)
+ exit 1
+ exit 0
}
- Finally, if `_opti' is either zero or greater than the length of the
-current command-line argument, it means this element in `argv' is
-through being processed, so `Optind' is incremented to point to the
-next element in `argv'. If neither condition is true, then only
-`_opti' is incremented, so that the next option letter can be processed
-on the next call to `getopt()'.
+ The `usage()' function prints a usage message in case of invalid
+options, and then exits:
- The `BEGIN' rule initializes both `Opterr' and `Optind' to one.
-`Opterr' is set to one, since the default behavior is for `getopt()' to
-print a diagnostic message upon seeing an invalid option. `Optind' is
-set to one, since there's no reason to look at the program name, which
-is in `ARGV[0]':
+ function usage( e)
+ {
+ e = "Usage: egrep [-csvil] [-e pat] [files ...]"
+ e = e "\n\tegrep [-csvil] pat [files ...]"
+ print e > "/dev/stderr"
+ exit 1
+ }
- BEGIN {
- Opterr = 1 # default is to diagnose
- Optind = 1 # skip ARGV[0]
+ The variable `e' is used so that the function fits nicely on the
+printed page.
- # test program
- if (_getopt_test) {
- while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
- printf("c = <%c>, optarg = <%s>\n",
- _go_c, Optarg)
- printf("non-option arguments:\n")
- for (; Optind < ARGC; Optind++)
- printf("\tARGV[%d] = <%s>\n",
- Optind, ARGV[Optind])
- }
- }
+ Just a note on programming style: you may have noticed that the `END'
+rule uses backslash continuation, with the open brace on a line by
+itself. This is so that it more closely resembles the way functions
+are written. Many of the examples in this major node use this style.
+You can decide for yourself if you like writing your `BEGIN' and `END'
+rules this way or not.
- The rest of the `BEGIN' rule is a simple test program. Here is the
-result of two sample runs of the test program:
+ ---------- Footnotes ----------
- $ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
- -| c = <a>, optarg = <>
- -| c = <c>, optarg = <>
- -| c = <b>, optarg = <ARG>
- -| non-option arguments:
- -| ARGV[3] = <bax>
- -| ARGV[4] = <-x>
+ (1) It also introduces a subtle bug; if a match happens, we output
+the translated line, not the original.
- $ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
- -| c = <a>, optarg = <>
- error--> x -- invalid option
- -| c = <?>, optarg = <>
- -| non-option arguments:
- -| ARGV[4] = <xyz>
- -| ARGV[5] = <abc>
+
+File: gawk.info, Node: Id Program, Next: Split Program, Prev: Egrep
Program, Up: Clones
- In both runs, the first `--' terminates the arguments to `awk', so
-that it does not try to interpret the `-a', etc., as its own options.
+13.2.3 Printing out User Information
+------------------------------------
- NOTE: After `getopt()' is through, it is the responsibility of the
- user level code to clear out all the elements of `ARGV' from 1 to
- `Optind', so that `awk' does not try to process the command-line
- options as file names.
+The `id' utility lists a user's real and effective user ID numbers,
+real and effective group ID numbers, and the user's group set, if any.
+`id' only prints the effective user ID and group ID if they are
+different from the real ones. If possible, `id' also supplies the
+corresponding user and group names. The output might look like this:
- Several of the sample programs presented in *note Sample Programs::,
-use `getopt()' to process their arguments.
+ $ id
+ -| uid=500(arnold) gid=500(arnold) groups=6(disk),7(lp),19(floppy)
- ---------- Footnotes ----------
+ This information is part of what is provided by `gawk''s `PROCINFO'
+array (*note Built-in Variables::). However, the `id' utility provides
+a more palatable output than just individual numbers.
- (1) This function was written before `gawk' acquired the ability to
-split strings into single characters using `""' as the separator. We
-have left it alone, since using `substr()' is more portable.
+ Here is a simple version of `id' written in `awk'. It uses the user
+database library functions (*note Passwd Functions::) and the group
+database library functions (*note Group Functions::):
-
-File: gawk.info, Node: Passwd Functions, Next: Group Functions, Prev:
Getopt Function, Up: Library Functions
+ The program is fairly straightforward. All the work is done in the
+`BEGIN' rule. The user and group ID numbers are obtained from
+`PROCINFO'. The code is repetitive. The entry in the user database
+for the real user ID number is split into parts at the `:'. The name is
+the first field. Similar code is used for the effective user ID number
+and the group numbers:
-13.5 Reading the User Database
-==============================
+ # id.awk --- implement id in awk
+ #
+ # Requires user and group library functions
+ # output is:
+ # uid=12(foo) euid=34(bar) gid=3(baz) \
+ # egid=5(blat) groups=9(nine),2(two),1(one)
-The `PROCINFO' array (*note Built-in Variables::) provides access to
-the current user's real and effective user and group ID numbers, and if
-available, the user's supplementary group set. However, because these
-are numbers, they do not provide very useful information to the average
-user. There needs to be some way to find the user information
-associated with the user and group ID numbers. This minor node
-presents a suite of functions for retrieving information from the user
-database. *Note Group Functions::, for a similar suite that retrieves
-information from the group database.
+ BEGIN \
+ {
+ uid = PROCINFO["uid"]
+ euid = PROCINFO["euid"]
+ gid = PROCINFO["gid"]
+ egid = PROCINFO["egid"]
- The POSIX standard does not define the file where user information is
-kept. Instead, it provides the `<pwd.h>' header file and several C
-language subroutines for obtaining user information. The primary
-function is `getpwent()', for "get password entry." The "password"
-comes from the original user database file, `/etc/passwd', which stores
-user information, along with the encrypted passwords (hence the name).
+ printf("uid=%d", uid)
+ pw = getpwuid(uid)
+ if (pw != "") {
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ }
- While an `awk' program could simply read `/etc/passwd' directly,
-this file may not contain complete information about the system's set
-of users.(1) To be sure you are able to produce a readable and complete
-version of the user database, it is necessary to write a small C
-program that calls `getpwent()'. `getpwent()' is defined as returning
-a pointer to a `struct passwd'. Each time it is called, it returns the
-next entry in the database. When there are no more entries, it returns
-`NULL', the null pointer. When this happens, the C program should call
-`endpwent()' to close the database. Following is `pwcat', a C program
-that "cats" the password database:
+ if (euid != uid) {
+ printf(" euid=%d", euid)
+ pw = getpwuid(euid)
+ if (pw != "") {
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ }
+ }
- /*
- * pwcat.c
- *
- * Generate a printable version of the password database
- */
- #include <stdio.h>
- #include <pwd.h>
+ printf(" gid=%d", gid)
+ pw = getgrgid(gid)
+ if (pw != "") {
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ }
- int
- main(int argc, char **argv)
- {
- struct passwd *p;
+ if (egid != gid) {
+ printf(" egid=%d", egid)
+ pw = getgrgid(egid)
+ if (pw != "") {
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ }
+ }
- while ((p = getpwent()) != NULL)
- printf("%s:%s:%ld:%ld:%s:%s:%s\n",
- p->pw_name, p->pw_passwd, (long) p->pw_uid,
- (long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
+ for (i = 1; ("group" i) in PROCINFO; i++) {
+ if (i == 1)
+ printf(" groups=")
+ group = PROCINFO["group" i]
+ printf("%d", group)
+ pw = getgrgid(group)
+ if (pw != "") {
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ }
+ if (("group" (i+1)) in PROCINFO)
+ printf(",")
+ }
- endpwent();
- return 0;
+ print ""
}
- If you don't understand C, don't worry about it. The output from
-`pwcat' is the user database, in the traditional `/etc/passwd' format
-of colon-separated fields. The fields are:
-
-Login name
- The user's login name.
+ The test in the `for' loop is worth noting. Any supplementary
+groups in the `PROCINFO' array have the indices `"group1"' through
+`"groupN"' for some N, i.e., the total number of supplementary groups.
+However, we don't know in advance how many of these groups there are.
-Encrypted password
- The user's encrypted password. This may not be available on some
- systems.
+ This loop works by starting at one, concatenating the value with
+`"group"', and then using `in' to see if that value is in the array.
+Eventually, `i' is incremented past the last group in the array and the
+loop exits.
-User-ID
- The user's numeric user ID number. (On some systems it's a C
- `long', and not an `int'. Thus we cast it to `long' for all
- cases.)
+ The loop is also correct if there are _no_ supplementary groups;
+then the condition is false the first time it's tested, and the loop
+body never executes.
-Group-ID
- The user's numeric group ID number. (Similar comments about
- `long' vs. `int' apply here.)
+
+File: gawk.info, Node: Split Program, Next: Tee Program, Prev: Id Program,
Up: Clones
-Full name
- The user's full name, and perhaps other information associated
- with the user.
+13.2.4 Splitting a Large File into Pieces
+-----------------------------------------
-Home directory
- The user's login (or "home") directory (familiar to shell
- programmers as `$HOME').
+The `split' program splits large text files into smaller pieces. Usage
+is as follows:(1)
-Login shell
- The program that is run when the user logs in. This is usually a
- shell, such as Bash.
+ split [-COUNT] file [ PREFIX ]
- A few lines representative of `pwcat''s output are as follows:
+ By default, the output files are named `xaa', `xab', and so on. Each
+file has 1000 lines in it, with the likely exception of the last file.
+To change the number of lines in each file, supply a number on the
+command line preceded with a minus; e.g., `-500' for files with 500
+lines in them instead of 1000. To change the name of the output files
+to something like `myfileaa', `myfileab', and so on, supply an
+additional argument that specifies the file name prefix.
- $ pwcat
- -| root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
- -| nobody:*:65534:65534::/:
- -| daemon:*:1:1::/:
- -| sys:*:2:2::/:/bin/csh
- -| bin:*:3:3::/bin:
- -| arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
- -| miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
- -| andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh
- ...
+ Here is a version of `split' in `awk'. It uses the `ord()' and
+`chr()' functions presented in *note Ordinal Functions::.
- With that introduction, following is a group of functions for
-getting user information. There are several functions here,
-corresponding to the C functions of the same names:
+ The program first sets its defaults, and then tests to make sure
+there are not too many arguments. It then looks at each argument in
+turn. The first argument could be a minus sign followed by a number.
+If it is, this happens to look like a negative number, so it is made
+positive, and that is the count of lines. The data file name is
+skipped over and the final argument is used as the prefix for the
+output file names:
- # passwd.awk --- access password file information
+ # split.awk --- do split in awk
+ #
+ # Requires ord() and chr() library functions
+ # usage: split [-num] [file] [outname]
BEGIN {
- # tailor this to suit your system
- _pw_awklib = "/usr/local/libexec/awk/"
- }
-
- function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat)
- {
- if (_pw_inited)
- return
+ outfile = "x" # default
+ count = 1000
+ if (ARGC > 4)
+ usage()
- oldfs = FS
- oldrs = RS
- olddol0 = $0
- using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
- using_fpat = (PROCINFO["FS"] == "FPAT")
- FS = ":"
- RS = "\n"
+ i = 1
+ if (ARGV[i] ~ /^-[[:digit:]]+$/) {
+ count = -ARGV[i]
+ ARGV[i] = ""
+ i++
+ }
+ # test argv in case reading from stdin instead of file
+ if (i in ARGV)
+ i++ # skip data file name
+ if (i in ARGV) {
+ outfile = ARGV[i]
+ ARGV[i] = ""
+ }
- pwcat = _pw_awklib "pwcat"
- while ((pwcat | getline) > 0) {
- _pw_byname[$1] = $0
- _pw_byuid[$3] = $0
- _pw_bycount[++_pw_total] = $0
+ s1 = s2 = "a"
+ out = (outfile s1 s2)
+ }
+
+ The next rule does most of the work. `tcount' (temporary count)
+tracks how many lines have been printed to the output file so far. If
+it is greater than `count', it is time to close the current file and
+start a new one. `s1' and `s2' track the current suffixes for the file
+name. If they are both `z', the file is just too big. Otherwise, `s1'
+moves to the next letter in the alphabet and `s2' starts over again at
+`a':
+
+ {
+ if (++tcount > count) {
+ close(out)
+ if (s2 == "z") {
+ if (s1 == "z") {
+ printf("split: %s is too large to split\n",
+ FILENAME) > "/dev/stderr"
+ exit 1
+ }
+ s1 = chr(ord(s1) + 1)
+ s2 = "a"
+ }
+ else
+ s2 = chr(ord(s2) + 1)
+ out = (outfile s1 s2)
+ tcount = 1
}
- close(pwcat)
- _pw_count = 0
- _pw_inited = 1
- FS = oldfs
- if (using_fw)
- FIELDWIDTHS = FIELDWIDTHS
- else if (using_fpat)
- FPAT = FPAT
- RS = oldrs
- $0 = olddol0
+ print > out
}
- The `BEGIN' rule sets a private variable to the directory where
-`pwcat' is stored. Because it is used to help out an `awk' library
-routine, we have chosen to put it in `/usr/local/libexec/awk'; however,
-you might want it to be in a different directory on your system.
+The `usage()' function simply prints an error message and exits:
- The function `_pw_init()' keeps three copies of the user information
-in three associative arrays. The arrays are indexed by username
-(`_pw_byname'), by user ID number (`_pw_byuid'), and by order of
-occurrence (`_pw_bycount'). The variable `_pw_inited' is used for
-efficiency, since `_pw_init()' needs to be called only once.
+ function usage( e)
+ {
+ e = "usage: split [-num] [file] [outname]"
+ print e > "/dev/stderr"
+ exit 1
+ }
- Because this function uses `getline' to read information from
-`pwcat', it first saves the values of `FS', `RS', and `$0'. It notes
-in the variable `using_fw' whether field splitting with `FIELDWIDTHS'
-is in effect or not. Doing so is necessary, since these functions
-could be called from anywhere within a user's program, and the user may
-have his or her own way of splitting records and fields.
+The variable `e' is used so that the function fits nicely on the screen.
- The `using_fw' variable checks `PROCINFO["FS"]', which is
-`"FIELDWIDTHS"' if field splitting is being done with `FIELDWIDTHS'.
-This makes it possible to restore the correct field-splitting mechanism
-later. The test can only be true for `gawk'. It is false if using
-`FS' or `FPAT', or on some other `awk' implementation.
+ This program is a bit sloppy; it relies on `awk' to automatically
+close the last file instead of doing it in an `END' rule. It also
+assumes that letters are contiguous in the character set, which isn't
+true for EBCDIC systems.
- The code that checks for using `FPAT', using `using_fpat' and
-`PROCINFO["FS"]' is similar.
+ ---------- Footnotes ----------
- The main part of the function uses a loop to read database lines,
-split the line into fields, and then store the line into each array as
-necessary. When the loop is done, `_pw_init()' cleans up by closing
-the pipeline, setting `_pw_inited' to one, and restoring `FS' (and
-`FIELDWIDTHS' or `FPAT' if necessary), `RS', and `$0'. The use of
-`_pw_count' is explained shortly.
+ (1) This is the traditional usage. The POSIX usage is different, but
+not relevant for what the program aims to demonstrate.
- The `getpwnam()' function takes a username as a string argument. If
-that user is in the database, it returns the appropriate line.
-Otherwise, it relies on the array reference to a nonexistent element to
-create the element with the null string as its value:
+
+File: gawk.info, Node: Tee Program, Next: Uniq Program, Prev: Split
Program, Up: Clones
- function getpwnam(name)
- {
- _pw_init()
- return _pw_byname[name]
- }
+13.2.5 Duplicating Output into Multiple Files
+---------------------------------------------
- Similarly, the `getpwuid' function takes a user ID number argument.
-If that user number is in the database, it returns the appropriate
-line. Otherwise, it returns the null string:
+The `tee' program is known as a "pipe fitting." `tee' copies its
+standard input to its standard output and also duplicates it to the
+files named on the command line. Its usage is as follows:
- function getpwuid(uid)
- {
- _pw_init()
- return _pw_byuid[uid]
- }
+ tee [-a] file ...
- The `getpwent()' function simply steps through the database, one
-entry at a time. It uses `_pw_count' to track its current position in
-the `_pw_bycount' array:
+ The `-a' option tells `tee' to append to the named files, instead of
+truncating them and starting over.
- function getpwent()
+ The `BEGIN' rule first makes a copy of all the command-line arguments
+into an array named `copy'. `ARGV[0]' is not copied, since it is not
+needed. `tee' cannot use `ARGV' directly, since `awk' attempts to
+process each file name in `ARGV' as input data.
+
+ If the first argument is `-a', then the flag variable `append' is
+set to true, and both `ARGV[1]' and `copy[1]' are deleted. If `ARGC' is
+less than two, then no file names were supplied and `tee' prints a
+usage message and exits. Finally, `awk' is forced to read the standard
+input by setting `ARGV[1]' to `"-"' and `ARGC' to two:
+
+ # tee.awk --- tee in awk
+ #
+ # Copy standard input to all named output files.
+ # Append content if -a option is supplied.
+ #
+ BEGIN \
{
- _pw_init()
- if (_pw_count < _pw_total)
- return _pw_bycount[++_pw_count]
- return ""
+ for (i = 1; i < ARGC; i++)
+ copy[i] = ARGV[i]
+
+ if (ARGV[1] == "-a") {
+ append = 1
+ delete ARGV[1]
+ delete copy[1]
+ ARGC--
+ }
+ if (ARGC < 2) {
+ print "usage: tee [-a] file ..." > "/dev/stderr"
+ exit 1
+ }
+ ARGV[1] = "-"
+ ARGC = 2
}
- The `endpwent()' function resets `_pw_count' to zero, so that
-subsequent calls to `getpwent()' start over again:
+ The following single rule does all the work. Since there is no
+pattern, it is executed for each line of input. The body of the rule
+simply prints the line into each file on the command line, and then to
+the standard output:
- function endpwent()
{
- _pw_count = 0
+ # moving the if outside the loop makes it run faster
+ if (append)
+ for (i in copy)
+ print >> copy[i]
+ else
+ for (i in copy)
+ print > copy[i]
+ print
}
- A conscious design decision in this suite is that each subroutine
-calls `_pw_init()' to initialize the database arrays. The overhead of
-running a separate process to generate the user database, and the I/O
-to scan it, are only incurred if the user's main program actually calls
-one of these functions. If this library file is loaded along with a
-user's program, but none of the routines are ever called, then there is
-no extra runtime overhead. (The alternative is move the body of
-`_pw_init()' into a `BEGIN' rule, which always runs `pwcat'. This
-simplifies the code but runs an extra process that may never be needed.)
+It is also possible to write the loop this way:
- In turn, calling `_pw_init()' is not too expensive, because the
-`_pw_inited' variable keeps the program from reading the data more than
-once. If you are worried about squeezing every last cycle out of your
-`awk' program, the check of `_pw_inited' could be moved out of
-`_pw_init()' and duplicated in all the other functions. In practice,
-this is not necessary, since most `awk' programs are I/O-bound, and
-such a change would clutter up the code.
+ for (i in copy)
+ if (append)
+ print >> copy[i]
+ else
+ print > copy[i]
- The `id' program in *note Id Program::, uses these functions.
+This is more concise but it is also less efficient. The `if' is tested
+for each record and for each output file. By duplicating the loop
+body, the `if' is only tested once for each input record. If there are
+N input records and M output files, the first method only executes N
+`if' statements, while the second executes N`*'M `if' statements.
- ---------- Footnotes ----------
+ Finally, the `END' rule cleans up by closing all the output files:
- (1) It is often the case that password information is stored in a
-network database.
+ END \
+ {
+ for (i in copy)
+ close(copy[i])
+ }
-File: gawk.info, Node: Group Functions, Next: Walking Arrays, Prev: Passwd
Functions, Up: Library Functions
+File: gawk.info, Node: Uniq Program, Next: Wc Program, Prev: Tee Program,
Up: Clones
-13.6 Reading the Group Database
-===============================
+13.2.6 Printing Nonduplicated Lines of Text
+-------------------------------------------
-Much of the discussion presented in *note Passwd Functions::, applies
-to the group database as well. Although there has traditionally been a
-well-known file (`/etc/group') in a well-known format, the POSIX
-standard only provides a set of C library routines (`<grp.h>' and
-`getgrent()') for accessing the information. Even though this file may
-exist, it may not have complete information. Therefore, as with the
-user database, it is necessary to have a small C program that generates
-the group database as its output. `grcat', a C program that "cats" the
-group database, is as follows:
+The `uniq' utility reads sorted lines of data on its standard input,
+and by default removes duplicate lines. In other words, it only prints
+unique lines--hence the name. `uniq' has a number of options. The
+usage is as follows:
- /*
- * grcat.c
- *
- * Generate a printable version of the group database
- */
- #include <stdio.h>
- #include <grp.h>
+ uniq [-udc [-N]] [+N] [ INPUT FILE [ OUTPUT FILE ]]
- int
- main(int argc, char **argv)
- {
- struct group *g;
- int i;
+ The options for `uniq' are:
- while ((g = getgrent()) != NULL) {
- printf("%s:%s:%ld:", g->gr_name, g->gr_passwd,
- (long) g->gr_gid);
- for (i = 0; g->gr_mem[i] != NULL; i++) {
- printf("%s", g->gr_mem[i]);
- if (g->gr_mem[i+1] != NULL)
- putchar(',');
- }
- putchar('\n');
- }
- endgrent();
- return 0;
- }
+`-d'
+ Print only repeated lines.
- Each line in the group database represents one group. The fields are
-separated with colons and represent the following information:
+`-u'
+ Print only nonrepeated lines.
-Group Name
- The group's name.
+`-c'
+ Count lines. This option overrides `-d' and `-u'. Both repeated
+ and nonrepeated lines are counted.
-Group Password
- The group's encrypted password. In practice, this field is never
- used; it is usually empty or set to `*'.
+`-N'
+ Skip N fields before comparing lines. The definition of fields is
+ similar to `awk''s default: nonwhitespace characters separated by
+ runs of spaces and/or TABs.
-Group ID Number
- The group's numeric group ID number; this number must be unique
- within the file. (On some systems it's a C `long', and not an
- `int'. Thus we cast it to `long' for all cases.)
+`+N'
+ Skip N characters before comparing lines. Any fields specified
+ with `-N' are skipped first.
-Group Member List
- A comma-separated list of user names. These users are members of
- the group. Modern Unix systems allow users to be members of
- several groups simultaneously. If your system does, then there
- are elements `"group1"' through `"groupN"' in `PROCINFO' for those
- group ID numbers. (Note that `PROCINFO' is a `gawk' extension;
- *note Built-in Variables::.)
+`INPUT FILE'
+ Data is read from the input file named on the command line,
+ instead of from the standard input.
- Here is what running `grcat' might produce:
+`OUTPUT FILE'
+ The generated output is sent to the named output file, instead of
+ to the standard output.
- $ grcat
- -| wheel:*:0:arnold
- -| nogroup:*:65534:
- -| daemon:*:1:
- -| kmem:*:2:
- -| staff:*:10:arnold,miriam,andy
- -| other:*:20:
- ...
+ Normally `uniq' behaves as if both the `-d' and `-u' options are
+provided.
- Here are the functions for obtaining information from the group
-database. There are several, modeled after the C library functions of
-the same names:
+ `uniq' uses the `getopt()' library function (*note Getopt Function::)
+and the `join()' library function (*note Join Function::).
- # group.awk --- functions for dealing with the group file
+ The program begins with a `usage()' function and then a brief
+outline of the options and their meanings in comments. The `BEGIN'
+rule deals with the command-line arguments and options. It uses a trick
+to get `getopt()' to handle options of the form `-25', treating such an
+option as the option letter `2' with an argument of `5'. If indeed two
+or more digits are supplied (`Optarg' looks like a number), `Optarg' is
+concatenated with the option digit and then the result is added to zero
+to make it into a number. If there is only one digit in the option,
+then `Optarg' is not needed. In this case, `Optind' must be decremented
+so that `getopt()' processes it next time. This code is admittedly a
+bit tricky.
- BEGIN \
+ If no options are supplied, then the default is taken, to print both
+repeated and nonrepeated lines. The output file, if provided, is
+assigned to `outputfile'. Early on, `outputfile' is initialized to the
+standard output, `/dev/stdout':
+
+ # uniq.awk --- do uniq in awk
+ #
+ # Requires getopt() and join() library functions
+
+ function usage( e)
{
- # Change to suit your system
- _gr_awklib = "/usr/local/libexec/awk/"
+ e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
+ print e > "/dev/stderr"
+ exit 1
}
- function _gr_init( oldfs, oldrs, olddol0, grcat,
- using_fw, using_fpat, n, a, i)
+ # -c count lines. overrides -d and -u
+ # -d only repeated lines
+ # -u only nonrepeated lines
+ # -n skip n fields
+ # +n skip n characters, skip fields first
+
+ BEGIN \
{
- if (_gr_inited)
- return
+ count = 1
+ outputfile = "/dev/stdout"
+ opts = "udc0:1:2:3:4:5:6:7:8:9:"
+ while ((c = getopt(ARGC, ARGV, opts)) != -1) {
+ if (c == "u")
+ non_repeated_only++
+ else if (c == "d")
+ repeated_only++
+ else if (c == "c")
+ do_count++
+ else if (index("0123456789", c) != 0) {
+ # getopt requires args to options
+ # this messes us up for things like -5
+ if (Optarg ~ /^[[:digit:]]+$/)
+ fcount = (c Optarg) + 0
+ else {
+ fcount = c + 0
+ Optind--
+ }
+ } else
+ usage()
+ }
- oldfs = FS
- oldrs = RS
- olddol0 = $0
- using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
- using_fpat = (PROCINFO["FS"] == "FPAT")
- FS = ":"
- RS = "\n"
+ if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) {
+ charcount = substr(ARGV[Optind], 2) + 0
+ Optind++
+ }
- grcat = _gr_awklib "grcat"
- while ((grcat | getline) > 0) {
- if ($1 in _gr_byname)
- _gr_byname[$1] = _gr_byname[$1] "," $4
- else
- _gr_byname[$1] = $0
- if ($3 in _gr_bygid)
- _gr_bygid[$3] = _gr_bygid[$3] "," $4
- else
- _gr_bygid[$3] = $0
+ for (i = 1; i < Optind; i++)
+ ARGV[i] = ""
- n = split($4, a, "[ \t]*,[ \t]*")
- for (i = 1; i <= n; i++)
- if (a[i] in _gr_groupsbyuser)
- _gr_groupsbyuser[a[i]] = \
- _gr_groupsbyuser[a[i]] " " $1
- else
- _gr_groupsbyuser[a[i]] = $1
+ if (repeated_only == 0 && non_repeated_only == 0)
+ repeated_only = non_repeated_only = 1
- _gr_bycount[++_gr_count] = $0
+ if (ARGC - Optind == 2) {
+ outputfile = ARGV[ARGC - 1]
+ ARGV[ARGC - 1] = ""
}
- close(grcat)
- _gr_count = 0
- _gr_inited++
- FS = oldfs
- if (using_fw)
- FIELDWIDTHS = FIELDWIDTHS
- else if (using_fpat)
- FPAT = FPAT
- RS = oldrs
- $0 = olddol0
}
- The `BEGIN' rule sets a private variable to the directory where
-`grcat' is stored. Because it is used to help out an `awk' library
-routine, we have chosen to put it in `/usr/local/libexec/awk'. You
-might want it to be in a different directory on your system.
+ The following function, `are_equal()', compares the current line,
+`$0', to the previous line, `last'. It handles skipping fields and
+characters. If no field count and no character count are specified,
+`are_equal()' simply returns one or zero depending upon the result of a
+simple string comparison of `last' and `$0'. Otherwise, things get more
+complicated. If fields have to be skipped, each line is broken into an
+array using `split()' (*note String Functions::); the desired fields
+are then joined back into a line using `join()'. The joined lines are
+stored in `clast' and `cline'. If no fields are skipped, `clast' and
+`cline' are set to `last' and `$0', respectively. Finally, if
+characters are skipped, `substr()' is used to strip off the leading
+`charcount' characters in `clast' and `cline'. The two strings are
+then compared and `are_equal()' returns the result:
- These routines follow the same general outline as the user database
-routines (*note Passwd Functions::). The `_gr_inited' variable is used
-to ensure that the database is scanned no more than once. The
-`_gr_init()' function first saves `FS', `RS', and `$0', and then sets
-`FS' and `RS' to the correct values for scanning the group information.
-It also takes care to note whether `FIELDWIDTHS' or `FPAT' is being
-used, and to restore the appropriate field splitting mechanism.
+ function are_equal( n, m, clast, cline, alast, aline)
+ {
+ if (fcount == 0 && charcount == 0)
+ return (last == $0)
- The group information is stored is several associative arrays. The
-arrays are indexed by group name (`_gr_byname'), by group ID number
-(`_gr_bygid'), and by position in the database (`_gr_bycount'). There
-is an additional array indexed by user name (`_gr_groupsbyuser'), which
-is a space-separated list of groups to which each user belongs.
+ if (fcount > 0) {
+ n = split(last, alast)
+ m = split($0, aline)
+ clast = join(alast, fcount+1, n)
+ cline = join(aline, fcount+1, m)
+ } else {
+ clast = last
+ cline = $0
+ }
+ if (charcount) {
+ clast = substr(clast, charcount + 1)
+ cline = substr(cline, charcount + 1)
+ }
- Unlike the user database, it is possible to have multiple records in
-the database for the same group. This is common when a group has a
-large number of members. A pair of such entries might look like the
-following:
+ return (clast == cline)
+ }
- tvpeople:*:101:johnny,jay,arsenio
- tvpeople:*:101:david,conan,tom,joan
+ The following two rules are the body of the program. The first one
+is executed only for the very first line of data. It sets `last' equal
+to `$0', so that subsequent lines of text have something to be compared
+to.
- For this reason, `_gr_init()' looks to see if a group name or group
-ID number is already seen. If it is, then the user names are simply
-concatenated onto the previous list of users. (There is actually a
-subtle problem with the code just presented. Suppose that the first
-time there were no names. This code adds the names with a leading
-comma. It also doesn't check that there is a `$4'.)
+ The second rule does the work. The variable `equal' is one or zero,
+depending upon the results of `are_equal()''s comparison. If `uniq' is
+counting repeated lines, and the lines are equal, then it increments
+the `count' variable. Otherwise, it prints the line and resets `count',
+since the two lines are not equal.
- Finally, `_gr_init()' closes the pipeline to `grcat', restores `FS'
-(and `FIELDWIDTHS' or `FPAT' if necessary), `RS', and `$0', initializes
-`_gr_count' to zero (it is used later), and makes `_gr_inited' nonzero.
+ If `uniq' is not counting, and if the lines are equal, `count' is
+incremented. Nothing is printed, since the point is to remove
+duplicates. Otherwise, if `uniq' is counting repeated lines and more
+than one line is seen, or if `uniq' is counting nonrepeated lines and
+only one line is seen, then the line is printed, and `count' is reset.
- The `getgrnam()' function takes a group name as its argument, and if
-that group exists, it is returned. Otherwise, it relies on the array
-reference to a nonexistent element to create the element with the null
-string as its value:
+ Finally, similar logic is used in the `END' rule to print the final
+line of input data:
- function getgrnam(group)
- {
- _gr_init()
- return _gr_byname[group]
+ NR == 1 {
+ last = $0
+ next
}
- The `getgrgid()' function is similar; it takes a numeric group ID and
-looks up the information associated with that group ID:
-
- function getgrgid(gid)
{
- _gr_init()
- return _gr_bygid[gid]
- }
+ equal = are_equal()
- The `getgruser()' function does not have a C counterpart. It takes a
-user name and returns the list of groups that have the user as a member:
+ if (do_count) { # overrides -d and -u
+ if (equal)
+ count++
+ else {
+ printf("%4d %s\n", count, last) > outputfile
+ last = $0
+ count = 1 # reset
+ }
+ next
+ }
- function getgruser(user)
- {
- _gr_init()
- return _gr_groupsbyuser[user]
+ if (equal)
+ count++
+ else {
+ if ((repeated_only && count > 1) ||
+ (non_repeated_only && count == 1))
+ print last > outputfile
+ last = $0
+ count = 1
+ }
}
- The `getgrent()' function steps through the database one entry at a
-time. It uses `_gr_count' to track its position in the list:
-
- function getgrent()
- {
- _gr_init()
- if (++_gr_count in _gr_bycount)
- return _gr_bycount[_gr_count]
- return ""
+ END {
+ if (do_count)
+ printf("%4d %s\n", count, last) > outputfile
+ else if ((repeated_only && count > 1) ||
+ (non_repeated_only && count == 1))
+ print last > outputfile
+ close(outputfile)
}
- The `endgrent()' function resets `_gr_count' to zero so that
-`getgrent()' can start over again:
-
- function endgrent()
- {
- _gr_count = 0
- }
+
+File: gawk.info, Node: Wc Program, Prev: Uniq Program, Up: Clones
- As with the user database routines, each function calls `_gr_init()'
-to initialize the arrays. Doing so only incurs the extra overhead of
-running `grcat' if these functions are used (as opposed to moving the
-body of `_gr_init()' into a `BEGIN' rule).
+13.2.7 Counting Things
+----------------------
- Most of the work is in scanning the database and building the various
-associative arrays. The functions that the user calls are themselves
-very simple, relying on `awk''s associative arrays to do work.
+The `wc' (word count) utility counts lines, words, and characters in
+one or more input files. Its usage is as follows:
- The `id' program in *note Id Program::, uses these functions.
+ wc [-lwc] [ FILES ... ]
-
-File: gawk.info, Node: Walking Arrays, Prev: Group Functions, Up: Library
Functions
+ If no files are specified on the command line, `wc' reads its
+standard input. If there are multiple files, it also prints total
+counts for all the files. The options and their meanings are shown in
+the following list:
-13.7 Traversing Arrays of Arrays
-================================
+`-l'
+ Count only lines.
-*note Arrays of Arrays::, described how `gawk' provides arrays of
-arrays. In particular, any element of an array may be either a scalar,
-or another array. The `isarray()' function (*note Type Functions::)
-lets you distinguish an array from a scalar. The following function,
-`walk_array()', recursively traverses an array, printing each element's
-indices and value. You call it with the array and a string
-representing the name of the array:
+`-w'
+ Count only words. A "word" is a contiguous sequence of
+ nonwhitespace characters, separated by spaces and/or TABs.
+ Luckily, this is the normal way `awk' separates fields in its
+ input data.
- function walk_array(arr, name, i)
- {
- for (i in arr) {
- if (isarray(arr[i]))
- walk_array(arr[i], (name "[" i "]"))
- else
- printf("%s[%s] = %s\n", name, i, arr[i])
- }
- }
+`-c'
+ Count only characters.
-It works by looping over each element of the array. If any given
-element is itself an array, the function calls itself recursively,
-passing the subarray and a new string representing the current index.
-Otherwise, the function simply prints the element's name, index, and
-value. Here is a main program to demonstrate:
+ Implementing `wc' in `awk' is particularly elegant, since `awk' does
+a lot of the work for us; it splits lines into words (i.e., fields) and
+counts them, it counts lines (i.e., records), and it can easily tell us
+how long a line is.
- BEGIN {
- a[1] = 1
- a[2][1] = 21
- a[2][2] = 22
- a[3] = 3
- a[4][1][1] = 411
- a[4][2] = 42
+ This program uses the `getopt()' library function (*note Getopt
+Function::) and the file-transition functions (*note Filetrans
+Function::).
- walk_array(a, "a")
- }
+ This version has one notable difference from traditional versions of
+`wc': it always prints the counts in the order lines, words, and
+characters. Traditional versions note the order of the `-l', `-w', and
+`-c' options on the command line, and print the counts in that order.
- When run, the program produces the following output:
+ The `BEGIN' rule does the argument processing. The variable
+`print_total' is true if more than one file is named on the command
+line:
- $ gawk -f walk_array.awk
- -| a[4][1][1] = 411
- -| a[4][2] = 42
- -| a[1] = 1
- -| a[2][1] = 21
- -| a[2][2] = 22
- -| a[3] = 3
+ # wc.awk --- count lines, words, characters
-
-File: gawk.info, Node: Sample Programs, Next: Debugger, Prev: Library
Functions, Up: Top
+ # Options:
+ # -l only count lines
+ # -w only count words
+ # -c only count characters
+ #
+ # Default is to count lines, words, characters
+ #
+ # Requires getopt() and file transition library functions
-14 Practical `awk' Programs
-***************************
+ BEGIN {
+ # let getopt() print a message about
+ # invalid options. we ignore them
+ while ((c = getopt(ARGC, ARGV, "lwc")) != -1) {
+ if (c == "l")
+ do_lines = 1
+ else if (c == "w")
+ do_words = 1
+ else if (c == "c")
+ do_chars = 1
+ }
+ for (i = 1; i < Optind; i++)
+ ARGV[i] = ""
-*note Library Functions::, presents the idea that reading programs in a
-language contributes to learning that language. This major node
-continues that theme, presenting a potpourri of `awk' programs for your
-reading enjoyment.
+ # if no options, do all
+ if (! do_lines && ! do_words && ! do_chars)
+ do_lines = do_words = do_chars = 1
- Many of these programs use library functions presented in *note
-Library Functions::.
+ print_total = (ARGC - i > 2)
+ }
-* Menu:
+ The `beginfile()' function is simple; it just resets the counts of
+lines, words, and characters to zero, and saves the current file name in
+`fname':
-* Running Examples:: How to run these examples.
-* Clones:: Clones of common utilities.
-* Miscellaneous Programs:: Some interesting `awk' programs.
+ function beginfile(file)
+ {
+ lines = words = chars = 0
+ fname = FILENAME
+ }
-
-File: gawk.info, Node: Running Examples, Next: Clones, Up: Sample Programs
+ The `endfile()' function adds the current file's numbers to the
+running totals of lines, words, and characters.(1) It then prints out
+those numbers for the file that was just read. It relies on
+`beginfile()' to reset the numbers for the following data file:
-14.1 Running the Example Programs
-=================================
+ function endfile(file)
+ {
+ tlines += lines
+ twords += words
+ tchars += chars
+ if (do_lines)
+ printf "\t%d", lines
+ if (do_words)
+ printf "\t%d", words
+ if (do_chars)
+ printf "\t%d", chars
+ printf "\t%s\n", fname
+ }
-To run a given program, you would typically do something like this:
+ There is one rule that is executed for each line. It adds the length
+of the record, plus one, to `chars'.(2) Adding one plus the record
+length is needed because the newline character separating records (the
+value of `RS') is not part of the record itself, and thus not included
+in its length. Next, `lines' is incremented for each line read, and
+`words' is incremented by the value of `NF', which is the number of
+"words" on this line:
- awk -f PROGRAM -- OPTIONS FILES
+ # do per line
+ {
+ chars += length($0) + 1 # get newline
+ lines++
+ words += NF
+ }
-Here, PROGRAM is the name of the `awk' program (such as `cut.awk'),
-OPTIONS are any command-line options for the program that start with a
-`-', and FILES are the actual data files.
+ Finally, the `END' rule simply prints the totals for all the files:
- If your system supports the `#!' executable interpreter mechanism
-(*note Executable Scripts::), you can instead run your program directly:
+ END {
+ if (print_total) {
+ if (do_lines)
+ printf "\t%d", tlines
+ if (do_words)
+ printf "\t%d", twords
+ if (do_chars)
+ printf "\t%d", tchars
+ print "\ttotal"
+ }
+ }
- cut.awk -c1-8 myfiles > results
+ ---------- Footnotes ----------
- If your `awk' is not `gawk', you may instead need to use this:
+ (1) `wc' can't just use the value of `FNR' in `endfile()'. If you
+examine the code in *note Filetrans Function::, you will see that `FNR'
+has already been reset by the time `endfile()' is called.
- cut.awk -- -c1-8 myfiles > results
+ (2) Since `gawk' understands multibyte locales, this code counts
+characters, not bytes.
-File: gawk.info, Node: Clones, Next: Miscellaneous Programs, Prev: Running
Examples, Up: Sample Programs
-
-14.2 Reinventing Wheels for Fun and Profit
-==========================================
-
-This minor node presents a number of POSIX utilities implemented in
-`awk'. Reinventing these programs in `awk' is often enjoyable, because
-the algorithms can be very clearly expressed, and the code is usually
-very concise and simple. This is true because `awk' does so much for
-you.
+File: gawk.info, Node: Miscellaneous Programs, Prev: Clones, Up: Sample
Programs
- It should be noted that these programs are not necessarily intended
-to replace the installed versions on your system. Nor may all of these
-programs be fully compliant with the most recent POSIX standard. This
-is not a problem; their purpose is to illustrate `awk' language
-programming for "real world" tasks.
+13.3 A Grab Bag of `awk' Programs
+=================================
- The programs are presented in alphabetical order.
+This minor node is a large "grab bag" of miscellaneous programs. We
+hope you find them both interesting and enjoyable.
* Menu:
-* Cut Program:: The `cut' utility.
-* Egrep Program:: The `egrep' utility.
-* Id Program:: The `id' utility.
-* Split Program:: The `split' utility.
-* Tee Program:: The `tee' utility.
-* Uniq Program:: The `uniq' utility.
-* Wc Program:: The `wc' utility.
+* Dupword Program:: Finding duplicated words in a document.
+* Alarm Program:: An alarm clock.
+* Translate Program:: A program similar to the `tr' utility.
+* Labels Program:: Printing mailing labels.
+* Word Sorting:: A program to produce a word usage count.
+* History Sorting:: Eliminating duplicate entries from a history
+ file.
+* Extract Program:: Pulling out programs from Texinfo source
+ files.
+* Simple Sed:: A Simple Stream Editor.
+* Igawk Program:: A wrapper for `awk' that includes
+ files.
+* Anagram Program:: Finding anagrams from a dictionary.
+* Signature Program:: People do amazing things with too much time on
+ their hands.
-File: gawk.info, Node: Cut Program, Next: Egrep Program, Up: Clones
+File: gawk.info, Node: Dupword Program, Next: Alarm Program, Up:
Miscellaneous Programs
-14.2.1 Cutting out Fields and Columns
--------------------------------------
+13.3.1 Finding Duplicated Words in a Document
+---------------------------------------------
-The `cut' utility selects, or "cuts," characters or fields from its
-standard input and sends them to its standard output. Fields are
-separated by TABs by default, but you may supply a command-line option
-to change the field "delimiter" (i.e., the field-separator character).
-`cut''s definition of fields is less general than `awk''s.
+A common error when writing large amounts of prose is to accidentally
+duplicate words. Typically you will see this in text as something like
+"the the program does the following..." When the text is online, often
+the duplicated words occur at the end of one line and the beginning of
+another, making them very difficult to spot.
- A common use of `cut' might be to pull out just the login name of
-logged-on users from the output of `who'. For example, the following
-pipeline generates a sorted, unique list of the logged-on users:
+ This program, `dupword.awk', scans through a file one line at a time
+and looks for adjacent occurrences of the same word. It also saves the
+last word on a line (in the variable `prev') for comparison with the
+first word on the next line.
- who | cut -c1-8 | sort | uniq
+ The first two statements make sure that the line is all lowercase,
+so that, for example, "The" and "the" compare equal to each other. The
+next statement replaces nonalphanumeric and nonwhitespace characters
+with spaces, so that punctuation does not affect the comparison either.
+The characters are replaced with spaces so that formatting controls
+don't create nonsense words (e.g., the Texinfo address@hidden' becomes
+`codeNF' if punctuation is simply deleted). The record is then resplit
+into fields, yielding just the actual words on the line, and ensuring
+that there are no empty fields.
- The options for `cut' are:
+ If there are no fields left after removing all the punctuation, the
+current record is skipped. Otherwise, the program loops through each
+word, comparing it to the previous one:
-`-c LIST'
- Use LIST as the list of characters to cut out. Items within the
- list may be separated by commas, and ranges of characters can be
- separated with dashes. The list `1-8,15,22-35' specifies
- characters 1 through 8, 15, and 22 through 35.
+ # dupword.awk --- find duplicate words in text
+ {
+ $0 = tolower($0)
+ gsub(/[^[:alnum:][:blank:]]/, " ");
+ $0 = $0 # re-split
+ if (NF == 0)
+ next
+ if ($1 == prev)
+ printf("%s:%d: duplicate %s\n",
+ FILENAME, FNR, $1)
+ for (i = 2; i <= NF; i++)
+ if ($i == $(i-1))
+ printf("%s:%d: duplicate %s\n",
+ FILENAME, FNR, $i)
+ prev = $NF
+ }
-`-f LIST'
- Use LIST as the list of fields to cut out.
+
+File: gawk.info, Node: Alarm Program, Next: Translate Program, Prev:
Dupword Program, Up: Miscellaneous Programs
-`-d DELIM'
- Use DELIM as the field-separator character instead of the TAB
- character.
+13.3.2 An Alarm Clock Program
+-----------------------------
-`-s'
- Suppress printing of lines that do not contain the field delimiter.
+ Nothing cures insomnia like a ringing alarm clock.
+ Arnold Robbins
- The `awk' implementation of `cut' uses the `getopt()' library
-function (*note Getopt Function::) and the `join()' library function
-(*note Join Function::).
+ The following program is a simple "alarm clock" program. You give
+it a time of day and an optional message. At the specified time, it
+prints the message on the standard output. In addition, you can give it
+the number of times to repeat the message as well as a delay between
+repetitions.
- The program begins with a comment describing the options, the library
-functions needed, and a `usage()' function that prints out a usage
-message and exits. `usage()' is called if invalid arguments are
-supplied:
+ This program uses the `getlocaltime()' function from *note
+Getlocaltime Function::.
- # cut.awk --- implement cut in awk
+ All the work is done in the `BEGIN' rule. The first part is argument
+checking and setting of defaults: the delay, the count, and the message
+to print. If the user supplied a message without the ASCII BEL
+character (known as the "alert" character, `"\a"'), then it is added to
+the message. (On many systems, printing the ASCII BEL generates an
+audible alert. Thus when the alarm goes off, the system calls attention
+to itself in case the user is not looking at the computer.) Just for a
+change, this program uses a `switch' statement (*note Switch
+Statement::), but the processing could be done with a series of
+`if'-`else' statements instead. Here is the program:
- # Options:
- # -f list Cut fields
- # -d c Field delimiter character
- # -c list Cut characters
- #
- # -s Suppress lines without the delimiter
+ # alarm.awk --- set an alarm
#
- # Requires getopt() and join() library functions
-
- function usage( e1, e2)
- {
- e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
- e2 = "usage: cut [-c list] [files...]"
- print e1 > "/dev/stderr"
- print e2 > "/dev/stderr"
- exit 1
- }
-
-The variables `e1' and `e2' are used so that the function fits nicely
-on the screen.
-
- Next comes a `BEGIN' rule that parses the command-line options. It
-sets `FS' to a single TAB character, because that is `cut''s default
-field separator. The rule then sets the output field separator to be the
-same as the input field separator. A loop using `getopt()' steps
-through the command-line options. Exactly one of the variables
-`by_fields' or `by_chars' is set to true, to indicate that processing
-should be done by fields or by characters, respectively. When cutting
-by characters, the output field separator is set to the null string:
+ # Requires getlocaltime() library function
+ # usage: alarm time [ "message" [ count [ delay ] ] ]
BEGIN \
{
- FS = "\t" # default
- OFS = FS
- while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) {
- if (c == "f") {
- by_fields = 1
- fieldlist = Optarg
- } else if (c == "c") {
- by_chars = 1
- fieldlist = Optarg
- OFS = ""
- } else if (c == "d") {
- if (length(Optarg) > 1) {
- printf("Using first character of %s" \
- " for delimiter\n", Optarg) > "/dev/stderr"
- Optarg = substr(Optarg, 1, 1)
- }
- FS = Optarg
- OFS = FS
- if (FS == " ") # defeat awk semantics
- FS = "[ ]"
- } else if (c == "s")
- suppress++
- else
- usage()
+ # Initial argument sanity checking
+ usage1 = "usage: alarm time ['message' [count [delay]]]"
+ usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
+
+ if (ARGC < 2) {
+ print usage1 > "/dev/stderr"
+ print usage2 > "/dev/stderr"
+ exit 1
+ }
+ switch (ARGC) {
+ case 5:
+ delay = ARGV[4] + 0
+ # fall through
+ case 4:
+ count = ARGV[3] + 0
+ # fall through
+ case 3:
+ message = ARGV[2]
+ break
+ default:
+ if (ARGV[1] !~ /[[:digit:]]?[[:digit:]]:[[:digit:]]{2}/) {
+ print usage1 > "/dev/stderr"
+ print usage2 > "/dev/stderr"
+ exit 1
+ }
+ break
}
- # Clear out options
- for (i = 1; i < Optind; i++)
- ARGV[i] = ""
+ # set defaults for once we reach the desired time
+ if (delay == 0)
+ delay = 180 # 3 minutes
+ if (count == 0)
+ count = 5
+ if (message == "")
+ message = sprintf("\aIt is now %s!\a", ARGV[1])
+ else if (index(message, "\a") == 0)
+ message = "\a" message "\a"
- The code must take special care when the field delimiter is a space.
-Using a single space (`" "') for the value of `FS' is incorrect--`awk'
-would separate fields with runs of spaces, TABs, and/or newlines, and
-we want them to be separated with individual spaces. Also remember
-that after `getopt()' is through (as described in *note Getopt
-Function::), we have to clear out all the elements of `ARGV' from 1 to
-`Optind', so that `awk' does not try to process the command-line options
-as file names.
+ The next minor node of code turns the alarm time into hours and
+minutes, converts it (if necessary) to a 24-hour clock, and then turns
+that time into a count of the seconds since midnight. Next it turns
+the current time into a count of seconds since midnight. The
+difference between the two is how long to wait before setting off the
+alarm:
- After dealing with the command-line options, the program verifies
-that the options make sense. Only one or the other of `-c' and `-f'
-should be used, and both require a field list. Then the program calls
-either `set_fieldlist()' or `set_charlist()' to pull apart the list of
-fields or characters:
+ # split up alarm time
+ split(ARGV[1], atime, ":")
+ hour = atime[1] + 0 # force numeric
+ minute = atime[2] + 0 # force numeric
- if (by_fields && by_chars)
- usage()
+ # get current broken down time
+ getlocaltime(now)
- if (by_fields == 0 && by_chars == 0)
- by_fields = 1 # default
-
- if (fieldlist == "") {
- print "cut: needs list for -c or -f" > "/dev/stderr"
- exit 1
- }
+ # if time given is 12-hour hours and it's after that
+ # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
+ # then add 12 to real hour
+ if (hour < 12 && now["hour"] > hour)
+ hour += 12
- if (by_fields)
- set_fieldlist()
- else
- set_charlist()
- }
+ # set target time in seconds since midnight
+ target = (hour * 60 * 60) + (minute * 60)
- `set_fieldlist()' splits the field list apart at the commas into an
-array. Then, for each element of the array, it looks to see if the
-element is actually a range, and if so, splits it apart. The function
-checks the range to make sure that the first number is smaller than the
-second. Each number in the list is added to the `flist' array, which
-simply lists the fields that will be printed. Normal field splitting
-is used. The program lets `awk' handle the job of doing the field
-splitting:
+ # get current time in seconds since midnight
+ current = (now["hour"] * 60 * 60) + \
+ (now["minute"] * 60) + now["second"]
- function set_fieldlist( n, m, i, j, k, f, g)
- {
- n = split(fieldlist, f, ",")
- j = 1 # index in flist
- for (i = 1; i <= n; i++) {
- if (index(f[i], "-") != 0) { # a range
- m = split(f[i], g, "-")
- if (m != 2 || g[1] >= g[2]) {
- printf("bad field list: %s\n",
- f[i]) > "/dev/stderr"
- exit 1
- }
- for (k = g[1]; k <= g[2]; k++)
- flist[j++] = k
- } else
- flist[j++] = f[i]
+ # how long to sleep for
+ naptime = target - current
+ if (naptime <= 0) {
+ print "time is in the past!" > "/dev/stderr"
+ exit 1
}
- nfields = j - 1
- }
- The `set_charlist()' function is more complicated than
-`set_fieldlist()'. The idea here is to use `gawk''s `FIELDWIDTHS'
-variable (*note Constant Size::), which describes constant-width input.
-When using a character list, that is exactly what we have.
+ Finally, the program uses the `system()' function (*note I/O
+Functions::) to call the `sleep' utility. The `sleep' utility simply
+pauses for the given number of seconds. If the exit status is not zero,
+the program assumes that `sleep' was interrupted and exits. If `sleep'
+exited with an OK status (zero), then the program prints the message in
+a loop, again using `sleep' to delay for however many seconds are
+necessary:
- Setting up `FIELDWIDTHS' is more complicated than simply listing the
-fields that need to be printed. We have to keep track of the fields to
-print and also the intervening characters that have to be skipped. For
-example, suppose you wanted characters 1 through 8, 15, and 22 through
-35. You would use `-c 1-8,15,22-35'. The necessary value for
-`FIELDWIDTHS' is `"8 6 1 6 14"'. This yields five fields, and the
-fields to print are `$1', `$3', and `$5'. The intermediate fields are
-"filler", which is stuff in between the desired data. `flist' lists
-the fields to print, and `t' tracks the complete field list, including
-filler fields:
+ # zzzzzz..... go away if interrupted
+ if (system(sprintf("sleep %d", naptime)) != 0)
+ exit 1
- function set_charlist( field, i, j, f, g, t,
- filler, last, len)
- {
- field = 1 # count total fields
- n = split(fieldlist, f, ",")
- j = 1 # index in flist
- for (i = 1; i <= n; i++) {
- if (index(f[i], "-") != 0) { # range
- m = split(f[i], g, "-")
- if (m != 2 || g[1] >= g[2]) {
- printf("bad character list: %s\n",
- f[i]) > "/dev/stderr"
- exit 1
- }
- len = g[2] - g[1] + 1
- if (g[1] > 1) # compute length of filler
- filler = g[1] - last - 1
- else
- filler = 0
- if (filler)
- t[field++] = filler
- t[field++] = len # length of field
- last = g[2]
- flist[j++] = field - 1
- } else {
- if (f[i] > 1)
- filler = f[i] - last - 1
- else
- filler = 0
- if (filler)
- t[field++] = filler
- t[field++] = 1
- last = f[i]
- flist[j++] = field - 1
- }
+ # time to notify!
+ command = sprintf("sleep %d", delay)
+ for (i = 1; i <= count; i++) {
+ print message
+ # if sleep command interrupted, go away
+ if (system(command) != 0)
+ break
}
- FIELDWIDTHS = join(t, 1, field - 1)
- nfields = j - 1
- }
-
- Next is the rule that actually processes the data. If the `-s'
-option is given, then `suppress' is true. The first `if' statement
-makes sure that the input record does have the field separator. If
-`cut' is processing fields, `suppress' is true, and the field separator
-character is not in the record, then the record is skipped.
-
- If the record is valid, then `gawk' has split the data into fields,
-either using the character in `FS' or using fixed-length fields and
-`FIELDWIDTHS'. The loop goes through the list of fields that should be
-printed. The corresponding field is printed if it contains data. If
-the next field also has data, then the separator character is written
-out between the fields:
-
- {
- if (by_fields && suppress && index($0, FS) != 0)
- next
- for (i = 1; i <= nfields; i++) {
- if ($flist[i] != "") {
- printf "%s", $flist[i]
- if (i < nfields && $flist[i+1] != "")
- printf "%s", OFS
- }
- }
- print ""
+ exit 0
}
- This version of `cut' relies on `gawk''s `FIELDWIDTHS' variable to
-do the character-based cutting. While it is possible in other `awk'
-implementations to use `substr()' (*note String Functions::), it is
-also extremely painful. The `FIELDWIDTHS' variable supplies an elegant
-solution to the problem of picking the input line apart by characters.
-
-File: gawk.info, Node: Egrep Program, Next: Id Program, Prev: Cut Program,
Up: Clones
-
-14.2.2 Searching for Regular Expressions in Files
--------------------------------------------------
+File: gawk.info, Node: Translate Program, Next: Labels Program, Prev: Alarm
Program, Up: Miscellaneous Programs
-The `egrep' utility searches files for patterns. It uses regular
-expressions that are almost identical to those available in `awk'
-(*note Regexp::). You invoke it as follows:
+13.3.3 Transliterating Characters
+---------------------------------
- egrep [ OPTIONS ] 'PATTERN' FILES ...
+The system `tr' utility transliterates characters. For example, it is
+often used to map uppercase letters into lowercase for further
+processing:
- The PATTERN is a regular expression. In typical usage, the regular
-expression is quoted to prevent the shell from expanding any of the
-special characters as file name wildcards. Normally, `egrep' prints
-the lines that matched. If multiple file names are provided on the
-command line, each output line is preceded by the name of the file and
-a colon.
+ GENERATE DATA | tr 'A-Z' 'a-z' | PROCESS DATA ...
- The options to `egrep' are as follows:
+ `tr' requires two lists of characters.(1) When processing the
+input, the first character in the first list is replaced with the first
+character in the second list, the second character in the first list is
+replaced with the second character in the second list, and so on. If
+there are more characters in the "from" list than in the "to" list, the
+last character of the "to" list is used for the remaining characters in
+the "from" list.
-`-c'
- Print out a count of the lines that matched the pattern, instead
- of the lines themselves.
+ Some time ago, a user proposed that a transliteration function should
+be added to `gawk'. The following program was written to prove that
+character transliteration could be done with a user-level function.
+This program is not as complete as the system `tr' utility but it does
+most of the job.
-`-s'
- Be silent. No output is produced and the exit value indicates
- whether the pattern was matched.
+ The `translate' program demonstrates one of the few weaknesses of
+standard `awk': dealing with individual characters is very painful,
+requiring repeated use of the `substr()', `index()', and `gsub()'
+built-in functions (*note String Functions::).(2) There are two
+functions. The first, `stranslate()', takes three arguments:
-`-v'
- Invert the sense of the test. `egrep' prints the lines that do
- _not_ match the pattern and exits successfully if the pattern is
- not matched.
+`from'
+ A list of characters from which to translate.
-`-i'
- Ignore case distinctions in both the pattern and the input data.
+`to'
+ A list of characters to which to translate.
-`-l'
- Only print (list) the names of the files that matched, not the
- lines that matched.
+`target'
+ The string on which to do the translation.
-`-e PATTERN'
- Use PATTERN as the regexp to match. The purpose of the `-e'
- option is to allow patterns that start with a `-'.
+ Associative arrays make the translation part fairly easy. `t_ar'
+holds the "to" characters, indexed by the "from" characters. Then a
+simple loop goes through `from', one character at a time. For each
+character in `from', if the character appears in `target', it is
+replaced with the corresponding `to' character.
- This version uses the `getopt()' library function (*note Getopt
-Function::) and the file transition library program (*note Filetrans
-Function::).
+ The `translate()' function simply calls `stranslate()' using `$0' as
+the target. The main program sets two global variables, `FROM' and
+`TO', from the command line, and then changes `ARGV' so that `awk'
+reads from the standard input.
- The program begins with a descriptive comment and then a `BEGIN' rule
-that processes the command-line arguments with `getopt()'. The `-i'
-(ignore case) option is particularly easy with `gawk'; we just use the
-`IGNORECASE' built-in variable (*note Built-in Variables::):
+ Finally, the processing rule simply calls `translate()' for each
+record:
- # egrep.awk --- simulate egrep in awk
- #
- # Options:
- # -c count of lines
- # -s silent - use exit value
- # -v invert test, success if no match
- # -i ignore case
- # -l print filenames only
- # -e argument is pattern
- #
- # Requires getopt and file transition library functions
+ # translate.awk --- do tr-like stuff
+ # Bugs: does not handle things like: tr A-Z a-z, it has
+ # to be spelled out. However, if `to' is shorter than `from',
+ # the last character in `to' is used for the rest of `from'.
- BEGIN {
- while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) {
- if (c == "c")
- count_only++
- else if (c == "s")
- no_print++
- else if (c == "v")
- invert++
- else if (c == "i")
- IGNORECASE = 1
- else if (c == "l")
- filenames_only++
- else if (c == "e")
- pattern = Optarg
- else
- usage()
+ function stranslate(from, to, target, lf, lt, ltarget, t_ar, i, c,
+ result)
+ {
+ lf = length(from)
+ lt = length(to)
+ ltarget = length(target)
+ for (i = 1; i <= lt; i++)
+ t_ar[substr(from, i, 1)] = substr(to, i, 1)
+ if (lt < lf)
+ for (; i <= lf; i++)
+ t_ar[substr(from, i, 1)] = substr(to, lt, 1)
+ for (i = 1; i <= ltarget; i++) {
+ c = substr(target, i, 1)
+ if (c in t_ar)
+ c = t_ar[c]
+ result = result c
}
+ return result
+ }
- Next comes the code that handles the `egrep'-specific behavior. If no
-pattern is supplied with `-e', the first nonoption on the command line
-is used. The `awk' command-line arguments up to `ARGV[Optind]' are
-cleared, so that `awk' won't try to process them as files. If no files
-are specified, the standard input is used, and if multiple files are
-specified, we make sure to note this so that the file names can precede
-the matched lines in the output:
-
- if (pattern == "")
- pattern = ARGV[Optind++]
+ function translate(from, to)
+ {
+ return $0 = stranslate(from, to, $0)
+ }
- for (i = 1; i < Optind; i++)
- ARGV[i] = ""
- if (Optind >= ARGC) {
- ARGV[1] = "-"
- ARGC = 2
- } else if (ARGC - Optind > 1)
- do_filenames++
+ # main program
+ BEGIN {
+ if (ARGC < 3) {
+ print "usage: translate from to" > "/dev/stderr"
+ exit
+ }
+ FROM = ARGV[1]
+ TO = ARGV[2]
+ ARGC = 2
+ ARGV[1] = "-"
+ }
- # if (IGNORECASE)
- # pattern = tolower(pattern)
+ {
+ translate(FROM, TO)
+ print
}
- The last two lines are commented out, since they are not needed in
-`gawk'. They should be uncommented if you have to use another version
-of `awk'.
+ While it is possible to do character transliteration in a user-level
+function, it is not necessarily efficient, and we (the `gawk' authors)
+started to consider adding a built-in function. However, shortly after
+writing this program, we learned that the System V Release 4 `awk' had
+added the `toupper()' and `tolower()' functions (*note String
+Functions::). These functions handle the vast majority of the cases
+where character transliteration is necessary, and so we chose to simply
+add those functions to `gawk' as well and then leave well enough alone.
- The next set of lines should be uncommented if you are not using
-`gawk'. This rule translates all the characters in the input line into
-lowercase if the `-i' option is specified.(1) The rule is commented out
-since it is not necessary with `gawk':
+ An obvious improvement to this program would be to set up the `t_ar'
+array only once, in a `BEGIN' rule. However, this assumes that the
+"from" and "to" lists will never change throughout the lifetime of the
+program.
- #{
- # if (IGNORECASE)
- # $0 = tolower($0)
- #}
+ ---------- Footnotes ----------
- The `beginfile()' function is called by the rule in `ftrans.awk'
-when each new file is processed. In this case, it is very simple; all
-it does is initialize a variable `fcount' to zero. `fcount' tracks how
-many lines in the current file matched the pattern. Naming the
-parameter `junk' shows we know that `beginfile()' is called with a
-parameter, but that we're not interested in its value:
+ (1) On some older systems, `tr' may require that the lists be
+written as range expressions enclosed in square brackets (`[a-z]') and
+quoted, to prevent the shell from attempting a file name expansion.
+This is not a feature.
- function beginfile(junk)
- {
- fcount = 0
- }
+ (2) This program was written before `gawk' acquired the ability to
+split each character in a string into separate array elements.
- The `endfile()' function is called after each file has been
-processed. It affects the output only when the user wants a count of
-the number of lines that matched. `no_print' is true only if the exit
-status is desired. `count_only' is true if line counts are desired.
-`egrep' therefore only prints line counts if printing and counting are
-enabled. The output format must be adjusted depending upon the number
-of files to process. Finally, `fcount' is added to `total', so that we
-know the total number of lines that matched the pattern:
+
+File: gawk.info, Node: Labels Program, Next: Word Sorting, Prev: Translate
Program, Up: Miscellaneous Programs
- function endfile(file)
- {
- if (! no_print && count_only) {
- if (do_filenames)
- print file ":" fcount
- else
- print fcount
- }
+13.3.4 Printing Mailing Labels
+------------------------------
- total += fcount
- }
+Here is a "real world"(1) program. This script reads lists of names and
+addresses and generates mailing labels. Each page of labels has 20
+labels on it, two across and 10 down. The addresses are guaranteed to
+be no more than five lines of data. Each address is separated from the
+next by a blank line.
- The following rule does most of the work of matching lines. The
-variable `matches' is true if the line matched the pattern. If the user
-wants lines that did not match, the sense of `matches' is inverted
-using the `!' operator. `fcount' is incremented with the value of
-`matches', which is either one or zero, depending upon a successful or
-unsuccessful match. If the line does not match, the `next' statement
-just moves on to the next record.
+ The basic idea is to read 20 labels worth of data. Each line of
+each label is stored in the `line' array. The single rule takes care
+of filling the `line' array and printing the page when 20 labels have
+been read.
- A number of additional tests are made, but they are only done if we
-are not counting lines. First, if the user only wants exit status
-(`no_print' is true), then it is enough to know that _one_ line in this
-file matched, and we can skip on to the next file with `nextfile'.
-Similarly, if we are only printing file names, we can print the file
-name, and then skip to the next file with `nextfile'. Finally, each
-line is printed, with a leading file name and colon if necessary:
+ The `BEGIN' rule simply sets `RS' to the empty string, so that `awk'
+splits records at blank lines (*note Records::). It sets `MAXLINES' to
+100, since 100 is the maximum number of lines on the page (20 * 5 =
+100).
- {
- matches = ($0 ~ pattern)
- if (invert)
- matches = ! matches
+ Most of the work is done in the `printpage()' function. The label
+lines are stored sequentially in the `line' array. But they have to
+print horizontally; `line[1]' next to `line[6]', `line[2]' next to
+`line[7]', and so on. Two loops are used to accomplish this. The
+outer loop, controlled by `i', steps through every 10 lines of data;
+this is each row of labels. The inner loop, controlled by `j', goes
+through the lines within the row. As `j' goes from 0 to 4, `i+j' is
+the `j'-th line in the row, and `i+j+5' is the entry next to it. The
+output ends up looking something like this:
- fcount += matches # 1 or 0
+ line 1 line 6
+ line 2 line 7
+ line 3 line 8
+ line 4 line 9
+ line 5 line 10
+ ...
- if (! matches)
- next
+The `printf' format string `%-41s' left-aligns the data and prints it
+within a fixed-width field.
- if (! count_only) {
- if (no_print)
- nextfile
+ As a final note, an extra blank line is printed at lines 21 and 61,
+to keep the output lined up on the labels. This is dependent on the
+particular brand of labels in use when the program was written. You
+will also note that there are two blank lines at the top and two blank
+lines at the bottom.
- if (filenames_only) {
- print FILENAME
- nextfile
- }
+ The `END' rule arranges to flush the final page of labels; there may
+not have been an even multiple of 20 labels in the data:
- if (do_filenames)
- print FILENAME ":" $0
- else
- print
+ # labels.awk --- print mailing labels
+
+ # Each label is 5 lines of data that may have blank lines.
+ # The label sheets have 2 blank lines at the top and 2 at
+ # the bottom.
+
+ BEGIN { RS = "" ; MAXLINES = 100 }
+
+ function printpage( i, j)
+ {
+ if (Nlines <= 0)
+ return
+
+ printf "\n\n" # header
+
+ for (i = 1; i <= Nlines; i += 10) {
+ if (i == 21 || i == 61)
+ print ""
+ for (j = 0; j < 5; j++) {
+ if (i + j > MAXLINES)
+ break
+ printf " %-41s %s\n", line[i+j], line[i+j+5]
+ }
+ print ""
}
+
+ printf "\n\n" # footer
+
+ delete line
}
- The `END' rule takes care of producing the correct exit status. If
-there are no matches, the exit status is one; otherwise it is zero:
+ # main rule
+ {
+ if (Count >= 20) {
+ printpage()
+ Count = 0
+ Nlines = 0
+ }
+ n = split($0, a, "\n")
+ for (i = 1; i <= n; i++)
+ line[++Nlines] = a[i]
+ for (; i <= 5; i++)
+ line[++Nlines] = ""
+ Count++
+ }
END \
{
- if (total == 0)
- exit 1
- exit 0
+ printpage()
}
- The `usage()' function prints a usage message in case of invalid
-options, and then exits:
+ ---------- Footnotes ----------
- function usage( e)
- {
- e = "Usage: egrep [-csvil] [-e pat] [files ...]"
- e = e "\n\tegrep [-csvil] pat [files ...]"
- print e > "/dev/stderr"
- exit 1
- }
+ (1) "Real world" is defined as "a program actually used to get
+something done."
- The variable `e' is used so that the function fits nicely on the
-printed page.
+
+File: gawk.info, Node: Word Sorting, Next: History Sorting, Prev: Labels
Program, Up: Miscellaneous Programs
- Just a note on programming style: you may have noticed that the `END'
-rule uses backslash continuation, with the open brace on a line by
-itself. This is so that it more closely resembles the way functions
-are written. Many of the examples in this major node use this style.
-You can decide for yourself if you like writing your `BEGIN' and `END'
-rules this way or not.
+13.3.5 Generating Word-Usage Counts
+-----------------------------------
- ---------- Footnotes ----------
+When working with large amounts of text, it can be interesting to know
+how often different words appear. For example, an author may overuse
+certain words, in which case she might wish to find synonyms to
+substitute for words that appear too often. This node develops a
+program for counting words and presenting the frequency information in
+a useful format.
- (1) It also introduces a subtle bug; if a match happens, we output
-the translated line, not the original.
+ At first glance, a program like this would seem to do the job:
-
-File: gawk.info, Node: Id Program, Next: Split Program, Prev: Egrep
Program, Up: Clones
+ # Print list of word frequencies
-14.2.3 Printing out User Information
-------------------------------------
+ {
+ for (i = 1; i <= NF; i++)
+ freq[$i]++
+ }
-The `id' utility lists a user's real and effective user ID numbers,
-real and effective group ID numbers, and the user's group set, if any.
-`id' only prints the effective user ID and group ID if they are
-different from the real ones. If possible, `id' also supplies the
-corresponding user and group names. The output might look like this:
+ END {
+ for (word in freq)
+ printf "%s\t%d\n", word, freq[word]
+ }
- $ id
- -| uid=500(arnold) gid=500(arnold) groups=6(disk),7(lp),19(floppy)
+ The program relies on `awk''s default field splitting mechanism to
+break each line up into "words," and uses an associative array named
+`freq', indexed by each word, to count the number of times the word
+occurs. In the `END' rule, it prints the counts.
- This information is part of what is provided by `gawk''s `PROCINFO'
-array (*note Built-in Variables::). However, the `id' utility provides
-a more palatable output than just individual numbers.
+ This program has several problems that prevent it from being useful
+on real text files:
- Here is a simple version of `id' written in `awk'. It uses the user
-database library functions (*note Passwd Functions::) and the group
-database library functions (*note Group Functions::):
+ * The `awk' language considers upper- and lowercase characters to be
+ distinct. Therefore, "bartender" and "Bartender" are not treated
+ as the same word. This is undesirable, since in normal text, words
+ are capitalized if they begin sentences, and a frequency analyzer
+ should not be sensitive to capitalization.
- The program is fairly straightforward. All the work is done in the
-`BEGIN' rule. The user and group ID numbers are obtained from
-`PROCINFO'. The code is repetitive. The entry in the user database
-for the real user ID number is split into parts at the `:'. The name is
-the first field. Similar code is used for the effective user ID number
-and the group numbers:
+ * Words are detected using the `awk' convention that fields are
+ separated just by whitespace. Other characters in the input
+ (except newlines) don't have any special meaning to `awk'. This
+ means that punctuation characters count as part of words.
- # id.awk --- implement id in awk
- #
- # Requires user and group library functions
- # output is:
- # uid=12(foo) euid=34(bar) gid=3(baz) \
- # egid=5(blat) groups=9(nine),2(two),1(one)
+ * The output does not come out in any useful order. You're more
+ likely to be interested in which words occur most frequently or in
+ having an alphabetized table of how frequently each word occurs.
+
+ The first problem can be solved by using `tolower()' to remove case
+distinctions. The second problem can be solved by using `gsub()' to
+remove punctuation characters. Finally, we solve the third problem by
+using the system `sort' utility to process the output of the `awk'
+script. Here is the new version of the program:
+
+ # wordfreq.awk --- print list of word frequencies
- BEGIN \
{
- uid = PROCINFO["uid"]
- euid = PROCINFO["euid"]
- gid = PROCINFO["gid"]
- egid = PROCINFO["egid"]
+ $0 = tolower($0) # remove case distinctions
+ # remove punctuation
+ gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
+ for (i = 1; i <= NF; i++)
+ freq[$i]++
+ }
- printf("uid=%d", uid)
- pw = getpwuid(uid)
- if (pw != "") {
- split(pw, a, ":")
- printf("(%s)", a[1])
- }
+ END {
+ for (word in freq)
+ printf "%s\t%d\n", word, freq[word]
+ }
- if (euid != uid) {
- printf(" euid=%d", euid)
- pw = getpwuid(euid)
- if (pw != "") {
- split(pw, a, ":")
- printf("(%s)", a[1])
- }
- }
+ Assuming we have saved this program in a file named `wordfreq.awk',
+and that the data is in `file1', the following pipeline:
- printf(" gid=%d", gid)
- pw = getgrgid(gid)
- if (pw != "") {
- split(pw, a, ":")
- printf("(%s)", a[1])
- }
+ awk -f wordfreq.awk file1 | sort -k 2nr
- if (egid != gid) {
- printf(" egid=%d", egid)
- pw = getgrgid(egid)
- if (pw != "") {
- split(pw, a, ":")
- printf("(%s)", a[1])
- }
- }
+produces a table of the words appearing in `file1' in order of
+decreasing frequency.
- for (i = 1; ("group" i) in PROCINFO; i++) {
- if (i == 1)
- printf(" groups=")
- group = PROCINFO["group" i]
- printf("%d", group)
- pw = getgrgid(group)
- if (pw != "") {
- split(pw, a, ":")
- printf("(%s)", a[1])
- }
- if (("group" (i+1)) in PROCINFO)
- printf(",")
- }
+ The `awk' program suitably massages the data and produces a word
+frequency table, which is not ordered. The `awk' script's output is
+then sorted by the `sort' utility and printed on the screen.
- print ""
- }
+ The options given to `sort' specify a sort that uses the second
+field of each input line (skipping one field), that the sort keys
+should be treated as numeric quantities (otherwise `15' would come
+before `5'), and that the sorting should be done in descending
+(reverse) order.
- The test in the `for' loop is worth noting. Any supplementary
-groups in the `PROCINFO' array have the indices `"group1"' through
-`"groupN"' for some N, i.e., the total number of supplementary groups.
-However, we don't know in advance how many of these groups there are.
+ The `sort' could even be done from within the program, by changing
+the `END' action to:
- This loop works by starting at one, concatenating the value with
-`"group"', and then using `in' to see if that value is in the array.
-Eventually, `i' is incremented past the last group in the array and the
-loop exits.
+ END {
+ sort = "sort -k 2nr"
+ for (word in freq)
+ printf "%s\t%d\n", word, freq[word] | sort
+ close(sort)
+ }
- The loop is also correct if there are _no_ supplementary groups;
-then the condition is false the first time it's tested, and the loop
-body never executes.
+ This way of sorting must be used on systems that do not have true
+pipes at the command-line (or batch-file) level. See the general
+operating system documentation for more information on how to use the
+`sort' program.
-File: gawk.info, Node: Split Program, Next: Tee Program, Prev: Id Program,
Up: Clones
+File: gawk.info, Node: History Sorting, Next: Extract Program, Prev: Word
Sorting, Up: Miscellaneous Programs
-14.2.4 Splitting a Large File into Pieces
------------------------------------------
+13.3.6 Removing Duplicates from Unsorted Text
+---------------------------------------------
-The `split' program splits large text files into smaller pieces. Usage
-is as follows:(1)
+The `uniq' program (*note Uniq Program::), removes duplicate lines from
+_sorted_ data.
- split [-COUNT] file [ PREFIX ]
+ Suppose, however, you need to remove duplicate lines from a data
+file but that you want to preserve the order the lines are in. A good
+example of this might be a shell history file. The history file keeps
+a copy of all the commands you have entered, and it is not unusual to
+repeat a command several times in a row. Occasionally you might want
+to compact the history by removing duplicate entries. Yet it is
+desirable to maintain the order of the original commands.
- By default, the output files are named `xaa', `xab', and so on. Each
-file has 1000 lines in it, with the likely exception of the last file.
-To change the number of lines in each file, supply a number on the
-command line preceded with a minus; e.g., `-500' for files with 500
-lines in them instead of 1000. To change the name of the output files
-to something like `myfileaa', `myfileab', and so on, supply an
-additional argument that specifies the file name prefix.
+ This simple program does the job. It uses two arrays. The `data'
+array is indexed by the text of each line. For each line, `data[$0]'
+is incremented. If a particular line has not been seen before, then
+`data[$0]' is zero. In this case, the text of the line is stored in
+`lines[count]'. Each element of `lines' is a unique command, and the
+indices of `lines' indicate the order in which those lines are
+encountered. The `END' rule simply prints out the lines, in order:
- Here is a version of `split' in `awk'. It uses the `ord()' and
-`chr()' functions presented in *note Ordinal Functions::.
+ # histsort.awk --- compact a shell history file
+ # Thanks to Byron Rakitzis for the general idea
- The program first sets its defaults, and then tests to make sure
-there are not too many arguments. It then looks at each argument in
-turn. The first argument could be a minus sign followed by a number.
-If it is, this happens to look like a negative number, so it is made
-positive, and that is the count of lines. The data file name is
-skipped over and the final argument is used as the prefix for the
-output file names:
+ {
+ if (data[$0]++ == 0)
+ lines[++count] = $0
+ }
- # split.awk --- do split in awk
- #
- # Requires ord() and chr() library functions
- # usage: split [-num] [file] [outname]
+ END {
+ for (i = 1; i <= count; i++)
+ print lines[i]
+ }
- BEGIN {
- outfile = "x" # default
- count = 1000
- if (ARGC > 4)
- usage()
+ This program also provides a foundation for generating other useful
+information. For example, using the following `print' statement in the
+`END' rule indicates how often a particular command is used:
- i = 1
- if (ARGV[i] ~ /^-[[:digit:]]+$/) {
- count = -ARGV[i]
- ARGV[i] = ""
- i++
- }
- # test argv in case reading from stdin instead of file
- if (i in ARGV)
- i++ # skip data file name
- if (i in ARGV) {
- outfile = ARGV[i]
- ARGV[i] = ""
- }
+ print data[lines[i]], lines[i]
- s1 = s2 = "a"
- out = (outfile s1 s2)
- }
+ This works because `data[$0]' is incremented each time a line is
+seen.
- The next rule does most of the work. `tcount' (temporary count)
-tracks how many lines have been printed to the output file so far. If
-it is greater than `count', it is time to close the current file and
-start a new one. `s1' and `s2' track the current suffixes for the file
-name. If they are both `z', the file is just too big. Otherwise, `s1'
-moves to the next letter in the alphabet and `s2' starts over again at
-`a':
+
+File: gawk.info, Node: Extract Program, Next: Simple Sed, Prev: History
Sorting, Up: Miscellaneous Programs
- {
- if (++tcount > count) {
- close(out)
- if (s2 == "z") {
- if (s1 == "z") {
- printf("split: %s is too large to split\n",
- FILENAME) > "/dev/stderr"
- exit 1
- }
- s1 = chr(ord(s1) + 1)
- s2 = "a"
- }
- else
- s2 = chr(ord(s2) + 1)
- out = (outfile s1 s2)
- tcount = 1
- }
- print > out
- }
+13.3.7 Extracting Programs from Texinfo Source Files
+----------------------------------------------------
-The `usage()' function simply prints an error message and exits:
+The nodes *note Library Functions::, and *note Sample Programs::, are
+the top level nodes for a large number of `awk' programs. If you want
+to experiment with these programs, it is tedious to have to type them
+in by hand. Here we present a program that can extract parts of a
+Texinfo input file into separate files.
- function usage( e)
- {
- e = "usage: split [-num] [file] [outname]"
- print e > "/dev/stderr"
- exit 1
- }
+This Info file is written in Texinfo (http://texinfo.org), the GNU
+project's document formatting language. A single Texinfo source file
+can be used to produce both printed and online documentation. The
+Texinfo language is described fully, starting with *note (Texinfo)Top::
+texinfo,Texinfo--The GNU Documentation Format.
-The variable `e' is used so that the function fits nicely on the screen.
+ For our purposes, it is enough to know three things about Texinfo
+input files:
- This program is a bit sloppy; it relies on `awk' to automatically
-close the last file instead of doing it in an `END' rule. It also
-assumes that letters are contiguous in the character set, which isn't
-true for EBCDIC systems.
+ * The "at" symbol (`@') is special in Texinfo, much as the backslash
+ (`\') is in C or `awk'. Literal `@' symbols are represented in
+ Texinfo source files as `@@'.
- ---------- Footnotes ----------
+ * Comments start with either address@hidden' or address@hidden'. The
+ file-extraction program works by using special comments that start
+ at the beginning of a line.
- (1) This is the traditional usage. The POSIX usage is different, but
-not relevant for what the program aims to demonstrate.
+ * Lines containing address@hidden' and address@hidden group' commands
bracket
+ example text that should not be split across a page boundary.
+ (Unfortunately, TeX isn't always smart enough to do things exactly
+ right, so we have to give it some help.)
-
-File: gawk.info, Node: Tee Program, Next: Uniq Program, Prev: Split
Program, Up: Clones
+ The following program, `extract.awk', reads through a Texinfo source
+file and does two things, based on the special comments. Upon seeing
address@hidden system ...', it runs a command, by extracting the command text
from
+the control line and passing it on to the `system()' function (*note
+I/O Functions::). Upon seeing address@hidden file FILENAME', each subsequent
line
+is sent to the file FILENAME, until address@hidden endfile' is encountered.
The
+rules in `extract.awk' match either address@hidden' or address@hidden' by
letting the
+`omment' part be optional. Lines containing address@hidden' and
address@hidden group'
+are simply removed. `extract.awk' uses the `join()' library function
+(*note Join Function::).
-14.2.5 Duplicating Output into Multiple Files
----------------------------------------------
+ The example programs in the online Texinfo source for `GAWK:
+Effective AWK Programming' (`gawk.texi') have all been bracketed inside
+`file' and `endfile' lines. The `gawk' distribution uses a copy of
+`extract.awk' to extract the sample programs and install many of them
+in a standard directory where `gawk' can find them. The Texinfo file
+looks something like this:
-The `tee' program is known as a "pipe fitting." `tee' copies its
-standard input to its standard output and also duplicates it to the
-files named on the command line. Its usage is as follows:
+ ...
+ This program has a @code{BEGIN} rule,
+ that prints a nice message:
- tee [-a] file ...
+ @example
+ @c file examples/messages.awk
+ BEGIN @{ print "Don't panic!" @}
+ @c end file
+ @end example
- The `-a' option tells `tee' to append to the named files, instead of
-truncating them and starting over.
+ It also prints some final advice:
- The `BEGIN' rule first makes a copy of all the command-line arguments
-into an array named `copy'. `ARGV[0]' is not copied, since it is not
-needed. `tee' cannot use `ARGV' directly, since `awk' attempts to
-process each file name in `ARGV' as input data.
+ @example
+ @c file examples/messages.awk
+ END @{ print "Always avoid bored archeologists!" @}
+ @c end file
+ @end example
+ ...
- If the first argument is `-a', then the flag variable `append' is
-set to true, and both `ARGV[1]' and `copy[1]' are deleted. If `ARGC' is
-less than two, then no file names were supplied and `tee' prints a
-usage message and exits. Finally, `awk' is forced to read the standard
-input by setting `ARGV[1]' to `"-"' and `ARGC' to two:
+ `extract.awk' begins by setting `IGNORECASE' to one, so that mixed
+upper- and lowercase letters in the directives won't matter.
- # tee.awk --- tee in awk
- #
- # Copy standard input to all named output files.
- # Append content if -a option is supplied.
- #
- BEGIN \
- {
- for (i = 1; i < ARGC; i++)
- copy[i] = ARGV[i]
+ The first rule handles calling `system()', checking that a command is
+given (`NF' is at least three) and also checking that the command exits
+with a zero exit status, signifying OK:
- if (ARGV[1] == "-a") {
- append = 1
- delete ARGV[1]
- delete copy[1]
- ARGC--
- }
- if (ARGC < 2) {
- print "usage: tee [-a] file ..." > "/dev/stderr"
- exit 1
- }
- ARGV[1] = "-"
- ARGC = 2
- }
+ # extract.awk --- extract files and run programs
+ # from texinfo files
- The following single rule does all the work. Since there is no
-pattern, it is executed for each line of input. The body of the rule
-simply prints the line into each file on the command line, and then to
-the standard output:
+ BEGIN { IGNORECASE = 1 }
+ /address@hidden(omment)?[ \t]+system/ \
{
- # moving the if outside the loop makes it run faster
- if (append)
- for (i in copy)
- print >> copy[i]
- else
- for (i in copy)
- print > copy[i]
- print
+ if (NF < 3) {
+ e = (FILENAME ":" FNR)
+ e = (e ": badly formed `system' line")
+ print e > "/dev/stderr"
+ next
+ }
+ $1 = ""
+ $2 = ""
+ stat = system($0)
+ if (stat != 0) {
+ e = (FILENAME ":" FNR)
+ e = (e ": warning: system returned " stat)
+ print e > "/dev/stderr"
+ }
}
-It is also possible to write the loop this way:
+The variable `e' is used so that the rule fits nicely on the screen.
- for (i in copy)
- if (append)
- print >> copy[i]
- else
- print > copy[i]
+ The second rule handles moving data into files. It verifies that a
+file name is given in the directive. If the file named is not the
+current file, then the current file is closed. Keeping the current file
+open until a new file is encountered allows the use of the `>'
+redirection for printing the contents, keeping open file management
+simple.
-This is more concise but it is also less efficient. The `if' is tested
-for each record and for each output file. By duplicating the loop
-body, the `if' is only tested once for each input record. If there are
-N input records and M output files, the first method only executes N
-`if' statements, while the second executes N`*'M `if' statements.
+ The `for' loop does the work. It reads lines using `getline' (*note
+Getline::). For an unexpected end of file, it calls the
+`unexpected_eof()' function. If the line is an "endfile" line, then it
+breaks out of the loop. If the line is an address@hidden' or address@hidden
group'
+line, then it ignores it and goes on to the next line. Similarly,
+comments within examples are also ignored.
- Finally, the `END' rule cleans up by closing all the output files:
+ Most of the work is in the following few lines. If the line has no
+`@' symbols, the program can print it directly. Otherwise, each
+leading `@' must be stripped off. To remove the `@' symbols, the line
+is split into separate elements of the array `a', using the `split()'
+function (*note String Functions::). The `@' symbol is used as the
+separator character. Each element of `a' that is empty indicates two
+successive `@' symbols in the original line. For each two empty
+elements (`@@' in the original file), we have to add a single `@'
+symbol back in.(1)
- END \
+ When the processing of the array is finished, `join()' is called
+with the value of `SUBSEP', to rejoin the pieces back into a single
+line. That line is then printed to the output file:
+
+ /address@hidden(omment)?[ \t]+file/ \
{
- for (i in copy)
- close(copy[i])
+ if (NF != 3) {
+ e = (FILENAME ":" FNR ": badly formed `file' line")
+ print e > "/dev/stderr"
+ next
+ }
+ if ($3 != curfile) {
+ if (curfile != "")
+ close(curfile)
+ curfile = $3
+ }
+
+ for (;;) {
+ if ((getline line) <= 0)
+ unexpected_eof()
+ if (line ~ /address@hidden(omment)?[ \t]+endfile/)
+ break
+ else if (line ~ /^@(end[ \t]+)?group/)
+ continue
+ else if (line ~ /address@hidden(omment+)?[ \t]+/)
+ continue
+ if (index(line, "@") == 0) {
+ print line > curfile
+ continue
+ }
+ n = split(line, a, "@")
+ # if a[1] == "", means leading @,
+ # don't add one back in.
+ for (i = 2; i <= n; i++) {
+ if (a[i] == "") { # was an @@
+ a[i] = "@"
+ if (a[i+1] == "")
+ i++
+ }
+ }
+ print join(a, 1, n, SUBSEP) > curfile
+ }
}
-
-File: gawk.info, Node: Uniq Program, Next: Wc Program, Prev: Tee Program,
Up: Clones
+ An important thing to note is the use of the `>' redirection.
+Output done with `>' only opens the file once; it stays open and
+subsequent output is appended to the file (*note Redirection::). This
+makes it easy to mix program text and explanatory prose for the same
+sample source file (as has been done here!) without any hassle. The
+file is only closed when a new data file name is encountered or at the
+end of the input file.
-14.2.6 Printing Nonduplicated Lines of Text
--------------------------------------------
+ Finally, the function `unexpected_eof()' prints an appropriate error
+message and then exits. The `END' rule handles the final cleanup,
+closing the open file:
-The `uniq' utility reads sorted lines of data on its standard input,
-and by default removes duplicate lines. In other words, it only prints
-unique lines--hence the name. `uniq' has a number of options. The
-usage is as follows:
+ function unexpected_eof()
+ {
+ printf("%s:%d: unexpected EOF or error\n",
+ FILENAME, FNR) > "/dev/stderr"
+ exit 1
+ }
- uniq [-udc [-N]] [+N] [ INPUT FILE [ OUTPUT FILE ]]
+ END {
+ if (curfile)
+ close(curfile)
+ }
- The options for `uniq' are:
+ ---------- Footnotes ----------
-`-d'
- Print only repeated lines.
+ (1) This program was written before `gawk' had the `gensub()'
+function. Consider how you might use it to simplify the code.
-`-u'
- Print only nonrepeated lines.
+
+File: gawk.info, Node: Simple Sed, Next: Igawk Program, Prev: Extract
Program, Up: Miscellaneous Programs
-`-c'
- Count lines. This option overrides `-d' and `-u'. Both repeated
- and nonrepeated lines are counted.
+13.3.8 A Simple Stream Editor
+-----------------------------
-`-N'
- Skip N fields before comparing lines. The definition of fields is
- similar to `awk''s default: nonwhitespace characters separated by
- runs of spaces and/or TABs.
+The `sed' utility is a stream editor, a program that reads a stream of
+data, makes changes to it, and passes it on. It is often used to make
+global changes to a large file or to a stream of data generated by a
+pipeline of commands. While `sed' is a complicated program in its own
+right, its most common use is to perform global substitutions in the
+middle of a pipeline:
-`+N'
- Skip N characters before comparing lines. Any fields specified
- with `-N' are skipped first.
+ command1 < orig.data | sed 's/old/new/g' | command2 > result
-`INPUT FILE'
- Data is read from the input file named on the command line,
- instead of from the standard input.
+ Here, `s/old/new/g' tells `sed' to look for the regexp `old' on each
+input line and globally replace it with the text `new', i.e., all the
+occurrences on a line. This is similar to `awk''s `gsub()' function
+(*note String Functions::).
-`OUTPUT FILE'
- The generated output is sent to the named output file, instead of
- to the standard output.
+ The following program, `awksed.awk', accepts at least two
+command-line arguments: the pattern to look for and the text to replace
+it with. Any additional arguments are treated as data file names to
+process. If none are provided, the standard input is used:
- Normally `uniq' behaves as if both the `-d' and `-u' options are
-provided.
+ # awksed.awk --- do s/foo/bar/g using just print
+ # Thanks to Michael Brennan for the idea
- `uniq' uses the `getopt()' library function (*note Getopt Function::)
-and the `join()' library function (*note Join Function::).
+ function usage()
+ {
+ print "usage: awksed pat repl [files...]" > "/dev/stderr"
+ exit 1
+ }
- The program begins with a `usage()' function and then a brief
-outline of the options and their meanings in comments. The `BEGIN'
-rule deals with the command-line arguments and options. It uses a trick
-to get `getopt()' to handle options of the form `-25', treating such an
-option as the option letter `2' with an argument of `5'. If indeed two
-or more digits are supplied (`Optarg' looks like a number), `Optarg' is
-concatenated with the option digit and then the result is added to zero
-to make it into a number. If there is only one digit in the option,
-then `Optarg' is not needed. In this case, `Optind' must be decremented
-so that `getopt()' processes it next time. This code is admittedly a
-bit tricky.
+ BEGIN {
+ # validate arguments
+ if (ARGC < 3)
+ usage()
- If no options are supplied, then the default is taken, to print both
-repeated and nonrepeated lines. The output file, if provided, is
-assigned to `outputfile'. Early on, `outputfile' is initialized to the
-standard output, `/dev/stdout':
+ RS = ARGV[1]
+ ORS = ARGV[2]
- # uniq.awk --- do uniq in awk
- #
- # Requires getopt() and join() library functions
+ # don't use arguments as files
+ ARGV[1] = ARGV[2] = ""
+ }
- function usage( e)
+ # look ma, no hands!
{
- e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
- print e > "/dev/stderr"
- exit 1
+ if (RT == "")
+ printf "%s", $0
+ else
+ print
}
- # -c count lines. overrides -d and -u
- # -d only repeated lines
- # -u only nonrepeated lines
- # -n skip n fields
- # +n skip n characters, skip fields first
+ The program relies on `gawk''s ability to have `RS' be a regexp, as
+well as on the setting of `RT' to the actual text that terminates the
+record (*note Records::).
- BEGIN \
- {
- count = 1
- outputfile = "/dev/stdout"
- opts = "udc0:1:2:3:4:5:6:7:8:9:"
- while ((c = getopt(ARGC, ARGV, opts)) != -1) {
- if (c == "u")
- non_repeated_only++
- else if (c == "d")
- repeated_only++
- else if (c == "c")
- do_count++
- else if (index("0123456789", c) != 0) {
- # getopt requires args to options
- # this messes us up for things like -5
- if (Optarg ~ /^[[:digit:]]+$/)
- fcount = (c Optarg) + 0
- else {
- fcount = c + 0
- Optind--
- }
- } else
- usage()
- }
+ The idea is to have `RS' be the pattern to look for. `gawk'
+automatically sets `$0' to the text between matches of the pattern.
+This is text that we want to keep, unmodified. Then, by setting `ORS'
+to the replacement text, a simple `print' statement outputs the text we
+want to keep, followed by the replacement text.
- if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) {
- charcount = substr(ARGV[Optind], 2) + 0
- Optind++
- }
+ There is one wrinkle to this scheme, which is what to do if the last
+record doesn't end with text that matches `RS'. Using a `print'
+statement unconditionally prints the replacement text, which is not
+correct. However, if the file did not end in text that matches `RS',
+`RT' is set to the null string. In this case, we can print `$0' using
+`printf' (*note Printf::).
- for (i = 1; i < Optind; i++)
- ARGV[i] = ""
+ The `BEGIN' rule handles the setup, checking for the right number of
+arguments and calling `usage()' if there is a problem. Then it sets
+`RS' and `ORS' from the command-line arguments and sets `ARGV[1]' and
+`ARGV[2]' to the null string, so that they are not treated as file names
+(*note ARGC and ARGV::).
- if (repeated_only == 0 && non_repeated_only == 0)
- repeated_only = non_repeated_only = 1
+ The `usage()' function prints an error message and exits. Finally,
+the single rule handles the printing scheme outlined above, using
+`print' or `printf' as appropriate, depending upon the value of `RT'.
- if (ARGC - Optind == 2) {
- outputfile = ARGV[ARGC - 1]
- ARGV[ARGC - 1] = ""
- }
- }
+
+File: gawk.info, Node: Igawk Program, Next: Anagram Program, Prev: Simple
Sed, Up: Miscellaneous Programs
- The following function, `are_equal()', compares the current line,
-`$0', to the previous line, `last'. It handles skipping fields and
-characters. If no field count and no character count are specified,
-`are_equal()' simply returns one or zero depending upon the result of a
-simple string comparison of `last' and `$0'. Otherwise, things get more
-complicated. If fields have to be skipped, each line is broken into an
-array using `split()' (*note String Functions::); the desired fields
-are then joined back into a line using `join()'. The joined lines are
-stored in `clast' and `cline'. If no fields are skipped, `clast' and
-`cline' are set to `last' and `$0', respectively. Finally, if
-characters are skipped, `substr()' is used to strip off the leading
-`charcount' characters in `clast' and `cline'. The two strings are
-then compared and `are_equal()' returns the result:
+13.3.9 An Easy Way to Use Library Functions
+-------------------------------------------
- function are_equal( n, m, clast, cline, alast, aline)
- {
- if (fcount == 0 && charcount == 0)
- return (last == $0)
+In *note Include Files::, we saw how `gawk' provides a built-in
+file-inclusion capability. However, this is a `gawk' extension. This
+minor node provides the motivation for making file inclusion available
+for standard `awk', and shows how to do it using a combination of shell
+and `awk' programming.
- if (fcount > 0) {
- n = split(last, alast)
- m = split($0, aline)
- clast = join(alast, fcount+1, n)
- cline = join(aline, fcount+1, m)
- } else {
- clast = last
- cline = $0
- }
- if (charcount) {
- clast = substr(clast, charcount + 1)
- cline = substr(cline, charcount + 1)
- }
+ Using library functions in `awk' can be very beneficial. It
+encourages code reuse and the writing of general functions. Programs are
+smaller and therefore clearer. However, using library functions is
+only easy when writing `awk' programs; it is painful when running them,
+requiring multiple `-f' options. If `gawk' is unavailable, then so too
+is the `AWKPATH' environment variable and the ability to put `awk'
+functions into a library directory (*note Options::). It would be nice
+to be able to write programs in the following manner:
- return (clast == cline)
+ # library functions
+ @include getopt.awk
+ @include join.awk
+ ...
+
+ # main program
+ BEGIN {
+ while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
+ ...
+ ...
}
- The following two rules are the body of the program. The first one
-is executed only for the very first line of data. It sets `last' equal
-to `$0', so that subsequent lines of text have something to be compared
-to.
+ The following program, `igawk.sh', provides this service. It
+simulates `gawk''s searching of the `AWKPATH' variable and also allows
+"nested" includes; i.e., a file that is included with address@hidden' can
+contain further address@hidden' statements. `igawk' makes an effort to only
+include files once, so that nested includes don't accidentally include
+a library function twice.
- The second rule does the work. The variable `equal' is one or zero,
-depending upon the results of `are_equal()''s comparison. If `uniq' is
-counting repeated lines, and the lines are equal, then it increments
-the `count' variable. Otherwise, it prints the line and resets `count',
-since the two lines are not equal.
+ `igawk' should behave just like `gawk' externally. This means it
+should accept all of `gawk''s command-line arguments, including the
+ability to have multiple source files specified via `-f', and the
+ability to mix command-line and library source files.
- If `uniq' is not counting, and if the lines are equal, `count' is
-incremented. Nothing is printed, since the point is to remove
-duplicates. Otherwise, if `uniq' is counting repeated lines and more
-than one line is seen, or if `uniq' is counting nonrepeated lines and
-only one line is seen, then the line is printed, and `count' is reset.
+ The program is written using the POSIX Shell (`sh') command
+language.(1) It works as follows:
- Finally, similar logic is used in the `END' rule to print the final
-line of input data:
+ 1. Loop through the arguments, saving anything that doesn't represent
+ `awk' source code for later, when the expanded program is run.
- NR == 1 {
- last = $0
- next
- }
+ 2. For any arguments that do represent `awk' text, put the arguments
+ into a shell variable that will be expanded. There are two cases:
- {
- equal = are_equal()
+ a. Literal text, provided with `--source' or `--source='. This
+ text is just appended directly.
- if (do_count) { # overrides -d and -u
- if (equal)
- count++
- else {
- printf("%4d %s\n", count, last) > outputfile
- last = $0
- count = 1 # reset
- }
- next
- }
+ b. Source file names, provided with `-f'. We use a neat trick
+ and append address@hidden FILENAME' to the shell variable's
+ contents. Since the file-inclusion program works the way
+ `gawk' does, this gets the text of the file included into the
+ program at the correct point.
- if (equal)
- count++
- else {
- if ((repeated_only && count > 1) ||
- (non_repeated_only && count == 1))
- print last > outputfile
- last = $0
- count = 1
- }
- }
+ 3. Run an `awk' program (naturally) over the shell variable's
+ contents to expand address@hidden' statements. The expanded program is
+ placed in a second shell variable.
- END {
- if (do_count)
- printf("%4d %s\n", count, last) > outputfile
- else if ((repeated_only && count > 1) ||
- (non_repeated_only && count == 1))
- print last > outputfile
- close(outputfile)
- }
+ 4. Run the expanded program with `gawk' and any other original
+ command-line arguments that the user supplied (such as the data
+ file names).
-
-File: gawk.info, Node: Wc Program, Prev: Uniq Program, Up: Clones
+ This program uses shell variables extensively: for storing
+command-line arguments, the text of the `awk' program that will expand
+the user's program, for the user's original program, and for the
+expanded program. Doing so removes some potential problems that might
+arise were we to use temporary files instead, at the cost of making the
+script somewhat more complicated.
-14.2.7 Counting Things
-----------------------
+ The initial part of the program turns on shell tracing if the first
+argument is `debug'.
-The `wc' (word count) utility counts lines, words, and characters in
-one or more input files. Its usage is as follows:
+ The next part loops through all the command-line arguments. There
+are several cases of interest:
- wc [-lwc] [ FILES ... ]
+`--'
+ This ends the arguments to `igawk'. Anything else should be
+ passed on to the user's `awk' program without being evaluated.
- If no files are specified on the command line, `wc' reads its
-standard input. If there are multiple files, it also prints total
-counts for all the files. The options and their meanings are shown in
-the following list:
+`-W'
+ This indicates that the next option is specific to `gawk'. To make
+ argument processing easier, the `-W' is appended to the front of
+ the remaining arguments and the loop continues. (This is an `sh'
+ programming trick. Don't worry about it if you are not familiar
+ with `sh'.)
-`-l'
- Count only lines.
+`-v, -F'
+ These are saved and passed on to `gawk'.
-`-w'
- Count only words. A "word" is a contiguous sequence of
- nonwhitespace characters, separated by spaces and/or TABs.
- Luckily, this is the normal way `awk' separates fields in its
- input data.
+`-f, --file, --file=, -Wfile='
+ The file name is appended to the shell variable `program' with an
+ address@hidden' statement. The `expr' utility is used to remove the
+ leading option part of the argument (e.g., `--file='). (Typical
+ `sh' usage would be to use the `echo' and `sed' utilities to do
+ this work. Unfortunately, some versions of `echo' evaluate escape
+ sequences in their arguments, possibly mangling the program text.
+ Using `expr' avoids this problem.)
-`-c'
- Count only characters.
+`--source, --source=, -Wsource='
+ The source text is appended to `program'.
- Implementing `wc' in `awk' is particularly elegant, since `awk' does
-a lot of the work for us; it splits lines into words (i.e., fields) and
-counts them, it counts lines (i.e., records), and it can easily tell us
-how long a line is.
+`--version, -Wversion'
+ `igawk' prints its version number, runs `gawk --version' to get
+ the `gawk' version information, and then exits.
- This program uses the `getopt()' library function (*note Getopt
-Function::) and the file-transition functions (*note Filetrans
-Function::).
+ If none of the `-f', `--file', `-Wfile', `--source', or `-Wsource'
+arguments are supplied, then the first nonoption argument should be the
+`awk' program. If there are no command-line arguments left, `igawk'
+prints an error message and exits. Otherwise, the first argument is
+appended to `program'. In any case, after the arguments have been
+processed, `program' contains the complete text of the original `awk'
+program.
- This version has one notable difference from traditional versions of
-`wc': it always prints the counts in the order lines, words, and
-characters. Traditional versions note the order of the `-l', `-w', and
-`-c' options on the command line, and print the counts in that order.
+ The program is as follows:
- The `BEGIN' rule does the argument processing. The variable
-`print_total' is true if more than one file is named on the command
-line:
+ #! /bin/sh
+ # igawk --- like gawk but do @include processing
- # wc.awk --- count lines, words, characters
+ if [ "$1" = debug ]
+ then
+ set -x
+ shift
+ fi
- # Options:
- # -l only count lines
- # -w only count words
- # -c only count characters
- #
- # Default is to count lines, words, characters
- #
- # Requires getopt() and file transition library functions
+ # A literal newline, so that program text is formatted correctly
+ n='
+ '
- BEGIN {
- # let getopt() print a message about
- # invalid options. we ignore them
- while ((c = getopt(ARGC, ARGV, "lwc")) != -1) {
- if (c == "l")
- do_lines = 1
- else if (c == "w")
- do_words = 1
- else if (c == "c")
- do_chars = 1
- }
- for (i = 1; i < Optind; i++)
- ARGV[i] = ""
+ # Initialize variables to empty
+ program=
+ opts=
- # if no options, do all
- if (! do_lines && ! do_words && ! do_chars)
- do_lines = do_words = do_chars = 1
+ while [ $# -ne 0 ] # loop over arguments
+ do
+ case $1 in
+ --) shift
+ break ;;
- print_total = (ARGC - i > 2)
- }
+ -W) shift
+ # The ${x?'message here'} construct prints a
+ # diagnostic if $x is the null string
+ set -- -W"address@hidden'missing operand'}"
+ continue ;;
- The `beginfile()' function is simple; it just resets the counts of
-lines, words, and characters to zero, and saves the current file name in
-`fname':
+ -[vF]) opts="$opts $1 '${2?'missing operand'}'"
+ shift ;;
- function beginfile(file)
- {
- lines = words = chars = 0
- fname = FILENAME
- }
+ -[vF]*) opts="$opts '$1'" ;;
- The `endfile()' function adds the current file's numbers to the
-running totals of lines, words, and characters.(1) It then prints out
-those numbers for the file that was just read. It relies on
-`beginfile()' to reset the numbers for the following data file:
+ -f) program="address@hidden ${2?'missing operand'}"
+ shift ;;
- function endfile(file)
- {
- tlines += lines
- twords += words
- tchars += chars
- if (do_lines)
- printf "\t%d", lines
- if (do_words)
- printf "\t%d", words
- if (do_chars)
- printf "\t%d", chars
- printf "\t%s\n", fname
- }
+ -f*) f=$(expr "$1" : '-f\(.*\)')
+ program="address@hidden $f" ;;
- There is one rule that is executed for each line. It adds the length
-of the record, plus one, to `chars'.(2) Adding one plus the record
-length is needed because the newline character separating records (the
-value of `RS') is not part of the record itself, and thus not included
-in its length. Next, `lines' is incremented for each line read, and
-`words' is incremented by the value of `NF', which is the number of
-"words" on this line:
+ -[W-]file=*)
+ f=$(expr "$1" : '-.file=\(.*\)')
+ program="address@hidden $f" ;;
- # do per line
- {
- chars += length($0) + 1 # get newline
- lines++
- words += NF
- }
+ -[W-]file)
+ program="address@hidden ${2?'missing operand'}"
+ shift ;;
- Finally, the `END' rule simply prints the totals for all the files:
+ -[W-]source=*)
+ t=$(expr "$1" : '-.source=\(.*\)')
+ program="$program$n$t" ;;
- END {
- if (print_total) {
- if (do_lines)
- printf "\t%d", tlines
- if (do_words)
- printf "\t%d", twords
- if (do_chars)
- printf "\t%d", tchars
- print "\ttotal"
- }
- }
+ -[W-]source)
+ program="$program$n${2?'missing operand'}"
+ shift ;;
- ---------- Footnotes ----------
+ -[W-]version)
+ echo igawk: version 3.0 1>&2
+ gawk --version
+ exit 0 ;;
- (1) `wc' can't just use the value of `FNR' in `endfile()'. If you
-examine the code in *note Filetrans Function::, you will see that `FNR'
-has already been reset by the time `endfile()' is called.
+ -[W-]*) opts="$opts '$1'" ;;
- (2) Since `gawk' understands multibyte locales, this code counts
-characters, not bytes.
+ *) break ;;
+ esac
+ shift
+ done
-
-File: gawk.info, Node: Miscellaneous Programs, Prev: Clones, Up: Sample
Programs
+ if [ -z "$program" ]
+ then
+ program=${1?'missing program'}
+ shift
+ fi
-14.3 A Grab Bag of `awk' Programs
-=================================
+ # At this point, `program' has the program.
-This minor node is a large "grab bag" of miscellaneous programs. We
-hope you find them both interesting and enjoyable.
+ The `awk' program to process address@hidden' directives is stored in the
+shell variable `expand_prog'. Doing this keeps the shell script
+readable. The `awk' program reads through the user's program, one line
+at a time, using `getline' (*note Getline::). The input file names and
address@hidden' statements are managed using a stack. As each address@hidden'
is
+encountered, the current file name is "pushed" onto the stack and the
+file named in the address@hidden' directive becomes the current file name.
+As each file is finished, the stack is "popped," and the previous input
+file becomes the current input file again. The process is started by
+making the original file the first one on the stack.
-* Menu:
+ The `pathto()' function does the work of finding the full path to a
+file. It simulates `gawk''s behavior when searching the `AWKPATH'
+environment variable (*note AWKPATH Variable::). If a file name has a
+`/' in it, no path search is done. Similarly, if the file name is
+`"-"', then that string is used as-is. Otherwise, the file name is
+concatenated with the name of each directory in the path, and an
+attempt is made to open the generated file name. The only way to test
+if a file can be read in `awk' is to go ahead and try to read it with
+`getline'; this is what `pathto()' does.(2) If the file can be read, it
+is closed and the file name is returned:
-* Dupword Program:: Finding duplicated words in a document.
-* Alarm Program:: An alarm clock.
-* Translate Program:: A program similar to the `tr' utility.
-* Labels Program:: Printing mailing labels.
-* Word Sorting:: A program to produce a word usage count.
-* History Sorting:: Eliminating duplicate entries from a history
- file.
-* Extract Program:: Pulling out programs from Texinfo source
- files.
-* Simple Sed:: A Simple Stream Editor.
-* Igawk Program:: A wrapper for `awk' that includes
- files.
-* Anagram Program:: Finding anagrams from a dictionary.
-* Signature Program:: People do amazing things with too much time on
- their hands.
+ expand_prog='
-
-File: gawk.info, Node: Dupword Program, Next: Alarm Program, Up:
Miscellaneous Programs
+ function pathto(file, i, t, junk)
+ {
+ if (index(file, "/") != 0)
+ return file
-14.3.1 Finding Duplicated Words in a Document
----------------------------------------------
+ if (file == "-")
+ return file
-A common error when writing large amounts of prose is to accidentally
-duplicate words. Typically you will see this in text as something like
-"the the program does the following..." When the text is online, often
-the duplicated words occur at the end of one line and the beginning of
-another, making them very difficult to spot.
+ for (i = 1; i <= ndirs; i++) {
+ t = (pathlist[i] "/" file)
+ if ((getline junk < t) > 0) {
+ # found it
+ close(t)
+ return t
+ }
+ }
+ return ""
+ }
- This program, `dupword.awk', scans through a file one line at a time
-and looks for adjacent occurrences of the same word. It also saves the
-last word on a line (in the variable `prev') for comparison with the
-first word on the next line.
+ The main program is contained inside one `BEGIN' rule. The first
+thing it does is set up the `pathlist' array that `pathto()' uses.
+After splitting the path on `:', null elements are replaced with `"."',
+which represents the current directory:
- The first two statements make sure that the line is all lowercase,
-so that, for example, "The" and "the" compare equal to each other. The
-next statement replaces nonalphanumeric and nonwhitespace characters
-with spaces, so that punctuation does not affect the comparison either.
-The characters are replaced with spaces so that formatting controls
-don't create nonsense words (e.g., the Texinfo address@hidden' becomes
-`codeNF' if punctuation is simply deleted). The record is then resplit
-into fields, yielding just the actual words on the line, and ensuring
-that there are no empty fields.
+ BEGIN {
+ path = ENVIRON["AWKPATH"]
+ ndirs = split(path, pathlist, ":")
+ for (i = 1; i <= ndirs; i++) {
+ if (pathlist[i] == "")
+ pathlist[i] = "."
+ }
- If there are no fields left after removing all the punctuation, the
-current record is skipped. Otherwise, the program loops through each
-word, comparing it to the previous one:
+ The stack is initialized with `ARGV[1]', which will be `/dev/stdin'.
+The main loop comes next. Input lines are read in succession. Lines
+that do not start with address@hidden' are printed verbatim. If the line
+does start with address@hidden', the file name is in `$2'. `pathto()' is
+called to generate the full path. If it cannot, then the program
+prints an error message and continues.
- # dupword.awk --- find duplicate words in text
- {
- $0 = tolower($0)
- gsub(/[^[:alnum:][:blank:]]/, " ");
- $0 = $0 # re-split
- if (NF == 0)
- next
- if ($1 == prev)
- printf("%s:%d: duplicate %s\n",
- FILENAME, FNR, $1)
- for (i = 2; i <= NF; i++)
- if ($i == $(i-1))
- printf("%s:%d: duplicate %s\n",
- FILENAME, FNR, $i)
- prev = $NF
- }
+ The next thing to check is if the file is included already. The
+`processed' array is indexed by the full file name of each included
+file and it tracks this information for us. If the file is seen again,
+a warning message is printed. Otherwise, the new file name is pushed
+onto the stack and processing continues.
-
-File: gawk.info, Node: Alarm Program, Next: Translate Program, Prev:
Dupword Program, Up: Miscellaneous Programs
+ Finally, when `getline' encounters the end of the input file, the
+file is closed and the stack is popped. When `stackptr' is less than
+zero, the program is done:
-14.3.2 An Alarm Clock Program
------------------------------
+ stackptr = 0
+ input[stackptr] = ARGV[1] # ARGV[1] is first file
- Nothing cures insomnia like a ringing alarm clock.
- Arnold Robbins
+ for (; stackptr >= 0; stackptr--) {
+ while ((getline < input[stackptr]) > 0) {
+ if (tolower($1) != "@include") {
+ print
+ continue
+ }
+ fpath = pathto($2)
+ if (fpath == "") {
+ printf("igawk:%s:%d: cannot find %s\n",
+ input[stackptr], FNR, $2) > "/dev/stderr"
+ continue
+ }
+ if (! (fpath in processed)) {
+ processed[fpath] = input[stackptr]
+ input[++stackptr] = fpath # push onto stack
+ } else
+ print $2, "included in", input[stackptr],
+ "already included in",
+ processed[fpath] > "/dev/stderr"
+ }
+ close(input[stackptr])
+ }
+ }' # close quote ends `expand_prog' variable
- The following program is a simple "alarm clock" program. You give
-it a time of day and an optional message. At the specified time, it
-prints the message on the standard output. In addition, you can give it
-the number of times to repeat the message as well as a delay between
-repetitions.
+ processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF
+ $program
+ EOF
+ )
- This program uses the `getlocaltime()' function from *note
-Getlocaltime Function::.
+ The shell construct `COMMAND << MARKER' is called a "here document".
+Everything in the shell script up to the MARKER is fed to COMMAND as
+input. The shell processes the contents of the here document for
+variable and command substitution (and possibly other things as well,
+depending upon the shell).
- All the work is done in the `BEGIN' rule. The first part is argument
-checking and setting of defaults: the delay, the count, and the message
-to print. If the user supplied a message without the ASCII BEL
-character (known as the "alert" character, `"\a"'), then it is added to
-the message. (On many systems, printing the ASCII BEL generates an
-audible alert. Thus when the alarm goes off, the system calls attention
-to itself in case the user is not looking at the computer.) Just for a
-change, this program uses a `switch' statement (*note Switch
-Statement::), but the processing could be done with a series of
-`if'-`else' statements instead. Here is the program:
+ The shell construct `$(...)' is called "command substitution". The
+output of the command inside the parentheses is substituted into the
+command line. Because the result is used in a variable assignment, it
+is saved as a single string, even if the results contain whitespace.
- # alarm.awk --- set an alarm
- #
- # Requires getlocaltime() library function
- # usage: alarm time [ "message" [ count [ delay ] ] ]
+ The expanded program is saved in the variable `processed_program'.
+It's done in these steps:
- BEGIN \
- {
- # Initial argument sanity checking
- usage1 = "usage: alarm time ['message' [count [delay]]]"
- usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
+ 1. Run `gawk' with the address@hidden'-processing program (the value of
+ the `expand_prog' shell variable) on standard input.
- if (ARGC < 2) {
- print usage1 > "/dev/stderr"
- print usage2 > "/dev/stderr"
- exit 1
- }
- switch (ARGC) {
- case 5:
- delay = ARGV[4] + 0
- # fall through
- case 4:
- count = ARGV[3] + 0
- # fall through
- case 3:
- message = ARGV[2]
- break
- default:
- if (ARGV[1] !~ /[[:digit:]]?[[:digit:]]:[[:digit:]]{2}/) {
- print usage1 > "/dev/stderr"
- print usage2 > "/dev/stderr"
- exit 1
- }
- break
- }
+ 2. Standard input is the contents of the user's program, from the
+ shell variable `program'. Its contents are fed to `gawk' via a
+ here document.
- # set defaults for once we reach the desired time
- if (delay == 0)
- delay = 180 # 3 minutes
- if (count == 0)
- count = 5
- if (message == "")
- message = sprintf("\aIt is now %s!\a", ARGV[1])
- else if (index(message, "\a") == 0)
- message = "\a" message "\a"
+ 3. The results of this processing are saved in the shell variable
+ `processed_program' by using command substitution.
- The next minor node of code turns the alarm time into hours and
-minutes, converts it (if necessary) to a 24-hour clock, and then turns
-that time into a count of the seconds since midnight. Next it turns
-the current time into a count of seconds since midnight. The
-difference between the two is how long to wait before setting off the
-alarm:
+ The last step is to call `gawk' with the expanded program, along
+with the original options and command-line arguments that the user
+supplied.
- # split up alarm time
- split(ARGV[1], atime, ":")
- hour = atime[1] + 0 # force numeric
- minute = atime[2] + 0 # force numeric
+ eval gawk $opts -- '"$processed_program"' '"$@"'
- # get current broken down time
- getlocaltime(now)
+ The `eval' command is a shell construct that reruns the shell's
+parsing process. This keeps things properly quoted.
- # if time given is 12-hour hours and it's after that
- # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
- # then add 12 to real hour
- if (hour < 12 && now["hour"] > hour)
- hour += 12
+ This version of `igawk' represents my fifth version of this program.
+There are four key simplifications that make the program work better:
- # set target time in seconds since midnight
- target = (hour * 60 * 60) + (minute * 60)
+ * Using address@hidden' even for the files named with `-f' makes building
+ the initial collected `awk' program much simpler; all the
+ address@hidden' processing can be done once.
- # get current time in seconds since midnight
- current = (now["hour"] * 60 * 60) + \
- (now["minute"] * 60) + now["second"]
+ * Not trying to save the line read with `getline' in the `pathto()'
+ function when testing for the file's accessibility for use with
+ the main program simplifies things considerably.
- # how long to sleep for
- naptime = target - current
- if (naptime <= 0) {
- print "time is in the past!" > "/dev/stderr"
- exit 1
- }
+ * Using a `getline' loop in the `BEGIN' rule does it all in one
+ place. It is not necessary to call out to a separate loop for
+ processing nested address@hidden' statements.
- Finally, the program uses the `system()' function (*note I/O
-Functions::) to call the `sleep' utility. The `sleep' utility simply
-pauses for the given number of seconds. If the exit status is not zero,
-the program assumes that `sleep' was interrupted and exits. If `sleep'
-exited with an OK status (zero), then the program prints the message in
-a loop, again using `sleep' to delay for however many seconds are
-necessary:
+ * Instead of saving the expanded program in a temporary file,
+ putting it in a shell variable avoids some potential security
+ problems. This has the disadvantage that the script relies upon
+ more features of the `sh' language, making it harder to follow for
+ those who aren't familiar with `sh'.
- # zzzzzz..... go away if interrupted
- if (system(sprintf("sleep %d", naptime)) != 0)
- exit 1
+ Also, this program illustrates that it is often worthwhile to combine
+`sh' and `awk' programming together. You can usually accomplish quite
+a lot, without having to resort to low-level programming in C or C++,
+and it is frequently easier to do certain kinds of string and argument
+manipulation using the shell than it is in `awk'.
- # time to notify!
- command = sprintf("sleep %d", delay)
- for (i = 1; i <= count; i++) {
- print message
- # if sleep command interrupted, go away
- if (system(command) != 0)
- break
- }
+ Finally, `igawk' shows that it is not always necessary to add new
+features to a program; they can often be layered on top.
- exit 0
- }
+ As an additional example of this, consider the idea of having two
+files in a directory in the search path:
-
-File: gawk.info, Node: Translate Program, Next: Labels Program, Prev: Alarm
Program, Up: Miscellaneous Programs
+`default.awk'
+ This file contains a set of default library functions, such as
+ `getopt()' and `assert()'.
-14.3.3 Transliterating Characters
----------------------------------
+`site.awk'
+ This file contains library functions that are specific to a site or
+ installation; i.e., locally developed functions. Having a
+ separate file allows `default.awk' to change with new `gawk'
+ releases, without requiring the system administrator to update it
+ each time by adding the local functions.
-The system `tr' utility transliterates characters. For example, it is
-often used to map uppercase letters into lowercase for further
-processing:
+ One user suggested that `gawk' be modified to automatically read
+these files upon startup. Instead, it would be very simple to modify
+`igawk' to do this. Since `igawk' can process nested address@hidden'
+directives, `default.awk' could simply contain address@hidden' statements
+for the desired library functions.
- GENERATE DATA | tr 'A-Z' 'a-z' | PROCESS DATA ...
+ ---------- Footnotes ----------
- `tr' requires two lists of characters.(1) When processing the
-input, the first character in the first list is replaced with the first
-character in the second list, the second character in the first list is
-replaced with the second character in the second list, and so on. If
-there are more characters in the "from" list than in the "to" list, the
-last character of the "to" list is used for the remaining characters in
-the "from" list.
+ (1) Fully explaining the `sh' language is beyond the scope of this
+book. We provide some minimal explanations, but see a good shell
+programming book if you wish to understand things in more depth.
- Some time ago, a user proposed that a transliteration function should
-be added to `gawk'. The following program was written to prove that
-character transliteration could be done with a user-level function.
-This program is not as complete as the system `tr' utility but it does
-most of the job.
+ (2) On some very old versions of `awk', the test `getline junk < t'
+can loop forever if the file exists but is empty. Caveat emptor.
- The `translate' program demonstrates one of the few weaknesses of
-standard `awk': dealing with individual characters is very painful,
-requiring repeated use of the `substr()', `index()', and `gsub()'
-built-in functions (*note String Functions::).(2) There are two
-functions. The first, `stranslate()', takes three arguments:
+
+File: gawk.info, Node: Anagram Program, Next: Signature Program, Prev:
Igawk Program, Up: Miscellaneous Programs
-`from'
- A list of characters from which to translate.
+13.3.10 Finding Anagrams From A Dictionary
+------------------------------------------
-`to'
- A list of characters to which to translate.
+An interesting programming challenge is to search for "anagrams" in a
+word list (such as `/usr/share/dict/words' on many GNU/Linux systems).
+One word is an anagram of another if both words contain the same letters
+(for example, "babbling" and "blabbing").
-`target'
- The string on which to do the translation.
+ An elegant algorithm is presented in Column 2, Problem C of Jon
+Bentley's `Programming Pearls', second edition. The idea is to give
+words that are anagrams a common signature, sort all the words together
+by their signature, and then print them. Dr. Bentley observes that
+taking the letters in each word and sorting them produces that common
+signature.
- Associative arrays make the translation part fairly easy. `t_ar'
-holds the "to" characters, indexed by the "from" characters. Then a
-simple loop goes through `from', one character at a time. For each
-character in `from', if the character appears in `target', it is
-replaced with the corresponding `to' character.
+ The following program uses arrays of arrays to bring together words
+with the same signature and array sorting to print the words in sorted
+order.
- The `translate()' function simply calls `stranslate()' using `$0' as
-the target. The main program sets two global variables, `FROM' and
-`TO', from the command line, and then changes `ARGV' so that `awk'
-reads from the standard input.
+ # anagram.awk --- An implementation of the anagram finding algorithm
+ # from Jon Bentley's "Programming Pearls", 2nd edition.
+ # Addison Wesley, 2000, ISBN 0-201-65788-0.
+ # Column 2, Problem C, section 2.8, pp 18-20.
- Finally, the processing rule simply calls `translate()' for each
-record:
+ /'s$/ { next } # Skip possessives
- # translate.awk --- do tr-like stuff
- # Bugs: does not handle things like: tr A-Z a-z, it has
- # to be spelled out. However, if `to' is shorter than `from',
- # the last character in `to' is used for the rest of `from'.
+ The program starts with a header, and then a rule to skip
+possessives in the dictionary file. The next rule builds up the data
+structure. The first dimension of the array is indexed by the
+signature; the second dimension is the word itself:
- function stranslate(from, to, target, lf, lt, ltarget, t_ar, i, c,
- result)
{
- lf = length(from)
- lt = length(to)
- ltarget = length(target)
- for (i = 1; i <= lt; i++)
- t_ar[substr(from, i, 1)] = substr(to, i, 1)
- if (lt < lf)
- for (; i <= lf; i++)
- t_ar[substr(from, i, 1)] = substr(to, lt, 1)
- for (i = 1; i <= ltarget; i++) {
- c = substr(target, i, 1)
- if (c in t_ar)
- c = t_ar[c]
- result = result c
- }
- return result
+ key = word2key($1) # Build signature
+ data[key][$1] = $1 # Store word with signature
}
- function translate(from, to)
+ The `word2key()' function creates the signature. It splits the word
+apart into individual letters, sorts the letters, and then joins them
+back together:
+
+ # word2key --- split word apart into letters, sort, joining back together
+
+ function word2key(word, a, i, n, result)
{
- return $0 = stranslate(from, to, $0)
+ n = split(word, a, "")
+ asort(a)
+
+ for (i = 1; i <= n; i++)
+ result = result a[i]
+
+ return result
}
- # main program
- BEGIN {
- if (ARGC < 3) {
- print "usage: translate from to" > "/dev/stderr"
- exit
+ Finally, the `END' rule traverses the array and prints out the
+anagram lists. It sends the output to the system `sort' command, since
+otherwise the anagrams would appear in arbitrary order:
+
+ END {
+ sort = "sort"
+ for (key in data) {
+ # Sort words with same key
+ nwords = asorti(data[key], words)
+ if (nwords == 1)
+ continue
+
+ # And print. Minor glitch: trailing space at end of each line
+ for (j = 1; j <= nwords; j++)
+ printf("%s ", words[j]) | sort
+ print "" | sort
}
- FROM = ARGV[1]
- TO = ARGV[2]
- ARGC = 2
- ARGV[1] = "-"
+ close(sort)
}
- {
- translate(FROM, TO)
- print
- }
+ Here is some partial output when the program is run:
- While it is possible to do character transliteration in a user-level
-function, it is not necessarily efficient, and we (the `gawk' authors)
-started to consider adding a built-in function. However, shortly after
-writing this program, we learned that the System V Release 4 `awk' had
-added the `toupper()' and `tolower()' functions (*note String
-Functions::). These functions handle the vast majority of the cases
-where character transliteration is necessary, and so we chose to simply
-add those functions to `gawk' as well and then leave well enough alone.
+ $ gawk -f anagram.awk /usr/share/dict/words | grep '^b'
+ ...
+ babbled blabbed
+ babbler blabber brabble
+ babblers blabbers brabbles
+ babbling blabbing
+ babbly blabby
+ babel bable
+ babels beslab
+ babery yabber
+ ...
- An obvious improvement to this program would be to set up the `t_ar'
-array only once, in a `BEGIN' rule. However, this assumes that the
-"from" and "to" lists will never change throughout the lifetime of the
-program.
+
+File: gawk.info, Node: Signature Program, Prev: Anagram Program, Up:
Miscellaneous Programs
+
+13.3.11 And Now For Something Completely Different
+--------------------------------------------------
+
+The following program was written by Davide Brini and is published on
+his website (http://backreference.org/2011/02/03/obfuscated-awk/). It
+serves as his signature in the Usenet group `comp.lang.awk'. He
+supplies the following copyright terms:
+
+ Copyright (C) 2008 Davide Brini
+
+ Copying and distribution of the code published in this page, with
+ or without modification, are permitted in any medium without
+ royalty provided the copyright notice and this notice are
+ preserved.
- ---------- Footnotes ----------
+ Here is the program:
- (1) On some older systems, `tr' may require that the lists be
-written as range expressions enclosed in square brackets (`[a-z]') and
-quoted, to prevent the shell from attempting a file name expansion.
-This is not a feature.
+ awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
+ printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
+ X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
+ O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'
- (2) This program was written before `gawk' acquired the ability to
-split each character in a string into separate array elements.
+ We leave it to you to determine what the program does.
-File: gawk.info, Node: Labels Program, Next: Word Sorting, Prev: Translate
Program, Up: Miscellaneous Programs
-
-14.3.4 Printing Mailing Labels
-------------------------------
-
-Here is a "real world"(1) program. This script reads lists of names and
-addresses and generates mailing labels. Each page of labels has 20
-labels on it, two across and 10 down. The addresses are guaranteed to
-be no more than five lines of data. Each address is separated from the
-next by a blank line.
+File: gawk.info, Node: Debugger, Next: Dynamic Extensions, Prev: Sample
Programs, Up: Top
- The basic idea is to read 20 labels worth of data. Each line of
-each label is stored in the `line' array. The single rule takes care
-of filling the `line' array and printing the page when 20 labels have
-been read.
+14 Debugging `awk' Programs
+***************************
- The `BEGIN' rule simply sets `RS' to the empty string, so that `awk'
-splits records at blank lines (*note Records::). It sets `MAXLINES' to
-100, since 100 is the maximum number of lines on the page (20 * 5 =
-100).
+It would be nice if computer programs worked perfectly the first time
+they were run, but in real life, this rarely happens for programs of
+any complexity. Thus, most programming languages have facilities
+available for "debugging" programs, and now `awk' is no exception.
- Most of the work is done in the `printpage()' function. The label
-lines are stored sequentially in the `line' array. But they have to
-print horizontally; `line[1]' next to `line[6]', `line[2]' next to
-`line[7]', and so on. Two loops are used to accomplish this. The
-outer loop, controlled by `i', steps through every 10 lines of data;
-this is each row of labels. The inner loop, controlled by `j', goes
-through the lines within the row. As `j' goes from 0 to 4, `i+j' is
-the `j'-th line in the row, and `i+j+5' is the entry next to it. The
-output ends up looking something like this:
+ The `gawk' debugger is purposely modeled after the GNU Debugger
+(GDB) (http://www.gnu.org/software/gdb/) command-line debugger. If you
+are familiar with GDB, learning how to use `gawk' for debugging your
+program is easy.
- line 1 line 6
- line 2 line 7
- line 3 line 8
- line 4 line 9
- line 5 line 10
- ...
+* Menu:
-The `printf' format string `%-41s' left-aligns the data and prints it
-within a fixed-width field.
+* Debugging:: Introduction to `gawk' debugger.
+* Sample Debugging Session:: Sample debugging session.
+* List of Debugger Commands:: Main debugger commands.
+* Readline Support:: Readline support.
+* Limitations:: Limitations and future plans.
- As a final note, an extra blank line is printed at lines 21 and 61,
-to keep the output lined up on the labels. This is dependent on the
-particular brand of labels in use when the program was written. You
-will also note that there are two blank lines at the top and two blank
-lines at the bottom.
+
+File: gawk.info, Node: Debugging, Next: Sample Debugging Session, Up:
Debugger
- The `END' rule arranges to flush the final page of labels; there may
-not have been an even multiple of 20 labels in the data:
+14.1 Introduction to `gawk' Debugger
+====================================
- # labels.awk --- print mailing labels
+This minor node introduces debugging in general and begins the
+discussion of debugging in `gawk'.
- # Each label is 5 lines of data that may have blank lines.
- # The label sheets have 2 blank lines at the top and 2 at
- # the bottom.
+* Menu:
- BEGIN { RS = "" ; MAXLINES = 100 }
+* Debugging Concepts:: Debugging in General.
+* Debugging Terms:: Additional Debugging Concepts.
+* Awk Debugging:: Awk Debugging.
- function printpage( i, j)
- {
- if (Nlines <= 0)
- return
+
+File: gawk.info, Node: Debugging Concepts, Next: Debugging Terms, Up:
Debugging
- printf "\n\n" # header
+14.1.1 Debugging in General
+---------------------------
- for (i = 1; i <= Nlines; i += 10) {
- if (i == 21 || i == 61)
- print ""
- for (j = 0; j < 5; j++) {
- if (i + j > MAXLINES)
- break
- printf " %-41s %s\n", line[i+j], line[i+j+5]
- }
- print ""
- }
+(If you have used debuggers in other languages, you may want to skip
+ahead to the next section on the specific features of the `awk'
+debugger.)
- printf "\n\n" # footer
+ Of course, a debugging program cannot remove bugs for you, since it
+has no way of knowing what you or your users consider a "bug" and what
+is a "feature." (Sometimes, we humans have a hard time with this
+ourselves.) In that case, what can you expect from such a tool? The
+answer to that depends on the language being debugged, but in general,
+you can expect at least the following:
- delete line
- }
+ * The ability to watch a program execute its instructions one by one,
+ giving you, the programmer, the opportunity to think about what is
+ happening on a time scale of seconds, minutes, or hours, rather
+ than the nanosecond time scale at which the code usually runs.
- # main rule
- {
- if (Count >= 20) {
- printpage()
- Count = 0
- Nlines = 0
- }
- n = split($0, a, "\n")
- for (i = 1; i <= n; i++)
- line[++Nlines] = a[i]
- for (; i <= 5; i++)
- line[++Nlines] = ""
- Count++
- }
+ * The opportunity to not only passively observe the operation of your
+ program, but to control it and try different paths of execution,
+ without having to change your source files.
- END \
- {
- printpage()
- }
+ * The chance to see the values of data in the program at any point in
+ execution, and also to change that data on the fly, to see how that
+ affects what happens afterwards. (This often includes the ability
+ to look at internal data structures besides the variables you
+ actually defined in your code.)
- ---------- Footnotes ----------
+ * The ability to obtain additional information about your program's
+ state or even its internal structure.
- (1) "Real world" is defined as "a program actually used to get
-something done."
+ All of these tools provide a great amount of help in using your own
+skills and understanding of the goals of your program to find where it
+is going wrong (or, for that matter, to better comprehend a perfectly
+functional program that you or someone else wrote).
-File: gawk.info, Node: Word Sorting, Next: History Sorting, Prev: Labels
Program, Up: Miscellaneous Programs
+File: gawk.info, Node: Debugging Terms, Next: Awk Debugging, Prev:
Debugging Concepts, Up: Debugging
-14.3.5 Generating Word-Usage Counts
------------------------------------
+14.1.2 Additional Debugging Concepts
+------------------------------------
-When working with large amounts of text, it can be interesting to know
-how often different words appear. For example, an author may overuse
-certain words, in which case she might wish to find synonyms to
-substitute for words that appear too often. This node develops a
-program for counting words and presenting the frequency information in
-a useful format.
+Before diving in to the details, we need to introduce several important
+concepts that apply to just about all debuggers. The following list
+defines terms used throughout the rest of this major node.
- At first glance, a program like this would seem to do the job:
+"Stack Frame"
+ Programs generally call functions during the course of their
+ execution. One function can call another, or a function can call
+ itself (recursion). You can view the chain of called functions
+ (main program calls A, which calls B, which calls C), as a stack
+ of executing functions: the currently running function is the
+ topmost one on the stack, and when it finishes (returns), the next
+ one down then becomes the active function. Such a stack is termed
+ a "call stack".
- # Print list of word frequencies
+ For each function on the call stack, the system maintains a data
+ area that contains the function's parameters, local variables, and
+ return value, as well as any other "bookkeeping" information
+ needed to manage the call stack. This data area is termed a
+ "stack frame".
- {
- for (i = 1; i <= NF; i++)
- freq[$i]++
- }
+ `gawk' also follows this model, and gives you access to the call
+ stack and to each stack frame. You can see the call stack, as well
+ as from where each function on the stack was invoked. Commands
+ that print the call stack print information about each stack frame
+ (as detailed later on).
- END {
- for (word in freq)
- printf "%s\t%d\n", word, freq[word]
- }
+"Breakpoint"
+ During debugging, you often wish to let the program run until it
+ reaches a certain point, and then continue execution from there one
+ statement (or instruction) at a time. The way to do this is to set
+ a "breakpoint" within the program. A breakpoint is where the
+ execution of the program should break off (stop), so that you can
+ take over control of the program's execution. You can add and
+ remove as many breakpoints as you like.
- The program relies on `awk''s default field splitting mechanism to
-break each line up into "words," and uses an associative array named
-`freq', indexed by each word, to count the number of times the word
-occurs. In the `END' rule, it prints the counts.
+"Watchpoint"
+ A watchpoint is similar to a breakpoint. The difference is that
+ breakpoints are oriented around the code: stop when a certain
+ point in the code is reached. A watchpoint, however, specifies
+ that program execution should stop when a _data value_ is changed.
+ This is useful, since sometimes it happens that a variable
+ receives an erroneous value, and it's hard to track down where
+ this happens just by looking at the code. By using a watchpoint,
+ you can stop whenever a variable is assigned to, and usually find
+ the errant code quite quickly.
- This program has several problems that prevent it from being useful
-on real text files:
+
+File: gawk.info, Node: Awk Debugging, Prev: Debugging Terms, Up: Debugging
- * The `awk' language considers upper- and lowercase characters to be
- distinct. Therefore, "bartender" and "Bartender" are not treated
- as the same word. This is undesirable, since in normal text, words
- are capitalized if they begin sentences, and a frequency analyzer
- should not be sensitive to capitalization.
+14.1.3 Awk Debugging
+--------------------
- * Words are detected using the `awk' convention that fields are
- separated just by whitespace. Other characters in the input
- (except newlines) don't have any special meaning to `awk'. This
- means that punctuation characters count as part of words.
+Debugging an `awk' program has some specific aspects that are not
+shared with other programming languages.
- * The output does not come out in any useful order. You're more
- likely to be interested in which words occur most frequently or in
- having an alphabetized table of how frequently each word occurs.
+ First of all, the fact that `awk' programs usually take input
+line-by-line from a file or files and operate on those lines using
+specific rules makes it especially useful to organize viewing the
+execution of the program in terms of these rules. As we will see, each
+`awk' rule is treated almost like a function call, with its own
+specific block of instructions.
- The first problem can be solved by using `tolower()' to remove case
-distinctions. The second problem can be solved by using `gsub()' to
-remove punctuation characters. Finally, we solve the third problem by
-using the system `sort' utility to process the output of the `awk'
-script. Here is the new version of the program:
+ In addition, since `awk' is by design a very concise language, it is
+easy to lose sight of everything that is going on "inside" each line of
+`awk' code. The debugger provides the opportunity to look at the
+individual primitive instructions carried out by the higher-level `awk'
+commands.
+
+
+File: gawk.info, Node: Sample Debugging Session, Next: List of Debugger
Commands, Prev: Debugging, Up: Debugger
- # wordfreq.awk --- print list of word frequencies
+14.2 Sample Debugging Session
+=============================
- {
- $0 = tolower($0) # remove case distinctions
- # remove punctuation
- gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
- for (i = 1; i <= NF; i++)
- freq[$i]++
- }
+In order to illustrate the use of `gawk' as a debugger, let's look at a
+sample debugging session. We will use the `awk' implementation of the
+POSIX `uniq' command described earlier (*note Uniq Program::) as our
+example.
- END {
- for (word in freq)
- printf "%s\t%d\n", word, freq[word]
- }
+* Menu:
- Assuming we have saved this program in a file named `wordfreq.awk',
-and that the data is in `file1', the following pipeline:
+* Debugger Invocation:: How to Start the Debugger.
+* Finding The Bug:: Finding the Bug.
- awk -f wordfreq.awk file1 | sort -k 2nr
+
+File: gawk.info, Node: Debugger Invocation, Next: Finding The Bug, Up:
Sample Debugging Session
-produces a table of the words appearing in `file1' in order of
-decreasing frequency.
+14.2.1 How to Start the Debugger
+--------------------------------
- The `awk' program suitably massages the data and produces a word
-frequency table, which is not ordered. The `awk' script's output is
-then sorted by the `sort' utility and printed on the screen.
+Starting the debugger is almost exactly like running `awk', except you
+have to pass an additional option `--debug' or the corresponding short
+option `-D'. The file(s) containing the program and any supporting
+code are given on the command line as arguments to one or more `-f'
+options. (`gawk' is not designed to debug command-line programs, only
+programs contained in files.) In our case, we invoke the debugger like
+this:
- The options given to `sort' specify a sort that uses the second
-field of each input line (skipping one field), that the sort keys
-should be treated as numeric quantities (otherwise `15' would come
-before `5'), and that the sorting should be done in descending
-(reverse) order.
+ $ gawk -D -f getopt.awk -f join.awk -f uniq.awk inputfile
- The `sort' could even be done from within the program, by changing
-the `END' action to:
+where both `getopt.awk' and `uniq.awk' are in `$AWKPATH'. (Experienced
+users of GDB or similar debuggers should note that this syntax is
+slightly different from what they are used to. With `gawk' debugger,
+the arguments for running the program are given in the command line to
+the debugger rather than as part of the `run' command at the debugger
+prompt.)
- END {
- sort = "sort -k 2nr"
- for (word in freq)
- printf "%s\t%d\n", word, freq[word] | sort
- close(sort)
- }
+ Instead of immediately running the program on `inputfile', as `gawk'
+would ordinarily do, the debugger merely loads all the program source
+files, compiles them internally, and then gives us a prompt:
- This way of sorting must be used on systems that do not have true
-pipes at the command-line (or batch-file) level. See the general
-operating system documentation for more information on how to use the
-`sort' program.
+ gawk>
+
+from which we can issue commands to the debugger. At this point, no
+code has been executed.
-File: gawk.info, Node: History Sorting, Next: Extract Program, Prev: Word
Sorting, Up: Miscellaneous Programs
+File: gawk.info, Node: Finding The Bug, Prev: Debugger Invocation, Up:
Sample Debugging Session
-14.3.6 Removing Duplicates from Unsorted Text
----------------------------------------------
+14.2.2 Finding the Bug
+----------------------
-The `uniq' program (*note Uniq Program::), removes duplicate lines from
-_sorted_ data.
+Let's say that we are having a problem using (a faulty version of)
+`uniq.awk' in the "field-skipping" mode, and it doesn't seem to be
+catching lines which should be identical when skipping the first field,
+such as:
- Suppose, however, you need to remove duplicate lines from a data
-file but that you want to preserve the order the lines are in. A good
-example of this might be a shell history file. The history file keeps
-a copy of all the commands you have entered, and it is not unusual to
-repeat a command several times in a row. Occasionally you might want
-to compact the history by removing duplicate entries. Yet it is
-desirable to maintain the order of the original commands.
+ awk is a wonderful program!
+ gawk is a wonderful program!
- This simple program does the job. It uses two arrays. The `data'
-array is indexed by the text of each line. For each line, `data[$0]'
-is incremented. If a particular line has not been seen before, then
-`data[$0]' is zero. In this case, the text of the line is stored in
-`lines[count]'. Each element of `lines' is a unique command, and the
-indices of `lines' indicate the order in which those lines are
-encountered. The `END' rule simply prints out the lines, in order:
+ This could happen if we were thinking (C-like) of the fields in a
+record as being numbered in a zero-based fashion, so instead of the
+lines:
- # histsort.awk --- compact a shell history file
- # Thanks to Byron Rakitzis for the general idea
+ clast = join(alast, fcount+1, n)
+ cline = join(aline, fcount+1, m)
- {
- if (data[$0]++ == 0)
- lines[++count] = $0
- }
+we wrote:
- END {
- for (i = 1; i <= count; i++)
- print lines[i]
- }
+ clast = join(alast, fcount, n)
+ cline = join(aline, fcount, m)
- This program also provides a foundation for generating other useful
-information. For example, using the following `print' statement in the
-`END' rule indicates how often a particular command is used:
+ The first thing we usually want to do when trying to investigate a
+problem like this is to put a breakpoint in the program so that we can
+watch it at work and catch what it is doing wrong. A reasonable spot
+for a breakpoint in `uniq.awk' is at the beginning of the function
+`are_equal()', which compares the current line with the previous one.
+To set the breakpoint, use the `b' (breakpoint) command:
- print data[lines[i]], lines[i]
+ gawk> b are_equal
+ -| Breakpoint 1 set at file `awklib/eg/prog/uniq.awk', line 64
- This works because `data[$0]' is incremented each time a line is
-seen.
+ The debugger tells us the file and line number where the breakpoint
+is. Now type `r' or `run' and the program runs until it hits the
+breakpoint for the first time:
-
-File: gawk.info, Node: Extract Program, Next: Simple Sed, Prev: History
Sorting, Up: Miscellaneous Programs
+ gawk> r
+ -| Starting program:
+ -| Stopping in Rule ...
+ -| Breakpoint 1, are_equal(n, m, clast, cline, alast, aline)
+ at `awklib/eg/prog/uniq.awk':64
+ -| 64 if (fcount == 0 && charcount == 0)
+ gawk>
-14.3.7 Extracting Programs from Texinfo Source Files
-----------------------------------------------------
+ Now we can look at what's going on inside our program. First of all,
+let's see how we got to where we are. At the prompt, we type `bt'
+(short for "backtrace"), and the debugger responds with a listing of
+the current stack frames:
-The nodes *note Library Functions::, and *note Sample Programs::, are
-the top level nodes for a large number of `awk' programs. If you want
-to experiment with these programs, it is tedious to have to type them
-in by hand. Here we present a program that can extract parts of a
-Texinfo input file into separate files.
+ gawk> bt
+ -| #0 are_equal(n, m, clast, cline, alast, aline)
+ at `awklib/eg/prog/uniq.awk':69
+ -| #1 in main() at `awklib/eg/prog/uniq.awk':89
-This Info file is written in Texinfo (http://texinfo.org), the GNU
-project's document formatting language. A single Texinfo source file
-can be used to produce both printed and online documentation. The
-Texinfo language is described fully, starting with *note (Texinfo)Top::
-texinfo,Texinfo--The GNU Documentation Format.
+ This tells us that `are_equal()' was called by the main program at
+line 89 of `uniq.awk'. (This is not a big surprise, since this is the
+only call to `are_equal()' in the program, but in more complex
+programs, knowing who called a function and with what parameters can be
+the key to finding the source of the problem.)
- For our purposes, it is enough to know three things about Texinfo
-input files:
+ Now that we're in `are_equal()', we can start looking at the values
+of some variables. Let's say we type `p n' (`p' is short for "print").
+We would expect to see the value of `n', a parameter to `are_equal()'.
+Actually, the debugger gives us:
- * The "at" symbol (`@') is special in Texinfo, much as the backslash
- (`\') is in C or `awk'. Literal `@' symbols are represented in
- Texinfo source files as `@@'.
+ gawk> p n
+ -| n = untyped variable
- * Comments start with either address@hidden' or address@hidden'. The
- file-extraction program works by using special comments that start
- at the beginning of a line.
+In this case, `n' is an uninitialized local variable, since the
+function was called without arguments (*note Function Calls::).
- * Lines containing address@hidden' and address@hidden group' commands
bracket
- example text that should not be split across a page boundary.
- (Unfortunately, TeX isn't always smart enough to do things exactly
- right, so we have to give it some help.)
+ A more useful variable to display might be the current record:
- The following program, `extract.awk', reads through a Texinfo source
-file and does two things, based on the special comments. Upon seeing
address@hidden system ...', it runs a command, by extracting the command text
from
-the control line and passing it on to the `system()' function (*note
-I/O Functions::). Upon seeing address@hidden file FILENAME', each subsequent
line
-is sent to the file FILENAME, until address@hidden endfile' is encountered.
The
-rules in `extract.awk' match either address@hidden' or address@hidden' by
letting the
-`omment' part be optional. Lines containing address@hidden' and
address@hidden group'
-are simply removed. `extract.awk' uses the `join()' library function
-(*note Join Function::).
+ gawk> p $0
+ -| $0 = string ("gawk is a wonderful program!")
- The example programs in the online Texinfo source for `GAWK:
-Effective AWK Programming' (`gawk.texi') have all been bracketed inside
-`file' and `endfile' lines. The `gawk' distribution uses a copy of
-`extract.awk' to extract the sample programs and install many of them
-in a standard directory where `gawk' can find them. The Texinfo file
-looks something like this:
+This might be a bit puzzling at first since this is the second line of
+our test input above. Let's look at `NR':
- ...
- This program has a @code{BEGIN} rule,
- that prints a nice message:
+ gawk> p NR
+ -| NR = number (2)
- @example
- @c file examples/messages.awk
- BEGIN @{ print "Don't panic!" @}
- @c end file
- @end example
+So we can see that `are_equal()' was only called for the second record
+of the file. Of course, this is because our program contained a rule
+for `NR == 1':
- It also prints some final advice:
+ NR == 1 {
+ last = $0
+ next
+ }
- @example
- @c file examples/messages.awk
- END @{ print "Always avoid bored archeologists!" @}
- @c end file
- @end example
- ...
+ OK, let's just check that that rule worked correctly:
- `extract.awk' begins by setting `IGNORECASE' to one, so that mixed
-upper- and lowercase letters in the directives won't matter.
+ gawk> p last
+ -| last = string ("awk is a wonderful program!")
- The first rule handles calling `system()', checking that a command is
-given (`NF' is at least three) and also checking that the command exits
-with a zero exit status, signifying OK:
+ Everything we have done so far has verified that the program has
+worked as planned, up to and including the call to `are_equal()', so
+the problem must be inside this function. To investigate further, we
+must begin "stepping through" the lines of `are_equal()'. We start by
+typing `n' (for "next"):
- # extract.awk --- extract files and run programs
- # from texinfo files
+ gawk> n
+ -| 67 if (fcount > 0) {
- BEGIN { IGNORECASE = 1 }
+ This tells us that `gawk' is now ready to execute line 67, which
+decides whether to give the lines the special "field skipping" treatment
+indicated by the `-f' command-line option. (Notice that we skipped
+from where we were before at line 64 to here, since the condition in
+line 64
+
+ if (fcount == 0 && charcount == 0)
+
+was false.)
+
+ Continuing to step, we now get to the splitting of the current and
+last records:
- /address@hidden(omment)?[ \t]+system/ \
- {
- if (NF < 3) {
- e = (FILENAME ":" FNR)
- e = (e ": badly formed `system' line")
- print e > "/dev/stderr"
- next
- }
- $1 = ""
- $2 = ""
- stat = system($0)
- if (stat != 0) {
- e = (FILENAME ":" FNR)
- e = (e ": warning: system returned " stat)
- print e > "/dev/stderr"
- }
- }
+ gawk> n
+ -| 68 n = split(last, alast)
+ gawk> n
+ -| 69 m = split($0, aline)
-The variable `e' is used so that the rule fits nicely on the screen.
+ At this point, we should be curious to see what our records were
+split into, so we try to look:
- The second rule handles moving data into files. It verifies that a
-file name is given in the directive. If the file named is not the
-current file, then the current file is closed. Keeping the current file
-open until a new file is encountered allows the use of the `>'
-redirection for printing the contents, keeping open file management
-simple.
+ gawk> p n m alast aline
+ -| n = number (5)
+ -| m = number (5)
+ -| alast = array, 5 elements
+ -| aline = array, 5 elements
- The `for' loop does the work. It reads lines using `getline' (*note
-Getline::). For an unexpected end of file, it calls the
-`unexpected_eof()' function. If the line is an "endfile" line, then it
-breaks out of the loop. If the line is an address@hidden' or address@hidden
group'
-line, then it ignores it and goes on to the next line. Similarly,
-comments within examples are also ignored.
+(The `p' command can take more than one argument, similar to `awk''s
+`print' statement.)
- Most of the work is in the following few lines. If the line has no
-`@' symbols, the program can print it directly. Otherwise, each
-leading `@' must be stripped off. To remove the `@' symbols, the line
-is split into separate elements of the array `a', using the `split()'
-function (*note String Functions::). The `@' symbol is used as the
-separator character. Each element of `a' that is empty indicates two
-successive `@' symbols in the original line. For each two empty
-elements (`@@' in the original file), we have to add a single `@'
-symbol back in.(1)
+ This is kind of disappointing, though. All we found out is that
+there are five elements in each of our arrays. Useful enough (we now
+know that none of the words were accidentally left out), but what if we
+want to see inside the array?
- When the processing of the array is finished, `join()' is called
-with the value of `SUBSEP', to rejoin the pieces back into a single
-line. That line is then printed to the output file:
+ The first choice would be to use subscripts:
- /address@hidden(omment)?[ \t]+file/ \
- {
- if (NF != 3) {
- e = (FILENAME ":" FNR ": badly formed `file' line")
- print e > "/dev/stderr"
- next
- }
- if ($3 != curfile) {
- if (curfile != "")
- close(curfile)
- curfile = $3
- }
+ gawk> p alast[0]
+ -| "0" not in array `alast'
- for (;;) {
- if ((getline line) <= 0)
- unexpected_eof()
- if (line ~ /address@hidden(omment)?[ \t]+endfile/)
- break
- else if (line ~ /^@(end[ \t]+)?group/)
- continue
- else if (line ~ /address@hidden(omment+)?[ \t]+/)
- continue
- if (index(line, "@") == 0) {
- print line > curfile
- continue
- }
- n = split(line, a, "@")
- # if a[1] == "", means leading @,
- # don't add one back in.
- for (i = 2; i <= n; i++) {
- if (a[i] == "") { # was an @@
- a[i] = "@"
- if (a[i+1] == "")
- i++
- }
- }
- print join(a, 1, n, SUBSEP) > curfile
- }
- }
+Oops!
- An important thing to note is the use of the `>' redirection.
-Output done with `>' only opens the file once; it stays open and
-subsequent output is appended to the file (*note Redirection::). This
-makes it easy to mix program text and explanatory prose for the same
-sample source file (as has been done here!) without any hassle. The
-file is only closed when a new data file name is encountered or at the
-end of the input file.
+ gawk> p alast[1]
+ -| alast["1"] = string ("awk")
- Finally, the function `unexpected_eof()' prints an appropriate error
-message and then exits. The `END' rule handles the final cleanup,
-closing the open file:
+ This would be kind of slow for a 100-member array, though, so `gawk'
+provides a shortcut (reminiscent of another language not to be
+mentioned):
- function unexpected_eof()
- {
- printf("%s:%d: unexpected EOF or error\n",
- FILENAME, FNR) > "/dev/stderr"
- exit 1
- }
+ gawk> p @alast
+ -| alast["1"] = string ("awk")
+ -| alast["2"] = string ("is")
+ -| alast["3"] = string ("a")
+ -| alast["4"] = string ("wonderful")
+ -| alast["5"] = string ("program!")
- END {
- if (curfile)
- close(curfile)
- }
+ It looks like we got this far OK. Let's take another step or two:
- ---------- Footnotes ----------
+ gawk> n
+ -| 70 clast = join(alast, fcount, n)
+ gawk> n
+ -| 71 cline = join(aline, fcount, m)
- (1) This program was written before `gawk' had the `gensub()'
-function. Consider how you might use it to simplify the code.
+ Well, here we are at our error (sorry to spoil the suspense). What
+we had in mind was to join the fields starting from the second one to
+make the virtual record to compare, and if the first field was numbered
+zero, this would work. Let's look at what we've got:
-
-File: gawk.info, Node: Simple Sed, Next: Igawk Program, Prev: Extract
Program, Up: Miscellaneous Programs
+ gawk> p cline clast
+ -| cline = string ("gawk is a wonderful program!")
+ -| clast = string ("awk is a wonderful program!")
-14.3.8 A Simple Stream Editor
------------------------------
+ Hey, those look pretty familiar! They're just our original,
+unaltered, input records. A little thinking (the human brain is still
+the best debugging tool), and we realize that we were off by one!
-The `sed' utility is a stream editor, a program that reads a stream of
-data, makes changes to it, and passes it on. It is often used to make
-global changes to a large file or to a stream of data generated by a
-pipeline of commands. While `sed' is a complicated program in its own
-right, its most common use is to perform global substitutions in the
-middle of a pipeline:
+ We get out of the debugger:
- command1 < orig.data | sed 's/old/new/g' | command2 > result
+ gawk> q
+ -| The program is running. Exit anyway (y/n)? y
- Here, `s/old/new/g' tells `sed' to look for the regexp `old' on each
-input line and globally replace it with the text `new', i.e., all the
-occurrences on a line. This is similar to `awk''s `gsub()' function
-(*note String Functions::).
+Then we get into an editor:
- The following program, `awksed.awk', accepts at least two
-command-line arguments: the pattern to look for and the text to replace
-it with. Any additional arguments are treated as data file names to
-process. If none are provided, the standard input is used:
+ clast = join(alast, fcount+1, n)
+ cline = join(aline, fcount+1, m)
- # awksed.awk --- do s/foo/bar/g using just print
- # Thanks to Michael Brennan for the idea
+and problem solved!
- function usage()
- {
- print "usage: awksed pat repl [files...]" > "/dev/stderr"
- exit 1
- }
+
+File: gawk.info, Node: List of Debugger Commands, Next: Readline Support,
Prev: Sample Debugging Session, Up: Debugger
- BEGIN {
- # validate arguments
- if (ARGC < 3)
- usage()
+14.3 Main Debugger Commands
+===========================
- RS = ARGV[1]
- ORS = ARGV[2]
+The `gawk' debugger command set can be divided into the following
+categories:
- # don't use arguments as files
- ARGV[1] = ARGV[2] = ""
- }
+ * Breakpoint control
- # look ma, no hands!
- {
- if (RT == "")
- printf "%s", $0
- else
- print
- }
+ * Execution control
- The program relies on `gawk''s ability to have `RS' be a regexp, as
-well as on the setting of `RT' to the actual text that terminates the
-record (*note Records::).
+ * Viewing and changing data
- The idea is to have `RS' be the pattern to look for. `gawk'
-automatically sets `$0' to the text between matches of the pattern.
-This is text that we want to keep, unmodified. Then, by setting `ORS'
-to the replacement text, a simple `print' statement outputs the text we
-want to keep, followed by the replacement text.
+ * Working with the stack
- There is one wrinkle to this scheme, which is what to do if the last
-record doesn't end with text that matches `RS'. Using a `print'
-statement unconditionally prints the replacement text, which is not
-correct. However, if the file did not end in text that matches `RS',
-`RT' is set to the null string. In this case, we can print `$0' using
-`printf' (*note Printf::).
+ * Getting information
- The `BEGIN' rule handles the setup, checking for the right number of
-arguments and calling `usage()' if there is a problem. Then it sets
-`RS' and `ORS' from the command-line arguments and sets `ARGV[1]' and
-`ARGV[2]' to the null string, so that they are not treated as file names
-(*note ARGC and ARGV::).
+ * Miscellaneous
- The `usage()' function prints an error message and exits. Finally,
-the single rule handles the printing scheme outlined above, using
-`print' or `printf' as appropriate, depending upon the value of `RT'.
+ Each of these are discussed in the following subsections. In the
+following descriptions, commands which may be abbreviated show the
+abbreviation on a second description line. A debugger command name may
+also be truncated if that partial name is unambiguous. The debugger has
+the built-in capability to automatically repeat the previous command
+when just hitting <Enter>. This works for the commands `list', `next',
+`nexti', `step', `stepi' and `continue' executed without any argument.
-
-File: gawk.info, Node: Igawk Program, Next: Anagram Program, Prev: Simple
Sed, Up: Miscellaneous Programs
+* Menu:
-14.3.9 An Easy Way to Use Library Functions
--------------------------------------------
+* Breakpoint Control:: Control of Breakpoints.
+* Debugger Execution Control:: Control of Execution.
+* Viewing And Changing Data:: Viewing and Changing Data.
+* Execution Stack:: Dealing with the Stack.
+* Debugger Info:: Obtaining Information about the Program and
+ the Debugger State.
+* Miscellaneous Debugger Commands:: Miscellaneous Commands.
-In *note Include Files::, we saw how `gawk' provides a built-in
-file-inclusion capability. However, this is a `gawk' extension. This
-minor node provides the motivation for making file inclusion available
-for standard `awk', and shows how to do it using a combination of shell
-and `awk' programming.
+
+File: gawk.info, Node: Breakpoint Control, Next: Debugger Execution Control,
Up: List of Debugger Commands
- Using library functions in `awk' can be very beneficial. It
-encourages code reuse and the writing of general functions. Programs are
-smaller and therefore clearer. However, using library functions is
-only easy when writing `awk' programs; it is painful when running them,
-requiring multiple `-f' options. If `gawk' is unavailable, then so too
-is the `AWKPATH' environment variable and the ability to put `awk'
-functions into a library directory (*note Options::). It would be nice
-to be able to write programs in the following manner:
+14.3.1 Control of Breakpoints
+-----------------------------
- # library functions
- @include getopt.awk
- @include join.awk
- ...
+As we saw above, the first thing you probably want to do in a debugging
+session is to get your breakpoints set up, since otherwise your program
+will just run as if it was not under the debugger. The commands for
+controlling breakpoints are:
- # main program
- BEGIN {
- while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
- ...
- ...
- }
+`break' [[FILENAME`:']N | FUNCTION] [`"EXPRESSION"']
+`b' [[FILENAME`:']N | FUNCTION] [`"EXPRESSION"']
+ Without any argument, set a breakpoint at the next instruction to
+ be executed in the selected stack frame. Arguments can be one of
+ the following:
- The following program, `igawk.sh', provides this service. It
-simulates `gawk''s searching of the `AWKPATH' variable and also allows
-"nested" includes; i.e., a file that is included with address@hidden' can
-contain further address@hidden' statements. `igawk' makes an effort to only
-include files once, so that nested includes don't accidentally include
-a library function twice.
+ N
+ Set a breakpoint at line number N in the current source file.
- `igawk' should behave just like `gawk' externally. This means it
-should accept all of `gawk''s command-line arguments, including the
-ability to have multiple source files specified via `-f', and the
-ability to mix command-line and library source files.
+ FILENAME`:'N
+ Set a breakpoint at line number N in source file FILENAME.
- The program is written using the POSIX Shell (`sh') command
-language.(1) It works as follows:
+ FUNCTION
+ Set a breakpoint at entry to (the first instruction of)
+ function FUNCTION.
- 1. Loop through the arguments, saving anything that doesn't represent
- `awk' source code for later, when the expanded program is run.
+ Each breakpoint is assigned a number which can be used to delete
+ it from the breakpoint list using the `delete' command.
- 2. For any arguments that do represent `awk' text, put the arguments
- into a shell variable that will be expanded. There are two cases:
+ With a breakpoint, you may also supply a condition. This is an
+ `awk' expression (enclosed in double quotes) that the debugger
+ evaluates whenever the breakpoint is reached. If the condition is
+ true, then the debugger stops execution and prompts for a command.
+ Otherwise, it continues executing the program.
- a. Literal text, provided with `--source' or `--source='. This
- text is just appended directly.
+`clear' [[FILENAME`:']N | FUNCTION]
+ Without any argument, delete any breakpoint at the next instruction
+ to be executed in the selected stack frame. If the program stops at
+ a breakpoint, this deletes that breakpoint so that the program
+ does not stop at that location again. Arguments can be one of the
+ following:
- b. Source file names, provided with `-f'. We use a neat trick
- and append address@hidden FILENAME' to the shell variable's
- contents. Since the file-inclusion program works the way
- `gawk' does, this gets the text of the file included into the
- program at the correct point.
+ N
+ Delete breakpoint(s) set at line number N in the current
+ source file.
- 3. Run an `awk' program (naturally) over the shell variable's
- contents to expand address@hidden' statements. The expanded program is
- placed in a second shell variable.
+ FILENAME`:'N
+ Delete breakpoint(s) set at line number N in source file
+ FILENAME.
- 4. Run the expanded program with `gawk' and any other original
- command-line arguments that the user supplied (such as the data
- file names).
+ FUNCTION
+ Delete breakpoint(s) set at entry to function FUNCTION.
- This program uses shell variables extensively: for storing
-command-line arguments, the text of the `awk' program that will expand
-the user's program, for the user's original program, and for the
-expanded program. Doing so removes some potential problems that might
-arise were we to use temporary files instead, at the cost of making the
-script somewhat more complicated.
+`condition' N `"EXPRESSION"'
+ Add a condition to existing breakpoint or watchpoint N. The
+ condition is an `awk' expression that the debugger evaluates
+ whenever the breakpoint or watchpoint is reached. If the condition
+ is true, then the debugger stops execution and prompts for a
+ command. Otherwise, the debugger continues executing the program.
+ If the condition expression is not specified, any existing
+ condition is removed; i.e., the breakpoint or watchpoint is made
+ unconditional.
- The initial part of the program turns on shell tracing if the first
-argument is `debug'.
+`delete' [N1 N2 ...] [N-M]
+`d' [N1 N2 ...] [N-M]
+ Delete specified breakpoints or a range of breakpoints. Deletes
+ all defined breakpoints if no argument is supplied.
- The next part loops through all the command-line arguments. There
-are several cases of interest:
+`disable' [N1 N2 ... | N-M]
+ Disable specified breakpoints or a range of breakpoints. Without
+ any argument, disables all breakpoints.
-`--'
- This ends the arguments to `igawk'. Anything else should be
- passed on to the user's `awk' program without being evaluated.
+`enable' [`del' | `once'] [N1 N2 ...] [N-M]
+`e' [`del' | `once'] [N1 N2 ...] [N-M]
+ Enable specified breakpoints or a range of breakpoints. Without
+ any argument, enables all breakpoints. Optionally, you can
+ specify how to enable the breakpoint:
-`-W'
- This indicates that the next option is specific to `gawk'. To make
- argument processing easier, the `-W' is appended to the front of
- the remaining arguments and the loop continues. (This is an `sh'
- programming trick. Don't worry about it if you are not familiar
- with `sh'.)
+ `del'
+ Enable the breakpoint(s) temporarily, then delete it when the
+ program stops at the breakpoint.
-`-v, -F'
- These are saved and passed on to `gawk'.
+ `once'
+ Enable the breakpoint(s) temporarily, then disable it when
+ the program stops at the breakpoint.
-`-f, --file, --file=, -Wfile='
- The file name is appended to the shell variable `program' with an
- address@hidden' statement. The `expr' utility is used to remove the
- leading option part of the argument (e.g., `--file='). (Typical
- `sh' usage would be to use the `echo' and `sed' utilities to do
- this work. Unfortunately, some versions of `echo' evaluate escape
- sequences in their arguments, possibly mangling the program text.
- Using `expr' avoids this problem.)
+`ignore' N COUNT
+ Ignore breakpoint number N the next COUNT times it is hit.
-`--source, --source=, -Wsource='
- The source text is appended to `program'.
+`tbreak' [[FILENAME`:']N | FUNCTION]
+`t' [[FILENAME`:']N | FUNCTION]
+ Set a temporary breakpoint (enabled for only one stop). The
+ arguments are the same as for `break'.
-`--version, -Wversion'
- `igawk' prints its version number, runs `gawk --version' to get
- the `gawk' version information, and then exits.
+
+File: gawk.info, Node: Debugger Execution Control, Next: Viewing And
Changing Data, Prev: Breakpoint Control, Up: List of Debugger Commands
- If none of the `-f', `--file', `-Wfile', `--source', or `-Wsource'
-arguments are supplied, then the first nonoption argument should be the
-`awk' program. If there are no command-line arguments left, `igawk'
-prints an error message and exits. Otherwise, the first argument is
-appended to `program'. In any case, after the arguments have been
-processed, `program' contains the complete text of the original `awk'
-program.
+14.3.2 Control of Execution
+---------------------------
- The program is as follows:
+Now that your breakpoints are ready, you can start running the program
+and observing its behavior. There are more commands for controlling
+execution of the program than we saw in our earlier example:
- #! /bin/sh
- # igawk --- like gawk but do @include processing
+`commands' [N]
+`silent'
+...
+`end'
+ Set a list of commands to be executed upon stopping at a
+ breakpoint or watchpoint. N is the breakpoint or watchpoint number.
+ Without a number, the last one set is used. The actual commands
+ follow, starting on the next line, and terminated by the `end'
+ command. If the command `silent' is in the list, the usual
+ messages about stopping at a breakpoint and the source line are
+ not printed. Any command in the list that resumes execution (e.g.,
+ `continue') terminates the list (an implicit `end'), and
+ subsequent commands are ignored. For example:
- if [ "$1" = debug ]
- then
- set -x
- shift
- fi
+ gawk> commands
+ > silent
+ > printf "A silent breakpoint; i = %d\n", i
+ > info locals
+ > set i = 10
+ > continue
+ > end
+ gawk>
- # A literal newline, so that program text is formatted correctly
- n='
- '
+`continue' [COUNT]
+`c' [COUNT]
+ Resume program execution. If continued from a breakpoint and COUNT
+ is specified, ignores the breakpoint at that location the next
+ COUNT times before stopping.
- # Initialize variables to empty
- program=
- opts=
+`finish'
+ Execute until the selected stack frame returns. Print the
+ returned value.
- while [ $# -ne 0 ] # loop over arguments
- do
- case $1 in
- --) shift
- break ;;
+`next' [COUNT]
+`n' [COUNT]
+ Continue execution to the next source line, stepping over function
+ calls. The argument COUNT controls how many times to repeat the
+ action, as in `step'.
- -W) shift
- # The ${x?'message here'} construct prints a
- # diagnostic if $x is the null string
- set -- -W"address@hidden'missing operand'}"
- continue ;;
+`nexti' [COUNT]
+`ni' [COUNT]
+ Execute one (or COUNT) instruction(s), stepping over function
+ calls.
- -[vF]) opts="$opts $1 '${2?'missing operand'}'"
- shift ;;
+`return' [VALUE]
+ Cancel execution of a function call. If VALUE (either a string or a
+ number) is specified, it is used as the function's return value.
+ If used in a frame other than the innermost one (the currently
+ executing function, i.e., frame number 0), discard all inner
+ frames in addition to the selected one, and the caller of that
+ frame becomes the innermost frame.
- -[vF]*) opts="$opts '$1'" ;;
+`run'
+`r'
+ Start/restart execution of the program. When restarting, the
+ debugger retains the current breakpoints, watchpoints, command
+ history, automatic display variables, and debugger options.
- -f) program="address@hidden ${2?'missing operand'}"
- shift ;;
+`step' [COUNT]
+`s' [COUNT]
+ Continue execution until control reaches a different source line
+ in the current stack frame. `step' steps inside any function
+ called within the line. If the argument COUNT is supplied, steps
+ that many times before stopping, unless it encounters a breakpoint
+ or watchpoint.
+
+`stepi' [COUNT]
+`si' [COUNT]
+ Execute one (or COUNT) instruction(s), stepping inside function
+ calls. (For illustration of what is meant by an "instruction" in
+ `gawk', see the output shown under `dump' in *note Miscellaneous
+ Debugger Commands::.)
- -f*) f=$(expr "$1" : '-f\(.*\)')
- program="address@hidden $f" ;;
+`until' [[FILENAME`:']N | FUNCTION]
+`u' [[FILENAME`:']N | FUNCTION]
+ Without any argument, continue execution until a line past the
+ current line in current stack frame is reached. With an argument,
+ continue execution until the specified location is reached, or the
+ current stack frame returns.
- -[W-]file=*)
- f=$(expr "$1" : '-.file=\(.*\)')
- program="address@hidden $f" ;;
+
+File: gawk.info, Node: Viewing And Changing Data, Next: Execution Stack,
Prev: Debugger Execution Control, Up: List of Debugger Commands
- -[W-]file)
- program="address@hidden ${2?'missing operand'}"
- shift ;;
+14.3.3 Viewing and Changing Data
+--------------------------------
- -[W-]source=*)
- t=$(expr "$1" : '-.source=\(.*\)')
- program="$program$n$t" ;;
+The commands for viewing and changing variables inside of `gawk' are:
- -[W-]source)
- program="$program$n${2?'missing operand'}"
- shift ;;
+`display' [VAR | `$'N]
+ Add variable VAR (or field `$N') to the display list. The value
+ of the variable or field is displayed each time the program stops.
+ Each variable added to the list is identified by a unique number:
- -[W-]version)
- echo igawk: version 3.0 1>&2
- gawk --version
- exit 0 ;;
+ gawk> display x
+ -| 10: x = 1
- -[W-]*) opts="$opts '$1'" ;;
+ displays the assigned item number, the variable name and its
+ current value. If the display variable refers to a function
+ parameter, it is silently deleted from the list as soon as the
+ execution reaches a context where no such variable of the given
+ name exists. Without argument, `display' displays the current
+ values of items on the list.
- *) break ;;
- esac
- shift
- done
+`eval "AWK STATEMENTS"'
+ Evaluate AWK STATEMENTS in the context of the running program.
+ You can do anything that an `awk' program would do: assign values
+ to variables, call functions, and so on.
- if [ -z "$program" ]
- then
- program=${1?'missing program'}
- shift
- fi
+`eval' PARAM, ...
+AWK STATEMENTS
+`end'
+ This form of `eval' is similar, but it allows you to define "local
+ variables" that exist in the context of the AWK STATEMENTS,
+ instead of using variables or function parameters defined by the
+ program.
- # At this point, `program' has the program.
+`print' VAR1[`,' VAR2 ...]
+`p' VAR1[`,' VAR2 ...]
+ Print the value of a `gawk' variable or field. Fields must be
+ referenced by constants:
- The `awk' program to process address@hidden' directives is stored in the
-shell variable `expand_prog'. Doing this keeps the shell script
-readable. The `awk' program reads through the user's program, one line
-at a time, using `getline' (*note Getline::). The input file names and
address@hidden' statements are managed using a stack. As each address@hidden'
is
-encountered, the current file name is "pushed" onto the stack and the
-file named in the address@hidden' directive becomes the current file name.
-As each file is finished, the stack is "popped," and the previous input
-file becomes the current input file again. The process is started by
-making the original file the first one on the stack.
+ gawk> print $3
- The `pathto()' function does the work of finding the full path to a
-file. It simulates `gawk''s behavior when searching the `AWKPATH'
-environment variable (*note AWKPATH Variable::). If a file name has a
-`/' in it, no path search is done. Similarly, if the file name is
-`"-"', then that string is used as-is. Otherwise, the file name is
-concatenated with the name of each directory in the path, and an
-attempt is made to open the generated file name. The only way to test
-if a file can be read in `awk' is to go ahead and try to read it with
-`getline'; this is what `pathto()' does.(2) If the file can be read, it
-is closed and the file name is returned:
+ This prints the third field in the input record (if the specified
+ field does not exist, it prints `Null field'). A variable can be
+ an array element, with the subscripts being constant values. To
+ print the contents of an array, prefix the name of the array with
+ the `@' symbol:
- expand_prog='
+ gawk> print @a
- function pathto(file, i, t, junk)
- {
- if (index(file, "/") != 0)
- return file
+ This prints the indices and the corresponding values for all
+ elements in the array `a'.
- if (file == "-")
- return file
+`printf' FORMAT [`,' ARG ...]
+ Print formatted text. The FORMAT may include escape sequences,
+ such as `\n' (*note Escape Sequences::). No newline is printed
+ unless one is specified.
- for (i = 1; i <= ndirs; i++) {
- t = (pathlist[i] "/" file)
- if ((getline junk < t) > 0) {
- # found it
- close(t)
- return t
- }
- }
- return ""
- }
+`set' VAR`='VALUE
+ Assign a constant (number or string) value to an `awk' variable or
+ field. String values must be enclosed between double quotes
+ (`"..."').
- The main program is contained inside one `BEGIN' rule. The first
-thing it does is set up the `pathlist' array that `pathto()' uses.
-After splitting the path on `:', null elements are replaced with `"."',
-which represents the current directory:
+ You can also set special `awk' variables, such as `FS', `NF',
+ `NR', etc.
- BEGIN {
- path = ENVIRON["AWKPATH"]
- ndirs = split(path, pathlist, ":")
- for (i = 1; i <= ndirs; i++) {
- if (pathlist[i] == "")
- pathlist[i] = "."
- }
+`watch' VAR | `$'N [`"EXPRESSION"']
+`w' VAR | `$'N [`"EXPRESSION"']
+ Add variable VAR (or field `$N') to the watch list. The debugger
+ then stops whenever the value of the variable or field changes.
+ Each watched item is assigned a number which can be used to delete
+ it from the watch list using the `unwatch' command.
- The stack is initialized with `ARGV[1]', which will be `/dev/stdin'.
-The main loop comes next. Input lines are read in succession. Lines
-that do not start with address@hidden' are printed verbatim. If the line
-does start with address@hidden', the file name is in `$2'. `pathto()' is
-called to generate the full path. If it cannot, then the program
-prints an error message and continues.
+ With a watchpoint, you may also supply a condition. This is an
+ `awk' expression (enclosed in double quotes) that the debugger
+ evaluates whenever the watchpoint is reached. If the condition is
+ true, then the debugger stops execution and prompts for a command.
+ Otherwise, `gawk' continues executing the program.
- The next thing to check is if the file is included already. The
-`processed' array is indexed by the full file name of each included
-file and it tracks this information for us. If the file is seen again,
-a warning message is printed. Otherwise, the new file name is pushed
-onto the stack and processing continues.
+`undisplay' [N]
+ Remove item number N (or all items, if no argument) from the
+ automatic display list.
- Finally, when `getline' encounters the end of the input file, the
-file is closed and the stack is popped. When `stackptr' is less than
-zero, the program is done:
+`unwatch' [N]
+ Remove item number N (or all items, if no argument) from the watch
+ list.
- stackptr = 0
- input[stackptr] = ARGV[1] # ARGV[1] is first file
- for (; stackptr >= 0; stackptr--) {
- while ((getline < input[stackptr]) > 0) {
- if (tolower($1) != "@include") {
- print
- continue
- }
- fpath = pathto($2)
- if (fpath == "") {
- printf("igawk:%s:%d: cannot find %s\n",
- input[stackptr], FNR, $2) > "/dev/stderr"
- continue
- }
- if (! (fpath in processed)) {
- processed[fpath] = input[stackptr]
- input[++stackptr] = fpath # push onto stack
- } else
- print $2, "included in", input[stackptr],
- "already included in",
- processed[fpath] > "/dev/stderr"
- }
- close(input[stackptr])
- }
- }' # close quote ends `expand_prog' variable
+
+File: gawk.info, Node: Execution Stack, Next: Debugger Info, Prev: Viewing
And Changing Data, Up: List of Debugger Commands
- processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF
- $program
- EOF
- )
+14.3.4 Dealing with the Stack
+-----------------------------
- The shell construct `COMMAND << MARKER' is called a "here document".
-Everything in the shell script up to the MARKER is fed to COMMAND as
-input. The shell processes the contents of the here document for
-variable and command substitution (and possibly other things as well,
-depending upon the shell).
+Whenever you run a program which contains any function calls, `gawk'
+maintains a stack of all of the function calls leading up to where the
+program is right now. You can see how you got to where you are, and
+also move around in the stack to see what the state of things was in the
+functions which called the one you are in. The commands for doing this
+are:
- The shell construct `$(...)' is called "command substitution". The
-output of the command inside the parentheses is substituted into the
-command line. Because the result is used in a variable assignment, it
-is saved as a single string, even if the results contain whitespace.
+`backtrace' [COUNT]
+`bt' [COUNT]
+ Print a backtrace of all function calls (stack frames), or
+ innermost COUNT frames if COUNT > 0. Print the outermost COUNT
+ frames if COUNT < 0. The backtrace displays the name and
+ arguments to each function, the source file name, and the line
+ number.
- The expanded program is saved in the variable `processed_program'.
-It's done in these steps:
+`down' [COUNT]
+ Move COUNT (default 1) frames down the stack toward the innermost
+ frame. Then select and print the frame.
- 1. Run `gawk' with the address@hidden'-processing program (the value of
- the `expand_prog' shell variable) on standard input.
+`frame' [N]
+`f' [N]
+ Select and print (frame number, function and argument names,
+ source file, and the source line) stack frame N. Frame 0 is the
+ currently executing, or "innermost", frame (function call), frame
+ 1 is the frame that called the innermost one. The highest numbered
+ frame is the one for the main program.
- 2. Standard input is the contents of the user's program, from the
- shell variable `program'. Its contents are fed to `gawk' via a
- here document.
+`up' [COUNT]
+ Move COUNT (default 1) frames up the stack toward the outermost
+ frame. Then select and print the frame.
- 3. The results of this processing are saved in the shell variable
- `processed_program' by using command substitution.
+
+File: gawk.info, Node: Debugger Info, Next: Miscellaneous Debugger Commands,
Prev: Execution Stack, Up: List of Debugger Commands
- The last step is to call `gawk' with the expanded program, along
-with the original options and command-line arguments that the user
-supplied.
+14.3.5 Obtaining Information about the Program and the Debugger State
+---------------------------------------------------------------------
- eval gawk $opts -- '"$processed_program"' '"$@"'
+Besides looking at the values of variables, there is often a need to get
+other sorts of information about the state of your program and of the
+debugging environment itself. The `gawk' debugger has one command which
+provides this information, appropriately called `info'. `info' is used
+with one of a number of arguments that tell it exactly what you want to
+know:
- The `eval' command is a shell construct that reruns the shell's
-parsing process. This keeps things properly quoted.
+`info' WHAT
+`i' WHAT
+ The value for WHAT should be one of the following:
- This version of `igawk' represents my fifth version of this program.
-There are four key simplifications that make the program work better:
+ `args'
+ Arguments of the selected frame.
- * Using address@hidden' even for the files named with `-f' makes building
- the initial collected `awk' program much simpler; all the
- address@hidden' processing can be done once.
+ `break'
+ List all currently set breakpoints.
- * Not trying to save the line read with `getline' in the `pathto()'
- function when testing for the file's accessibility for use with
- the main program simplifies things considerably.
+ `display'
+ List all items in the automatic display list.
- * Using a `getline' loop in the `BEGIN' rule does it all in one
- place. It is not necessary to call out to a separate loop for
- processing nested address@hidden' statements.
+ `frame'
+ Description of the selected stack frame.
- * Instead of saving the expanded program in a temporary file,
- putting it in a shell variable avoids some potential security
- problems. This has the disadvantage that the script relies upon
- more features of the `sh' language, making it harder to follow for
- those who aren't familiar with `sh'.
+ `functions'
+ List all function definitions including source file names and
+ line numbers.
- Also, this program illustrates that it is often worthwhile to combine
-`sh' and `awk' programming together. You can usually accomplish quite
-a lot, without having to resort to low-level programming in C or C++,
-and it is frequently easier to do certain kinds of string and argument
-manipulation using the shell than it is in `awk'.
+ `locals'
+ Local variables of the selected frame.
- Finally, `igawk' shows that it is not always necessary to add new
-features to a program; they can often be layered on top.
+ `source'
+ The name of the current source file. Each time the program
+ stops, the current source file is the file containing the
+ current instruction. When the debugger first starts, the
+ current source file is the first file included via the `-f'
+ option. The `list FILENAME:LINENO' command can be used at any
+ time to change the current source.
- As an additional example of this, consider the idea of having two
-files in a directory in the search path:
+ `sources'
+ List all program sources.
-`default.awk'
- This file contains a set of default library functions, such as
- `getopt()' and `assert()'.
+ `variables'
+ List all global variables.
-`site.awk'
- This file contains library functions that are specific to a site or
- installation; i.e., locally developed functions. Having a
- separate file allows `default.awk' to change with new `gawk'
- releases, without requiring the system administrator to update it
- each time by adding the local functions.
+ `watch'
+ List all items in the watch list.
- One user suggested that `gawk' be modified to automatically read
-these files upon startup. Instead, it would be very simple to modify
-`igawk' to do this. Since `igawk' can process nested address@hidden'
-directives, `default.awk' could simply contain address@hidden' statements
-for the desired library functions.
+ Additional commands give you control over the debugger, the ability
+to save the debugger's state, and the ability to run debugger commands
+from a file. The commands are:
- ---------- Footnotes ----------
+`option' [NAME[`='VALUE]]
+`o' [NAME[`='VALUE]]
+ Without an argument, display the available debugger options and
+ their current values. `option NAME' shows the current value of the
+ named option. `option NAME=VALUE' assigns a new value to the named
+ option. The available options are:
- (1) Fully explaining the `sh' language is beyond the scope of this
-book. We provide some minimal explanations, but see a good shell
-programming book if you wish to understand things in more depth.
+ `history_size'
+ The maximum number of lines to keep in the history file
+ `./.gawk_history'. The default is 100.
- (2) On some very old versions of `awk', the test `getline junk < t'
-can loop forever if the file exists but is empty. Caveat emptor.
+ `listsize'
+ The number of lines that `list' prints. The default is 15.
-
-File: gawk.info, Node: Anagram Program, Next: Signature Program, Prev:
Igawk Program, Up: Miscellaneous Programs
+ `outfile'
+ Send `gawk' output to a file; debugger output still goes to
+ standard output. An empty string (`""') resets output to
+ standard output.
-14.3.10 Finding Anagrams From A Dictionary
-------------------------------------------
+ `prompt'
+ The debugger prompt. The default is `gawk> '.
-An interesting programming challenge is to search for "anagrams" in a
-word list (such as `/usr/share/dict/words' on many GNU/Linux systems).
-One word is an anagram of another if both words contain the same letters
-(for example, "babbling" and "blabbing").
+ `save_history [on | off]'
+ Save command history to file `./.gawk_history'. The default
+ is `on'.
- An elegant algorithm is presented in Column 2, Problem C of Jon
-Bentley's `Programming Pearls', second edition. The idea is to give
-words that are anagrams a common signature, sort all the words together
-by their signature, and then print them. Dr. Bentley observes that
-taking the letters in each word and sorting them produces that common
-signature.
+ `save_options [on | off]'
+ Save current options to file `./.gawkrc' upon exit. The
+ default is `on'. Options are read back in to the next
+ session upon startup.
- The following program uses arrays of arrays to bring together words
-with the same signature and array sorting to print the words in sorted
-order.
+ `trace [on | off]'
+ Turn instruction tracing on or off. The default is `off'.
- # anagram.awk --- An implementation of the anagram finding algorithm
- # from Jon Bentley's "Programming Pearls", 2nd edition.
- # Addison Wesley, 2000, ISBN 0-201-65788-0.
- # Column 2, Problem C, section 2.8, pp 18-20.
+`save' FILENAME
+ Save the commands from the current session to the given file name,
+ so that they can be replayed using the `source' command.
- /'s$/ { next } # Skip possessives
+`source' FILENAME
+ Run command(s) from a file; an error in any command does not
+ terminate execution of subsequent commands. Comments (lines
+ starting with `#') are allowed in a command file. Empty lines are
+ ignored; they do _not_ repeat the last command. You can't restart
+ the program by having more than one `run' command in the file.
+ Also, the list of commands may include additional `source'
+ commands; however, the `gawk' debugger will not source the same
+ file more than once in order to avoid infinite recursion.
- The program starts with a header, and then a rule to skip
-possessives in the dictionary file. The next rule builds up the data
-structure. The first dimension of the array is indexed by the
-signature; the second dimension is the word itself:
+ In addition to, or instead of the `source' command, you can use
+ the `-D FILE' or `--debug=FILE' command-line options to execute
+ commands from a file non-interactively (*note Options::.
- {
- key = word2key($1) # Build signature
- data[key][$1] = $1 # Store word with signature
- }
+
+File: gawk.info, Node: Miscellaneous Debugger Commands, Prev: Debugger Info,
Up: List of Debugger Commands
- The `word2key()' function creates the signature. It splits the word
-apart into individual letters, sorts the letters, and then joins them
-back together:
+14.3.6 Miscellaneous Commands
+-----------------------------
- # word2key --- split word apart into letters, sort, joining back together
+There are a few more commands which do not fit into the previous
+categories, as follows:
- function word2key(word, a, i, n, result)
- {
- n = split(word, a, "")
- asort(a)
+`dump' [FILENAME]
+ Dump bytecode of the program to standard output or to the file
+ named in FILENAME. This prints a representation of the internal
+ instructions which `gawk' executes to implement the `awk' commands
+ in a program. This can be very enlightening, as the following
+ partial dump of Davide Brini's obfuscated code (*note Signature
+ Program::) demonstrates:
- for (i = 1; i <= n; i++)
- result = result a[i]
+ gawk> dump
+ -| # BEGIN
+ -|
+ -| [ 2:0x89faef4] Op_rule : [in_rule = BEGIN]
[source_file = brini.awk]
+ -| [ 3:0x89fa428] Op_push_i : "~" [PERM|STRING|STRCUR]
+ -| [ 3:0x89fa464] Op_push_i : "~" [PERM|STRING|STRCUR]
+ -| [ 3:0x89fa450] Op_match :
+ -| [ 3:0x89fa3ec] Op_store_var : O [do_reference = FALSE]
+ -| [ 4:0x89fa48c] Op_push_i : "==" [PERM|STRING|STRCUR]
+ -| [ 4:0x89fa4c8] Op_push_i : "==" [PERM|STRING|STRCUR]
+ -| [ 4:0x89fa4b4] Op_equal :
+ -| [ 4:0x89fa400] Op_store_var : o [do_reference = FALSE]
+ -| [ 5:0x89fa4f0] Op_push : o
+ -| [ 5:0x89fa4dc] Op_plus_i : 0 [PERM|NUMCUR|NUMBER]
+ -| [ 5:0x89fa414] Op_push_lhs : o [do_reference = TRUE]
+ -| [ 5:0x89fa4a0] Op_assign_plus :
+ -| [ :0x89fa478] Op_pop :
+ -| [ 6:0x89fa540] Op_push : O
+ -| [ 6:0x89fa554] Op_push_i : "" [PERM|STRING|STRCUR]
+ -| [ :0x89fa5a4] Op_no_op :
+ -| [ 6:0x89fa590] Op_push : O
+ -| [ :0x89fa5b8] Op_concat : [expr_count = 3]
[concat_flag = 0]
+ -| [ 6:0x89fa518] Op_store_var : x [do_reference = FALSE]
+ -| [ 7:0x89fa504] Op_push_loop : [target_continue =
0x89fa568] [target_break = 0x89fa680]
+ -| [ 7:0x89fa568] Op_push_lhs : X [do_reference = TRUE]
+ -| [ 7:0x89fa52c] Op_postincrement :
+ -| [ 7:0x89fa5e0] Op_push : x
+ -| [ 7:0x89fa61c] Op_push : o
+ -| [ 7:0x89fa5f4] Op_plus :
+ -| [ 7:0x89fa644] Op_push : o
+ -| [ 7:0x89fa630] Op_plus :
+ -| [ 7:0x89fa5cc] Op_leq :
+ -| [ :0x89fa57c] Op_jmp_false : [target_jmp = 0x89fa680]
+ -| [ 7:0x89fa694] Op_push_i : "%c" [PERM|STRING|STRCUR]
+ -| [ :0x89fa6d0] Op_no_op :
+ -| [ 7:0x89fa608] Op_assign_concat : c
+ -| [ :0x89fa6a8] Op_jmp : [target_jmp = 0x89fa568]
+ -| [ :0x89fa680] Op_pop_loop :
+ -|
+ ...
+ -|
+ -| [ 8:0x89fa658] Op_K_printf : [expr_count = 17]
[redir_type = ""]
+ -| [ :0x89fa374] Op_no_op :
+ -| [ :0x89fa3d8] Op_atexit :
+ -| [ :0x89fa6bc] Op_stop :
+ -| [ :0x89fa39c] Op_no_op :
+ -| [ :0x89fa3b0] Op_after_beginfile :
+ -| [ :0x89fa388] Op_no_op :
+ -| [ :0x89fa3c4] Op_after_endfile :
+ gawk>
- return result
- }
+`help'
+`h'
+ Print a list of all of the `gawk' debugger commands with a short
+ summary of their usage. `help COMMAND' prints the information
+ about the command COMMAND.
- Finally, the `END' rule traverses the array and prints out the
-anagram lists. It sends the output to the system `sort' command, since
-otherwise the anagrams would appear in arbitrary order:
+`list' [`-' | `+' | N | FILENAME`:'N | N-M | FUNCTION]
+`l' [`-' | `+' | N | FILENAME`:'N | N-M | FUNCTION]
+ Print the specified lines (default 15) from the current source file
+ or the file named FILENAME. The possible arguments to `list' are
+ as follows:
- END {
- sort = "sort"
- for (key in data) {
- # Sort words with same key
- nwords = asorti(data[key], words)
- if (nwords == 1)
- continue
+ `-'
+ Print lines before the lines last printed.
- # And print. Minor glitch: trailing space at end of each line
- for (j = 1; j <= nwords; j++)
- printf("%s ", words[j]) | sort
- print "" | sort
- }
- close(sort)
- }
+ `+'
+ Print lines after the lines last printed. `list' without any
+ argument does the same thing.
- Here is some partial output when the program is run:
+ N
+ Print lines centered around line number N.
- $ gawk -f anagram.awk /usr/share/dict/words | grep '^b'
- ...
- babbled blabbed
- babbler blabber brabble
- babblers blabbers brabbles
- babbling blabbing
- babbly blabby
- babel bable
- babels beslab
- babery yabber
- ...
+ N-M
+ Print lines from N to M.
-
-File: gawk.info, Node: Signature Program, Prev: Anagram Program, Up:
Miscellaneous Programs
+ FILENAME`:'N
+ Print lines centered around line number N in source file
+ FILENAME. This command may change the current source file.
-14.3.11 And Now For Something Completely Different
---------------------------------------------------
+ FUNCTION
+ Print lines centered around beginning of the function
+ FUNCTION. This command may change the current source file.
-The following program was written by Davide Brini and is published on
-his website (http://backreference.org/2011/02/03/obfuscated-awk/). It
-serves as his signature in the Usenet group `comp.lang.awk'. He
-supplies the following copyright terms:
+`quit'
+`q'
+ Exit the debugger. Debugging is great fun, but sometimes we all
+ have to tend to other obligations in life, and sometimes we find
+ the bug, and are free to go on to the next one! As we saw above,
+ if you are running a program, the debugger warns you if you
+ accidentally type `q' or `quit', to make sure you really want to
+ quit.
- Copyright (C) 2008 Davide Brini
+`trace' `on' | `off'
+ Turn on or off a continuous printing of instructions which are
+ about to be executed, along with printing the `awk' line which they
+ implement. The default is `off'.
- Copying and distribution of the code published in this page, with
- or without modification, are permitted in any medium without
- royalty provided the copyright notice and this notice are
- preserved.
+ It is to be hoped that most of the "opcodes" in these instructions
+ are fairly self-explanatory, and using `stepi' and `nexti' while
+ `trace' is on will make them into familiar friends.
- Here is the program:
- awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
- printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
- X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
- O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'
+
+File: gawk.info, Node: Readline Support, Next: Limitations, Prev: List of
Debugger Commands, Up: Debugger
- We leave it to you to determine what the program does.
+14.4 Readline Support
+=====================
-
-File: gawk.info, Node: Debugger, Next: Dynamic Extensions, Prev: Sample
Programs, Up: Top
+If `gawk' is compiled with the `readline' library, you can take
+advantage of that library's command completion and history expansion
+features. The following types of completion are available:
-15 Debugging `awk' Programs
-***************************
+Command completion
+ Command names.
-It would be nice if computer programs worked perfectly the first time
-they were run, but in real life, this rarely happens for programs of
-any complexity. Thus, most programming languages have facilities
-available for "debugging" programs, and now `awk' is no exception.
+Source file name completion
+ Source file names. Relevant commands are `break', `clear', `list',
+ `tbreak', and `until'.
- The `gawk' debugger is purposely modeled after the GNU Debugger
-(GDB) (http://www.gnu.org/software/gdb/) command-line debugger. If you
-are familiar with GDB, learning how to use `gawk' for debugging your
-program is easy.
+Argument completion
+ Non-numeric arguments to a command. Relevant commands are
+ `enable' and `info'.
-* Menu:
+Variable name completion
+ Global variable names, and function arguments in the current
+ context if the program is running. Relevant commands are `display',
+ `print', `set', and `watch'.
-* Debugging:: Introduction to `gawk' debugger.
-* Sample Debugging Session:: Sample debugging session.
-* List of Debugger Commands:: Main debugger commands.
-* Readline Support:: Readline support.
-* Limitations:: Limitations and future plans.
-File: gawk.info, Node: Debugging, Next: Sample Debugging Session, Up:
Debugger
+File: gawk.info, Node: Limitations, Prev: Readline Support, Up: Debugger
-15.1 Introduction to `gawk' Debugger
-====================================
+14.5 Limitations and Future Plans
+=================================
-This minor node introduces debugging in general and begins the
-discussion of debugging in `gawk'.
+We hope you find the `gawk' debugger useful and enjoyable to work with,
+but as with any program, especially in its early releases, it still has
+some limitations. A few which are worth being aware of are:
-* Menu:
+ * At this point, the debugger does not give a detailed explanation of
+ what you did wrong when you type in something it doesn't like.
+ Rather, it just responds `syntax error'. When you do figure out
+ what your mistake was, though, you'll feel like a real guru.
-* Debugging Concepts:: Debugging in General.
-* Debugging Terms:: Additional Debugging Concepts.
-* Awk Debugging:: Awk Debugging.
+ * If you perused the dump of opcodes in *note Miscellaneous Debugger
+ Commands::, (or if you are already familiar with `gawk' internals),
+ you will realize that much of the internal manipulation of data in
+ `gawk', as in many interpreters, is done on a stack. `Op_push',
+ `Op_pop', etc., are the "bread and butter" of most `gawk' code.
+ Unfortunately, as of now, the `gawk' debugger does not allow you
+ to examine the stack's contents.
-
-File: gawk.info, Node: Debugging Concepts, Next: Debugging Terms, Up:
Debugging
+ That is, the intermediate results of expression evaluation are on
+ the stack, but cannot be printed. Rather, only variables which
+ are defined in the program can be printed. Of course, a
+ workaround for this is to use more explicit variables at the
+ debugging stage and then change back to obscure, perhaps more
+ optimal code later.
-15.1.1 Debugging in General
----------------------------
+ * There is no way to look "inside" the process of compiling regular
+ expressions to see if you got it right. As an `awk' programmer,
+ you are expected to know what `/[^[:alnum:][:blank:]]/' means.
-(If you have used debuggers in other languages, you may want to skip
-ahead to the next section on the specific features of the `awk'
-debugger.)
+ * The `gawk' debugger is designed to be used by running a program
+ (with all its parameters) on the command line, as described in
+ *note Debugger Invocation::. There is no way (as of now) to
+ attach or "break in" to a running program. This seems reasonable
+ for a language which is used mainly for quickly executing, short
+ programs.
- Of course, a debugging program cannot remove bugs for you, since it
-has no way of knowing what you or your users consider a "bug" and what
-is a "feature." (Sometimes, we humans have a hard time with this
-ourselves.) In that case, what can you expect from such a tool? The
-answer to that depends on the language being debugged, but in general,
-you can expect at least the following:
+ * The `gawk' debugger only accepts source supplied with the `-f'
+ option.
- * The ability to watch a program execute its instructions one by one,
- giving you, the programmer, the opportunity to think about what is
- happening on a time scale of seconds, minutes, or hours, rather
- than the nanosecond time scale at which the code usually runs.
+ Look forward to a future release when these and other missing
+features may be added, and of course feel free to try to add them
+yourself!
- * The opportunity to not only passively observe the operation of your
- program, but to control it and try different paths of execution,
- without having to change your source files.
+
+File: gawk.info, Node: Dynamic Extensions, Next: Language History, Prev:
Debugger, Up: Top
- * The chance to see the values of data in the program at any point in
- execution, and also to change that data on the fly, to see how that
- affects what happens afterwards. (This often includes the ability
- to look at internal data structures besides the variables you
- actually defined in your code.)
+15 Writing Extensions for `gawk'
+********************************
- * The ability to obtain additional information about your program's
- state or even its internal structure.
+This chapter is a placeholder, pending a rewrite for the new API. Some
+of the old bits remain, since they can be partially reused.
- All of these tools provide a great amount of help in using your own
-skills and understanding of the goals of your program to find where it
-is going wrong (or, for that matter, to better comprehend a perfectly
-functional program that you or someone else wrote).
+ It is possible to add new built-in functions to `gawk' using
+dynamically loaded libraries. This facility is available on systems
+(such as GNU/Linux) that support the C `dlopen()' and `dlsym()'
+functions. This major node describes how to write and use dynamically
+loaded extensions for `gawk'. Experience with programming in C or C++
+is necessary when reading this minor node.
-
-File: gawk.info, Node: Debugging Terms, Next: Awk Debugging, Prev:
Debugging Concepts, Up: Debugging
+ NOTE: When `--sandbox' is specified, extensions are disabled
+ (*note Options::.
-15.1.2 Additional Debugging Concepts
-------------------------------------
+* Menu:
-Before diving in to the details, we need to introduce several important
-concepts that apply to just about all debuggers. The following list
-defines terms used throughout the rest of this major node.
+* Plugin License:: A note about licensing.
+* Sample Library:: A example of new functions.
-"Stack Frame"
- Programs generally call functions during the course of their
- execution. One function can call another, or a function can call
- itself (recursion). You can view the chain of called functions
- (main program calls A, which calls B, which calls C), as a stack
- of executing functions: the currently running function is the
- topmost one on the stack, and when it finishes (returns), the next
- one down then becomes the active function. Such a stack is termed
- a "call stack".
+
+File: gawk.info, Node: Plugin License, Next: Sample Library, Up: Dynamic
Extensions
- For each function on the call stack, the system maintains a data
- area that contains the function's parameters, local variables, and
- return value, as well as any other "bookkeeping" information
- needed to manage the call stack. This data area is termed a
- "stack frame".
+15.1 Extension Licensing
+========================
- `gawk' also follows this model, and gives you access to the call
- stack and to each stack frame. You can see the call stack, as well
- as from where each function on the stack was invoked. Commands
- that print the call stack print information about each stack frame
- (as detailed later on).
+Every dynamic extension should define the global symbol
+`plugin_is_GPL_compatible' to assert that it has been licensed under a
+GPL-compatible license. If this symbol does not exist, `gawk' will
+emit a fatal error and exit.
-"Breakpoint"
- During debugging, you often wish to let the program run until it
- reaches a certain point, and then continue execution from there one
- statement (or instruction) at a time. The way to do this is to set
- a "breakpoint" within the program. A breakpoint is where the
- execution of the program should break off (stop), so that you can
- take over control of the program's execution. You can add and
- remove as many breakpoints as you like.
+ The declared type of the symbol should be `int'. It does not need
+to be in any allocated section, though. The code merely asserts that
+the symbol exists in the global scope. Something like this is enough:
-"Watchpoint"
- A watchpoint is similar to a breakpoint. The difference is that
- breakpoints are oriented around the code: stop when a certain
- point in the code is reached. A watchpoint, however, specifies
- that program execution should stop when a _data value_ is changed.
- This is useful, since sometimes it happens that a variable
- receives an erroneous value, and it's hard to track down where
- this happens just by looking at the code. By using a watchpoint,
- you can stop whenever a variable is assigned to, and usually find
- the errant code quite quickly.
+ int plugin_is_GPL_compatible;
-File: gawk.info, Node: Awk Debugging, Prev: Debugging Terms, Up: Debugging
-
-15.1.3 Awk Debugging
---------------------
+File: gawk.info, Node: Sample Library, Prev: Plugin License, Up: Dynamic
Extensions
-Debugging an `awk' program has some specific aspects that are not
-shared with other programming languages.
+15.2 Example: Directory and File Operation Built-ins
+====================================================
- First of all, the fact that `awk' programs usually take input
-line-by-line from a file or files and operate on those lines using
-specific rules makes it especially useful to organize viewing the
-execution of the program in terms of these rules. As we will see, each
-`awk' rule is treated almost like a function call, with its own
-specific block of instructions.
+Two useful functions that are not in `awk' are `chdir()' (so that an
+`awk' program can change its directory) and `stat()' (so that an `awk'
+program can gather information about a file). This minor node
+implements these functions for `gawk' in an external extension library.
- In addition, since `awk' is by design a very concise language, it is
-easy to lose sight of everything that is going on "inside" each line of
-`awk' code. The debugger provides the opportunity to look at the
-individual primitive instructions carried out by the higher-level `awk'
-commands.
+* Menu:
+
+* Internal File Description:: What the new functions will do.
+* Internal File Ops:: The code for internal file operations.
+* Using Internal File Ops:: How to use an external extension.
-File: gawk.info, Node: Sample Debugging Session, Next: List of Debugger
Commands, Prev: Debugging, Up: Debugger
+File: gawk.info, Node: Internal File Description, Next: Internal File Ops,
Up: Sample Library
-15.2 Sample Debugging Session
-=============================
+15.2.1 Using `chdir()' and `stat()'
+-----------------------------------
-In order to illustrate the use of `gawk' as a debugger, let's look at a
-sample debugging session. We will use the `awk' implementation of the
-POSIX `uniq' command described earlier (*note Uniq Program::) as our
-example.
+This minor node shows how to use the new functions at the `awk' level
+once they've been integrated into the running `gawk' interpreter.
+Using `chdir()' is very straightforward. It takes one argument, the new
+directory to change to:
-* Menu:
+ ...
+ newdir = "/home/arnold/funstuff"
+ ret = chdir(newdir)
+ if (ret < 0) {
+ printf("could not change to %s: %s\n",
+ newdir, ERRNO) > "/dev/stderr"
+ exit 1
+ }
+ ...
-* Debugger Invocation:: How to Start the Debugger.
-* Finding The Bug:: Finding the Bug.
+ The return value is negative if the `chdir' failed, and `ERRNO'
+(*note Built-in Variables::) is set to a string indicating the error.
-
-File: gawk.info, Node: Debugger Invocation, Next: Finding The Bug, Up:
Sample Debugging Session
+ Using `stat()' is a bit more complicated. The C `stat()' function
+fills in a structure that has a fair amount of information. The right
+way to model this in `awk' is to fill in an associative array with the
+appropriate information:
-15.2.1 How to Start the Debugger
---------------------------------
+ file = "/home/arnold/.profile"
+ fdata[1] = "x" # force `fdata' to be an array
+ ret = stat(file, fdata)
+ if (ret < 0) {
+ printf("could not stat %s: %s\n",
+ file, ERRNO) > "/dev/stderr"
+ exit 1
+ }
+ printf("size of %s is %d bytes\n", file, fdata["size"])
-Starting the debugger is almost exactly like running `awk', except you
-have to pass an additional option `--debug' or the corresponding short
-option `-D'. The file(s) containing the program and any supporting
-code are given on the command line as arguments to one or more `-f'
-options. (`gawk' is not designed to debug command-line programs, only
-programs contained in files.) In our case, we invoke the debugger like
-this:
+ The `stat()' function always clears the data array, even if the
+`stat()' fails. It fills in the following elements:
- $ gawk -D -f getopt.awk -f join.awk -f uniq.awk inputfile
+`"name"'
+ The name of the file that was `stat()''ed.
-where both `getopt.awk' and `uniq.awk' are in `$AWKPATH'. (Experienced
-users of GDB or similar debuggers should note that this syntax is
-slightly different from what they are used to. With `gawk' debugger,
-the arguments for running the program are given in the command line to
-the debugger rather than as part of the `run' command at the debugger
-prompt.)
+`"dev"'
+`"ino"'
+ The file's device and inode numbers, respectively.
- Instead of immediately running the program on `inputfile', as `gawk'
-would ordinarily do, the debugger merely loads all the program source
-files, compiles them internally, and then gives us a prompt:
+`"mode"'
+ The file's mode, as a numeric value. This includes both the file's
+ type and its permissions.
- gawk>
+`"nlink"'
+ The number of hard links (directory entries) the file has.
-from which we can issue commands to the debugger. At this point, no
-code has been executed.
+`"uid"'
+`"gid"'
+ The numeric user and group ID numbers of the file's owner.
-
-File: gawk.info, Node: Finding The Bug, Prev: Debugger Invocation, Up:
Sample Debugging Session
+`"size"'
+ The size in bytes of the file.
-15.2.2 Finding the Bug
-----------------------
+`"blocks"'
+ The number of disk blocks the file actually occupies. This may not
+ be a function of the file's size if the file has holes.
-Let's say that we are having a problem using (a faulty version of)
-`uniq.awk' in the "field-skipping" mode, and it doesn't seem to be
-catching lines which should be identical when skipping the first field,
-such as:
+`"atime"'
+`"mtime"'
+`"ctime"'
+ The file's last access, modification, and inode update times,
+ respectively. These are numeric timestamps, suitable for
+ formatting with `strftime()' (*note Built-in::).
- awk is a wonderful program!
- gawk is a wonderful program!
+`"pmode"'
+ The file's "printable mode." This is a string representation of
+ the file's type and permissions, such as what is produced by `ls
+ -l'--for example, `"drwxr-xr-x"'.
- This could happen if we were thinking (C-like) of the fields in a
-record as being numbered in a zero-based fashion, so instead of the
-lines:
+`"type"'
+ A printable string representation of the file's type. The value
+ is one of the following:
- clast = join(alast, fcount+1, n)
- cline = join(aline, fcount+1, m)
+ `"blockdev"'
+ `"chardev"'
+ The file is a block or character device ("special file").
-we wrote:
+ `"directory"'
+ The file is a directory.
- clast = join(alast, fcount, n)
- cline = join(aline, fcount, m)
+ `"fifo"'
+ The file is a named-pipe (also known as a FIFO).
- The first thing we usually want to do when trying to investigate a
-problem like this is to put a breakpoint in the program so that we can
-watch it at work and catch what it is doing wrong. A reasonable spot
-for a breakpoint in `uniq.awk' is at the beginning of the function
-`are_equal()', which compares the current line with the previous one.
-To set the breakpoint, use the `b' (breakpoint) command:
+ `"file"'
+ The file is just a regular file.
- gawk> b are_equal
- -| Breakpoint 1 set at file `awklib/eg/prog/uniq.awk', line 64
+ `"socket"'
+ The file is an `AF_UNIX' ("Unix domain") socket in the
+ filesystem.
- The debugger tells us the file and line number where the breakpoint
-is. Now type `r' or `run' and the program runs until it hits the
-breakpoint for the first time:
+ `"symlink"'
+ The file is a symbolic link.
- gawk> r
- -| Starting program:
- -| Stopping in Rule ...
- -| Breakpoint 1, are_equal(n, m, clast, cline, alast, aline)
- at `awklib/eg/prog/uniq.awk':64
- -| 64 if (fcount == 0 && charcount == 0)
- gawk>
+ Several additional elements may be present depending upon the
+operating system and the type of the file. You can test for them in
+your `awk' program by using the `in' operator (*note Reference to
+Elements::):
- Now we can look at what's going on inside our program. First of all,
-let's see how we got to where we are. At the prompt, we type `bt'
-(short for "backtrace"), and the debugger responds with a listing of
-the current stack frames:
+`"blksize"'
+ The preferred block size for I/O to the file. This field is not
+ present on all POSIX-like systems in the C `stat' structure.
- gawk> bt
- -| #0 are_equal(n, m, clast, cline, alast, aline)
- at `awklib/eg/prog/uniq.awk':69
- -| #1 in main() at `awklib/eg/prog/uniq.awk':89
+`"linkval"'
+ If the file is a symbolic link, this element is the name of the
+ file the link points to (i.e., the value of the link).
- This tells us that `are_equal()' was called by the main program at
-line 89 of `uniq.awk'. (This is not a big surprise, since this is the
-only call to `are_equal()' in the program, but in more complex
-programs, knowing who called a function and with what parameters can be
-the key to finding the source of the problem.)
+`"rdev"'
+`"major"'
+`"minor"'
+ If the file is a block or character device file, then these values
+ represent the numeric device number and the major and minor
+ components of that number, respectively.
- Now that we're in `are_equal()', we can start looking at the values
-of some variables. Let's say we type `p n' (`p' is short for "print").
-We would expect to see the value of `n', a parameter to `are_equal()'.
-Actually, the debugger gives us:
+
+File: gawk.info, Node: Internal File Ops, Next: Using Internal File Ops,
Prev: Internal File Description, Up: Sample Library
- gawk> p n
- -| n = untyped variable
+15.2.2 C Code for `chdir()' and `stat()'
+----------------------------------------
-In this case, `n' is an uninitialized local variable, since the
-function was called without arguments (*note Function Calls::).
+Here is the C code for these extensions. They were written for
+GNU/Linux. The code needs some more work for complete portability to
+other POSIX-compliant systems:(1)
- A more useful variable to display might be the current record:
+ #include "awk.h"
- gawk> p $0
- -| $0 = string ("gawk is a wonderful program!")
+ #include <sys/sysmacros.h>
-This might be a bit puzzling at first since this is the second line of
-our test input above. Let's look at `NR':
+ int plugin_is_GPL_compatible;
- gawk> p NR
- -| NR = number (2)
+ /* do_chdir --- provide dynamically loaded chdir() builtin for gawk */
-So we can see that `are_equal()' was only called for the second record
-of the file. Of course, this is because our program contained a rule
-for `NR == 1':
+ static NODE *
+ do_chdir(int nargs)
+ {
+ NODE *newdir;
+ int ret = -1;
- NR == 1 {
- last = $0
- next
- }
+ if (do_lint && nargs != 1)
+ lintwarn("chdir: called with incorrect number of arguments");
- OK, let's just check that that rule worked correctly:
+ newdir = get_scalar_argument(0, FALSE);
- gawk> p last
- -| last = string ("awk is a wonderful program!")
+ The file includes the `"awk.h"' header file for definitions for the
+`gawk' internals. It includes `<sys/sysmacros.h>' for access to the
+`major()' and `minor'() macros.
- Everything we have done so far has verified that the program has
-worked as planned, up to and including the call to `are_equal()', so
-the problem must be inside this function. To investigate further, we
-must begin "stepping through" the lines of `are_equal()'. We start by
-typing `n' (for "next"):
+ By convention, for an `awk' function `foo', the function that
+implements it is called `do_foo'. The function should take a `int'
+argument, usually called `nargs', that represents the number of defined
+arguments for the function. The `newdir' variable represents the new
+directory to change to, retrieved with `get_scalar_argument()'. Note
+that the first argument is numbered zero.
- gawk> n
- -| 67 if (fcount > 0) {
+ This code actually accomplishes the `chdir()'. It first forces the
+argument to be a string and passes the string value to the `chdir()'
+system call. If the `chdir()' fails, `ERRNO' is updated.
- This tells us that `gawk' is now ready to execute line 67, which
-decides whether to give the lines the special "field skipping" treatment
-indicated by the `-f' command-line option. (Notice that we skipped
-from where we were before at line 64 to here, since the condition in
-line 64
+ (void) force_string(newdir);
+ ret = chdir(newdir->stptr);
+ if (ret < 0)
+ update_ERRNO_int(errno);
- if (fcount == 0 && charcount == 0)
+ Finally, the function returns the return value to the `awk' level:
-was false.)
+ return make_number((AWKNUM) ret);
+ }
- Continuing to step, we now get to the splitting of the current and
-last records:
+ The `stat()' built-in is more involved. First comes a function that
+turns a numeric mode into a printable representation (e.g., 644 becomes
+`-rw-r--r--'). This is omitted here for brevity:
- gawk> n
- -| 68 n = split(last, alast)
- gawk> n
- -| 69 m = split($0, aline)
+ /* format_mode --- turn a stat mode field into something readable */
- At this point, we should be curious to see what our records were
-split into, so we try to look:
+ static char *
+ format_mode(unsigned long fmode)
+ {
+ ...
+ }
- gawk> p n m alast aline
- -| n = number (5)
- -| m = number (5)
- -| alast = array, 5 elements
- -| aline = array, 5 elements
+ Next comes the `do_stat()' function. It starts with variable
+declarations and argument checking:
-(The `p' command can take more than one argument, similar to `awk''s
-`print' statement.)
+ /* do_stat --- provide a stat() function for gawk */
- This is kind of disappointing, though. All we found out is that
-there are five elements in each of our arrays. Useful enough (we now
-know that none of the words were accidentally left out), but what if we
-want to see inside the array?
+ static NODE *
+ do_stat(int nargs)
+ {
+ NODE *file, *array, *tmp;
+ struct stat sbuf;
+ int ret;
+ NODE **aptr;
+ char *pmode; /* printable mode */
+ char *type = "unknown";
- The first choice would be to use subscripts:
+ if (do_lint && nargs > 2)
+ lintwarn("stat: called with too many arguments");
- gawk> p alast[0]
- -| "0" not in array `alast'
+ Then comes the actual work. First, the function gets the arguments.
+Then, it always clears the array. The code use `lstat()' (instead of
+`stat()') to get the file information, in case the file is a symbolic
+link. If there's an error, it sets `ERRNO' and returns:
-Oops!
+ /* file is first arg, array to hold results is second */
+ file = get_scalar_argument(0, FALSE);
+ array = get_array_argument(1, FALSE);
- gawk> p alast[1]
- -| alast["1"] = string ("awk")
+ /* empty out the array */
+ assoc_clear(array);
- This would be kind of slow for a 100-member array, though, so `gawk'
-provides a shortcut (reminiscent of another language not to be
-mentioned):
+ /* lstat the file, if error, set ERRNO and return */
+ (void) force_string(file);
+ ret = lstat(file->stptr, & sbuf);
+ if (ret < 0) {
+ update_ERRNO_int(errno);
+ return make_number((AWKNUM) ret);
+ }
- gawk> p @alast
- -| alast["1"] = string ("awk")
- -| alast["2"] = string ("is")
- -| alast["3"] = string ("a")
- -| alast["4"] = string ("wonderful")
- -| alast["5"] = string ("program!")
+ Now comes the tedious part: filling in the array. Only a few of the
+calls are shown here, since they all follow the same pattern:
- It looks like we got this far OK. Let's take another step or two:
+ /* fill in the array */
+ aptr = assoc_lookup(array, tmp = make_string("name", 4));
+ *aptr = dupnode(file);
+ unref(tmp);
- gawk> n
- -| 70 clast = join(alast, fcount, n)
- gawk> n
- -| 71 cline = join(aline, fcount, m)
+ aptr = assoc_lookup(array, tmp = make_string("mode", 4));
+ *aptr = make_number((AWKNUM) sbuf.st_mode);
+ unref(tmp);
- Well, here we are at our error (sorry to spoil the suspense). What
-we had in mind was to join the fields starting from the second one to
-make the virtual record to compare, and if the first field was numbered
-zero, this would work. Let's look at what we've got:
+ aptr = assoc_lookup(array, tmp = make_string("pmode", 5));
+ pmode = format_mode(sbuf.st_mode);
+ *aptr = make_string(pmode, strlen(pmode));
+ unref(tmp);
- gawk> p cline clast
- -| cline = string ("gawk is a wonderful program!")
- -| clast = string ("awk is a wonderful program!")
+ When done, return the `lstat()' return value:
- Hey, those look pretty familiar! They're just our original,
-unaltered, input records. A little thinking (the human brain is still
-the best debugging tool), and we realize that we were off by one!
- We get out of the debugger:
+ return make_number((AWKNUM) ret);
+ }
- gawk> q
- -| The program is running. Exit anyway (y/n)? y
+ Finally, it's necessary to provide the "glue" that loads the new
+function(s) into `gawk'. By convention, each library has a routine
+named `dl_load()' that does the job. The simplest way is to use the
+`dl_load_func' macro in `gawkapi.h'.
-Then we get into an editor:
+ And that's it! As an exercise, consider adding functions to
+implement system calls such as `chown()', `chmod()', and `umask()'.
- clast = join(alast, fcount+1, n)
- cline = join(aline, fcount+1, m)
+ ---------- Footnotes ----------
-and problem solved!
+ (1) This version is edited slightly for presentation. See
+`extension/filefuncs.c' in the `gawk' distribution for the complete
+version.
-File: gawk.info, Node: List of Debugger Commands, Next: Readline Support,
Prev: Sample Debugging Session, Up: Debugger
-
-15.3 Main Debugger Commands
-===========================
+File: gawk.info, Node: Using Internal File Ops, Prev: Internal File Ops,
Up: Sample Library
-The `gawk' debugger command set can be divided into the following
-categories:
+15.2.3 Integrating the Extensions
+---------------------------------
- * Breakpoint control
+Now that the code is written, it must be possible to add it at runtime
+to the running `gawk' interpreter. First, the code must be compiled.
+Assuming that the functions are in a file named `filefuncs.c', and IDIR
+is the location of the `gawk' include files, the following steps create
+a GNU/Linux shared library:
- * Execution control
+ $ gcc -fPIC -shared -DHAVE_CONFIG_H -c -O -g -IIDIR filefuncs.c
+ $ ld -o filefuncs.so -shared filefuncs.o
- * Viewing and changing data
+ Once the library exists, it is loaded by calling the `extension()'
+built-in function. This function takes two arguments: the name of the
+library to load and the name of a function to call when the library is
+first loaded. This function adds the new functions to `gawk'. It
+returns the value returned by the initialization function within the
+shared library:
- * Working with the stack
+ # file testff.awk
+ BEGIN {
+ extension("./filefuncs.so", "dl_load")
- * Getting information
+ chdir(".") # no-op
- * Miscellaneous
+ data[1] = 1 # force `data' to be an array
+ print "Info for testff.awk"
+ ret = stat("testff.awk", data)
+ print "ret =", ret
+ for (i in data)
+ printf "data[\"%s\"] = %s\n", i, data[i]
+ print "testff.awk modified:",
+ strftime("%m %d %y %H:%M:%S", data["mtime"])
- Each of these are discussed in the following subsections. In the
-following descriptions, commands which may be abbreviated show the
-abbreviation on a second description line. A debugger command name may
-also be truncated if that partial name is unambiguous. The debugger has
-the built-in capability to automatically repeat the previous command
-when just hitting <Enter>. This works for the commands `list', `next',
-`nexti', `step', `stepi' and `continue' executed without any argument.
+ print "\nInfo for JUNK"
+ ret = stat("JUNK", data)
+ print "ret =", ret
+ for (i in data)
+ printf "data[\"%s\"] = %s\n", i, data[i]
+ print "JUNK modified:", strftime("%m %d %y %H:%M:%S", data["mtime"])
+ }
-* Menu:
+ Here are the results of running the program:
-* Breakpoint Control:: Control of Breakpoints.
-* Debugger Execution Control:: Control of Execution.
-* Viewing And Changing Data:: Viewing and Changing Data.
-* Execution Stack:: Dealing with the Stack.
-* Debugger Info:: Obtaining Information about the Program and
- the Debugger State.
-* Miscellaneous Debugger Commands:: Miscellaneous Commands.
+ $ gawk -f testff.awk
+ -| Info for testff.awk
+ -| ret = 0
+ -| data["size"] = 607
+ -| data["ino"] = 14945891
+ -| data["name"] = testff.awk
+ -| data["pmode"] = -rw-rw-r--
+ -| data["nlink"] = 1
+ -| data["atime"] = 1293993369
+ -| data["mtime"] = 1288520752
+ -| data["mode"] = 33204
+ -| data["blksize"] = 4096
+ -| data["dev"] = 2054
+ -| data["type"] = file
+ -| data["gid"] = 500
+ -| data["uid"] = 500
+ -| data["blocks"] = 8
+ -| data["ctime"] = 1290113572
+ -| testff.awk modified: 10 31 10 12:25:52
+ -|
+ -| Info for JUNK
+ -| ret = -1
+ -| JUNK modified: 01 01 70 02:00:00
-File: gawk.info, Node: Breakpoint Control, Next: Debugger Execution Control,
Up: List of Debugger Commands
-
-15.3.1 Control of Breakpoints
------------------------------
+File: gawk.info, Node: Arbitrary Precision Arithmetic, Next: Advanced
Features, Prev: Internationalization, Up: Top
-As we saw above, the first thing you probably want to do in a debugging
-session is to get your breakpoints set up, since otherwise your program
-will just run as if it was not under the debugger. The commands for
-controlling breakpoints are:
+16 Arithmetic and Arbitrary Precision Arithmetic with `gawk'
+************************************************************
-`break' [[FILENAME`:']N | FUNCTION] [`"EXPRESSION"']
-`b' [[FILENAME`:']N | FUNCTION] [`"EXPRESSION"']
- Without any argument, set a breakpoint at the next instruction to
- be executed in the selected stack frame. Arguments can be one of
- the following:
+ There's a credibility gap: We don't know how much of the
+ computer's answers to believe. Novice computer users solve this
+ problem by implicitly trusting in the computer as an infallible
+ authority; they tend to believe that all digits of a printed
+ answer are significant. Disillusioned computer users have just the
+ opposite approach; they are constantly afraid that their answers
+ are almost meaningless.
+ Donald Knuth(1)
- N
- Set a breakpoint at line number N in the current source file.
+ This major node discusses issues that you may encounter when
+performing arithmetic. It begins by discussing some of the general
+atributes of computer arithmetic, along with how this can influence
+what you see when running `awk' programs. This discussion applies to
+all versions of `awk'.
- FILENAME`:'N
- Set a breakpoint at line number N in source file FILENAME.
+ Then the discussion moves on to "arbitrary precsion arithmetic", a
+feature which is specific to `gawk'.
- FUNCTION
- Set a breakpoint at entry to (the first instruction of)
- function FUNCTION.
+* Menu:
- Each breakpoint is assigned a number which can be used to delete
- it from the breakpoint list using the `delete' command.
+* General Arithmetic:: An introduction to computer arithmetic.
+* Floating-point Programming:: Effective Floating-point Programming.
+* Gawk and MPFR:: How `gawk' provides
+ aribitrary-precision arithmetic.
+* Arbitrary Precision Floats:: Arbitrary Precision Floating-point Arithmetic
+ with `gawk'.
+* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with
+ `gawk'.
- With a breakpoint, you may also supply a condition. This is an
- `awk' expression (enclosed in double quotes) that the debugger
- evaluates whenever the breakpoint is reached. If the condition is
- true, then the debugger stops execution and prompts for a command.
- Otherwise, it continues executing the program.
+ ---------- Footnotes ----------
-`clear' [[FILENAME`:']N | FUNCTION]
- Without any argument, delete any breakpoint at the next instruction
- to be executed in the selected stack frame. If the program stops at
- a breakpoint, this deletes that breakpoint so that the program
- does not stop at that location again. Arguments can be one of the
- following:
+ (1) Donald E. Knuth. `The Art of Computer Programming'. Volume 2,
+`Seminumerical Algorithms', third edition, 1998, ISBN 0-201-89683-4, p.
+229.
- N
- Delete breakpoint(s) set at line number N in the current
- source file.
+
+File: gawk.info, Node: General Arithmetic, Next: Floating-point Programming,
Up: Arbitrary Precision Arithmetic
- FILENAME`:'N
- Delete breakpoint(s) set at line number N in source file
- FILENAME.
+16.1 A General Description of Computer Arithmetic
+=================================================
- FUNCTION
- Delete breakpoint(s) set at entry to function FUNCTION.
+Within computers, there are two kinds of numeric values: "integers" and
+"floating-point". In school, integer values were referred to as
+"whole" numbers--that is, numbers without any fractional part, such as
+1, 42, or -17. The advantage to integer numbers is that they represent
+values exactly. The disadvantage is that their range is limited. On
+most systems, this range is -2,147,483,648 to 2,147,483,647. However,
+many systems now support a range from -9,223,372,036,854,775,808 to
+9,223,372,036,854,775,807.
-`condition' N `"EXPRESSION"'
- Add a condition to existing breakpoint or watchpoint N. The
- condition is an `awk' expression that the debugger evaluates
- whenever the breakpoint or watchpoint is reached. If the condition
- is true, then the debugger stops execution and prompts for a
- command. Otherwise, the debugger continues executing the program.
- If the condition expression is not specified, any existing
- condition is removed; i.e., the breakpoint or watchpoint is made
- unconditional.
+ Integer values come in two flavors: "signed" and "unsigned". Signed
+values may be negative or positive, with the range of values just
+described. Unsigned values are always positive. On most systems, the
+range is from 0 to 4,294,967,295. However, many systems now support a
+range from 0 to 18,446,744,073,709,551,615.
-`delete' [N1 N2 ...] [N-M]
-`d' [N1 N2 ...] [N-M]
- Delete specified breakpoints or a range of breakpoints. Deletes
- all defined breakpoints if no argument is supplied.
+ Floating-point numbers represent what are called "real" numbers;
+i.e., those that do have a fractional part, such as 3.1415927. The
+advantage to floating-point numbers is that they can represent a much
+larger range of values. The disadvantage is that there are numbers
+that they cannot represent exactly. `awk' uses "double precision"
+floating-point numbers, which can hold more digits than "single
+precision" floating-point numbers.
-`disable' [N1 N2 ... | N-M]
- Disable specified breakpoints or a range of breakpoints. Without
- any argument, disables all breakpoints.
+ There a several important issues to be aware of, described next.
-`enable' [`del' | `once'] [N1 N2 ...] [N-M]
-`e' [`del' | `once'] [N1 N2 ...] [N-M]
- Enable specified breakpoints or a range of breakpoints. Without
- any argument, enables all breakpoints. Optionally, you can
- specify how to enable the breakpoint:
+* Menu:
- `del'
- Enable the breakpoint(s) temporarily, then delete it when the
- program stops at the breakpoint.
+* Floating Point Issues:: Stuff to know about floating-point numbers.
+* Integer Programming:: Effective integer programming.
- `once'
- Enable the breakpoint(s) temporarily, then disable it when
- the program stops at the breakpoint.
+
+File: gawk.info, Node: Floating Point Issues, Next: Integer Programming,
Up: General Arithmetic
-`ignore' N COUNT
- Ignore breakpoint number N the next COUNT times it is hit.
+16.1.1 Floating-Point Number Caveats
+------------------------------------
-`tbreak' [[FILENAME`:']N | FUNCTION]
-`t' [[FILENAME`:']N | FUNCTION]
- Set a temporary breakpoint (enabled for only one stop). The
- arguments are the same as for `break'.
+As mentioned earlier, floating-point numbers represent what are called
+"real" numbers, i.e., those that have a fractional part. `awk' uses
+double precision floating-point numbers to represent all numeric
+values. This minor node describes some of the issues involved in using
+floating-point numbers.
-
-File: gawk.info, Node: Debugger Execution Control, Next: Viewing And
Changing Data, Prev: Breakpoint Control, Up: List of Debugger Commands
+ There is a very nice paper on floating-point arithmetic
+(http://www.validlab.com/goldberg/paper.pdf) by David Goldberg, "What
+Every Computer Scientist Should Know About Floating-point Arithmetic,"
+`ACM Computing Surveys' *23*, 1 (1991-03), 5-48. This is worth reading
+if you are interested in the details, but it does require a background
+in computer science.
-15.3.2 Control of Execution
----------------------------
+* Menu:
-Now that your breakpoints are ready, you can start running the program
-and observing its behavior. There are more commands for controlling
-execution of the program than we saw in our earlier example:
+* String Conversion Precision:: The String Value Can Lie.
+* Unexpected Results:: Floating Point Numbers Are Not Abstract
+ Numbers.
+* POSIX Floating Point Problems:: Standards Versus Existing Practice.
-`commands' [N]
-`silent'
-...
-`end'
- Set a list of commands to be executed upon stopping at a
- breakpoint or watchpoint. N is the breakpoint or watchpoint number.
- Without a number, the last one set is used. The actual commands
- follow, starting on the next line, and terminated by the `end'
- command. If the command `silent' is in the list, the usual
- messages about stopping at a breakpoint and the source line are
- not printed. Any command in the list that resumes execution (e.g.,
- `continue') terminates the list (an implicit `end'), and
- subsequent commands are ignored. For example:
+
+File: gawk.info, Node: String Conversion Precision, Next: Unexpected
Results, Up: Floating Point Issues
- gawk> commands
- > silent
- > printf "A silent breakpoint; i = %d\n", i
- > info locals
- > set i = 10
- > continue
- > end
- gawk>
+16.1.1.1 The String Value Can Lie
+.................................
-`continue' [COUNT]
-`c' [COUNT]
- Resume program execution. If continued from a breakpoint and COUNT
- is specified, ignores the breakpoint at that location the next
- COUNT times before stopping.
+Internally, `awk' keeps both the numeric value (double precision
+floating-point) and the string value for a variable. Separately, `awk'
+keeps track of what type the variable has (*note Typing and
+Comparison::), which plays a role in how variables are used in
+comparisons.
-`finish'
- Execute until the selected stack frame returns. Print the
- returned value.
+ It is important to note that the string value for a number may not
+reflect the full value (all the digits) that the numeric value actually
+contains. The following program (`values.awk') illustrates this:
-`next' [COUNT]
-`n' [COUNT]
- Continue execution to the next source line, stepping over function
- calls. The argument COUNT controls how many times to repeat the
- action, as in `step'.
+ {
+ sum = $1 + $2
+ # see it for what it is
+ printf("sum = %.12g\n", sum)
+ # use CONVFMT
+ a = "<" sum ">"
+ print "a =", a
+ # use OFMT
+ print "sum =", sum
+ }
-`nexti' [COUNT]
-`ni' [COUNT]
- Execute one (or COUNT) instruction(s), stepping over function
- calls.
+This program shows the full value of the sum of `$1' and `$2' using
+`printf', and then prints the string values obtained from both
+automatic conversion (via `CONVFMT') and from printing (via `OFMT').
-`return' [VALUE]
- Cancel execution of a function call. If VALUE (either a string or a
- number) is specified, it is used as the function's return value.
- If used in a frame other than the innermost one (the currently
- executing function, i.e., frame number 0), discard all inner
- frames in addition to the selected one, and the caller of that
- frame becomes the innermost frame.
+ Here is what happens when the program is run:
-`run'
-`r'
- Start/restart execution of the program. When restarting, the
- debugger retains the current breakpoints, watchpoints, command
- history, automatic display variables, and debugger options.
+ $ echo 3.654321 1.2345678 | awk -f values.awk
+ -| sum = 4.8888888
+ -| a = <4.88889>
+ -| sum = 4.88889
-`step' [COUNT]
-`s' [COUNT]
- Continue execution until control reaches a different source line
- in the current stack frame. `step' steps inside any function
- called within the line. If the argument COUNT is supplied, steps
- that many times before stopping, unless it encounters a breakpoint
- or watchpoint.
+ This makes it clear that the full numeric value is different from
+what the default string representations show.
-`stepi' [COUNT]
-`si' [COUNT]
- Execute one (or COUNT) instruction(s), stepping inside function
- calls. (For illustration of what is meant by an "instruction" in
- `gawk', see the output shown under `dump' in *note Miscellaneous
- Debugger Commands::.)
+ `CONVFMT''s default value is `"%.6g"', which yields a value with at
+least six significant digits. For some applications, you might want to
+change it to specify more precision. On most modern machines, most of
+the time, 17 digits is enough to capture a floating-point number's
+value exactly.(1)
-`until' [[FILENAME`:']N | FUNCTION]
-`u' [[FILENAME`:']N | FUNCTION]
- Without any argument, continue execution until a line past the
- current line in current stack frame is reached. With an argument,
- continue execution until the specified location is reached, or the
- current stack frame returns.
+ ---------- Footnotes ----------
+
+ (1) Pathological cases can require up to 752 digits (!), but we
+doubt that you need to worry about this.
-File: gawk.info, Node: Viewing And Changing Data, Next: Execution Stack,
Prev: Debugger Execution Control, Up: List of Debugger Commands
+File: gawk.info, Node: Unexpected Results, Next: POSIX Floating Point
Problems, Prev: String Conversion Precision, Up: Floating Point Issues
-15.3.3 Viewing and Changing Data
---------------------------------
+16.1.1.2 Floating Point Numbers Are Not Abstract Numbers
+........................................................
-The commands for viewing and changing variables inside of `gawk' are:
+Unlike numbers in the abstract sense (such as what you studied in high
+school or college arithmetic), numbers stored in computers are limited
+in certain ways. They cannot represent an infinite number of digits,
+nor can they always represent things exactly. In particular,
+floating-point numbers cannot always represent values exactly. Here is
+an example:
-`display' [VAR | `$'N]
- Add variable VAR (or field `$N') to the display list. The value
- of the variable or field is displayed each time the program stops.
- Each variable added to the list is identified by a unique number:
+ $ awk '{ printf("%010d\n", $1 * 100) }'
+ 515.79
+ -| 0000051579
+ 515.80
+ -| 0000051579
+ 515.81
+ -| 0000051580
+ 515.82
+ -| 0000051582
+ Ctrl-d
- gawk> display x
- -| 10: x = 1
+This shows that some values can be represented exactly, whereas others
+are only approximated. This is not a "bug" in `awk', but simply an
+artifact of how computers represent numbers.
- displays the assigned item number, the variable name and its
- current value. If the display variable refers to a function
- parameter, it is silently deleted from the list as soon as the
- execution reaches a context where no such variable of the given
- name exists. Without argument, `display' displays the current
- values of items on the list.
+ NOTE: It cannot be emphasized enough that the behavior just
+ described is fundamental to modern computers. You will see this
+ kind of thing happen in _any_ programming language using hardware
+ floating-point numbers. It is _not_ a bug in `gawk', nor is it
+ something that can be "just fixed."
+
+ Another peculiarity of floating-point numbers on modern systems is
+that they often have more than one representation for the number zero!
+In particular, it is possible to represent "minus zero" as well as
+regular, or "positive" zero.
-`eval "AWK STATEMENTS"'
- Evaluate AWK STATEMENTS in the context of the running program.
- You can do anything that an `awk' program would do: assign values
- to variables, call functions, and so on.
+ This example shows that negative and positive zero are distinct
+values when stored internally, but that they are in fact equal to each
+other, as well as to "regular" zero:
-`eval' PARAM, ...
-AWK STATEMENTS
-`end'
- This form of `eval' is similar, but it allows you to define "local
- variables" that exist in the context of the AWK STATEMENTS,
- instead of using variables or function parameters defined by the
- program.
+ $ gawk 'BEGIN { mz = -0 ; pz = 0
+ > printf "-0 = %g, +0 = %g, (-0 == +0) -> %d\n", mz, pz, mz == pz
+ > printf "mz == 0 -> %d, pz == 0 -> %d\n", mz == 0, pz == 0
+ > }'
+ -| -0 = -0, +0 = 0, (-0 == +0) -> 1
+ -| mz == 0 -> 1, pz == 0 -> 1
-`print' VAR1[`,' VAR2 ...]
-`p' VAR1[`,' VAR2 ...]
- Print the value of a `gawk' variable or field. Fields must be
- referenced by constants:
+ It helps to keep this in mind should you process numeric data that
+contains negative zero values; the fact that the zero is negative is
+noted and can affect comparisons.
- gawk> print $3
+
+File: gawk.info, Node: POSIX Floating Point Problems, Prev: Unexpected
Results, Up: Floating Point Issues
- This prints the third field in the input record (if the specified
- field does not exist, it prints `Null field'). A variable can be
- an array element, with the subscripts being constant values. To
- print the contents of an array, prefix the name of the array with
- the `@' symbol:
+16.1.1.3 Standards Versus Existing Practice
+...........................................
- gawk> print @a
+Historically, `awk' has converted any non-numeric looking string to the
+numeric value zero, when required. Furthermore, the original
+definition of the language and the original POSIX standards specified
+that `awk' only understands decimal numbers (base 10), and not octal
+(base 8) or hexadecimal numbers (base 16).
- This prints the indices and the corresponding values for all
- elements in the array `a'.
+ Changes in the language of the 2001 and 2004 POSIX standards can be
+interpreted to imply that `awk' should support additional features.
+These features are:
-`printf' FORMAT [`,' ARG ...]
- Print formatted text. The FORMAT may include escape sequences,
- such as `\n' (*note Escape Sequences::). No newline is printed
- unless one is specified.
+ * Interpretation of floating point data values specified in
+ hexadecimal notation (`0xDEADBEEF'). (Note: data values, _not_
+ source code constants.)
-`set' VAR`='VALUE
- Assign a constant (number or string) value to an `awk' variable or
- field. String values must be enclosed between double quotes
- (`"..."').
+ * Support for the special IEEE 754 floating point values "Not A
+ Number" (NaN), positive Infinity ("inf") and negative Infinity
+ ("-inf"). In particular, the format for these values is as
+ specified by the ISO 1999 C standard, which ignores case and can
+ allow machine-dependent additional characters after the `nan' and
+ allow either `inf' or `infinity'.
- You can also set special `awk' variables, such as `FS', `NF',
- `NR', etc.
+ The first problem is that both of these are clear changes to
+historical practice:
-`watch' VAR | `$'N [`"EXPRESSION"']
-`w' VAR | `$'N [`"EXPRESSION"']
- Add variable VAR (or field `$N') to the watch list. The debugger
- then stops whenever the value of the variable or field changes.
- Each watched item is assigned a number which can be used to delete
- it from the watch list using the `unwatch' command.
+ * The `gawk' maintainer feels that supporting hexadecimal floating
+ point values, in particular, is ugly, and was never intended by the
+ original designers to be part of the language.
- With a watchpoint, you may also supply a condition. This is an
- `awk' expression (enclosed in double quotes) that the debugger
- evaluates whenever the watchpoint is reached. If the condition is
- true, then the debugger stops execution and prompts for a command.
- Otherwise, `gawk' continues executing the program.
+ * Allowing completely alphabetic strings to have valid numeric
+ values is also a very severe departure from historical practice.
-`undisplay' [N]
- Remove item number N (or all items, if no argument) from the
- automatic display list.
+ The second problem is that the `gawk' maintainer feels that this
+interpretation of the standard, which requires a certain amount of
+"language lawyering" to arrive at in the first place, was not even
+intended by the standard developers. In other words, "we see how you
+got where you are, but we don't think that that's where you want to be."
-`unwatch' [N]
- Remove item number N (or all items, if no argument) from the watch
- list.
+ Recognizing the above issues, but attempting to provide compatibility
+with the earlier versions of the standard, the 2008 POSIX standard
+added explicit wording to allow, but not require, that `awk' support
+hexadecimal floating point values and special values for "Not A Number"
+and infinity.
+ Although the `gawk' maintainer continues to feel that providing
+those features is inadvisable, nevertheless, on systems that support
+IEEE floating point, it seems reasonable to provide _some_ way to
+support NaN and Infinity values. The solution implemented in `gawk' is
+as follows:
-
-File: gawk.info, Node: Execution Stack, Next: Debugger Info, Prev: Viewing
And Changing Data, Up: List of Debugger Commands
+ * With the `--posix' command-line option, `gawk' becomes "hands
+ off." String values are passed directly to the system library's
+ `strtod()' function, and if it successfully returns a numeric
+ value, that is what's used.(1) By definition, the results are not
+ portable across different systems. They are also a little
+ surprising:
-15.3.4 Dealing with the Stack
------------------------------
+ $ echo nanny | gawk --posix '{ print $1 + 0 }'
+ -| nan
+ $ echo 0xDeadBeef | gawk --posix '{ print $1 + 0 }'
+ -| 3735928559
-Whenever you run a program which contains any function calls, `gawk'
-maintains a stack of all of the function calls leading up to where the
-program is right now. You can see how you got to where you are, and
-also move around in the stack to see what the state of things was in the
-functions which called the one you are in. The commands for doing this
-are:
+ * Without `--posix', `gawk' interprets the four strings `+inf',
+ `-inf', `+nan', and `-nan' specially, producing the corresponding
+ special numeric values. The leading sign acts a signal to `gawk'
+ (and the user) that the value is really numeric. Hexadecimal
+ floating point is not supported (unless you also use
+ `--non-decimal-data', which is _not_ recommended). For example:
-`backtrace' [COUNT]
-`bt' [COUNT]
- Print a backtrace of all function calls (stack frames), or
- innermost COUNT frames if COUNT > 0. Print the outermost COUNT
- frames if COUNT < 0. The backtrace displays the name and
- arguments to each function, the source file name, and the line
- number.
+ $ echo nanny | gawk '{ print $1 + 0 }'
+ -| 0
+ $ echo +nan | gawk '{ print $1 + 0 }'
+ -| nan
+ $ echo 0xDeadBeef | gawk '{ print $1 + 0 }'
+ -| 0
-`down' [COUNT]
- Move COUNT (default 1) frames down the stack toward the innermost
- frame. Then select and print the frame.
+ `gawk' does ignore case in the four special values. Thus `+nan'
+ and `+NaN' are the same.
-`frame' [N]
-`f' [N]
- Select and print (frame number, function and argument names,
- source file, and the source line) stack frame N. Frame 0 is the
- currently executing, or "innermost", frame (function call), frame
- 1 is the frame that called the innermost one. The highest numbered
- frame is the one for the main program.
+ ---------- Footnotes ----------
-`up' [COUNT]
- Move COUNT (default 1) frames up the stack toward the outermost
- frame. Then select and print the frame.
+ (1) You asked for it, you got it.
-File: gawk.info, Node: Debugger Info, Next: Miscellaneous Debugger Commands,
Prev: Execution Stack, Up: List of Debugger Commands
+File: gawk.info, Node: Integer Programming, Prev: Floating Point Issues,
Up: General Arithmetic
-15.3.5 Obtaining Information about the Program and the Debugger State
----------------------------------------------------------------------
+16.1.2 Mixing Integers And Floating-point
+-----------------------------------------
-Besides looking at the values of variables, there is often a need to get
-other sorts of information about the state of your program and of the
-debugging environment itself. The `gawk' debugger has one command which
-provides this information, appropriately called `info'. `info' is used
-with one of a number of arguments that tell it exactly what you want to
-know:
+As has been mentioned already, `gawk' ordinarily uses hardware double
+precision with 64-bit IEEE binary floating-point representation for
+numbers on most systems. A large integer like 9007199254740997 has a
+binary representation that, although finite, is more than 53 bits long;
+it must also be rounded to 53 bits. The biggest integer that can be
+stored in a C `double' is usually the same as the largest possible
+value of a `double'. If your system `double' is an IEEE 64-bit
+`double', this largest possible value is an integer and can be
+represented precisely. What more should one know about integers?
-`info' WHAT
-`i' WHAT
- The value for WHAT should be one of the following:
+ If you want to know what is the largest integer, such that it and
+all smaller integers can be stored in 64-bit doubles without losing
+precision, then the answer is 2^53. The next representable number is
+the even number 2^53 + 2, meaning it is unlikely that you will be able
+to make `gawk' print 2^53 + 1 in integer format. The range of integers
+exactly representable by a 64-bit double is [-2^53, 2^53]. If you ever
+see an integer outside this range in `gawk' using 64-bit doubles, you
+have reason to be very suspicious about the accuracy of the output.
+Here is a simple program with erroneous output:
- `args'
- Arguments of the selected frame.
+ $ gawk 'BEGIN { i = 2^53 - 1; for (j = 0; j < 4; j++) print i + j }'
+ -| 9007199254740991
+ -| 9007199254740992
+ -| 9007199254740992
+ -| 9007199254740994
- `break'
- List all currently set breakpoints.
+ The lesson is to not assume that any large integer printed by `gawk'
+represents an exact result from your computation, especially if it wraps
+around on your screen.
- `display'
- List all items in the automatic display list.
+
+File: gawk.info, Node: Floating-point Programming, Next: Gawk and MPFR,
Prev: General Arithmetic, Up: Arbitrary Precision Arithmetic
- `frame'
- Description of the selected stack frame.
+16.2 Understanding Floating-point Programming
+=============================================
- `functions'
- List all function definitions including source file names and
- line numbers.
+Numerical programming is an extensive area; if you need to develop
+sophisticated numerical algorithms then `gawk' may not be the ideal
+tool, and this documentation may not be sufficient. It might require
+digesting a book or two to really internalize how to compute with ideal
+accuracy and precision and the result often depends on the particular
+application.
- `locals'
- Local variables of the selected frame.
+ NOTE: A floating-point calculation's "accuracy" is how close it
+ comes to the real value. This is as opposed to the "precision",
+ which usually refers to the number of bits used to represent the
+ number (see the Wikipedia article
+ (http://en.wikipedia.org/wiki/Accuracy_and_precision) for more
+ information).
- `source'
- The name of the current source file. Each time the program
- stops, the current source file is the file containing the
- current instruction. When the debugger first starts, the
- current source file is the first file included via the `-f'
- option. The `list FILENAME:LINENO' command can be used at any
- time to change the current source.
+ There are two options for doing floating-point calculations:
+hardware floating-point (as used by standard `awk' and the default for
+`gawk'), and "arbitrary-precision" floating-point, which is software
+based. This major node aims to provide enough information to
+understand both, and then will focus on `gawk''s facilities for the
+latter.(1)
- `sources'
- List all program sources.
+ Binary floating-point representations and arithmetic are inexact.
+Simple values like 0.1 cannot be precisely represented using binary
+floating-point numbers, and the limited precision of floating-point
+numbers means that slight changes in the order of operations or the
+precision of intermediate storage can change the result. To make
+matters worse, with arbitrary precision floating-point, you can set the
+precision before starting a computation, but then you cannot be sure of
+the number of significant decimal places in the final result.
- `variables'
- List all global variables.
+ Sometimes, before you start to write any code, you should think more
+about what you really want and what's really happening. Consider the
+two numbers in the following example:
- `watch'
- List all items in the watch list.
+ x = 0.875 # 1/2 + 1/4 + 1/8
+ y = 0.425
- Additional commands give you control over the debugger, the ability
-to save the debugger's state, and the ability to run debugger commands
-from a file. The commands are:
+ Unlike the number in `y', the number stored in `x' is exactly
+representable in binary since it can be written as a finite sum of one
+or more fractions whose denominators are all powers of two. When
+`gawk' reads a floating-point number from program source, it
+automatically rounds that number to whatever precision your machine
+supports. If you try to print the numeric content of a variable using
+an output format string of `"%.17g"', it may not produce the same
+number as you assigned to it:
-`option' [NAME[`='VALUE]]
-`o' [NAME[`='VALUE]]
- Without an argument, display the available debugger options and
- their current values. `option NAME' shows the current value of the
- named option. `option NAME=VALUE' assigns a new value to the named
- option. The available options are:
+ $ gawk 'BEGIN { x = 0.875; y = 0.425
+ > printf("%0.17g, %0.17g\n", x, y) }'
+ -| 0.875, 0.42499999999999999
- `history_size'
- The maximum number of lines to keep in the history file
- `./.gawk_history'. The default is 100.
+ Often the error is so small you do not even notice it, and if you do,
+you can always specify how much precision you would like in your output.
+Usually this is a format string like `"%.15g"', which when used in the
+previous example, produces an output identical to the input.
- `listsize'
- The number of lines that `list' prints. The default is 15.
+ Because the underlying representation can be little bit off from the
+exact value, comparing floating-point values to see if they are equal
+is generally not a good idea. Here is an example where it does not
+work like you expect:
- `outfile'
- Send `gawk' output to a file; debugger output still goes to
- standard output. An empty string (`""') resets output to
- standard output.
+ $ gawk 'BEGIN { print (0.1 + 12.2 == 12.3) }'
+ -| 0
- `prompt'
- The debugger prompt. The default is `gawk> '.
+ The loss of accuracy during a single computation with floating-point
+numbers usually isn't enough to worry about. However, if you compute a
+value which is the result of a sequence of floating point operations,
+the error can accumulate and greatly affect the computation itself.
+Here is an attempt to compute the value of the constant pi using one of
+its many series representations:
- `save_history [on | off]'
- Save command history to file `./.gawk_history'. The default
- is `on'.
+ BEGIN {
+ x = 1.0 / sqrt(3.0)
+ n = 6
+ for (i = 1; i < 30; i++) {
+ n = n * 2.0
+ x = (sqrt(x * x + 1) - 1) / x
+ printf("%.15f\n", n * x)
+ }
+ }
- `save_options [on | off]'
- Save current options to file `./.gawkrc' upon exit. The
- default is `on'. Options are read back in to the next
- session upon startup.
+ When run, the early errors propagating through later computations
+cause the loop to terminate prematurely after an attempt to divide by
+zero.
- `trace [on | off]'
- Turn instruction tracing on or off. The default is `off'.
+ $ gawk -f pi.awk
+ -| 3.215390309173475
+ -| 3.159659942097510
+ -| 3.146086215131467
+ -| 3.142714599645573
+ ...
+ -| 3.224515243534819
+ -| 2.791117213058638
+ -| 0.000000000000000
+ error--> gawk: pi.awk:6: fatal: division by zero attempted
-`save' FILENAME
- Save the commands from the current session to the given file name,
- so that they can be replayed using the `source' command.
+ Here is one more example where the inaccuracies in internal
+representations yield an unexpected result:
-`source' FILENAME
- Run command(s) from a file; an error in any command does not
- terminate execution of subsequent commands. Comments (lines
- starting with `#') are allowed in a command file. Empty lines are
- ignored; they do _not_ repeat the last command. You can't restart
- the program by having more than one `run' command in the file.
- Also, the list of commands may include additional `source'
- commands; however, the `gawk' debugger will not source the same
- file more than once in order to avoid infinite recursion.
+ $ gawk 'BEGIN {
+ > for (d = 1.1; d <= 1.5; d += 0.1)
+ > i++
+ > print i
+ > }'
+ -| 4
- In addition to, or instead of the `source' command, you can use
- the `-D FILE' or `--debug=FILE' command-line options to execute
- commands from a file non-interactively (*note Options::.
+ Can computation using aribitrary precision help with the previous
+examples? If you are impatient to know, see *note Exact Arithmetic::.
-
-File: gawk.info, Node: Miscellaneous Debugger Commands, Prev: Debugger Info,
Up: List of Debugger Commands
+ Instead of aribitrary precision floating-point arithmetic, often all
+you need is an adjustment of your logic or a different order for the
+operations in your calculation. The stability and the accuracy of the
+computation of the constant pi in the previous example can be enhanced
+by using the following simple algebraic transformation:
-15.3.6 Miscellaneous Commands
------------------------------
+ (sqrt(x * x + 1) - 1) / x = x / (sqrt(x * x + 1) + 1)
-There are a few more commands which do not fit into the previous
-categories, as follows:
+After making this, change the program does converge to pi in under 30
+iterations:
-`dump' [FILENAME]
- Dump bytecode of the program to standard output or to the file
- named in FILENAME. This prints a representation of the internal
- instructions which `gawk' executes to implement the `awk' commands
- in a program. This can be very enlightening, as the following
- partial dump of Davide Brini's obfuscated code (*note Signature
- Program::) demonstrates:
+ $ gawk -f /tmp/pi2.awk
+ -| 3.215390309173473
+ -| 3.159659942097501
+ -| 3.146086215131436
+ -| 3.142714599645370
+ -| 3.141873049979825
+ ...
+ -| 3.141592653589797
+ -| 3.141592653589797
- gawk> dump
- -| # BEGIN
- -|
- -| [ 2:0x89faef4] Op_rule : [in_rule = BEGIN]
[source_file = brini.awk]
- -| [ 3:0x89fa428] Op_push_i : "~" [PERM|STRING|STRCUR]
- -| [ 3:0x89fa464] Op_push_i : "~" [PERM|STRING|STRCUR]
- -| [ 3:0x89fa450] Op_match :
- -| [ 3:0x89fa3ec] Op_store_var : O [do_reference = FALSE]
- -| [ 4:0x89fa48c] Op_push_i : "==" [PERM|STRING|STRCUR]
- -| [ 4:0x89fa4c8] Op_push_i : "==" [PERM|STRING|STRCUR]
- -| [ 4:0x89fa4b4] Op_equal :
- -| [ 4:0x89fa400] Op_store_var : o [do_reference = FALSE]
- -| [ 5:0x89fa4f0] Op_push : o
- -| [ 5:0x89fa4dc] Op_plus_i : 0 [PERM|NUMCUR|NUMBER]
- -| [ 5:0x89fa414] Op_push_lhs : o [do_reference = TRUE]
- -| [ 5:0x89fa4a0] Op_assign_plus :
- -| [ :0x89fa478] Op_pop :
- -| [ 6:0x89fa540] Op_push : O
- -| [ 6:0x89fa554] Op_push_i : "" [PERM|STRING|STRCUR]
- -| [ :0x89fa5a4] Op_no_op :
- -| [ 6:0x89fa590] Op_push : O
- -| [ :0x89fa5b8] Op_concat : [expr_count = 3]
[concat_flag = 0]
- -| [ 6:0x89fa518] Op_store_var : x [do_reference = FALSE]
- -| [ 7:0x89fa504] Op_push_loop : [target_continue =
0x89fa568] [target_break = 0x89fa680]
- -| [ 7:0x89fa568] Op_push_lhs : X [do_reference = TRUE]
- -| [ 7:0x89fa52c] Op_postincrement :
- -| [ 7:0x89fa5e0] Op_push : x
- -| [ 7:0x89fa61c] Op_push : o
- -| [ 7:0x89fa5f4] Op_plus :
- -| [ 7:0x89fa644] Op_push : o
- -| [ 7:0x89fa630] Op_plus :
- -| [ 7:0x89fa5cc] Op_leq :
- -| [ :0x89fa57c] Op_jmp_false : [target_jmp = 0x89fa680]
- -| [ 7:0x89fa694] Op_push_i : "%c" [PERM|STRING|STRCUR]
- -| [ :0x89fa6d0] Op_no_op :
- -| [ 7:0x89fa608] Op_assign_concat : c
- -| [ :0x89fa6a8] Op_jmp : [target_jmp = 0x89fa568]
- -| [ :0x89fa680] Op_pop_loop :
- -|
- ...
- -|
- -| [ 8:0x89fa658] Op_K_printf : [expr_count = 17]
[redir_type = ""]
- -| [ :0x89fa374] Op_no_op :
- -| [ :0x89fa3d8] Op_atexit :
- -| [ :0x89fa6bc] Op_stop :
- -| [ :0x89fa39c] Op_no_op :
- -| [ :0x89fa3b0] Op_after_beginfile :
- -| [ :0x89fa388] Op_no_op :
- -| [ :0x89fa3c4] Op_after_endfile :
- gawk>
+ There is no need to be unduly suspicious about the results from
+floating-point arithmetic. The lesson to remember is that
+floating-point arithmetic is always more complex than the arithmetic
+using pencil and paper. In order to take advantage of the power of
+computer floating-point, you need to know its limitations and work
+within them. For most casual use of floating-point arithmetic, you will
+often get the expected result in the end if you simply round the
+display of your final results to the correct number of significant
+decimal digits. And, avoid presenting numerical data in a manner that
+implies better precision than is actually the case.
-`help'
-`h'
- Print a list of all of the `gawk' debugger commands with a short
- summary of their usage. `help COMMAND' prints the information
- about the command COMMAND.
+* Menu:
-`list' [`-' | `+' | N | FILENAME`:'N | N-M | FUNCTION]
-`l' [`-' | `+' | N | FILENAME`:'N | N-M | FUNCTION]
- Print the specified lines (default 15) from the current source file
- or the file named FILENAME. The possible arguments to `list' are
- as follows:
+* Floating-point Representation:: Binary floating-point representation.
+* Floating-point Context:: Floating-point context.
+* Rounding Mode:: Floating-point rounding mode.
- `-'
- Print lines before the lines last printed.
+ ---------- Footnotes ----------
- `+'
- Print lines after the lines last printed. `list' without any
- argument does the same thing.
+ (1) If you are interested in other tools that perform arbitrary
+precision arithmetic, you may want to investigate the POSIX `bc' tool.
+See the POSIX specification for it
+(http://pubs.opengroup.org/onlinepubs/009695399/utilities/bc.html), for
+more information.
- N
- Print lines centered around line number N.
+
+File: gawk.info, Node: Floating-point Representation, Next: Floating-point
Context, Up: Floating-point Programming
- N-M
- Print lines from N to M.
+16.2.1 Binary Floating-point Representation
+-------------------------------------------
- FILENAME`:'N
- Print lines centered around line number N in source file
- FILENAME. This command may change the current source file.
+Although floating-point representations vary from machine to machine,
+the most commonly encountered representation is that defined by the
+IEEE 754 Standard. An IEEE-754 format value has three components:
- FUNCTION
- Print lines centered around beginning of the function
- FUNCTION. This command may change the current source file.
+ * A sign bit telling whether the number is positive or negative.
-`quit'
-`q'
- Exit the debugger. Debugging is great fun, but sometimes we all
- have to tend to other obligations in life, and sometimes we find
- the bug, and are free to go on to the next one! As we saw above,
- if you are running a program, the debugger warns you if you
- accidentally type `q' or `quit', to make sure you really want to
- quit.
+ * An "exponent" giving its order of magnitude, E.
-`trace' `on' | `off'
- Turn on or off a continuous printing of instructions which are
- about to be executed, along with printing the `awk' line which they
- implement. The default is `off'.
+ * A "significand", S, specifying the actual digits of the number.
+
+ The value of the number is then S * 2^E. The first bit of a
+non-zero binary significand is always one, so the significand in an
+IEEE-754 format only includes the fractional part, leaving the leading
+one implicit.
- It is to be hoped that most of the "opcodes" in these instructions
- are fairly self-explanatory, and using `stepi' and `nexti' while
- `trace' is on will make them into familiar friends.
+ Three of the standard IEEE-754 types are 32-bit single precision,
+64-bit double precision and 128-bit quadruple precision. The standard
+also specifies extended precision formats to allow greater precisions
+and larger exponent ranges.
+ The significand is stored in "normalized" format, which means that
+the first bit is always a one.
-File: gawk.info, Node: Readline Support, Next: Limitations, Prev: List of
Debugger Commands, Up: Debugger
+File: gawk.info, Node: Floating-point Context, Next: Rounding Mode, Prev:
Floating-point Representation, Up: Floating-point Programming
-15.4 Readline Support
-=====================
+16.2.2 Floating-point Context
+-----------------------------
-If `gawk' is compiled with the `readline' library, you can take
-advantage of that library's command completion and history expansion
-features. The following types of completion are available:
+A floating-point "context" defines the environment for arithmetic
+operations. It governs precision, sets rules for rounding, and limits
+the range for exponents. The context has the following primary
+components:
-Command completion
- Command names.
+"Precision"
+ Precision of the floating-point format in bits.
-Source file name completion
- Source file names. Relevant commands are `break', `clear', `list',
- `tbreak', and `until'.
+"emax"
+ Maximum exponent allowed for this format.
-Argument completion
- Non-numeric arguments to a command. Relevant commands are
- `enable' and `info'.
+"emin"
+ Minimum exponent allowed for this format.
-Variable name completion
- Global variable names, and function arguments in the current
- context if the program is running. Relevant commands are `display',
- `print', `set', and `watch'.
+"Underflow behavior"
+ The format may or may not support gradual underflow.
+"Rounding"
+ The rounding mode of this context.
-
-File: gawk.info, Node: Limitations, Prev: Readline Support, Up: Debugger
+ *note table-ieee-formats:: lists the precision and exponent field
+values for the basic IEEE-754 binary formats:
-15.5 Limitations and Future Plans
-=================================
+Name Total bits Precision emin emax
+---------------------------------------------------------------------------
+Single 32 24 -126 +127
+Double 64 53 -1022 +1023
+Quadruple 128 113 -16382 +16383
-We hope you find the `gawk' debugger useful and enjoyable to work with,
-but as with any program, especially in its early releases, it still has
-some limitations. A few which are worth being aware of are:
+Table 16.1: Basic IEEE Format Context Values
- * At this point, the debugger does not give a detailed explanation of
- what you did wrong when you type in something it doesn't like.
- Rather, it just responds `syntax error'. When you do figure out
- what your mistake was, though, you'll feel like a real guru.
+ NOTE: The precision numbers include the implied leading one that
+ gives them one extra bit of significand.
- * If you perused the dump of opcodes in *note Miscellaneous Debugger
- Commands::, (or if you are already familiar with `gawk' internals),
- you will realize that much of the internal manipulation of data in
- `gawk', as in many interpreters, is done on a stack. `Op_push',
- `Op_pop', etc., are the "bread and butter" of most `gawk' code.
- Unfortunately, as of now, the `gawk' debugger does not allow you
- to examine the stack's contents.
+ A floating-point context can also determine which signals are treated
+as exceptions, and can set rules for arithmetic with special values.
+Please consult the IEEE-754 standard or other resources for details.
- That is, the intermediate results of expression evaluation are on
- the stack, but cannot be printed. Rather, only variables which
- are defined in the program can be printed. Of course, a
- workaround for this is to use more explicit variables at the
- debugging stage and then change back to obscure, perhaps more
- optimal code later.
+ `gawk' ordinarily uses the hardware double precision representation
+for numbers. On most systems, this is IEEE-754 floating-point format,
+corresponding to 64-bit binary with 53 bits of precision.
- * There is no way to look "inside" the process of compiling regular
- expressions to see if you got it right. As an `awk' programmer,
- you are expected to know what `/[^[:alnum:][:blank:]]/' means.
+ NOTE: In case an underflow occurs, the standard allows, but does
+ not require, the result from an arithmetic operation to be a
+ number smaller than the smallest nonzero normalized number. Such
+ numbers do not have as many significant digits as normal numbers,
+ and are called "denormals" or "subnormals". The alternative,
+ simply returning a zero, is called "flush to zero". The basic
+ IEEE-754 binary formats support subnormal numbers.
- * The `gawk' debugger is designed to be used by running a program
- (with all its parameters) on the command line, as described in
- *note Debugger Invocation::. There is no way (as of now) to
- attach or "break in" to a running program. This seems reasonable
- for a language which is used mainly for quickly executing, short
- programs.
+
+File: gawk.info, Node: Rounding Mode, Prev: Floating-point Context, Up:
Floating-point Programming
- * The `gawk' debugger only accepts source supplied with the `-f'
- option.
+16.2.3 Floating-point Rounding Mode
+-----------------------------------
- Look forward to a future release when these and other missing
-features may be added, and of course feel free to try to add them
-yourself!
+The "rounding mode" specifies the behavior for the results of numerical
+operations when discarding extra precision. Each rounding mode indicates
+how the least significant returned digit of a rounded result is to be
+calculated. *note table-rounding-modes:: lists the IEEE-754 defined
+rounding modes:
-
-File: gawk.info, Node: Dynamic Extensions, Next: Language History, Prev:
Debugger, Up: Top
+Rounding Mode IEEE Name
+--------------------------------------------------------------------------
+Round to nearest, ties to even `roundTiesToEven'
+Round toward plus Infinity `roundTowardPositive'
+Round toward negative Infinity `roundTowardNegative'
+Round toward zero `roundTowardZero'
+Round to nearest, ties away `roundTiesToAway'
+from zero
-16 Writing Extensions for `gawk'
-********************************
+Table 16.2: IEEE 754 Rounding Modes
-This chapter is a placeholder, pending a rewrite for the new API. Some
-of the old bits remain, since they can be partially reused.
+ The default mode `roundTiesToEven' is the most preferred, but the
+least intuitive. This method does the obvious thing for most values, by
+rounding them up or down to the nearest digit. For example, rounding
+1.132 to two digits yields 1.13, and rounding 1.157 yields 1.16.
- It is possible to add new built-in functions to `gawk' using
-dynamically loaded libraries. This facility is available on systems
-(such as GNU/Linux) that support the C `dlopen()' and `dlsym()'
-functions. This major node describes how to write and use dynamically
-loaded extensions for `gawk'. Experience with programming in C or C++
-is necessary when reading this minor node.
+ However, when it comes to rounding a value that is exactly halfway
+between, things do not work the way you probably learned in school. In
+this case, the number is rounded to the nearest even digit. So
+rounding 0.125 to two digits rounds down to 0.12, but rounding 0.6875
+to three digits rounds up to 0.688. You probably have already
+encountered this rounding mode when using the `printf' routine to
+format floating-point numbers. For example:
- NOTE: When `--sandbox' is specified, extensions are disabled
- (*note Options::.
+ BEGIN {
+ x = -4.5
+ for (i = 1; i < 10; i++) {
+ x += 1.0
+ printf("%4.1f => %2.0f\n", x, x)
+ }
+ }
-* Menu:
+produces the following output when run:(1)
-* Plugin License:: A note about licensing.
-* Sample Library:: A example of new functions.
+ -3.5 => -4
+ -2.5 => -2
+ -1.5 => -2
+ -0.5 => 0
+ 0.5 => 0
+ 1.5 => 2
+ 2.5 => 2
+ 3.5 => 4
+ 4.5 => 4
-
-File: gawk.info, Node: Plugin License, Next: Sample Library, Up: Dynamic
Extensions
+ The theory behind the rounding mode `roundTiesToEven' is that it
+more or less evenly distributes upward and downward rounds of exact
+halves, which might cause the round-off error to cancel itself out.
+This is the default rounding mode used in IEEE-754 computing functions
+and operators.
-16.1 Extension Licensing
-========================
+ The other rounding modes are rarely used. Round toward positive
+infinity (`roundTowardPositive') and round toward negative infinity
+(`roundTowardNegative') are often used to implement interval arithmetic,
+where you adjust the rounding mode to calculate upper and lower bounds
+for the range of output. The `roundTowardZero' mode can be used for
+converting floating-point numbers to integers. The rounding mode
+`roundTiesToAway' rounds the result to the nearest number and selects
+the number with the larger magnitude if a tie occurs.
-Every dynamic extension should define the global symbol
-`plugin_is_GPL_compatible' to assert that it has been licensed under a
-GPL-compatible license. If this symbol does not exist, `gawk' will
-emit a fatal error and exit.
+ Some numerical analysts will tell you that your choice of rounding
+style has tremendous impact on the final outcome, and advise you to
+wait until final output for any rounding. Instead, you can often avoid
+round-off error problems by setting the precision initially to some
+value sufficiently larger than the final desired precision, so that the
+accumulation of round-off error does not influence the outcome. If you
+suspect that results from your computation are sensitive to
+accumulation of round-off error, one way to be sure is to look for a
+significant difference in output when you change the rounding mode.
- The declared type of the symbol should be `int'. It does not need
-to be in any allocated section, though. The code merely asserts that
-the symbol exists in the global scope. Something like this is enough:
+ ---------- Footnotes ----------
- int plugin_is_GPL_compatible;
+ (1) It is possible for the output to be completely different if the
+C library in your system does not use the IEEE-754 even-rounding rule
+to round halfway cases for `printf()'.
-File: gawk.info, Node: Sample Library, Prev: Plugin License, Up: Dynamic
Extensions
+File: gawk.info, Node: Gawk and MPFR, Next: Arbitrary Precision Floats,
Prev: Floating-point Programming, Up: Arbitrary Precision Arithmetic
-16.2 Example: Directory and File Operation Built-ins
-====================================================
+16.3 `gawk' + MPFR = Powerful Arithmetic
+========================================
-Two useful functions that are not in `awk' are `chdir()' (so that an
-`awk' program can change its directory) and `stat()' (so that an `awk'
-program can gather information about a file). This minor node
-implements these functions for `gawk' in an external extension library.
+The rest of this major node decsribes how to use the arbitrary precision
+(also known as "multiple precision" or "infinite precision") numeric
+capabilites in `gawk' to produce maximally accurate results when you
+need it.
-* Menu:
+ But first you should check if your version of `gawk' supports
+arbitrary precision arithmetic. The easiest way to find out is to look
+at the output of the following command:
-* Internal File Description:: What the new functions will do.
-* Internal File Ops:: The code for internal file operations.
-* Using Internal File Ops:: How to use an external extension.
+ $ gawk --version
+ -| GNU Awk 4.1.0 (GNU MPFR 3.1.0, GNU MP 5.0.3)
+ -| Copyright (C) 1989, 1991-2012 Free Software Foundation.
+ ...
-
-File: gawk.info, Node: Internal File Description, Next: Internal File Ops,
Up: Sample Library
+ `gawk' uses the GNU MPFR (http://www.mpfr.org) and GNU MP
+(http://gmplib.org) (GMP) libraries for arbitrary precision arithmetic
+on numbers. So if you do not see the names of these libraries in the
+output, then your version of `gawk' does not support arbitrary
+precision arithmetic.
-16.2.1 Using `chdir()' and `stat()'
------------------------------------
+ Additionally, there are a few elements available in the `PROCINFO'
+array to provide information about the MPFR and GMP libraries. *Note
+Auto-set::, for more information.
-This minor node shows how to use the new functions at the `awk' level
-once they've been integrated into the running `gawk' interpreter.
-Using `chdir()' is very straightforward. It takes one argument, the new
-directory to change to:
+
+File: gawk.info, Node: Arbitrary Precision Floats, Next: Arbitrary Precision
Integers, Prev: Gawk and MPFR, Up: Arbitrary Precision Arithmetic
- ...
- newdir = "/home/arnold/funstuff"
- ret = chdir(newdir)
- if (ret < 0) {
- printf("could not change to %s: %s\n",
- newdir, ERRNO) > "/dev/stderr"
- exit 1
- }
- ...
+16.4 Arbitrary Precision Floating-point Arithmetic with `gawk'
+==============================================================
- The return value is negative if the `chdir' failed, and `ERRNO'
-(*note Built-in Variables::) is set to a string indicating the error.
+`gawk' uses the GNU MPFR library for arbitrary precision floating-point
+arithmetic. The MPFR library provides precise control over precisions
+and rounding modes, and gives correctly rounded reproducible
+platform-independent results. With the command-line option `--bignum'
+or `-M', all floating-point arithmetic operators and numeric functions
+can yield results to any desired precision level supported by MPFR.
+Two built-in variables `PREC' (*note Setting Precision::) and
+`ROUNDMODE' (*note Setting Rounding Mode::) provide control over the
+working precision and the rounding mode. The precision and the
+rounding mode are set globally for every operation to follow.
- Using `stat()' is a bit more complicated. The C `stat()' function
-fills in a structure that has a fair amount of information. The right
-way to model this in `awk' is to fill in an associative array with the
-appropriate information:
+ The default working precision for arbitrary precision floating-point
+values is 53, and the default value for `ROUNDMODE' is `"N"', which
+selects the IEEE-754 `roundTiesToEven' (*note Rounding Mode::) rounding
+mode.(1) `gawk' uses the default exponent range in MPFR (EMAX = 2^30 -
+1, EMIN = -EMAX) for all floating-point contexts. There is no explicit
+mechanism to adjust the exponent range. MPFR does not implement
+subnormal numbers by default, and this behavior cannot be changed in
+`gawk'.
- file = "/home/arnold/.profile"
- fdata[1] = "x" # force `fdata' to be an array
- ret = stat(file, fdata)
- if (ret < 0) {
- printf("could not stat %s: %s\n",
- file, ERRNO) > "/dev/stderr"
- exit 1
- }
- printf("size of %s is %d bytes\n", file, fdata["size"])
+ NOTE: When emulating an IEEE-754 format (*note Setting
+ Precision::), `gawk' internally adjusts the exponent range to the
+ value defined for the format and also performs computations needed
+ for gradual underflow (subnormal numbers).
- The `stat()' function always clears the data array, even if the
-`stat()' fails. It fills in the following elements:
+ NOTE: MPFR numbers are variable-size entities, consuming only as
+ much space as needed to store the significant digits. Since the
+ performance using MPFR numbers pales in comparison to doing
+ arithmetic using the underlying machine types, you should consider
+ using only as much precision as needed by your program.
-`"name"'
- The name of the file that was `stat()''ed.
+* Menu:
-`"dev"'
-`"ino"'
- The file's device and inode numbers, respectively.
+* Setting Precision:: Setting the working precision.
+* Setting Rounding Mode:: Setting the rounding mode.
+* Floating-point Constants:: Representing floating-point constants.
+* Changing Precision:: Changing the precision of a number.
+* Exact Arithmetic:: Exact arithmetic with floating-point numbers.
-`"mode"'
- The file's mode, as a numeric value. This includes both the file's
- type and its permissions.
+ ---------- Footnotes ----------
-`"nlink"'
- The number of hard links (directory entries) the file has.
+ (1) The default precision is 53, since according to the MPFR
+documentation, the library should be able to exactly reproduce all
+computations with double-precision machine floating-point numbers
+(`double' type in C), except the default exponent range is much wider
+and subnormal numbers are not implemented.
-`"uid"'
-`"gid"'
- The numeric user and group ID numbers of the file's owner.
+
+File: gawk.info, Node: Setting Precision, Next: Setting Rounding Mode, Up:
Arbitrary Precision Floats
-`"size"'
- The size in bytes of the file.
+16.4.1 Setting the Working Precision
+------------------------------------
-`"blocks"'
- The number of disk blocks the file actually occupies. This may not
- be a function of the file's size if the file has holes.
+`gawk' uses a global working precision; it does not keep track of the
+precision or accuracy of individual numbers. Performing an arithmetic
+operation or calling a built-in function rounds the result to the
+current working precision. The default working precision is 53 which
+can be modified using the built-in variable `PREC'. You can also set the
+value to one of the following pre-defined case-insensitive strings to
+emulate an IEEE-754 binary format:
-`"atime"'
-`"mtime"'
-`"ctime"'
- The file's last access, modification, and inode update times,
- respectively. These are numeric timestamps, suitable for
- formatting with `strftime()' (*note Built-in::).
+`PREC' IEEE-754 Binary Format
+---------------------------------------------------
+`"half"' 16-bit half-precision.
+`"single"' Basic 32-bit single precision.
+`"double"' Basic 64-bit double precision.
+`"quad"' Basic 128-bit quadruple precision.
+`"oct"' 256-bit octuple precision.
-`"pmode"'
- The file's "printable mode." This is a string representation of
- the file's type and permissions, such as what is produced by `ls
- -l'--for example, `"drwxr-xr-x"'.
+ The following example illustrates the effects of changing precision
+on arithmetic operations:
-`"type"'
- A printable string representation of the file's type. The value
- is one of the following:
+ $ gawk -M -vPREC=100 'BEGIN { x = 1.0e-400; print x + 0; \
+ > PREC = "double"; print x + 0 }'
+ -| 1e-400
+ -| 0
- `"blockdev"'
- `"chardev"'
- The file is a block or character device ("special file").
+ Binary and decimal precisions are related approximately according to
+the formula:
- `"directory"'
- The file is a directory.
+ PREC = 3.322 * DPS
- `"fifo"'
- The file is a named-pipe (also known as a FIFO).
+Here, PREC denotes the binary precision (measured in bits) and DPS
+(short for decimal places) is the decimal digits. We can easily
+calculate how many decimal digits the 53-bit significand of an IEEE
+double is equivalent to: 53 / 3.332 which is equal to about 15.95. But
+what does 15.95 digits actually mean? It depends whether you are
+concerned about how many digits you can rely on, or how many digits you
+need.
- `"file"'
- The file is just a regular file.
+ It is important to know how many bits it takes to uniquely identify
+a double-precision value (the C type `double'). If you want to convert
+from `double' to decimal and back to `double' (e.g., saving a `double'
+representing an intermediate result to a file, and later reading it
+back to restart the computation), then a few more decimal digits are
+required. 17 digits is generally enough for a `double'.
- `"socket"'
- The file is an `AF_UNIX' ("Unix domain") socket in the
- filesystem.
+ It can also be important to know what decimal numbers can be uniquely
+represented with a `double'. If you want to convert from decimal to
+`double' and back again, 15 digits is the most that you can get. Stated
+differently, you should not present the numbers from your
+floating-point computations with more than 15 significant digits in
+them.
- `"symlink"'
- The file is a symbolic link.
+ Conversely, it takes a precision of 332 bits to hold an approximation
+of the constant pi that is accurate to 100 decimal places. You should
+always add some extra bits in order to avoid the confusing round-off
+issues that occur because numbers are stored internally in binary.
- Several additional elements may be present depending upon the
-operating system and the type of the file. You can test for them in
-your `awk' program by using the `in' operator (*note Reference to
-Elements::):
+
+File: gawk.info, Node: Setting Rounding Mode, Next: Floating-point
Constants, Prev: Setting Precision, Up: Arbitrary Precision Floats
-`"blksize"'
- The preferred block size for I/O to the file. This field is not
- present on all POSIX-like systems in the C `stat' structure.
+16.4.2 Setting the Rounding Mode
+--------------------------------
-`"linkval"'
- If the file is a symbolic link, this element is the name of the
- file the link points to (i.e., the value of the link).
+The `ROUNDMODE' variable provides program level control over the
+rounding mode. The correspondance between `ROUNDMODE' and the IEEE
+rounding modes is shown in *note table-gawk-rounding-modes::.
-`"rdev"'
-`"major"'
-`"minor"'
- If the file is a block or character device file, then these values
- represent the numeric device number and the major and minor
- components of that number, respectively.
+Rounding Mode IEEE Name `ROUNDMODE'
+---------------------------------------------------------------------------
+Round to nearest, ties to even `roundTiesToEven' `"N"' or `"n"'
+Round toward plus Infinity `roundTowardPositive' `"U"' or `"u"'
+Round toward negative Infinity `roundTowardNegative' `"D"' or `"d"'
+Round toward zero `roundTowardZero' `"Z"' or `"z"'
+Round to nearest, ties away `roundTiesToAway' `"A"' or `"a"'
+from zero
-
-File: gawk.info, Node: Internal File Ops, Next: Using Internal File Ops,
Prev: Internal File Description, Up: Sample Library
+Table 16.3: `gawk' Rounding Modes
-16.2.2 C Code for `chdir()' and `stat()'
-----------------------------------------
+ `ROUNDMODE' has the default value `"N"', which selects the IEEE-754
+rounding mode `roundTiesToEven'. Besides the values listed in *note
+Table 16.3: table-gawk-rounding-modes, `gawk' also accepts `"A"' to
+select the IEEE-754 mode `roundTiesToAway' if your version of the MPFR
+library supports it; otherwise setting `ROUNDMODE' to this value has no
+effect. *Note Rounding Mode::, for the meanings of the various rounding
+modes.
-Here is the C code for these extensions. They were written for
-GNU/Linux. The code needs some more work for complete portability to
-other POSIX-compliant systems:(1)
+ Here is an example of how to change the default rounding behavior of
+`printf''s output:
- #include "awk.h"
+ $ gawk -M -vROUNDMODE="Z" 'BEGIN { printf("%.2f\n", 1.378) }'
+ -| 1.37
- #include <sys/sysmacros.h>
+
+File: gawk.info, Node: Floating-point Constants, Next: Changing Precision,
Prev: Setting Rounding Mode, Up: Arbitrary Precision Floats
- int plugin_is_GPL_compatible;
+16.4.3 Representing Floating-point Constants
+--------------------------------------------
- /* do_chdir --- provide dynamically loaded chdir() builtin for gawk */
+Be wary of floating-point constants! When reading a floating-point
+constant from program source code, `gawk' uses the default precision,
+unless overridden by an assignment to the special variable `PREC' on
+the command line, to store it internally as a MPFR number. Changing
+the precision using `PREC' in the program text does not change the
+precision of a constant. If you need to represent a floating-point
+constant at a higher precision than the default and cannot use a
+command line assignment to `PREC', you should either specify the
+constant as a string, or as a rational number whenever possible. The
+following example illustrates the differences among various ways to
+print a floating-point constant:
- static NODE *
- do_chdir(int nargs)
- {
- NODE *newdir;
- int ret = -1;
+ $ gawk -M 'BEGIN { PREC = 113; printf("%0.25f\n", 0.1) }'
+ -| 0.1000000000000000055511151
+ $ gawk -M -vPREC = 113 'BEGIN { printf("%0.25f\n", 0.1) }'
+ -| 0.1000000000000000000000000
+ $ gawk -M 'BEGIN { PREC = 113; printf("%0.25f\n", "0.1") }'
+ -| 0.1000000000000000000000000
+ $ gawk -M 'BEGIN { PREC = 113; printf("%0.25f\n", 1/10) }'
+ -| 0.1000000000000000000000000
- if (do_lint && nargs != 1)
- lintwarn("chdir: called with incorrect number of arguments");
+ In the first case, the number is stored with the default precision
+of 53.
- newdir = get_scalar_argument(0, FALSE);
+
+File: gawk.info, Node: Changing Precision, Next: Exact Arithmetic, Prev:
Floating-point Constants, Up: Arbitrary Precision Floats
- The file includes the `"awk.h"' header file for definitions for the
-`gawk' internals. It includes `<sys/sysmacros.h>' for access to the
-`major()' and `minor'() macros.
+16.4.4 Changing the Precision of a Number
+-----------------------------------------
- By convention, for an `awk' function `foo', the function that
-implements it is called `do_foo'. The function should take a `int'
-argument, usually called `nargs', that represents the number of defined
-arguments for the function. The `newdir' variable represents the new
-directory to change to, retrieved with `get_scalar_argument()'. Note
-that the first argument is numbered zero.
+ The point is that in any variable-precision package, a decision is
+ made on how to treat numbers given as data, or arising in
+ intermediate results, which are represented in floating-point
+ format to a precision lower than working precision. Do we promote
+ them to full membership of the high-precision club, or do we treat
+ them and all their associates as second-class citizens? Sometimes
+ the first course is proper, sometimes the second, and it takes
+ careful analysis to tell which.
- This code actually accomplishes the `chdir()'. It first forces the
-argument to be a string and passes the string value to the `chdir()'
-system call. If the `chdir()' fails, `ERRNO' is updated.
+ Dirk Laurie(1)
- (void) force_string(newdir);
- ret = chdir(newdir->stptr);
- if (ret < 0)
- update_ERRNO_int(errno);
+ `gawk' does not implicitly modify the precision of any previously
+computed results when the working precision is changed with an
+assignment to `PREC'. The precision of a number is always the one that
+was used at the time of its creation, and there is no way for the user
+to explicitly change it afterwards. However, since the result of a
+floating-point arithmetic operation is always an arbitrary precision
+floating-point value--with a precision set by the value of `PREC'--one
+of the following workarounds effectively accomplishes the desired
+behavior:
- Finally, the function returns the return value to the `awk' level:
+ x = x + 0.0
- return make_number((AWKNUM) ret);
- }
+or:
- The `stat()' built-in is more involved. First comes a function that
-turns a numeric mode into a printable representation (e.g., 644 becomes
-`-rw-r--r--'). This is omitted here for brevity:
+ x += 0.0
- /* format_mode --- turn a stat mode field into something readable */
+ ---------- Footnotes ----------
- static char *
- format_mode(unsigned long fmode)
- {
- ...
- }
+ (1) Dirk Laurie. `Variable-precision Arithmetic Considered Perilous
+-- A Detective Story'. Electronic Transactions on Numerical Analysis.
+Volume 28, pp. 168-173, 2008.
- Next comes the `do_stat()' function. It starts with variable
-declarations and argument checking:
+
+File: gawk.info, Node: Exact Arithmetic, Prev: Changing Precision, Up:
Arbitrary Precision Floats
- /* do_stat --- provide a stat() function for gawk */
+16.4.5 Exact Arithmetic with Floating-point Numbers
+---------------------------------------------------
- static NODE *
- do_stat(int nargs)
- {
- NODE *file, *array, *tmp;
- struct stat sbuf;
- int ret;
- NODE **aptr;
- char *pmode; /* printable mode */
- char *type = "unknown";
+ CAUTION: Never depend on the exactness of floating-point
+ arithmetic, even for apparently simple expressions!
- if (do_lint && nargs > 2)
- lintwarn("stat: called with too many arguments");
+ Can arbitrary precision arithmetic give exact results? There are no
+easy answers. The standard rules of algebra often do not apply when
+using floating-point arithmetic. Among other things, the distributive
+and associative laws do not hold completely, and order of operation may
+be important for your computation. Rounding error, cumulative precision
+loss and underflow are often troublesome.
- Then comes the actual work. First, the function gets the arguments.
-Then, it always clears the array. The code use `lstat()' (instead of
-`stat()') to get the file information, in case the file is a symbolic
-link. If there's an error, it sets `ERRNO' and returns:
+ When `gawk' tests the expressions `0.1 + 12.2' and `12.3' for
+equality using the machine double precision arithmetic, it decides that
+they are not equal! (*Note Floating-point Programming::.) You can get
+the result you want by increasing the precision; 56 in this case will
+get the job done:
- /* file is first arg, array to hold results is second */
- file = get_scalar_argument(0, FALSE);
- array = get_array_argument(1, FALSE);
+ $ gawk -M -vPREC=56 'BEGIN { print (0.1 + 12.2 == 12.3) }'
+ -| 1
- /* empty out the array */
- assoc_clear(array);
+ If adding more bits is good, perhaps adding even more bits of
+precision is better? Here is what happens if we use an even larger
+value of `PREC':
- /* lstat the file, if error, set ERRNO and return */
- (void) force_string(file);
- ret = lstat(file->stptr, & sbuf);
- if (ret < 0) {
- update_ERRNO_int(errno);
- return make_number((AWKNUM) ret);
- }
+ $ gawk -M -vPREC=201 'BEGIN { print (0.1 + 12.2 == 12.3) }'
+ -| 0
- Now comes the tedious part: filling in the array. Only a few of the
-calls are shown here, since they all follow the same pattern:
+ This is not a bug in `gawk' or in the MPFR library. It is easy to
+forget that the finite number of bits used to store the value is often
+just an approximation after proper rounding. The test for equality
+succeeds if and only if _all_ bits in the two operands are exactly the
+same. Since this is not necessarily true after floating-point
+computations with a particular precision and effective rounding rule, a
+straight test for equality may not work.
- /* fill in the array */
- aptr = assoc_lookup(array, tmp = make_string("name", 4));
- *aptr = dupnode(file);
- unref(tmp);
+ So, don't assume that floating-point values can be compared for
+equality. You should also exercise caution when using other forms of
+comparisons. The standard way to compare between floating-point
+numbers is to determine how much error (or "tolerance") you will allow
+in a comparison and check to see if one value is within this error
+range of the other.
- aptr = assoc_lookup(array, tmp = make_string("mode", 4));
- *aptr = make_number((AWKNUM) sbuf.st_mode);
- unref(tmp);
+ In applications where 15 or fewer decimal places suffice, hardware
+double precision arithmetic can be adequate, and is usually much faster.
+But you do need to keep in mind that every floating-point operation can
+suffer a new rounding error with catastrophic consequences as
+illustrated by our attempt to compute the value of the constant pi
+(*note Floating-point Programming::). Extra precision can greatly
+enhance the stability and the accuracy of your computation in such
+cases.
- aptr = assoc_lookup(array, tmp = make_string("pmode", 5));
- pmode = format_mode(sbuf.st_mode);
- *aptr = make_string(pmode, strlen(pmode));
- unref(tmp);
+ Repeated addition is not necessarily equivalent to multiplication in
+floating-point arithmetic. In the example in *note Floating-point
+Programming:::
- When done, return the `lstat()' return value:
+ $ gawk 'BEGIN {
+ > for (d = 1.1; d <= 1.5; d += 0.1)
+ > i++
+ > print i
+ > }'
+ -| 4
+you may or may not succeed in getting the correct result by choosing an
+arbitrarily large value for `PREC'. Reformulation of the problem at
+hand is often the correct approach in such situations.
- return make_number((AWKNUM) ret);
- }
+
+File: gawk.info, Node: Arbitrary Precision Integers, Prev: Arbitrary
Precision Floats, Up: Arbitrary Precision Arithmetic
- Finally, it's necessary to provide the "glue" that loads the new
-function(s) into `gawk'. By convention, each library has a routine
-named `dl_load()' that does the job. The simplest way is to use the
-`dl_load_func' macro in `gawkapi.h'.
+16.5 Arbitrary Precision Integer Arithmetic with `gawk'
+=======================================================
- And that's it! As an exercise, consider adding functions to
-implement system calls such as `chown()', `chmod()', and `umask()'.
+If the option `--bignum' or `-M' is specified, `gawk' performs all
+integer arithmetic using GMP arbitrary precision integers. Any number
+that looks like an integer in a program source or data file is stored
+as an arbitrary precision integer. The size of the integer is limited
+only by your computer's memory. The current floating-point context has
+no effect on operations involving integers. For example, the following
+computes 5^4^3^2, the result of which is beyond the limits of ordinary
+`gawk' numbers:
- ---------- Footnotes ----------
+ $ gawk -M 'BEGIN {
+ > x = 5^4^3^2
+ > print "# of digits =", length(x)
+ > print substr(x, 1, 20), "...", substr(x, length(x) - 19, 20)
+ > }'
+ -| # of digits = 183231
+ -| 62060698786608744707 ... 92256259918212890625
- (1) This version is edited slightly for presentation. See
-`extension/filefuncs.c' in the `gawk' distribution for the complete
-version.
+ If you were to compute the same value using arbitrary precision
+floating-point values instead, the precision needed for correct output
+(using the formula `prec = 3.322 * dps'), would be 3.322 x 183231, or
+608693. (Thus, the floating-point representation requires over 30
+times as many decimal digits!)
-
-File: gawk.info, Node: Using Internal File Ops, Prev: Internal File Ops,
Up: Sample Library
+ The result from an arithmetic operation with an integer and a
+floating-point value is a floating-point value with a precision equal
+to the working precision. The following program calculates the eighth
+term in Sylvester's sequence(1) using a recurrence:
-16.2.3 Integrating the Extensions
----------------------------------
+ $ gawk -M 'BEGIN {
+ > s = 2.0
+ > for (i = 1; i <= 7; i++)
+ > s = s * (s - 1) + 1
+ > print s
+ > }'
+ -| 113423713055421845118910464
-Now that the code is written, it must be possible to add it at runtime
-to the running `gawk' interpreter. First, the code must be compiled.
-Assuming that the functions are in a file named `filefuncs.c', and IDIR
-is the location of the `gawk' include files, the following steps create
-a GNU/Linux shared library:
+ The output differs from the acutal number,
+113423713055421844361000443, because the default precision of 53 is not
+enough to represent the floating-point results exactly. You can either
+increase the precision (100 is enough in this case), or replace the
+floating-point constant `2.0' with an integer, to perform all
+computations using integer arithmetic to get the correct output.
- $ gcc -fPIC -shared -DHAVE_CONFIG_H -c -O -g -IIDIR filefuncs.c
- $ ld -o filefuncs.so -shared filefuncs.o
+ It will sometimes be necessary for `gawk' to implicitly convert an
+arbitrary precision integer into an arbitrary precision floating-point
+value. This is primarily because the MPFR library does not always
+provide the relevant interface to process arbitrary precision integers
+or mixed-mode numbers as needed by an operation or function. In such a
+case, the precision is set to the minimum value necessary for exact
+conversion, and the working precision is not used for this purpose. If
+this is not what you need or want, you can employ a subterfuge like
+this:
- Once the library exists, it is loaded by calling the `extension()'
-built-in function. This function takes two arguments: the name of the
-library to load and the name of a function to call when the library is
-first loaded. This function adds the new functions to `gawk'. It
-returns the value returned by the initialization function within the
-shared library:
+ gawk -M 'BEGIN { n = 13; print (n + 0.0) % 2.0 }'
- # file testff.awk
- BEGIN {
- extension("./filefuncs.so", "dl_load")
+ You can avoid this issue altogether by specifying the number as a
+floating-point value to begin with:
- chdir(".") # no-op
+ gawk -M 'BEGIN { n = 13.0; print n % 2.0 }'
- data[1] = 1 # force `data' to be an array
- print "Info for testff.awk"
- ret = stat("testff.awk", data)
- print "ret =", ret
- for (i in data)
- printf "data[\"%s\"] = %s\n", i, data[i]
- print "testff.awk modified:",
- strftime("%m %d %y %H:%M:%S", data["mtime"])
+ Note that for the particular example above, there is likely best to
+just use the following:
- print "\nInfo for JUNK"
- ret = stat("JUNK", data)
- print "ret =", ret
- for (i in data)
- printf "data[\"%s\"] = %s\n", i, data[i]
- print "JUNK modified:", strftime("%m %d %y %H:%M:%S", data["mtime"])
- }
+ gawk -M 'BEGIN { n = 13; print n % 2 }'
- Here are the results of running the program:
+ ---------- Footnotes ----------
- $ gawk -f testff.awk
- -| Info for testff.awk
- -| ret = 0
- -| data["size"] = 607
- -| data["ino"] = 14945891
- -| data["name"] = testff.awk
- -| data["pmode"] = -rw-rw-r--
- -| data["nlink"] = 1
- -| data["atime"] = 1293993369
- -| data["mtime"] = 1288520752
- -| data["mode"] = 33204
- -| data["blksize"] = 4096
- -| data["dev"] = 2054
- -| data["type"] = file
- -| data["gid"] = 500
- -| data["uid"] = 500
- -| data["blocks"] = 8
- -| data["ctime"] = 1290113572
- -| testff.awk modified: 10 31 10 12:25:52
- -|
- -| Info for JUNK
- -| ret = -1
- -| JUNK modified: 01 01 70 02:00:00
+ (1) Weisstein, Eric W. `Sylvester's Sequence'. From MathWorld--A
+Wolfram Web Resource.
+`http://mathworld.wolfram.com/SylvestersSequence.html'
File: gawk.info, Node: Language History, Next: Installation, Prev: Dynamic
Extensions, Up: Top
@@ -28426,441 +28429,441 @@ Node: History39607
Node: Names41998
Ref: Names-Footnote-143475
Node: This Manual43547
-Ref: This Manual-Footnote-148451
-Node: Conventions48551
-Node: Manual History50685
-Ref: Manual History-Footnote-153955
-Ref: Manual History-Footnote-253996
-Node: How To Contribute54070
-Node: Acknowledgments55214
-Node: Getting Started59710
-Node: Running gawk62089
-Node: One-shot63275
-Node: Read Terminal64500
-Ref: Read Terminal-Footnote-166150
-Ref: Read Terminal-Footnote-266426
-Node: Long66597
-Node: Executable Scripts67973
-Ref: Executable Scripts-Footnote-169842
-Ref: Executable Scripts-Footnote-269944
-Node: Comments70491
-Node: Quoting72958
-Node: DOS Quoting77581
-Node: Sample Data Files78256
-Node: Very Simple81288
-Node: Two Rules85887
-Node: More Complex88034
-Ref: More Complex-Footnote-190964
-Node: Statements/Lines91049
-Ref: Statements/Lines-Footnote-195511
-Node: Other Features95776
-Node: When96704
-Node: Invoking Gawk98851
-Node: Command Line100312
-Node: Options101095
-Ref: Options-Footnote-1116493
-Node: Other Arguments116518
-Node: Naming Standard Input119176
-Node: Environment Variables120270
-Node: AWKPATH Variable120828
-Ref: AWKPATH Variable-Footnote-1123586
-Node: AWKLIBPATH Variable123846
-Node: Other Environment Variables124443
-Node: Exit Status126938
-Node: Include Files127613
-Node: Loading Shared Libraries131182
-Node: Obsolete132407
-Node: Undocumented133104
-Node: Regexp133347
-Node: Regexp Usage134736
-Node: Escape Sequences136762
-Node: Regexp Operators142525
-Ref: Regexp Operators-Footnote-1149905
-Ref: Regexp Operators-Footnote-2150052
-Node: Bracket Expressions150150
-Ref: table-char-classes152040
-Node: GNU Regexp Operators154563
-Node: Case-sensitivity158286
-Ref: Case-sensitivity-Footnote-1161254
-Ref: Case-sensitivity-Footnote-2161489
-Node: Leftmost Longest161597
-Node: Computed Regexps162798
-Node: Reading Files166208
-Node: Records168211
-Ref: Records-Footnote-1177135
-Node: Fields177172
-Ref: Fields-Footnote-1180205
-Node: Nonconstant Fields180291
-Node: Changing Fields182493
-Node: Field Separators188474
-Node: Default Field Splitting191103
-Node: Regexp Field Splitting192220
-Node: Single Character Fields195562
-Node: Command Line Field Separator196621
-Node: Field Splitting Summary200062
-Ref: Field Splitting Summary-Footnote-1203254
-Node: Constant Size203355
-Node: Splitting By Content207939
-Ref: Splitting By Content-Footnote-1211665
-Node: Multiple Line211705
-Ref: Multiple Line-Footnote-1217552
-Node: Getline217731
-Node: Plain Getline219947
-Node: Getline/Variable222036
-Node: Getline/File223177
-Node: Getline/Variable/File224499
-Ref: Getline/Variable/File-Footnote-1226098
-Node: Getline/Pipe226185
-Node: Getline/Variable/Pipe228745
-Node: Getline/Coprocess229852
-Node: Getline/Variable/Coprocess231095
-Node: Getline Notes231809
-Node: Getline Summary234596
-Ref: table-getline-variants235004
-Node: Read Timeout235860
-Ref: Read Timeout-Footnote-1239605
-Node: Command line directories239662
-Node: Printing240292
-Node: Print241923
-Node: Print Examples243260
-Node: Output Separators246044
-Node: OFMT247804
-Node: Printf249162
-Node: Basic Printf250068
-Node: Control Letters251607
-Node: Format Modifiers255419
-Node: Printf Examples261428
-Node: Redirection264143
-Node: Special Files271127
-Node: Special FD271660
-Ref: Special FD-Footnote-1275285
-Node: Special Network275359
-Node: Special Caveats276209
-Node: Close Files And Pipes277005
-Ref: Close Files And Pipes-Footnote-1284028
-Ref: Close Files And Pipes-Footnote-2284176
-Node: Expressions284326
-Node: Values285458
-Node: Constants286134
-Node: Scalar Constants286814
-Ref: Scalar Constants-Footnote-1287673
-Node: Nondecimal-numbers287855
-Node: Regexp Constants290914
-Node: Using Constant Regexps291389
-Node: Variables294444
-Node: Using Variables295099
-Node: Assignment Options296823
-Node: Conversion298695
-Ref: table-locale-affects304071
-Ref: Conversion-Footnote-1304695
-Node: All Operators304804
-Node: Arithmetic Ops305434
-Node: Concatenation307939
-Ref: Concatenation-Footnote-1310732
-Node: Assignment Ops310852
-Ref: table-assign-ops315840
-Node: Increment Ops317248
-Node: Truth Values and Conditions320718
-Node: Truth Values321801
-Node: Typing and Comparison322850
-Node: Variable Typing323639
-Ref: Variable Typing-Footnote-1327536
-Node: Comparison Operators327658
-Ref: table-relational-ops328068
-Node: POSIX String Comparison331617
-Ref: POSIX String Comparison-Footnote-1332573
-Node: Boolean Ops332711
-Ref: Boolean Ops-Footnote-1336789
-Node: Conditional Exp336880
-Node: Function Calls338612
-Node: Precedence342206
-Node: Locales345875
-Node: Patterns and Actions346964
-Node: Pattern Overview348018
-Node: Regexp Patterns349687
-Node: Expression Patterns350230
-Node: Ranges353915
-Node: BEGIN/END356881
-Node: Using BEGIN/END357643
-Ref: Using BEGIN/END-Footnote-1360374
-Node: I/O And BEGIN/END360480
-Node: BEGINFILE/ENDFILE362762
-Node: Empty365666
-Node: Using Shell Variables365982
-Node: Action Overview368267
-Node: Statements370624
-Node: If Statement372478
-Node: While Statement373977
-Node: Do Statement376021
-Node: For Statement377177
-Node: Switch Statement380329
-Node: Break Statement382426
-Node: Continue Statement384416
-Node: Next Statement386209
-Node: Nextfile Statement388599
-Node: Exit Statement391144
-Node: Built-in Variables393560
-Node: User-modified394655
-Ref: User-modified-Footnote-1403010
-Node: Auto-set403072
-Ref: Auto-set-Footnote-1412980
-Node: ARGC and ARGV413185
-Node: Arrays417036
-Node: Array Basics418541
-Node: Array Intro419367
-Node: Reference to Elements423685
-Node: Assigning Elements425955
-Node: Array Example426446
-Node: Scanning an Array428178
-Node: Controlling Scanning430492
-Ref: Controlling Scanning-Footnote-1435425
-Node: Delete435741
-Ref: Delete-Footnote-1438176
-Node: Numeric Array Subscripts438233
-Node: Uninitialized Subscripts440416
-Node: Multi-dimensional442044
-Node: Multi-scanning445138
-Node: Arrays of Arrays446729
-Node: Functions451374
-Node: Built-in452196
-Node: Calling Built-in453274
-Node: Numeric Functions455262
-Ref: Numeric Functions-Footnote-1459094
-Ref: Numeric Functions-Footnote-2459451
-Ref: Numeric Functions-Footnote-3459499
-Node: String Functions459768
-Ref: String Functions-Footnote-1483265
-Ref: String Functions-Footnote-2483394
-Ref: String Functions-Footnote-3483642
-Node: Gory Details483729
-Ref: table-sub-escapes485408
-Ref: table-sub-posix-92486762
-Ref: table-sub-proposed488105
-Ref: table-posix-sub489455
-Ref: table-gensub-escapes491001
-Ref: Gory Details-Footnote-1492208
-Ref: Gory Details-Footnote-2492259
-Node: I/O Functions492410
-Ref: I/O Functions-Footnote-1499065
-Node: Time Functions499212
-Ref: Time Functions-Footnote-1510104
-Ref: Time Functions-Footnote-2510172
-Ref: Time Functions-Footnote-3510330
-Ref: Time Functions-Footnote-4510441
-Ref: Time Functions-Footnote-5510553
-Ref: Time Functions-Footnote-6510780
-Node: Bitwise Functions511046
-Ref: table-bitwise-ops511604
-Ref: Bitwise Functions-Footnote-1515825
-Node: Type Functions516009
-Node: I18N Functions516479
-Node: User-defined518106
-Node: Definition Syntax518910
-Ref: Definition Syntax-Footnote-1523820
-Node: Function Example523889
-Node: Function Caveats526483
-Node: Calling A Function526904
-Node: Variable Scope528019
-Node: Pass By Value/Reference529994
-Node: Return Statement533434
-Node: Dynamic Typing536415
-Node: Indirect Calls537150
-Node: Internationalization546835
-Node: I18N and L10N548274
-Node: Explaining gettext548960
-Ref: Explaining gettext-Footnote-1554026
-Ref: Explaining gettext-Footnote-2554210
-Node: Programmer i18n554375
-Node: Translator i18n558575
-Node: String Extraction559368
-Ref: String Extraction-Footnote-1560329
-Node: Printf Ordering560415
-Ref: Printf Ordering-Footnote-1563199
-Node: I18N Portability563263
-Ref: I18N Portability-Footnote-1565712
-Node: I18N Example565775
-Ref: I18N Example-Footnote-1568410
-Node: Gawk I18N568482
-Node: Arbitrary Precision Arithmetic569099
-Ref: Arbitrary Precision Arithmetic-Footnote-1570751
-Node: General Arithmetic570899
-Node: Floating Point Issues572619
-Node: String Conversion Precision573714
-Ref: String Conversion Precision-Footnote-1575420
-Node: Unexpected Results575529
-Node: POSIX Floating Point Problems577682
-Ref: POSIX Floating Point Problems-Footnote-1581507
-Node: Integer Programming581545
-Node: Floating-point Programming583293
-Ref: Floating-point Programming-Footnote-1589557
-Node: Floating-point Representation589821
-Node: Floating-point Context590988
-Ref: table-ieee-formats591830
-Node: Rounding Mode593214
-Ref: table-rounding-modes593693
-Ref: Rounding Mode-Footnote-1596697
-Node: Gawk and MPFR596878
-Node: Arbitrary Precision Floats598119
-Ref: Arbitrary Precision Floats-Footnote-1600541
-Node: Setting Precision600852
-Node: Setting Rounding Mode603579
-Ref: table-gawk-rounding-modes603983
-Node: Floating-point Constants605180
-Node: Changing Precision606602
-Ref: Changing Precision-Footnote-1608002
-Node: Exact Arithmetic608176
-Node: Arbitrary Precision Integers611274
-Ref: Arbitrary Precision Integers-Footnote-1614356
-Node: Advanced Features614503
-Node: Nondecimal Data616026
-Node: Array Sorting617609
-Node: Controlling Array Traversal618306
-Node: Array Sorting Functions626543
-Ref: Array Sorting Functions-Footnote-1630217
-Ref: Array Sorting Functions-Footnote-2630310
-Node: Two-way I/O630504
-Ref: Two-way I/O-Footnote-1635936
-Node: TCP/IP Networking636006
-Node: Profiling638850
-Node: Library Functions646304
-Ref: Library Functions-Footnote-1649311
-Node: Library Names649482
-Ref: Library Names-Footnote-1652953
-Ref: Library Names-Footnote-2653173
-Node: General Functions653259
-Node: Strtonum Function654212
-Node: Assert Function657142
-Node: Round Function660468
-Node: Cliff Random Function662011
-Node: Ordinal Functions663027
-Ref: Ordinal Functions-Footnote-1666097
-Ref: Ordinal Functions-Footnote-2666349
-Node: Join Function666558
-Ref: Join Function-Footnote-1668329
-Node: Getlocaltime Function668529
-Node: Data File Management672244
-Node: Filetrans Function672876
-Node: Rewind Function677015
-Node: File Checking678402
-Node: Empty Files679496
-Node: Ignoring Assigns681726
-Node: Getopt Function683279
-Ref: Getopt Function-Footnote-1694583
-Node: Passwd Functions694786
-Ref: Passwd Functions-Footnote-1703761
-Node: Group Functions703849
-Node: Walking Arrays711933
-Node: Sample Programs713502
-Node: Running Examples714167
-Node: Clones714895
-Node: Cut Program716119
-Node: Egrep Program725964
-Ref: Egrep Program-Footnote-1733737
-Node: Id Program733847
-Node: Split Program737463
-Ref: Split Program-Footnote-1740982
-Node: Tee Program741110
-Node: Uniq Program743913
-Node: Wc Program751342
-Ref: Wc Program-Footnote-1755608
-Ref: Wc Program-Footnote-2755808
-Node: Miscellaneous Programs755900
-Node: Dupword Program757088
-Node: Alarm Program759119
-Node: Translate Program763868
-Ref: Translate Program-Footnote-1768255
-Ref: Translate Program-Footnote-2768483
-Node: Labels Program768617
-Ref: Labels Program-Footnote-1771988
-Node: Word Sorting772072
-Node: History Sorting775956
-Node: Extract Program777795
-Ref: Extract Program-Footnote-1785278
-Node: Simple Sed785406
-Node: Igawk Program788468
-Ref: Igawk Program-Footnote-1803625
-Ref: Igawk Program-Footnote-2803826
-Node: Anagram Program803964
-Node: Signature Program807032
-Node: Debugger808132
-Node: Debugging809086
-Node: Debugging Concepts809519
-Node: Debugging Terms811375
-Node: Awk Debugging813972
-Node: Sample Debugging Session814864
-Node: Debugger Invocation815384
-Node: Finding The Bug816713
-Node: List of Debugger Commands823201
-Node: Breakpoint Control824535
-Node: Debugger Execution Control828199
-Node: Viewing And Changing Data831559
-Node: Execution Stack834915
-Node: Debugger Info836382
-Node: Miscellaneous Debugger Commands840363
-Node: Readline Support845808
-Node: Limitations846639
-Node: Dynamic Extensions848891
-Node: Plugin License849787
-Node: Sample Library850401
-Node: Internal File Description851085
-Node: Internal File Ops854798
-Ref: Internal File Ops-Footnote-1859361
-Node: Using Internal File Ops859501
-Node: Language History861877
-Node: V7/SVR3.1863399
-Node: SVR4865720
-Node: POSIX867162
-Node: BTL868170
-Node: POSIX/GNU868904
-Node: Common Extensions874439
-Node: Ranges and Locales875546
-Ref: Ranges and Locales-Footnote-1880164
-Ref: Ranges and Locales-Footnote-2880191
-Ref: Ranges and Locales-Footnote-3880451
-Node: Contributors880672
-Node: Installation884968
-Node: Gawk Distribution885862
-Node: Getting886346
-Node: Extracting887172
-Node: Distribution contents888864
-Node: Unix Installation894086
-Node: Quick Installation894703
-Node: Additional Configuration Options896665
-Node: Configuration Philosophy898142
-Node: Non-Unix Installation900484
-Node: PC Installation900942
-Node: PC Binary Installation902241
-Node: PC Compiling904089
-Node: PC Testing907033
-Node: PC Using908209
-Node: Cygwin912394
-Node: MSYS913394
-Node: VMS Installation913908
-Node: VMS Compilation914511
-Ref: VMS Compilation-Footnote-1915518
-Node: VMS Installation Details915576
-Node: VMS Running917211
-Node: VMS Old Gawk918818
-Node: Bugs919292
-Node: Other Versions923144
-Node: Notes928459
-Node: Compatibility Mode929046
-Node: Additions929829
-Node: Accessing The Source930756
-Node: Adding Code932181
-Node: New Ports938189
-Node: Derived Files942324
-Ref: Derived Files-Footnote-1947628
-Ref: Derived Files-Footnote-2947662
-Ref: Derived Files-Footnote-3948262
-Node: Future Extensions948360
-Node: Basic Concepts949847
-Node: Basic High Level950528
-Ref: Basic High Level-Footnote-1954563
-Node: Basic Data Typing954748
-Node: Glossary958103
-Node: Copying983079
-Node: GNU Free Documentation License1020636
-Node: Index1045773
+Ref: This Manual-Footnote-148556
+Node: Conventions48656
+Node: Manual History50790
+Ref: Manual History-Footnote-154060
+Ref: Manual History-Footnote-254101
+Node: How To Contribute54175
+Node: Acknowledgments55319
+Node: Getting Started59815
+Node: Running gawk62194
+Node: One-shot63380
+Node: Read Terminal64605
+Ref: Read Terminal-Footnote-166255
+Ref: Read Terminal-Footnote-266531
+Node: Long66702
+Node: Executable Scripts68078
+Ref: Executable Scripts-Footnote-169947
+Ref: Executable Scripts-Footnote-270049
+Node: Comments70596
+Node: Quoting73063
+Node: DOS Quoting77686
+Node: Sample Data Files78361
+Node: Very Simple81393
+Node: Two Rules85992
+Node: More Complex88139
+Ref: More Complex-Footnote-191069
+Node: Statements/Lines91154
+Ref: Statements/Lines-Footnote-195616
+Node: Other Features95881
+Node: When96809
+Node: Invoking Gawk98956
+Node: Command Line100417
+Node: Options101200
+Ref: Options-Footnote-1116598
+Node: Other Arguments116623
+Node: Naming Standard Input119281
+Node: Environment Variables120375
+Node: AWKPATH Variable120933
+Ref: AWKPATH Variable-Footnote-1123691
+Node: AWKLIBPATH Variable123951
+Node: Other Environment Variables124548
+Node: Exit Status127043
+Node: Include Files127718
+Node: Loading Shared Libraries131287
+Node: Obsolete132512
+Node: Undocumented133209
+Node: Regexp133452
+Node: Regexp Usage134841
+Node: Escape Sequences136867
+Node: Regexp Operators142630
+Ref: Regexp Operators-Footnote-1150010
+Ref: Regexp Operators-Footnote-2150157
+Node: Bracket Expressions150255
+Ref: table-char-classes152145
+Node: GNU Regexp Operators154668
+Node: Case-sensitivity158391
+Ref: Case-sensitivity-Footnote-1161359
+Ref: Case-sensitivity-Footnote-2161594
+Node: Leftmost Longest161702
+Node: Computed Regexps162903
+Node: Reading Files166313
+Node: Records168316
+Ref: Records-Footnote-1177240
+Node: Fields177277
+Ref: Fields-Footnote-1180310
+Node: Nonconstant Fields180396
+Node: Changing Fields182598
+Node: Field Separators188579
+Node: Default Field Splitting191208
+Node: Regexp Field Splitting192325
+Node: Single Character Fields195667
+Node: Command Line Field Separator196726
+Node: Field Splitting Summary200167
+Ref: Field Splitting Summary-Footnote-1203359
+Node: Constant Size203460
+Node: Splitting By Content208044
+Ref: Splitting By Content-Footnote-1211770
+Node: Multiple Line211810
+Ref: Multiple Line-Footnote-1217657
+Node: Getline217836
+Node: Plain Getline220052
+Node: Getline/Variable222141
+Node: Getline/File223282
+Node: Getline/Variable/File224604
+Ref: Getline/Variable/File-Footnote-1226203
+Node: Getline/Pipe226290
+Node: Getline/Variable/Pipe228850
+Node: Getline/Coprocess229957
+Node: Getline/Variable/Coprocess231200
+Node: Getline Notes231914
+Node: Getline Summary234701
+Ref: table-getline-variants235109
+Node: Read Timeout235965
+Ref: Read Timeout-Footnote-1239710
+Node: Command line directories239767
+Node: Printing240397
+Node: Print242028
+Node: Print Examples243365
+Node: Output Separators246149
+Node: OFMT247909
+Node: Printf249267
+Node: Basic Printf250173
+Node: Control Letters251712
+Node: Format Modifiers255524
+Node: Printf Examples261533
+Node: Redirection264248
+Node: Special Files271232
+Node: Special FD271765
+Ref: Special FD-Footnote-1275390
+Node: Special Network275464
+Node: Special Caveats276314
+Node: Close Files And Pipes277110
+Ref: Close Files And Pipes-Footnote-1284133
+Ref: Close Files And Pipes-Footnote-2284281
+Node: Expressions284431
+Node: Values285563
+Node: Constants286239
+Node: Scalar Constants286919
+Ref: Scalar Constants-Footnote-1287778
+Node: Nondecimal-numbers287960
+Node: Regexp Constants291019
+Node: Using Constant Regexps291494
+Node: Variables294549
+Node: Using Variables295204
+Node: Assignment Options296928
+Node: Conversion298800
+Ref: table-locale-affects304176
+Ref: Conversion-Footnote-1304800
+Node: All Operators304909
+Node: Arithmetic Ops305539
+Node: Concatenation308044
+Ref: Concatenation-Footnote-1310837
+Node: Assignment Ops310957
+Ref: table-assign-ops315945
+Node: Increment Ops317353
+Node: Truth Values and Conditions320823
+Node: Truth Values321906
+Node: Typing and Comparison322955
+Node: Variable Typing323744
+Ref: Variable Typing-Footnote-1327641
+Node: Comparison Operators327763
+Ref: table-relational-ops328173
+Node: POSIX String Comparison331722
+Ref: POSIX String Comparison-Footnote-1332678
+Node: Boolean Ops332816
+Ref: Boolean Ops-Footnote-1336894
+Node: Conditional Exp336985
+Node: Function Calls338717
+Node: Precedence342311
+Node: Locales345980
+Node: Patterns and Actions347069
+Node: Pattern Overview348123
+Node: Regexp Patterns349792
+Node: Expression Patterns350335
+Node: Ranges354020
+Node: BEGIN/END356986
+Node: Using BEGIN/END357748
+Ref: Using BEGIN/END-Footnote-1360479
+Node: I/O And BEGIN/END360585
+Node: BEGINFILE/ENDFILE362867
+Node: Empty365771
+Node: Using Shell Variables366087
+Node: Action Overview368372
+Node: Statements370729
+Node: If Statement372583
+Node: While Statement374082
+Node: Do Statement376126
+Node: For Statement377282
+Node: Switch Statement380434
+Node: Break Statement382531
+Node: Continue Statement384521
+Node: Next Statement386314
+Node: Nextfile Statement388704
+Node: Exit Statement391249
+Node: Built-in Variables393665
+Node: User-modified394760
+Ref: User-modified-Footnote-1403115
+Node: Auto-set403177
+Ref: Auto-set-Footnote-1413085
+Node: ARGC and ARGV413290
+Node: Arrays417141
+Node: Array Basics418646
+Node: Array Intro419472
+Node: Reference to Elements423790
+Node: Assigning Elements426060
+Node: Array Example426551
+Node: Scanning an Array428283
+Node: Controlling Scanning430597
+Ref: Controlling Scanning-Footnote-1435530
+Node: Delete435846
+Ref: Delete-Footnote-1438281
+Node: Numeric Array Subscripts438338
+Node: Uninitialized Subscripts440521
+Node: Multi-dimensional442149
+Node: Multi-scanning445243
+Node: Arrays of Arrays446834
+Node: Functions451479
+Node: Built-in452301
+Node: Calling Built-in453379
+Node: Numeric Functions455367
+Ref: Numeric Functions-Footnote-1459199
+Ref: Numeric Functions-Footnote-2459556
+Ref: Numeric Functions-Footnote-3459604
+Node: String Functions459873
+Ref: String Functions-Footnote-1483370
+Ref: String Functions-Footnote-2483499
+Ref: String Functions-Footnote-3483747
+Node: Gory Details483834
+Ref: table-sub-escapes485513
+Ref: table-sub-posix-92486867
+Ref: table-sub-proposed488210
+Ref: table-posix-sub489560
+Ref: table-gensub-escapes491106
+Ref: Gory Details-Footnote-1492313
+Ref: Gory Details-Footnote-2492364
+Node: I/O Functions492515
+Ref: I/O Functions-Footnote-1499170
+Node: Time Functions499317
+Ref: Time Functions-Footnote-1510209
+Ref: Time Functions-Footnote-2510277
+Ref: Time Functions-Footnote-3510435
+Ref: Time Functions-Footnote-4510546
+Ref: Time Functions-Footnote-5510658
+Ref: Time Functions-Footnote-6510885
+Node: Bitwise Functions511151
+Ref: table-bitwise-ops511709
+Ref: Bitwise Functions-Footnote-1515930
+Node: Type Functions516114
+Node: I18N Functions516584
+Node: User-defined518211
+Node: Definition Syntax519015
+Ref: Definition Syntax-Footnote-1523925
+Node: Function Example523994
+Node: Function Caveats526588
+Node: Calling A Function527009
+Node: Variable Scope528124
+Node: Pass By Value/Reference530099
+Node: Return Statement533539
+Node: Dynamic Typing536520
+Node: Indirect Calls537255
+Node: Internationalization546940
+Node: I18N and L10N548379
+Node: Explaining gettext549065
+Ref: Explaining gettext-Footnote-1554131
+Ref: Explaining gettext-Footnote-2554315
+Node: Programmer i18n554480
+Node: Translator i18n558680
+Node: String Extraction559473
+Ref: String Extraction-Footnote-1560434
+Node: Printf Ordering560520
+Ref: Printf Ordering-Footnote-1563304
+Node: I18N Portability563368
+Ref: I18N Portability-Footnote-1565817
+Node: I18N Example565880
+Ref: I18N Example-Footnote-1568515
+Node: Gawk I18N568587
+Node: Advanced Features569204
+Node: Nondecimal Data570727
+Node: Array Sorting572310
+Node: Controlling Array Traversal573007
+Node: Array Sorting Functions581244
+Ref: Array Sorting Functions-Footnote-1584918
+Ref: Array Sorting Functions-Footnote-2585011
+Node: Two-way I/O585205
+Ref: Two-way I/O-Footnote-1590637
+Node: TCP/IP Networking590707
+Node: Profiling593551
+Node: Library Functions601005
+Ref: Library Functions-Footnote-1604012
+Node: Library Names604183
+Ref: Library Names-Footnote-1607654
+Ref: Library Names-Footnote-2607874
+Node: General Functions607960
+Node: Strtonum Function608913
+Node: Assert Function611843
+Node: Round Function615169
+Node: Cliff Random Function616712
+Node: Ordinal Functions617728
+Ref: Ordinal Functions-Footnote-1620798
+Ref: Ordinal Functions-Footnote-2621050
+Node: Join Function621259
+Ref: Join Function-Footnote-1623030
+Node: Getlocaltime Function623230
+Node: Data File Management626945
+Node: Filetrans Function627577
+Node: Rewind Function631716
+Node: File Checking633103
+Node: Empty Files634197
+Node: Ignoring Assigns636427
+Node: Getopt Function637980
+Ref: Getopt Function-Footnote-1649284
+Node: Passwd Functions649487
+Ref: Passwd Functions-Footnote-1658462
+Node: Group Functions658550
+Node: Walking Arrays666634
+Node: Sample Programs668203
+Node: Running Examples668868
+Node: Clones669596
+Node: Cut Program670820
+Node: Egrep Program680665
+Ref: Egrep Program-Footnote-1688438
+Node: Id Program688548
+Node: Split Program692164
+Ref: Split Program-Footnote-1695683
+Node: Tee Program695811
+Node: Uniq Program698614
+Node: Wc Program706043
+Ref: Wc Program-Footnote-1710309
+Ref: Wc Program-Footnote-2710509
+Node: Miscellaneous Programs710601
+Node: Dupword Program711789
+Node: Alarm Program713820
+Node: Translate Program718569
+Ref: Translate Program-Footnote-1722956
+Ref: Translate Program-Footnote-2723184
+Node: Labels Program723318
+Ref: Labels Program-Footnote-1726689
+Node: Word Sorting726773
+Node: History Sorting730657
+Node: Extract Program732496
+Ref: Extract Program-Footnote-1739979
+Node: Simple Sed740107
+Node: Igawk Program743169
+Ref: Igawk Program-Footnote-1758326
+Ref: Igawk Program-Footnote-2758527
+Node: Anagram Program758665
+Node: Signature Program761733
+Node: Debugger762833
+Node: Debugging763787
+Node: Debugging Concepts764220
+Node: Debugging Terms766076
+Node: Awk Debugging768673
+Node: Sample Debugging Session769565
+Node: Debugger Invocation770085
+Node: Finding The Bug771414
+Node: List of Debugger Commands777902
+Node: Breakpoint Control779236
+Node: Debugger Execution Control782900
+Node: Viewing And Changing Data786260
+Node: Execution Stack789616
+Node: Debugger Info791083
+Node: Miscellaneous Debugger Commands795064
+Node: Readline Support800509
+Node: Limitations801340
+Node: Dynamic Extensions803592
+Node: Plugin License804488
+Node: Sample Library805102
+Node: Internal File Description805786
+Node: Internal File Ops809499
+Ref: Internal File Ops-Footnote-1814062
+Node: Using Internal File Ops814202
+Node: Arbitrary Precision Arithmetic816578
+Ref: Arbitrary Precision Arithmetic-Footnote-1818230
+Node: General Arithmetic818378
+Node: Floating Point Issues820098
+Node: String Conversion Precision821193
+Ref: String Conversion Precision-Footnote-1822899
+Node: Unexpected Results823008
+Node: POSIX Floating Point Problems825161
+Ref: POSIX Floating Point Problems-Footnote-1828986
+Node: Integer Programming829024
+Node: Floating-point Programming830772
+Ref: Floating-point Programming-Footnote-1837036
+Node: Floating-point Representation837300
+Node: Floating-point Context838467
+Ref: table-ieee-formats839309
+Node: Rounding Mode840693
+Ref: table-rounding-modes841172
+Ref: Rounding Mode-Footnote-1844176
+Node: Gawk and MPFR844357
+Node: Arbitrary Precision Floats845598
+Ref: Arbitrary Precision Floats-Footnote-1848020
+Node: Setting Precision848331
+Node: Setting Rounding Mode851058
+Ref: table-gawk-rounding-modes851462
+Node: Floating-point Constants852659
+Node: Changing Precision854081
+Ref: Changing Precision-Footnote-1855481
+Node: Exact Arithmetic855655
+Node: Arbitrary Precision Integers858753
+Ref: Arbitrary Precision Integers-Footnote-1861835
+Node: Language History861982
+Node: V7/SVR3.1863504
+Node: SVR4865825
+Node: POSIX867267
+Node: BTL868275
+Node: POSIX/GNU869009
+Node: Common Extensions874544
+Node: Ranges and Locales875651
+Ref: Ranges and Locales-Footnote-1880269
+Ref: Ranges and Locales-Footnote-2880296
+Ref: Ranges and Locales-Footnote-3880556
+Node: Contributors880777
+Node: Installation885073
+Node: Gawk Distribution885967
+Node: Getting886451
+Node: Extracting887277
+Node: Distribution contents888969
+Node: Unix Installation894191
+Node: Quick Installation894808
+Node: Additional Configuration Options896770
+Node: Configuration Philosophy898247
+Node: Non-Unix Installation900589
+Node: PC Installation901047
+Node: PC Binary Installation902346
+Node: PC Compiling904194
+Node: PC Testing907138
+Node: PC Using908314
+Node: Cygwin912499
+Node: MSYS913499
+Node: VMS Installation914013
+Node: VMS Compilation914616
+Ref: VMS Compilation-Footnote-1915623
+Node: VMS Installation Details915681
+Node: VMS Running917316
+Node: VMS Old Gawk918923
+Node: Bugs919397
+Node: Other Versions923249
+Node: Notes928564
+Node: Compatibility Mode929151
+Node: Additions929934
+Node: Accessing The Source930861
+Node: Adding Code932286
+Node: New Ports938294
+Node: Derived Files942429
+Ref: Derived Files-Footnote-1947733
+Ref: Derived Files-Footnote-2947767
+Ref: Derived Files-Footnote-3948367
+Node: Future Extensions948465
+Node: Basic Concepts949952
+Node: Basic High Level950633
+Ref: Basic High Level-Footnote-1954668
+Node: Basic Data Typing954853
+Node: Glossary958208
+Node: Copying983184
+Node: GNU Free Documentation License1020741
+Node: Index1045878
End Tag Table
diff --git a/doc/gawk.texi b/doc/gawk.texi
index 7d463a3..d700f2a 100644
--- a/doc/gawk.texi
+++ b/doc/gawk.texi
@@ -296,14 +296,14 @@ particular records in a file and perform operations upon
them.
* Functions:: Built-in and user-defined functions.
* Internationalization:: Getting @command{gawk} to speak your
language.
-* Arbitrary Precision Arithmetic:: Arbitrary precision arithmetic with
- @command{gawk}.
* Advanced Features:: Stuff for advanced users, specific to
@command{gawk}.
* Library Functions:: A Library of @command{awk} Functions.
* Sample Programs:: Many @command{awk} programs with complete
explanations.
* Debugger:: The @code{gawk} debugger.
+* Arbitrary Precision Arithmetic:: Arbitrary precision arithmetic with
+ @command{gawk}.
* Dynamic Extensions:: Adding new built-in functions to
@command{gawk}.
* Language History:: The evolution of the @command{awk}
@@ -569,29 +569,6 @@ particular records in a file and perform operations upon
them.
* I18N Portability:: @command{awk}-level portability issues.
* I18N Example:: A simple i18n example.
* Gawk I18N:: @command{gawk} is also internationalized.
-* General Arithmetic:: An introduction to computer arithmetic.
-* Floating Point Issues:: Stuff to know about floating-point numbers.
-* String Conversion Precision:: The String Value Can Lie.
-* Unexpected Results:: Floating Point Numbers Are Not Abstract
- Numbers.
-* POSIX Floating Point Problems:: Standards Versus Existing Practice.
-* Integer Programming:: Effective integer programming.
-* Floating-point Programming:: Effective Floating-point Programming.
-* Floating-point Representation:: Binary floating-point representation.
-* Floating-point Context:: Floating-point context.
-* Rounding Mode:: Floating-point rounding mode.
-* Gawk and MPFR:: How @command{gawk} provides
- aribitrary-precision arithmetic.
-* Arbitrary Precision Floats:: Arbitrary Precision Floating-point
- Arithmetic with @command{gawk}.
-* Setting Precision:: Setting the working precision.
-* Setting Rounding Mode:: Setting the rounding mode.
-* Floating-point Constants:: Representing floating-point constants.
-* Changing Precision:: Changing the precision of a number.
-* Exact Arithmetic:: Exact arithmetic with floating-point
- numbers.
-* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with
- @command{gawk}.
* Nondecimal Data:: Allowing nondecimal input data.
* Array Sorting:: Facilities for controlling array traversal
and sorting arrays.
@@ -673,6 +650,29 @@ particular records in a file and perform operations upon
them.
* Miscellaneous Debugger Commands:: Miscellaneous Commands.
* Readline Support:: Readline support.
* Limitations:: Limitations and future plans.
+* General Arithmetic:: An introduction to computer arithmetic.
+* Floating Point Issues:: Stuff to know about floating-point numbers.
+* String Conversion Precision:: The String Value Can Lie.
+* Unexpected Results:: Floating Point Numbers Are Not Abstract
+ Numbers.
+* POSIX Floating Point Problems:: Standards Versus Existing Practice.
+* Integer Programming:: Effective integer programming.
+* Floating-point Programming:: Effective Floating-point Programming.
+* Floating-point Representation:: Binary floating-point representation.
+* Floating-point Context:: Floating-point context.
+* Rounding Mode:: Floating-point rounding mode.
+* Gawk and MPFR:: How @command{gawk} provides
+ aribitrary-precision arithmetic.
+* Arbitrary Precision Floats:: Arbitrary Precision Floating-point
+ Arithmetic with @command{gawk}.
+* Setting Precision:: Setting the working precision.
+* Setting Rounding Mode:: Setting the rounding mode.
+* Floating-point Constants:: Representing floating-point constants.
+* Changing Precision:: Changing the precision of a number.
+* Exact Arithmetic:: Exact arithmetic with floating-point
+ numbers.
+* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with
+ @command{gawk}.
* Plugin License:: A note about licensing.
* Sample Library:: A example of new functions.
* Internal File Description:: What the new functions will do.
@@ -1201,6 +1201,13 @@ solving real problems.
@ref{Debugger}, describes the @command{awk} debugger.
address@hidden Precision Arithmetic},
+describes advanced arithmetic facilities provided by
address@hidden
+
address@hidden Extensions}, describes how to add new variables and
+functions to @command{gawk} by writing extensions in C.
+
@ref{Language History},
describes how the @command{awk} language has evolved since
its first release to present. It also describes how @command{gawk}
@@ -18497,9447 +18504,9447 @@ then @command{gawk} produces usage messages,
warnings,
and fatal errors in the local language.
@c ENDOFRANGE inloc
address@hidden Arbitrary Precision Arithmetic
address@hidden Arithmetic and Arbitrary Precision Arithmetic with @command{gawk}
address@hidden arbitrary precision
address@hidden multiple precision
address@hidden infinite precision
address@hidden floating-point numbers, arbitrary precision
address@hidden MPFR
address@hidden GMP
address@hidden Advanced Features
address@hidden Advanced Features of @command{gawk}
address@hidden advanced features, network connections, See Also networks,
connections
address@hidden STARTOFRANGE gawadv
address@hidden @command{gawk}, features, advanced
address@hidden STARTOFRANGE advgaw
address@hidden advanced features, @command{gawk}
address@hidden
+Contributed by: Peter Langston <address@hidden>
address@hidden Knuth, Donald
+ Found in Steve English's "signature" line:
+
+"Write documentation as if whoever reads it is a violent psychopath
+who knows where you live."
address@hidden ignore
@quotation
address@hidden's a credibility gap: We don't know how much of the computer's
answers
-to believe. Novice computer users solve this problem by implicitly trusting
-in the computer as an infallible authority; they tend to believe that all
-digits of a printed answer are significant. Disillusioned computer users have
-just the opposite approach; they are constantly afraid that their answers
-are almost address@hidden
-Donald address@hidden E.@: Knuth.
address@hidden Art of Computer Programming}. Volume 2,
address@hidden Algorithms}, third edition,
-1998, ISBN 0-201-89683-4, p.@: 229.}
address@hidden documentation as if whoever reads it is
+a violent psychopath who knows where you address@hidden
+Steve English, as quoted by Peter Langston
@end quotation
-This @value{CHAPTER} discusses issues that you may encounter
-when performing arithmetic. It begins by discussing some of
-the general atributes of computer arithmetic, along with how
-this can influence what you see when running @command{awk} programs.
-This discussion applies to all versions of @command{awk}.
+This @value{CHAPTER} discusses advanced features in @command{gawk}.
+It's a bit of a ``grab bag'' of items that are otherwise unrelated
+to each other.
+First, a command-line option allows @command{gawk} to recognize
+nondecimal numbers in input data, not just in @command{awk}
+programs.
+Then, @command{gawk}'s special features for sorting arrays are presented.
+Next, two-way I/O, discussed briefly in earlier parts of this
address@hidden, is described in full detail, along with the basics
+of TCP/IP networking. Finally, @command{gawk}
+can @dfn{profile} an @command{awk} program, making it possible to tune
+it for performance.
-Then the discussion moves on to @dfn{arbitrary precsion
-arithmetic}, a feature which is specific to @command{gawk}.
address@hidden Extensions},
+discusses the ability to dynamically add new built-in functions to
address@hidden As this feature is still immature and likely to change,
+its description is relegated to an appendix.
@menu
-* General Arithmetic:: An introduction to computer arithmetic.
-* Floating-point Programming:: Effective Floating-point Programming.
-* Gawk and MPFR:: How @command{gawk} provides
- aribitrary-precision arithmetic.
-* Arbitrary Precision Floats:: Arbitrary Precision Floating-point Arithmetic
- with @command{gawk}.
-* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with
- @command{gawk}.
+* Nondecimal Data:: Allowing nondecimal input data.
+* Array Sorting:: Facilities for controlling array traversal and
+ sorting arrays.
+* Two-way I/O:: Two-way communications with another process.
+* TCP/IP Networking:: Using @command{gawk} for network programming.
+* Profiling:: Profiling your @command{awk} programs.
@end menu
address@hidden General Arithmetic
address@hidden A General Description of Computer Arithmetic
address@hidden Nondecimal Data
address@hidden Allowing Nondecimal Input Data
address@hidden @code{--non-decimal-data} option
address@hidden advanced features, @command{gawk}, nondecimal input data
address@hidden input, address@hidden nondecimal
address@hidden constants, nondecimal
address@hidden integers
address@hidden floating-point, numbers
address@hidden numbers, floating-point
-Within computers, there are two kinds of numeric values: @dfn{integers}
-and @dfn{floating-point}.
-In school, integer values were referred to as ``whole'' numbers---that is,
-numbers without any fractional part, such as 1, 42, or @minus{}17.
-The advantage to integer numbers is that they represent values exactly.
-The disadvantage is that their range is limited. On most systems,
-this range is @minus{}2,147,483,648 to 2,147,483,647.
-However, many systems now support a range from
address@hidden,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
+If you run @command{gawk} with the @option{--non-decimal-data} option,
+you can have nondecimal constants in your input data:
address@hidden unsigned integers
address@hidden integers, unsigned
-Integer values come in two flavors: @dfn{signed} and @dfn{unsigned}.
-Signed values may be negative or positive, with the range of values just
-described.
-Unsigned values are always positive. On most systems,
-the range is from 0 to 4,294,967,295.
-However, many systems now support a range from
-0 to 18,446,744,073,709,551,615.
address@hidden line break here for small book format
address@hidden
+$ @kbd{echo 0123 123 0x123 |}
+> @kbd{gawk --non-decimal-data '@{ printf "%d, %d, %d\n",}
+> @kbd{$1, $2, $3 @}'}
address@hidden 83, 123, 291
address@hidden example
address@hidden double precision floating-point
address@hidden single precision floating-point
-Floating-point numbers represent what are called ``real'' numbers; i.e.,
-those that do have a fractional part, such as 3.1415927.
-The advantage to floating-point numbers is that they
-can represent a much larger range of values.
-The disadvantage is that there are numbers that they cannot represent
-exactly.
address@hidden uses @dfn{double precision} floating-point numbers, which
-can hold more digits than @dfn{single precision}
-floating-point numbers.
address@hidden Floating-point issues are discussed more fully in
address@hidden @ref{Floating Point Issues}.
+For this feature to work, write your program so that
address@hidden treats your data as numeric:
-There a several important issues to be aware of, described next.
address@hidden
+$ @kbd{echo 0123 123 0x123 | gawk '@{ print $1, $2, $3 @}'}
address@hidden 0123 123 0x123
address@hidden example
address@hidden
-* Floating Point Issues:: Stuff to know about floating-point numbers.
-* Integer Programming:: Effective integer programming.
address@hidden menu
address@hidden
+The @code{print} statement treats its expressions as strings.
+Although the fields can act as numbers when necessary,
+they are still strings, so @code{print} does not try to treat them
+numerically. You may need to add zero to a field to force it to
+be treated as a number. For example:
address@hidden Floating Point Issues
address@hidden Floating-Point Number Caveats
address@hidden
+$ @kbd{echo 0123 123 0x123 | gawk --non-decimal-data '}
+> @address@hidden print $1, $2, $3}
+> @kbd{print $1 + 0, $2 + 0, $3 + 0 @}'}
address@hidden 0123 123 0x123
address@hidden 83 123 291
address@hidden example
-As mentioned earlier, floating-point numbers represent what are called
-``real'' numbers, i.e., those that have a fractional part. @command{awk}
-uses double precision floating-point numbers to represent all
-numeric values. This @value{SECTION} describes some of the issues
-involved in using floating-point numbers.
+Because it is common to have decimal data with leading zeros, and because
+using this facility could lead to surprising results, the default is to leave
it
+disabled. If you want it, you must explicitly request it.
-There is a very nice
address@hidden://www.validlab.com/goldberg/paper.pdf, paper on floating-point
arithmetic}
-by David Goldberg,
-``What Every Computer Scientist Should Know About Floating-point Arithmetic,''
address@hidden Computing Surveys} @strong{23}, 1 (1991-03), 5-48.
-This is worth reading if you are interested in the details,
-but it does require a background in computer science.
address@hidden programming conventions, @code{--non-decimal-data} option
address@hidden @code{--non-decimal-data} option, @code{strtonum()} function and
address@hidden @code{strtonum()} function (@command{gawk}),
@code{--non-decimal-data} option and
address@hidden CAUTION
address@hidden of this option is not recommended.}
+It can break old programs very badly.
+Instead, use the @code{strtonum()} function to convert your data
+(@pxref{Nondecimal-numbers}).
+This makes your programs easier to write and easier to read, and
+leads to less surprising results.
address@hidden quotation
+
address@hidden Array Sorting
address@hidden Controlling Array Traversal and Array Sorting
+
address@hidden lets you control the order in which a @samp{for (i in array)}
+loop traverses an array.
+
+In addition, two built-in functions, @code{asort()} and @code{asorti()},
+let you sort arrays based on the array values and indices, respectively.
+These two functions also provide control over the sorting criteria used
+to order the elements during sorting.
@menu
-* String Conversion Precision:: The String Value Can Lie.
-* Unexpected Results:: Floating Point Numbers Are Not Abstract
- Numbers.
-* POSIX Floating Point Problems:: Standards Versus Existing Practice.
+* Controlling Array Traversal:: How to use PROCINFO["sorted_in"].
+* Array Sorting Functions:: How to use @code{asort()} and @code{asorti()}.
@end menu
address@hidden String Conversion Precision
address@hidden The String Value Can Lie
-
-Internally, @command{awk} keeps both the numeric value
-(double precision floating-point) and the string value for a variable.
-Separately, @command{awk} keeps
-track of what type the variable has
-(@pxref{Typing and Comparison}),
-which plays a role in how variables are used in comparisons.
address@hidden Controlling Array Traversal
address@hidden Controlling Array Traversal
-It is important to note that the string value for a number may not
-reflect the full value (all the digits) that the numeric value
-actually contains.
-The following program (@file{values.awk}) illustrates this:
+By default, the order in which a @samp{for (i in array)} loop
+scans an array is not defined; it is generally based upon
+the internal implementation of arrays inside @command{awk}.
+
+Often, though, it is desirable to be able to loop over the elements
+in a particular order that you, the programmer, choose. @command{gawk}
+lets you do this.
+
address@hidden Scanning}, describes how you can assign special,
+pre-defined values to @code{PROCINFO["sorted_in"]} in order to
+control the order in which @command{gawk} will traverse an array
+during a @code{for} loop.
+
+In addition, the value of @code{PROCINFO["sorted_in"]} can be a function name.
+This lets you traverse an array based on any custom criterion.
+The array elements are ordered according to the return value of this
+function. The comparison function should be defined with at least
+four arguments:
@example
+function comp_func(i1, v1, i2, v2)
@{
- sum = $1 + $2
- # see it for what it is
- printf("sum = %.12g\n", sum)
- # use CONVFMT
- a = "<" sum ">"
- print "a =", a
- # use OFMT
- print "sum =", sum
+ @var{compare elements 1 and 2 in some fashion}
+ @var{return < 0; 0; or > 0}
@}
@end example
address@hidden
-This program shows the full value of the sum of @code{$1} and @code{$2}
-using @code{printf}, and then prints the string values obtained
-from both automatic conversion (via @code{CONVFMT}) and
-from printing (via @code{OFMT}).
+Here, @var{i1} and @var{i2} are the indices, and @var{v1} and @var{v2}
+are the corresponding values of the two elements being compared.
+Either @var{v1} or @var{v2}, or both, can be arrays if the array being
+traversed contains subarrays as values.
+(@xref{Arrays of Arrays}, for more information about subarrays.)
+The three possible return values are interpreted as follows:
-Here is what happens when the program is run:
address@hidden @code
address@hidden comp_func(i1, v1, i2, v2) < 0
+Index @var{i1} comes before index @var{i2} during loop traversal.
address@hidden
-$ @kbd{echo 3.654321 1.2345678 | awk -f values.awk}
address@hidden sum = 4.8888888
address@hidden a = <4.88889>
address@hidden sum = 4.88889
address@hidden example
address@hidden comp_func(i1, v1, i2, v2) == 0
+Indices @var{i1} and @var{i2}
+come together but the relative order with respect to each other is undefined.
-This makes it clear that the full numeric value is different from
-what the default string representations show.
address@hidden comp_func(i1, v1, i2, v2) > 0
+Index @var{i1} comes after index @var{i2} during loop traversal.
address@hidden table
address@hidden's default value is @code{"%.6g"}, which yields a value with
-at least six significant digits. For some applications, you might want to
-change it to specify more precision.
-On most modern machines, most of the time,
-17 digits is enough to capture a floating-point number's
-value address@hidden cases can require up to
-752 digits (!), but we doubt that you need to worry about this.}
+Our first comparison function can be used to scan an array in
+numerical order of the indices:
address@hidden Unexpected Results
address@hidden Floating Point Numbers Are Not Abstract Numbers
address@hidden
+function cmp_num_idx(i1, v1, i2, v2)
address@hidden
+ # numerical index comparison, ascending order
+ return (i1 - i2)
address@hidden
address@hidden example
address@hidden floating-point, numbers
-Unlike numbers in the abstract sense (such as what you studied in high school
-or college arithmetic), numbers stored in computers are limited in certain
ways.
-They cannot represent an infinite number of digits, nor can they always
-represent things exactly.
-In particular,
-floating-point numbers cannot
-always represent values exactly. Here is an example:
+Our second function traverses an array based on the string order of
+the element values rather than by indices:
@example
-$ @kbd{awk '@{ printf("%010d\n", $1 * 100) @}'}
-515.79
address@hidden 0000051579
-515.80
address@hidden 0000051579
-515.81
address@hidden 0000051580
-515.82
address@hidden 0000051582
address@hidden@value{CTL}-d}
+function cmp_str_val(i1, v1, i2, v2)
address@hidden
+ # string value comparison, ascending order
+ v1 = v1 ""
+ v2 = v2 ""
+ if (v1 < v2)
+ return -1
+ return (v1 != v2)
address@hidden
@end example
address@hidden
-This shows that some values can be represented exactly,
-whereas others are only approximated. This is not a ``bug''
-in @command{awk}, but simply an artifact of how computers
-represent numbers.
-
address@hidden NOTE
-It cannot be emphasized enough that the behavior just
-described is fundamental to modern computers. You will
-see this kind of thing happen in @emph{any} programming
-language using hardware floating-point numbers. It is @emph{not}
-a bug in @command{gawk}, nor is it something that can be ``just
-fixed.''
address@hidden quotation
+The third
+comparison function makes all numbers, and numeric strings without
+any leading or trailing spaces, come out first during loop traversal:
address@hidden negative zero
address@hidden positive zero
address@hidden address@hidden negative vs.@: positive
-Another peculiarity of floating-point numbers on modern systems
-is that they often have more than one representation for the number zero!
-In particular, it is possible to represent ``minus zero'' as well as
-regular, or ``positive'' zero.
address@hidden
+function cmp_num_str_val(i1, v1, i2, v2, n1, n2)
address@hidden
+ # numbers before string value comparison, ascending order
+ n1 = v1 + 0
+ n2 = v2 + 0
+ if (n1 == v1)
+ return (n2 == v2) ? (n1 - n2) : -1
+ else if (n2 == v2)
+ return 1
+ return (v1 < v2) ? -1 : (v1 != v2)
address@hidden
address@hidden example
-This example shows that negative and positive zero are distinct values
-when stored internally, but that they are in fact equal to each other,
-as well as to ``regular'' zero:
+Here is a main program to demonstrate how @command{gawk}
+behaves using each of the previous functions:
@example
-$ @kbd{gawk 'BEGIN @{ mz = -0 ; pz = 0}
-> @kbd{printf "-0 = %g, +0 = %g, (-0 == +0) -> %d\n", mz, pz, mz == pz}
-> @kbd{printf "mz == 0 -> %d, pz == 0 -> %d\n", mz == 0, pz == 0}
-> @address@hidden'}
address@hidden -0 = -0, +0 = 0, (-0 == +0) -> 1
address@hidden mz == 0 -> 1, pz == 0 -> 1
+BEGIN @{
+ data["one"] = 10
+ data["two"] = 20
+ data[10] = "one"
+ data[100] = 100
+ data[20] = "two"
+
+ f[1] = "cmp_num_idx"
+ f[2] = "cmp_str_val"
+ f[3] = "cmp_num_str_val"
+ for (i = 1; i <= 3; i++) @{
+ printf("Sort function: %s\n", f[i])
+ PROCINFO["sorted_in"] = f[i]
+ for (j in data)
+ printf("\tdata[%s] = %s\n", j, data[j])
+ print ""
+ @}
address@hidden
@end example
-It helps to keep this in mind should you process numeric data
-that contains negative zero values; the fact that the zero is negative
-is noted and can affect comparisons.
+Here are the results when the program is run:
address@hidden
address@hidden POSIX Floating Point Problems
address@hidden Standards Versus Existing Practice
address@hidden
+$ @kbd{gawk -f compdemo.awk}
address@hidden Sort function: cmp_num_idx @ii{Sort by numeric index}
address@hidden data[two] = 20
address@hidden data[one] = 10 @ii{Both strings are numerically
zero}
address@hidden data[10] = one
address@hidden data[20] = two
address@hidden data[100] = 100
address@hidden
address@hidden Sort function: cmp_str_val @ii{Sort by element values as
strings}
address@hidden data[one] = 10
address@hidden data[100] = 100 @ii{String 100 is less than
string 20}
address@hidden data[two] = 20
address@hidden data[10] = one
address@hidden data[20] = two
address@hidden
address@hidden Sort function: cmp_num_str_val @ii{Sort all numeric values
before all strings}
address@hidden data[one] = 10
address@hidden data[two] = 20
address@hidden data[100] = 100
address@hidden data[10] = one
address@hidden data[20] = two
address@hidden example
-Historically, @command{awk} has converted any non-numeric looking string
-to the numeric value zero, when required. Furthermore, the original
-definition of the language and the original POSIX standards specified that
address@hidden only understands decimal numbers (base 10), and not octal
-(base 8) or hexadecimal numbers (base 16).
+Consider sorting the entries of a GNU/Linux system password file
+according to login name. The following program sorts records
+by a specific field position and can be used for this purpose:
-Changes in the language of the
-2001 and 2004 POSIX standards can be interpreted to imply that @command{awk}
-should support additional features. These features are:
address@hidden
+# sort.awk --- simple program to sort by field position
+# field position is specified by the global variable POS
address@hidden @bullet
address@hidden
-Interpretation of floating point data values specified in hexadecimal
-notation (@samp{0xDEADBEEF}). (Note: data values, @emph{not}
-source code constants.)
+function cmp_field(i1, v1, i2, v2)
address@hidden
+ # comparison by value, as string, and ascending order
+ return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS])
address@hidden
address@hidden
-Support for the special IEEE 754 floating point values ``Not A Number''
-(NaN), positive Infinity (``inf'') and negative Infinity (address@hidden'').
-In particular, the format for these values is as specified by the ISO 1999
-C standard, which ignores case and can allow machine-dependent additional
-characters after the @samp{nan} and allow either @samp{inf} or @samp{infinity}.
address@hidden itemize
address@hidden
+ for (i = 1; i <= NF; i++)
+ a[NR][i] = $i
address@hidden
-The first problem is that both of these are clear changes to historical
-practice:
+END @{
+ PROCINFO["sorted_in"] = "cmp_field"
+ if (POS < 1 || POS > NF)
+ POS = 1
+ for (i in a) @{
+ for (j = 1; j <= NF; j++)
+ printf("%s%c", a[i][j], j < NF ? ":" : "")
+ print ""
+ @}
address@hidden
address@hidden example
address@hidden @bullet
address@hidden
-The @command{gawk} maintainer feels that supporting hexadecimal floating
-point values, in particular, is ugly, and was never intended by the
-original designers to be part of the language.
-
address@hidden
-Allowing completely alphabetic strings to have valid numeric
-values is also a very severe departure from historical practice.
address@hidden itemize
-
-The second problem is that the @code{gawk} maintainer feels that this
-interpretation of the standard, which requires a certain amount of
-``language lawyering'' to arrive at in the first place, was not even
-intended by the standard developers. In other words, ``we see how you
-got where you are, but we don't think that that's where you want to be.''
-
-Recognizing the above issues, but attempting to provide compatibility
-with the earlier versions of the standard,
-the 2008 POSIX standard added explicit wording to allow, but not require,
-that @command{awk} support hexadecimal floating point values and
-special values for ``Not A Number'' and infinity.
-
-Although the @command{gawk} maintainer continues to feel that
-providing those features is inadvisable,
-nevertheless, on systems that support IEEE floating point, it seems
-reasonable to provide @emph{some} way to support NaN and Infinity values.
-The solution implemented in @command{gawk} is as follows:
-
address@hidden @bullet
address@hidden
-With the @option{--posix} command-line option, @command{gawk} becomes
-``hands off.'' String values are passed directly to the system library's
address@hidden()} function, and if it successfully returns a numeric value,
-that is what's address@hidden asked for it, you got it.}
-By definition, the results are not portable across
-different systems. They are also a little surprising:
+The first field in each entry of the password file is the user's login name,
+and the fields are separated by colons.
+Each record defines a subarray,
+with each field as an element in the subarray.
+Running the program produces the
+following output:
@example
-$ @kbd{echo nanny | gawk --posix '@{ print $1 + 0 @}'}
address@hidden nan
-$ @kbd{echo 0xDeadBeef | gawk --posix '@{ print $1 + 0 @}'}
address@hidden 3735928559
+$ @kbd{gawk -vPOS=1 -F: -f sort.awk /etc/passwd}
address@hidden adm:x:3:4:adm:/var/adm:/sbin/nologin
address@hidden apache:x:48:48:Apache:/var/www:/sbin/nologin
address@hidden avahi:x:70:70:Avahi daemon:/:/sbin/nologin
address@hidden
@end example
address@hidden
-Without @option{--posix}, @command{gawk} interprets the four strings
address@hidden,
address@hidden,
address@hidden,
-and
address@hidden
-specially, producing the corresponding special numeric values.
-The leading sign acts a signal to @command{gawk} (and the user)
-that the value is really numeric. Hexadecimal floating point is
-not supported (unless you also use @option{--non-decimal-data},
-which is @emph{not} recommended). For example:
+The comparison should normally always return the same value when given a
+specific pair of array elements as its arguments. If inconsistent
+results are returned then the order is undefined. This behavior can be
+exploited to introduce random order into otherwise seemingly
+ordered data:
@example
-$ @kbd{echo nanny | gawk '@{ print $1 + 0 @}'}
address@hidden 0
-$ @kbd{echo +nan | gawk '@{ print $1 + 0 @}'}
address@hidden nan
-$ @kbd{echo 0xDeadBeef | gawk '@{ print $1 + 0 @}'}
address@hidden 0
+function cmp_randomize(i1, v1, i2, v2)
address@hidden
+ # random order
+ return (2 - 4 * rand())
address@hidden
@end example
address@hidden does ignore case in the four special values.
-Thus @samp{+nan} and @samp{+NaN} are the same.
address@hidden itemize
+As mentioned above, the order of the indices is arbitrary if two
+elements compare equal. This is usually not a problem, but letting
+the tied elements come out in arbitrary order can be an issue, especially
+when comparing item values. The partial ordering of the equal elements
+may change during the next loop traversal, if other elements are added or
+removed from the array. One way to resolve ties when comparing elements
+with otherwise equal values is to include the indices in the comparison
+rules. Note that doing this may make the loop traversal less efficient,
+so consider it only if necessary. The following comparison functions
+force a deterministic order, and are based on the fact that the
+indices of two elements are never equal:
address@hidden Integer Programming
address@hidden Mixing Integers And Floating-point
address@hidden
+function cmp_numeric(i1, v1, i2, v2)
address@hidden
+ # numerical value (and index) comparison, descending order
+ return (v1 != v2) ? (v2 - v1) : (i2 - i1)
address@hidden
-As has been mentioned already, @command{gawk} ordinarily uses hardware double
-precision with 64-bit IEEE binary floating-point representation
-for numbers on most systems. A large integer like 9007199254740997
-has a binary representation that, although finite, is more than 53 bits long;
-it must also be rounded to 53 bits.
-The biggest integer that can be stored in a C @code{double} is usually the same
-as the largest possible value of a @code{double}. If your system @code{double}
-is an IEEE 64-bit @code{double}, this largest possible value is an integer and
-can be represented precisely. What more should one know about integers?
+function cmp_string(i1, v1, i2, v2)
address@hidden
+ # string value (and index) comparison, descending order
+ v1 = v1 i1
+ v2 = v2 i2
+ return (v1 > v2) ? -1 : (v1 != v2)
address@hidden
address@hidden example
-If you want to know what is the largest integer, such that it and
-all smaller integers can be stored in 64-bit doubles without losing precision,
-then the answer is
address@hidden
address@hidden
address@hidden iftex
address@hidden
-2^53.
address@hidden ifnottex
-The next representable number is the even number
address@hidden
address@hidden + 2},
address@hidden iftex
address@hidden
-2^53 + 2,
address@hidden ifnottex
-meaning it is unlikely that you will be able to make
address@hidden print
address@hidden
address@hidden + 1}
address@hidden iftex
address@hidden
-2^53 + 1
address@hidden ifnottex
-in integer format.
-The range of integers exactly representable by a 64-bit double
-is
address@hidden
address@hidden, 2^{53}]}.
address@hidden iftex
address@hidden
address@hidden, 2^53].
address@hidden ifnottex
-If you ever see an integer outside this range in @command{gawk}
-using 64-bit doubles, you have reason to be very suspicious about
-the accuracy of the output. Here is a simple program with erroneous output:
address@hidden Avoid using the term ``stable'' when describing the
unpredictable behavior
address@hidden if two items compare equal. Usually, the goal of a "stable
algorithm"
address@hidden is to maintain the original order of the items, which is a
meaningless
address@hidden concept for a list constructed from a hash.
address@hidden
-$ @kbd{gawk 'BEGIN @{ i = 2^53 - 1; for (j = 0; j < 4; j++) print i + j @}'}
address@hidden 9007199254740991
address@hidden 9007199254740992
address@hidden 9007199254740992
address@hidden 9007199254740994
address@hidden example
+A custom comparison function can often simplify ordered loop
+traversal, and the sky is really the limit when it comes to
+designing such a function.
-The lesson is to not assume that any large integer printed by @command{gawk}
-represents an exact result from your computation, especially if it wraps
-around on your screen.
+When string comparisons are made during a sort, either for element
+values where one or both aren't numbers, or for element indices
+handled as strings, the value of @code{IGNORECASE}
+(@pxref{Built-in Variables}) controls whether
+the comparisons treat corresponding uppercase and lowercase letters as
+equivalent or distinct.
address@hidden Floating-point Programming
address@hidden Understanding Floating-point Programming
+Another point to keep in mind is that in the case of subarrays
+the element values can themselves be arrays; a production comparison
+function should use the @code{isarray()} function
+(@pxref{Type Functions}),
+to check for this, and choose a defined sorting order for subarrays.
-Numerical programming is an extensive area; if you need to develop
-sophisticated numerical algorithms then @command{gawk} may not be
-the ideal tool, and this documentation may not be sufficient.
address@hidden FIXME: JOHN: Do you want to cite some actual books?
-It might require digesting a book or two to really internalize how to compute
-with ideal accuracy and precision
-and the result often depends on the particular application.
+All sorting based on @code{PROCINFO["sorted_in"]}
+is disabled in POSIX mode,
+since the @code{PROCINFO} array is not special in that case.
address@hidden NOTE
-A floating-point calculation's @dfn{accuracy} is how close it comes
-to the real value. This is as opposed to the @dfn{precision}, which
-usually refers to the number of bits used to represent the number
-(see @uref{http://en.wikipedia.org/wiki/Accuracy_and_precision,
-the Wikipedia article} for more information).
address@hidden quotation
+As a side note, sorting the array indices before traversing
+the array has been reported to add 15% to 20% overhead to the
+execution time of @command{awk} programs. For this reason,
+sorted array traversal is not the default.
-There are two options for doing floating-point calculations:
-hardware floating-point (as used by standard @command{awk} and
-the default for @command{gawk}), and @dfn{arbitrary-precision}
-floating-point, which is software based. This @value{CHAPTER}
-aims to provide enough information to understand both, and then
-will focus on @command{gawk}'s facilities for the address@hidden you
-are interested in other tools that perform arbitrary precision arithmetic,
-you may want to investigate the POSIX @command{bc} tool. See
address@hidden://pubs.opengroup.org/onlinepubs/009695399/utilities/bc.html,
-the POSIX specification for it}, for more information.}
address@hidden The @command{gawk}
address@hidden maintainers believe that only the people who wish to use a
address@hidden feature should have to pay for it.
-Binary floating-point representations and arithmetic are inexact.
-Simple values like 0.1 cannot be precisely represented using
-binary floating-point numbers, and the limited precision of
-floating-point numbers means that slight changes in
-the order of operations or the precision of intermediate storage
-can change the result. To make matters worse, with arbitrary precision
-floating-point, you can set the precision before starting a computation,
-but then you cannot be sure of the number of significant decimal places
-in the final result.
address@hidden Array Sorting Functions
address@hidden Sorting Array Values and Indices with @command{gawk}
-Sometimes, before you start to write any code, you should think more
-about what you really want and what's really happening. Consider the
-two numbers in the following example:
address@hidden arrays, sorting
address@hidden @code{asort()} function (@command{gawk})
address@hidden @code{asort()} function (@command{gawk}), address@hidden sorting
address@hidden sort function, arrays, sorting
+In most @command{awk} implementations, sorting an array requires
+writing a @code{sort()} function.
+While this can be educational for exploring different sorting algorithms,
+usually that's not the point of the program.
address@hidden provides the built-in @code{asort()}
+and @code{asorti()} functions
+(@pxref{String Functions})
+for sorting arrays. For example:
@example
-x = 0.875 # 1/2 + 1/4 + 1/8
-y = 0.425
address@hidden the array} data
+n = asort(data)
+for (i = 1; i <= n; i++)
+ @var{do something with} data[i]
@end example
-Unlike the number in @code{y}, the number stored in @code{x}
-is exactly representable
-in binary since it can be written as a finite sum of one or
-more fractions whose denominators are all powers of two.
-When @command{gawk} reads a floating-point number from
-program source, it automatically rounds that number to whatever
-precision your machine supports. If you try to print the numeric
-content of a variable using an output format string of @code{"%.17g"},
-it may not produce the same number as you assigned to it:
+After the call to @code{asort()}, the array @code{data} is indexed from 1
+to some number @var{n}, the total number of elements in @code{data}.
+(This count is @code{asort()}'s return value.)
address@hidden @value{LEQ} @code{data[2]} @value{LEQ} @code{data[3]}, and so on.
+The comparison is based on the type of the elements
+(@pxref{Typing and Comparison}).
+All numeric values come before all string values,
+which in turn come before all subarrays.
+
address@hidden side effects, @code{asort()} function
+An important side effect of calling @code{asort()} is that
address@hidden array's original indices are irrevocably lost}.
+As this isn't always desirable, @code{asort()} accepts a
+second argument:
@example
-$ @kbd{gawk 'BEGIN @{ x = 0.875; y = 0.425}
-> @kbd{ printf("%0.17g, %0.17g\n", x, y) @}'}
address@hidden 0.875, 0.42499999999999999
address@hidden the array} source
+n = asort(source, dest)
+for (i = 1; i <= n; i++)
+ @var{do something with} dest[i]
@end example
-Often the error is so small you do not even notice it, and if you do,
-you can always specify how much precision you would like in your output.
-Usually this is a format string like @code{"%.15g"}, which when
-used in the previous example, produces an output identical to the input.
+In this case, @command{gawk} copies the @code{source} array into the
address@hidden array and then sorts @code{dest}, destroying its indices.
+However, the @code{source} array is not affected.
-Because the underlying representation can be little bit off from the exact
value,
-comparing floating-point values to see if they are equal is generally not a
good idea.
-Here is an example where it does not work like you expect:
address@hidden()} accepts a third string argument to control comparison of
+array elements. As with @code{PROCINFO["sorted_in"]}, this argument
+may be one of the predefined names that @command{gawk} provides
+(@pxref{Controlling Scanning}), or the name of a user-defined function
+(@pxref{Controlling Array Traversal}).
address@hidden
-$ @kbd{gawk 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'}
address@hidden 0
address@hidden example
address@hidden NOTE
+In all cases, the sorted element values consist of the original
+array's element values. The ability to control comparison merely
+affects the way in which they are sorted.
address@hidden quotation
-The loss of accuracy during a single computation with floating-point numbers
-usually isn't enough to worry about. However, if you compute a value
-which is the result of a sequence of floating point operations,
-the error can accumulate and greatly affect the computation itself.
-Here is an attempt to compute the value of the constant
address@hidden using one of its many series representations:
+Often, what's needed is to sort on the values of the @emph{indices}
+instead of the values of the elements.
+To do that, use the
address@hidden()} function. The interface is identical to that of
address@hidden()}, except that the index values are used for sorting, and
+become the values of the result array:
@example
-BEGIN @{
- x = 1.0 / sqrt(3.0)
- n = 6
- for (i = 1; i < 30; i++) @{
- n = n * 2.0
- x = (sqrt(x * x + 1) - 1) / x
- printf("%.15f\n", n * x)
address@hidden source[$0] = some_func($0) @}
+
+END @{
+ n = asorti(source, dest)
+ for (i = 1; i <= n; i++) @{
+ @ii{Work with sorted indices directly:}
+ @var{do something with} dest[i]
+ @dots{}
+ @ii{Access original array via sorted indices:}
+ @var{do something with} source[dest[i]]
@}
@}
@end example
-When run, the early errors propagating through later computations
-cause the loop to terminate prematurely after an attempt to divide by zero.
+Similar to @code{asort()},
+in all cases, the sorted element values consist of the original
+array's indices. The ability to control comparison merely
+affects the way in which they are sorted.
address@hidden
-$ @kbd{gawk -f pi.awk}
address@hidden 3.215390309173475
address@hidden 3.159659942097510
address@hidden 3.146086215131467
address@hidden 3.142714599645573
address@hidden
address@hidden 3.224515243534819
address@hidden 2.791117213058638
address@hidden 0.000000000000000
address@hidden gawk: pi.awk:6: fatal: division by zero attempted
address@hidden example
+Sorting the array by replacing the indices provides maximal flexibility.
+To traverse the elements in decreasing order, use a loop that goes from
address@hidden down to 1, either over the elements or over the address@hidden
+may also use one of the predefined sorting names that sorts in
+decreasing order.}
-Here is one more example where the inaccuracies in internal representations
-yield an unexpected result:
address@hidden reference counting, sorting arrays
+Copying array indices and elements isn't expensive in terms of memory.
+Internally, @command{gawk} maintains @dfn{reference counts} to data.
+For example, when @code{asort()} copies the first array to the second one,
+there is only one copy of the original array elements' data, even though
+both arrays use the values.
address@hidden
-$ @kbd{gawk 'BEGIN @{}
-> @kbd{for (d = 1.1; d <= 1.5; d += 0.1)}
-> @kbd{i++}
-> @kbd{print i}
-> @address@hidden'}
address@hidden 4
address@hidden example
address@hidden Document It And Call It A Feature. Sigh.
address@hidden @command{gawk}, @code{IGNORECASE} variable in
address@hidden @code{IGNORECASE} variable
address@hidden arrays, sorting, @code{IGNORECASE} variable and
address@hidden @code{IGNORECASE} variable, array sorting and
+Because @code{IGNORECASE} affects string comparisons, the value
+of @code{IGNORECASE} also affects sorting for both @code{asort()} and
@code{asorti()}.
+Note also that the locale's sorting order does @emph{not}
+come into play; comparisons are based on character values address@hidden
+is true because locale-based comparison occurs only when in POSIX
+compatibility mode, and since @code{asort()} and @code{asorti()} are
address@hidden extensions, they are not available in that case.}
+Caveat Emptor.
-Can computation using aribitrary precision help with the previous examples?
-If you are impatient to know, see
address@hidden Arithmetic}.
address@hidden Two-way I/O
address@hidden Two-Way Communications with Another Process
address@hidden Brennan, Michael
address@hidden programmers, attractiveness of
address@hidden
address@hidden Path:
cssun.mathcs.emory.edu!gatech!newsxfer3.itd.umich.edu!news-peer.sprintlink.net!news-sea-19.sprintlink.net!news-in-west.sprintlink.net!news.sprintlink.net!Sprint!204.94.52.5!news.whidbey.com!brennan
+From: brennan@@whidbey.com (Mike Brennan)
+Newsgroups: comp.lang.awk
+Subject: Re: Learn the SECRET to Attract Women Easily
+Date: 4 Aug 1997 17:34:46 GMT
address@hidden Organization: WhidbeyNet
address@hidden Lines: 12
+Message-ID: <5s53rm$eca@@news.whidbey.com>
address@hidden References: <address@hidden>
address@hidden Reply-To: address@hidden
address@hidden NNTP-Posting-Host: asn202.whidbey.com
address@hidden X-Newsreader: slrn (0.9.4.1 UNIX)
address@hidden Xref: cssun.mathcs.emory.edu comp.lang.awk:5403
-Instead of aribitrary precision floating-point arithmetic,
-often all you need is an adjustment of your logic
-or a different order for the operations in your calculation.
-The stability and the accuracy of the computation of the constant @value{PI}
-in the previous example can be enhanced by using the following
-simple algebraic transformation:
+On 3 Aug 1997 13:17:43 GMT, Want More Dates???
+<tracy78@@kilgrona.com> wrote:
+>Learn the SECRET to Attract Women Easily
+>
+>The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women
+
+The scent of awk programmers is a lot more attractive to women than
+the scent of perl programmers.
+--
+Mike Brennan
address@hidden brennan@@whidbey.com
address@hidden smallexample
+
address@hidden advanced features, @command{gawk}, address@hidden communicating
with
address@hidden processes, two-way communications with
+It is often useful to be able to
+send data to a separate program for
+processing and then read the result. This can always be
+done with temporary files:
@example
-(sqrt(x * x + 1) - 1) / x = x / (sqrt(x * x + 1) + 1)
+# Write the data for processing
+tempfile = ("mydata." PROCINFO["pid"])
+while (@var{not done with data})
+ print @var{data} | ("subprogram > " tempfile)
+close("subprogram > " tempfile)
+
+# Read the results, remove tempfile when done
+while ((getline newdata < tempfile) > 0)
+ @var{process} newdata @var{appropriately}
+close(tempfile)
+system("rm " tempfile)
@end example
@noindent
-After making this, change the program does converge to
address@hidden in under 30 iterations:
+This works, but not elegantly. Among other things, it requires that
+the program be run in a directory that cannot be shared among users;
+for example, @file{/tmp} will not do, as another user might happen
+to be using a temporary file with the same name.
+
address@hidden coprocesses
address@hidden input/output, two-way
address@hidden @code{|} (vertical bar), @code{|&} operator (I/O)
address@hidden vertical bar (@code{|}), @code{|&} operator (I/O)
address@hidden @command{csh} utility, @code{|&} operator, comparison with
+However, with @command{gawk}, it is possible to
+open a @emph{two-way} pipe to another process. The second process is
+termed a @dfn{coprocess}, since it runs in parallel with @command{gawk}.
+The two-way connection is created using the @samp{|&} operator
+(borrowed from the Korn shell, @command{ksh}):@footnote{This is very
+different from the same operator in the C shell.}
@example
-$ @kbd{gawk -f /tmp/pi2.awk}
address@hidden 3.215390309173473
address@hidden 3.159659942097501
address@hidden 3.146086215131436
address@hidden 3.142714599645370
address@hidden 3.141873049979825
address@hidden
address@hidden 3.141592653589797
address@hidden 3.141592653589797
+do @{
+ print @var{data} |& "subprogram"
+ "subprogram" |& getline results
address@hidden while (@var{data left to process})
+close("subprogram")
@end example
-There is no need to be unduly suspicious about the results from
-floating-point arithmetic. The lesson to remember is that
-floating-point arithmetic is always more complex than the arithmetic using
-pencil and paper. In order to take advantage of the power
-of computer floating-point, you need to know its limitations
-and work within them. For most casual use of floating-point arithmetic,
-you will often get the expected result in the end if you simply round
-the display of your final results to the correct number of significant
-decimal digits. And, avoid presenting numerical data in a manner that
-implies better precision than is actually the case.
-
address@hidden
-* Floating-point Representation:: Binary floating-point representation.
-* Floating-point Context:: Floating-point context.
-* Rounding Mode:: Floating-point rounding mode.
address@hidden menu
-
address@hidden Floating-point Representation
address@hidden Binary Floating-point Representation
address@hidden IEEE-754 format
+The first time an I/O operation is executed using the @samp{|&}
+operator, @command{gawk} creates a two-way pipeline to a child process
+that runs the other program. Output created with @code{print}
+or @code{printf} is written to the program's standard input, and
+output from the program's standard output can be read by the @command{gawk}
+program using @code{getline}.
+As is the case with processes started by @samp{|}, the subprogram
+can be any program, or pipeline of programs, that can be started by
+the shell.
-Although floating-point representations vary from machine to machine,
-the most commonly encountered representation is that defined by the
-IEEE 754 Standard. An IEEE-754 format value has three components:
+There are some cautionary items to be aware of:
@itemize @bullet
@item
-A sign bit telling whether the number is positive or negative.
-
address@hidden
-An @dfn{exponent} giving its order of magnitude, @var{e}.
+As the code inside @command{gawk} currently stands, the coprocess's
+standard error goes to the same place that the parent @command{gawk}'s
+standard error goes. It is not possible to read the child's
+standard error separately.
address@hidden deadlocks
address@hidden buffering, input/output
address@hidden @code{getline} command, deadlock and
@item
-A @dfn{significand}, @var{s},
-specifying the actual digits of the number.
+I/O buffering may be a problem. @command{gawk} automatically
+flushes all output down the pipe to the coprocess.
+However, if the coprocess does not flush its output,
address@hidden may hang when doing a @code{getline} in order to read
+the coprocess's results. This could lead to a situation
+known as @dfn{deadlock}, where each process is waiting for the
+other one to do something.
@end itemize
-The value of the
-number is then
address@hidden
address@hidden @cdot 2^e}.
address@hidden iftex
address@hidden
address@hidden * 2^e}.
address@hidden ifnottex
-The first bit of a non-zero binary significand
-is always one, so the significand in an IEEE-754 format only includes the
-fractional part, leaving the leading one implicit.
address@hidden @code{close()} function, two-way pipes and
+It is possible to close just one end of the two-way pipe to
+a coprocess, by supplying a second argument to the @code{close()}
+function of either @code{"to"} or @code{"from"}
+(@pxref{Close Files And Pipes}).
+These strings tell @command{gawk} to close the end of the pipe
+that sends data to the coprocess or the end that reads from it,
+respectively.
-Three of the standard IEEE-754 types are 32-bit single precision,
-64-bit double precision and 128-bit quadruple precision.
-The standard also specifies extended precision formats
-to allow greater precisions and larger exponent ranges.
address@hidden @command{sort} utility, coprocesses and
+This is particularly necessary in order to use
+the system @command{sort} utility as part of a coprocess;
address@hidden must read @emph{all} of its input
+data before it can produce any output.
+The @command{sort} program does not receive an end-of-file indication
+until @command{gawk} closes the write end of the pipe.
-The significand is stored in @dfn{normalized} format,
-which means that the first bit is always a one.
+When you have finished writing data to the @command{sort}
+utility, you can close the @code{"to"} end of the pipe, and
+then start reading sorted data via @code{getline}.
+For example:
address@hidden Floating-point Context
address@hidden Floating-point Context
address@hidden context, floating-point
address@hidden
+BEGIN @{
+ command = "LC_ALL=C sort"
+ n = split("abcdefghijklmnopqrstuvwxyz", a, "")
-A floating-point @dfn{context} defines the environment for arithmetic
operations.
-It governs precision, sets rules for rounding, and limits the range for
exponents.
-The context has the following primary components:
+ for (i = n; i > 0; i--)
+ print a[i] |& command
+ close(command, "to")
address@hidden @dfn
address@hidden Precision
-Precision of the floating-point format in bits.
address@hidden emax
-Maximum exponent allowed for this format.
address@hidden emin
-Minimum exponent allowed for this format.
address@hidden Underflow behavior
-The format may or may not support gradual underflow.
address@hidden Rounding
-The rounding mode of this context.
address@hidden table
+ while ((command |& getline line) > 0)
+ print "got", line
+ close(command)
address@hidden
address@hidden example
address@hidden lists the precision and exponent
-field values for the basic IEEE-754 binary formats:
+This program writes the letters of the alphabet in reverse order, one
+per line, down the two-way pipe to @command{sort}. It then closes the
+write end of the pipe, so that @command{sort} receives an end-of-file
+indication. This causes @command{sort} to sort the data and write the
+sorted data back to the @command{gawk} program. Once all of the data
+has been read, @command{gawk} terminates the coprocess and exits.
address@hidden Table,table-ieee-formats
address@hidden IEEE Format Context Values}
address@hidden @columnfractions .20 .20 .20 .20 .20
address@hidden Name @tab Total bits @tab Precision @tab emin @tab emax
address@hidden Single @tab 32 @tab 24 @tab @minus{}126 @tab +127
address@hidden Double @tab 64 @tab 53 @tab @minus{}1022 @tab +1023
address@hidden Quadruple @tab 128 @tab 113 @tab @minus{}16382 @tab +16383
address@hidden multitable
address@hidden float
+As a side note, the assignment @samp{LC_ALL=C} in the @command{sort}
+command ensures traditional Unix (ASCII) sorting from @command{sort}.
address@hidden NOTE
-The precision numbers include the implied leading one that gives them
-one extra bit of significand.
address@hidden quotation
address@hidden @command{gawk}, @code{PROCINFO} array in
address@hidden @code{PROCINFO} array
+You may also use pseudo-ttys (ptys) for
+two-way communication instead of pipes, if your system supports them.
+This is done on a per-command basis, by setting a special element
+in the @code{PROCINFO} array
+(@pxref{Auto-set}),
+like so:
-A floating-point context can also determine which signals are treated
-as exceptions, and can set rules for arithmetic with special values.
-Please consult the IEEE-754 standard or other resources for details.
address@hidden
+command = "sort -nr" # command, save in convenience variable
+PROCINFO[command, "pty"] = 1 # update PROCINFO
+print @dots{} |& command # start two-way pipe
address@hidden
address@hidden example
address@hidden ordinarily uses the hardware double precision
-representation for numbers. On most systems, this is IEEE-754
-floating-point format, corresponding to 64-bit binary with 53 bits
-of precision.
address@hidden
+Using ptys avoids the buffer deadlock issues described earlier, at some
+loss in performance. If your system does not have ptys, or if all the
+system's ptys are in use, @command{gawk} automatically falls back to
+using regular pipes.
address@hidden NOTE
-In case an underflow occurs, the standard allows, but does not require,
-the result from an arithmetic operation to be a number smaller than
-the smallest nonzero normalized number. Such numbers do
-not have as many significant digits as normal numbers, and are called
address@hidden or @dfn{subnormals}. The alternative, simply returning a zero,
-is called @dfn{flush to zero}. The basic IEEE-754 binary formats
-support subnormal numbers.
address@hidden TCP/IP Networking
address@hidden Using @command{gawk} for Network Programming
address@hidden advanced features, @command{gawk}, network programming
address@hidden networks, programming
address@hidden STARTOFRANGE tcpip
address@hidden TCP/IP
address@hidden @code{/inet/@dots{}} special files (@command{gawk})
address@hidden files, @code{/inet/@dots{}} (@command{gawk})
address@hidden @code{/inet4/@dots{}} special files (@command{gawk})
address@hidden files, @code{/inet4/@dots{}} (@command{gawk})
address@hidden @code{/inet6/@dots{}} special files (@command{gawk})
address@hidden files, @code{/inet6/@dots{}} (@command{gawk})
address@hidden @code{EMISTERED}
address@hidden
address@hidden:@*
+@ @ @ @ @i{A host is a host from coast to coast,@*
+@ @ @ @ and no-one can talk to host that's close,@*
+@ @ @ @ unless the host that isn't address@hidden
+@ @ @ @ is busy hung or dead.}
@end quotation
address@hidden Rounding Mode
address@hidden Floating-point Rounding Mode
address@hidden rounding mode, floating-point
+In addition to being able to open a two-way pipeline to a coprocess
+on the same system
+(@pxref{Two-way I/O}),
+it is possible to make a two-way connection to
+another process on another system across an IP network connection.
-The @dfn{rounding mode} specifies the behavior for the results of numerical
-operations when discarding extra precision. Each rounding mode indicates
-how the least significant returned digit of a rounded result is to
-be calculated.
address@hidden lists the IEEE-754 defined
-rounding modes:
+You can think of this as just a @emph{very long} two-way pipeline to
+a coprocess.
+The way @command{gawk} decides that you want to use TCP/IP networking is
+by recognizing special @value{FN}s that begin with one of @samp{/inet/},
address@hidden/inet4/} or @samp{/inet6}.
address@hidden Table,table-rounding-modes
address@hidden 754 Rounding Modes}
address@hidden @columnfractions .45 .55
address@hidden Rounding Mode @tab IEEE Name
address@hidden Round to nearest, ties to even @tab @code{roundTiesToEven}
address@hidden Round toward plus Infinity @tab @code{roundTowardPositive}
address@hidden Round toward negative Infinity @tab @code{roundTowardNegative}
address@hidden Round toward zero @tab @code{roundTowardZero}
address@hidden Round to nearest, ties away from zero @tab @code{roundTiesToAway}
address@hidden multitable
address@hidden float
+The full syntax of the special @value{FN} is
address@hidden/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}.
+The components are:
-The default mode @code{roundTiesToEven} is the most preferred,
-but the least intuitive. This method does the obvious thing for most values,
-by rounding them up or down to the nearest digit.
-For example, rounding 1.132 to two digits yields 1.13,
-and rounding 1.157 yields 1.16.
address@hidden @var
address@hidden net-type
+Specifies the kind of Internet connection to make.
+Use @samp{/inet4/} to force IPv4, and
address@hidden/inet6/} to force IPv6.
+Plain @samp{/inet/} (which used to be the only option) uses
+the system default, most likely IPv4.
-However, when it comes to rounding a value that is exactly halfway between,
-things do not work the way you probably learned in school.
-In this case, the number is rounded to the nearest even digit.
-So rounding 0.125 to two digits rounds down to 0.12,
-but rounding 0.6875 to three digits rounds up to 0.688.
-You probably have already encountered this rounding mode when
-using the @code{printf} routine to format floating-point numbers.
-For example:
address@hidden protocol
+The protocol to use over IP. This must be either @samp{tcp}, or
address@hidden, for a TCP or UDP IP connection,
+respectively. The use of TCP is recommended for most applications.
+
address@hidden local-port
address@hidden @code{getaddrinfo()} function (C library)
+The local TCP or UDP port number to use. Use a port number of @samp{0}
+when you want the system to pick a port. This is what you should do
+when writing a TCP or UDP client.
+You may also use a well-known service name, such as @samp{smtp}
+or @samp{http}, in which case @command{gawk} attempts to determine
+the predefined port number using the C @code{getaddrinfo()} function.
+
address@hidden remote-host
+The IP address or fully-qualified domain name of the Internet
+host to which you want to connect.
+
address@hidden remote-port
+The TCP or UDP port number to use on the given @var{remote-host}.
+Again, use @samp{0} if you don't care, or else a well-known
+service name.
address@hidden table
+
address@hidden @command{gawk}, @code{ERRNO} variable in
address@hidden @code{ERRNO} variable
address@hidden NOTE
+Failure in opening a two-way socket will result in a non-fatal error
+being returned to the calling code. The value of @code{ERRNO} indicates
+the error (@pxref{Auto-set}).
address@hidden quotation
+
+Consider the following very simple example:
@example
BEGIN @{
- x = -4.5
- for (i = 1; i < 10; i++) @{
- x += 1.0
- printf("%4.1f => %2.0f\n", x, x)
- @}
+ Service = "/inet/tcp/0/localhost/daytime"
+ Service |& getline
+ print $0
+ close(Service)
@}
@end example
address@hidden
-produces the following output when run:@footnote{It
-is possible for the output to be completely different if the
-C library in your system does not use the IEEE-754 even-rounding
-rule to round halfway cases for @code{printf()}.}
-
address@hidden
--3.5 => -4
--2.5 => -2
--1.5 => -2
--0.5 => 0
- 0.5 => 0
- 1.5 => 2
- 2.5 => 2
- 3.5 => 4
- 4.5 => 4
address@hidden example
-
-The theory behind the rounding mode @code{roundTiesToEven} is that
-it more or less evenly distributes upward and downward rounds
-of exact halves, which might cause the round-off error
-to cancel itself out. This is the default rounding mode used
-in IEEE-754 computing functions and operators.
+This program reads the current date and time from the local system's
+TCP @samp{daytime} server.
+It then prints the results and closes the connection.
-The other rounding modes are rarely used.
-Round toward positive infinity (@code{roundTowardPositive})
-and round toward negative infinity (@code{roundTowardNegative})
-are often used to implement interval arithmetic,
-where you adjust the rounding mode to calculate upper and lower bounds
-for the range of output. The @code{roundTowardZero}
-mode can be used for converting floating-point numbers to integers.
-The rounding mode @code{roundTiesToAway} rounds the result to the
-nearest number and selects the number with the larger magnitude
-if a tie occurs.
+Because this topic is extensive, the use of @command{gawk} for
+TCP/IP programming is documented separately.
address@hidden
+See
address@hidden, , General Introduction, gawkinet, TCP/IP Internetworking with
@command{gawk}},
address@hidden ifinfo
address@hidden
+See @cite{TCP/IP Internetworking with @command{gawk}},
+which comes as part of the @command{gawk} distribution,
address@hidden ifnotinfo
+for a much more complete introduction and discussion, as well as
+extensive examples.
-Some numerical analysts will tell you that your choice of rounding style
-has tremendous impact on the final outcome, and advise you to wait until
-final output for any rounding. Instead, you can often avoid round-off error
problems by
-setting the precision initially to some value sufficiently larger than
-the final desired precision, so that the accumulation of round-off error
-does not influence the outcome.
-If you suspect that results from your computation are
-sensitive to accumulation of round-off error,
-one way to be sure is to look for a significant difference in output
-when you change the rounding mode.
address@hidden ENDOFRANGE tcpip
address@hidden Gawk and MPFR
address@hidden @command{gawk} + MPFR = Powerful Arithmetic
address@hidden Profiling
address@hidden Profiling Your @command{awk} Programs
address@hidden STARTOFRANGE awkp
address@hidden @command{awk} programs, profiling
address@hidden STARTOFRANGE proawk
address@hidden profiling @command{awk} programs
address@hidden profiling @command{gawk}
address@hidden @code{awkprof.out} file
address@hidden files, @code{awkprof.out}
-The rest of this @value{CHAPTER} decsribes how to use the arbitrary precision
-(also known as @dfn{multiple precision} or @dfn{infinite precision}) numeric
-capabilites in @command{gawk} to produce maximally accurate results
-when you need it.
+You may produce execution traces of your @command{awk} programs.
+This is done by passing the option @option{--profile} to @command{gawk}.
+When @command{gawk} has finished running, it creates a profile of your program
in a file
+named @file{awkprof.out}. Because it is profiling, it also executes up to 45%
slower than
address@hidden normally does.
-But first you should check if your version of
address@hidden supports arbitrary precision arithmetic.
-The easiest way to find out is to look at the output of
-the following command:
address@hidden @code{--profile} option
+As shown in the following example,
+the @option{--profile} option can be used to change the name of the file
+where @command{gawk} will write the profile:
@example
-$ @kbd{gawk --version}
address@hidden GNU Awk 4.1.0 (GNU MPFR 3.1.0, GNU MP 5.0.3)
address@hidden Copyright (C) 1989, 1991-2012 Free Software Foundation.
address@hidden
+gawk --profile=myprog.prof -f myprog.awk data1 data2
@end example
address@hidden uses the
address@hidden://www.mpfr.org, GNU MPFR}
-and
address@hidden://gmplib.org, GNU MP} (GMP)
-libraries for arbitrary precision
-arithmetic on numbers. So if you do not see the names of these libraries
-in the output, then your version of @command{gawk} does not support
-arbitrary precision arithmetic.
-
-Additionally,
-there are a few elements available in the @code{PROCINFO} array
-to provide information about the MPFR and GMP libraries.
address@hidden, for more information.
-
address@hidden
-Even if you aren't interested in arbitrary precision arithmetic, you
-may still benefit from knowing about how @command{gawk} handles numbers
-in general, and the limitations of doing arithmetic with ordinary
address@hidden numbers.
address@hidden ignore
address@hidden
+In the above example, @command{gawk} places the profile in
address@hidden instead of in @file{awkprof.out}.
+Here is a sample session showing a simple @command{awk} program, its input
data, and the
+results from running @command{gawk} with the @option{--profile} option.
+First, the @command{awk} program:
address@hidden Arbitrary Precision Floats
address@hidden Arbitrary Precision Floating-point Arithmetic with @command{gawk}
address@hidden
+BEGIN @{ print "First BEGIN rule" @}
address@hidden uses the GNU MPFR library
-for arbitrary precision floating-point arithmetic. The MPFR library
-provides precise control over precisions and rounding modes, and gives
-correctly rounded reproducible platform-independent results. With the
-command-line option @option{--bignum} or @option{-M},
-all floating-point arithmetic operators and numeric functions can yield
-results to any desired precision level supported by MPFR.
-Two built-in
-variables @code{PREC}
-(@pxref{Setting Precision})
-and @code{ROUNDMODE}
-(@pxref{Setting Rounding Mode})
-provide control over the working precision and the rounding mode.
-The precision and the rounding mode are set globally for every operation
-to follow.
+END @{ print "First END rule" @}
-The default working precision for arbitrary precision floating-point values is
53,
-and the default value for @code{ROUNDMODE} is @code{"N"},
-which selects the IEEE-754
address@hidden (@pxref{Rounding Mode}) rounding address@hidden
-default precision is 53, since according to the MPFR documentation,
-the library should be able to exactly reproduce all computations with
-double-precision machine floating-point numbers (@code{double} type
-in C), except the default exponent range is much wider and subnormal
-numbers are not implemented.}
address@hidden uses the default exponent range in MPFR
address@hidden
-(@math{emax = 2^{30} - 1, emin = -emax})
address@hidden iftex
address@hidden
-(@var{emax} = 2^30 @minus{} 1, @var{emin} = @address@hidden)
address@hidden ifnottex
-for all floating-point contexts.
-There is no explicit mechanism to adjust the exponent range.
-MPFR does not implement subnormal numbers by default,
-and this behavior cannot be changed in @command{gawk}.
+/foo/ @{
+ print "matched /foo/, gosh"
+ for (i = 1; i <= 3; i++)
+ sing()
address@hidden
address@hidden NOTE
-When emulating an IEEE-754 format (@pxref{Setting Precision}),
address@hidden internally adjusts the exponent range
-to the value defined for the format and also performs computations needed for
-gradual underflow (subnormal numbers).
address@hidden quotation
address@hidden
+ if (/foo/)
+ print "if is true"
+ else
+ print "else is true"
address@hidden
address@hidden NOTE
-MPFR numbers are variable-size entities, consuming only as much space as
-needed to store the significant digits. Since the performance using MPFR
-numbers pales in comparison to doing arithmetic using the underlying machine
-types, you should consider using only as much precision as needed by
-your program.
address@hidden quotation
+BEGIN @{ print "Second BEGIN rule" @}
address@hidden
-* Setting Precision:: Setting the working precision.
-* Setting Rounding Mode:: Setting the rounding mode.
-* Floating-point Constants:: Representing floating-point constants.
-* Changing Precision:: Changing the precision of a number.
-* Exact Arithmetic:: Exact arithmetic with floating-point numbers.
address@hidden menu
+END @{ print "Second END rule" @}
address@hidden Setting Precision
address@hidden Setting the Working Precision
address@hidden @code{PREC} variable
+function sing( dummy)
address@hidden
+ print "I gotta be me!"
address@hidden
address@hidden example
address@hidden uses a global working precision; it does not keep track of
-the precision or accuracy of individual numbers. Performing an arithmetic
-operation or calling a built-in function rounds the result to the current
-working precision. The default working precision is 53 which can be
-modified using the built-in variable @code{PREC}. You can also set the
-value to one of the following pre-defined case-insensitive strings
-to emulate an IEEE-754 binary format:
+Following is the input data:
address@hidden address@hidden"double"}} {12345678901234567890123456789012345}
address@hidden @code{PREC} @tab IEEE-754 Binary Format
address@hidden @code{"half"} @tab 16-bit half-precision.
address@hidden @code{"single"} @tab Basic 32-bit single precision.
address@hidden @code{"double"} @tab Basic 64-bit double precision.
address@hidden @code{"quad"} @tab Basic 128-bit quadruple precision.
address@hidden @code{"oct"} @tab 256-bit octuple precision.
address@hidden multitable
address@hidden
+foo
+bar
+baz
+foo
+junk
address@hidden example
-The following example illustrates the effects of changing precision
-on arithmetic operations:
+Here is the @file{awkprof.out} that results from running the @command{gawk}
+profiler on this program and data (this example also illustrates that
@command{awk}
+programmers sometimes have to work late):
address@hidden @code{BEGIN} pattern
address@hidden @code{END} pattern
@example
-$ @kbd{gawk -M -vPREC=100 'BEGIN @{ x = 1.0e-400; print x + 0; \}
-> @kbd{PREC = "double"; print x + 0 @}'}
address@hidden 1e-400
address@hidden 0
address@hidden example
+ # gawk profile, created Sun Aug 13 00:00:15 2000
-Binary and decimal precisions are related approximately according to the
-formula:
+ # BEGIN block(s)
address@hidden
address@hidden = 3.322 @cdot dps}
address@hidden iftex
address@hidden
address@hidden = 3.322 * @var{dps}
address@hidden ifnottex
+ BEGIN @{
+ 1 print "First BEGIN rule"
+ 1 print "Second BEGIN rule"
+ @}
address@hidden
-Here, @var{prec} denotes the binary precision
-(measured in bits) and @var{dps} (short for decimal places)
-is the decimal digits. We can easily calculate how many decimal
-digits the 53-bit significand of an IEEE double is equivalent to:
-53 / 3.332 which is equal to about 15.95.
-But what does 15.95 digits actually mean? It depends whether you are
-concerned about how many digits you can rely on, or how many digits
-you need.
+ # Rule(s)
-It is important to know how many bits it takes to uniquely identify
-a double-precision value (the C type @code{double}). If you want to
-convert from @code{double} to decimal and back to @code{double} (e.g.,
-saving a @code{double} representing an intermediate result to a file, and
-later reading it back to restart the computation), then a few more decimal
-digits are required. 17 digits is generally enough for a @code{double}.
+ 5 /foo/ @{ # 2
+ 2 print "matched /foo/, gosh"
+ 6 for (i = 1; i <= 3; i++) @{
+ 6 sing()
+ @}
+ @}
-It can also be important to know what decimal numbers can be uniquely
-represented with a @code{double}. If you want to convert
-from decimal to @code{double} and back again, 15 digits is the most that
-you can get. Stated differently, you should not present
-the numbers from your floating-point computations with more than 15
-significant digits in them.
+ 5 @{
+ 5 if (/foo/) @{ # 2
+ 2 print "if is true"
+ 3 @} else @{
+ 3 print "else is true"
+ @}
+ @}
-Conversely, it takes a precision of 332 bits to hold an approximation
-of the constant @value{PI} that is accurate to 100 decimal places.
-You should always add some extra bits in order to avoid the confusing round-off
-issues that occur because numbers are stored internally in binary.
+ # END block(s)
address@hidden Setting Rounding Mode
address@hidden Setting the Rounding Mode
address@hidden @code{ROUNDMODE} variable
+ END @{
+ 1 print "First END rule"
+ 1 print "Second END rule"
+ @}
-The @code{ROUNDMODE} variable provides
-program level control over the rounding mode.
-The correspondance between @code{ROUNDMODE} and the IEEE
-rounding modes is shown in @ref{table-gawk-rounding-modes}.
+ # Functions, listed alphabetically
address@hidden Table,table-gawk-rounding-modes
address@hidden@command{gawk} Rounding Modes}
address@hidden @columnfractions .45 .30 .25
address@hidden Rounding Mode @tab IEEE Name @tab @code{ROUNDMODE}
address@hidden Round to nearest, ties to even @tab @code{roundTiesToEven} @tab
@code{"N"} or @code{"n"}
address@hidden Round toward plus Infinity @tab @code{roundTowardPositive} @tab
@code{"U"} or @code{"u"}
address@hidden Round toward negative Infinity @tab @code{roundTowardNegative}
@tab @code{"D"} or @code{"d"}
address@hidden Round toward zero @tab @code{roundTowardZero} @tab @code{"Z"} or
@code{"z"}
address@hidden Round to nearest, ties away from zero @tab
@code{roundTiesToAway} @tab @code{"A"} or @code{"a"}
address@hidden multitable
address@hidden float
+ 6 function sing(dummy)
+ @{
+ 6 print "I gotta be me!"
+ @}
address@hidden example
address@hidden has the default value @code{"N"},
-which selects the IEEE-754 rounding mode @code{roundTiesToEven}.
-Besides the values listed in @ref{table-gawk-rounding-modes},
address@hidden also accepts @code{"A"} to select the IEEE-754 mode
address@hidden
-if your version of the MPFR library supports it; otherwise setting
address@hidden to this value has no effect. @xref{Rounding Mode},
-for the meanings of the various rounding modes.
+This example illustrates many of the basic features of profiling output.
+They are as follows:
-Here is an example of how to change the default rounding behavior of
address@hidden's output:
address@hidden @bullet
address@hidden
+The program is printed in the order @code{BEGIN} rule,
address@hidden rule,
+pattern/action rules,
address@hidden rule, @code{END} rule and functions, listed
+alphabetically.
+Multiple @code{BEGIN} and @code{END} rules are merged together,
+as are multiple @code{BEGINFILE} and @code{ENDFILE} rules.
address@hidden
-$ @kbd{gawk -M -vROUNDMODE="Z" 'BEGIN @{ printf("%.2f\n", 1.378) @}'}
address@hidden 1.37
address@hidden example
address@hidden patterns, counts
address@hidden
+Pattern-action rules have two counts.
+The first count, to the left of the rule, shows how many times
+the rule's pattern was @emph{tested}.
+The second count, to the right of the rule's opening left brace
+in a comment,
+shows how many times the rule's action was @emph{executed}.
+The difference between the two indicates how many times the rule's
+pattern evaluated to false.
address@hidden Floating-point Constants
address@hidden Representing Floating-point Constants
address@hidden constants, floating-point
address@hidden
+Similarly,
+the count for an @address@hidden statement shows how many times
+the condition was tested.
+To the right of the opening left brace for the @code{if}'s body
+is a count showing how many times the condition was true.
+The count for the @code{else}
+indicates how many times the test failed.
-Be wary of floating-point constants! When reading a floating-point constant
-from program source code, @command{gawk} uses the default precision,
-unless overridden
-by an assignment to the special variable @code{PREC} on the command
-line, to store it internally as a MPFR number.
-Changing the precision using @code{PREC} in the program text does
-not change the precision of a constant. If you need to
-represent a floating-point constant at a higher precision than the
-default and cannot use a command line assignment to @code{PREC},
-you should either specify the constant as a string, or
-as a rational number whenever possible. The following example
-illustrates the differences among various ways to
-print a floating-point constant:
address@hidden loops, count for header
address@hidden
+The count for a loop header (such as @code{for}
+or @code{while}) shows how many times the loop test was executed.
+(Because of this, you can't just look at the count on the first
+statement in a rule to determine how many times the rule was executed.
+If the first statement is a loop, the count is misleading.)
address@hidden
-$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 0.1) @}'}
address@hidden 0.1000000000000000055511151
-$ @kbd{gawk -M -vPREC = 113 'BEGIN @{ printf("%0.25f\n", 0.1) @}'}
address@hidden 0.1000000000000000000000000
-$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", "0.1") @}'}
address@hidden 0.1000000000000000000000000
-$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 1/10) @}'}
address@hidden 0.1000000000000000000000000
address@hidden example
address@hidden functions, user-defined, counts
address@hidden user-defined, functions, counts
address@hidden
+For user-defined functions, the count next to the @code{function}
+keyword indicates how many times the function was called.
+The counts next to the statements in the body show how many times
+those statements were executed.
-In the first case, the number is stored with the default precision of 53.
address@hidden @address@hidden@}} (braces)
address@hidden braces (@address@hidden@}})
address@hidden
+The layout uses ``K&R'' style with TABs.
+Braces are used everywhere, even when
+the body of an @code{if}, @code{else}, or loop is only a single statement.
address@hidden Changing Precision
address@hidden Changing the Precision of a Number
address@hidden @code{()} (parentheses)
address@hidden parentheses @code{()}
address@hidden
+Parentheses are used only where needed, as indicated by the structure
+of the program and the precedence rules.
address@hidden extra verbiage here satisfies the copyeditor. ugh.
+For example, @samp{(3 + 5) * 4} means add three plus five, then multiply
+the total by four. However, @samp{3 + 5 * 4} has no parentheses, and
+means @samp{3 + (5 * 4)}.
address@hidden Laurie, Dirk
address@hidden
address@hidden point is that in any variable-precision package,
-a decision is made on how to treat numbers given as data,
-or arising in intermediate results, which are represented in
-floating-point format to a precision lower than working precision.
-Do we promote them to full membership of the high-precision club,
-or do we treat them and all their associates as second-class citizens?
-Sometimes the first course is proper, sometimes the second, and it takes
-careful analysis to tell which.}
address@hidden
address@hidden
+All string concatenations are parenthesized too.
+(This could be made a bit smarter.)
address@hidden ignore
-Dirk address@hidden Laurie.
address@hidden Arithmetic Considered Perilous --- A Detective Story}.
-Electronic Transactions on Numerical Analysis. Volume 28, pp. 168-173, 2008.}
address@hidden quotation
address@hidden
+Parentheses are used around the arguments to @code{print}
+and @code{printf} only when
+the @code{print} or @code{printf} statement is followed by a redirection.
+Similarly, if
+the target of a redirection isn't a scalar, it gets parenthesized.
address@hidden does not implicitly modify the precision of any previously
-computed results when the working precision is changed with an assignment
-to @code{PREC}. The precision of a number is always the one that was
-used at the time of its creation, and there is no way for the user
-to explicitly change it afterwards. However, since the result of a
-floating-point arithmetic operation is always an arbitrary precision
-floating-point value---with a precision set by the value of @code{PREC}---one
of the
-following workarounds effectively accomplishes the desired behavior:
address@hidden
address@hidden supplies leading comments in
+front of the @code{BEGIN} and @code{END} rules,
+the pattern/action rules, and the functions.
+
address@hidden itemize
+
+The profiled version of your program may not look exactly like what you
+typed when you wrote it. This is because @command{gawk} creates the
+profiled version by ``pretty printing'' its internal representation of
+the program. The advantage to this is that @command{gawk} can produce
+a standard representation. The disadvantage is that all source-code
+comments are lost, as are the distinctions among multiple @code{BEGIN},
address@hidden, @code{BEGINFILE}, and @code{ENDFILE} rules. Also, things such
as:
@example
-x = x + 0.0
+/foo/
@end example
@noindent
-or:
+come out as:
@example
-x += 0.0
+/foo/ @{
+ print $0
address@hidden
@end example
address@hidden Exact Arithmetic
address@hidden Exact Arithmetic with Floating-point Numbers
-
address@hidden CAUTION
-Never depend on the exactness of floating-point arithmetic,
-even for apparently simple expressions!
address@hidden quotation
-
-Can arbitrary precision arithmetic give exact results? There are
-no easy answers. The standard rules of algebra often do not apply
-when using floating-point arithmetic.
-Among other things, the distributive and associative laws
-do not hold completely, and order of operation may be important
-for your computation. Rounding error, cumulative precision loss
-and underflow are often troublesome.
address@hidden
+which is correct, but possibly surprising.
-When @command{gawk} tests the expressions @samp{0.1 + 12.2} and @samp{12.3}
-for equality
-using the machine double precision arithmetic, it decides that they
-are not equal!
-(@xref{Floating-point Programming}.)
-You can get the result you want by increasing the precision;
-56 in this case will get the job done:
address@hidden profiling @command{awk} programs, dynamically
address@hidden @command{gawk} program, dynamic profiling
+Besides creating profiles when a program has completed,
address@hidden can produce a profile while it is running.
+This is useful if your @command{awk} program goes into an
+infinite loop and you want to see what has been executed.
+To use this feature, run @command{gawk} with the @option{--profile}
+option in the background:
address@hidden
-$ @kbd{gawk -M -vPREC=56 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'}
address@hidden 1
address@hidden
+$ @kbd{gawk --profile -f myprog &}
+[1] 13992
@end example
-If adding more bits is good, perhaps adding even more bits of
-precision is better?
-Here is what happens if we use an even larger value of @code{PREC}:
-
address@hidden
-$ @kbd{gawk -M -vPREC=201 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'}
address@hidden 0
address@hidden example
-
-This is not a bug in @command{gawk} or in the MPFR library.
-It is easy to forget that the finite number of bits used to store the value
-is often just an approximation after proper rounding.
-The test for equality succeeds if and only if @emph{all} bits in the two
operands
-are exactly the same. Since this is not necessarily true after floating-point
-computations with a particular precision and effective rounding rule,
-a straight test for equality may not work.
address@hidden @command{kill} address@hidden dynamic profiling
address@hidden @code{USR1} signal
address@hidden @code{SIGUSR1} signal
address@hidden signals, @code{USR1}/@code{SIGUSR1}
address@hidden
+The shell prints a job number and process ID number; in this case, 13992.
+Use the @command{kill} command to send the @code{USR1} signal
+to @command{gawk}:
-So, don't assume that floating-point values can be compared for equality.
-You should also exercise caution when using other forms of comparisons.
-The standard way to compare between floating-point numbers is to determine
-how much error (or @dfn{tolerance}) you will allow in a comparison and
-check to see if one value is within this error range of the other.
address@hidden
+$ @kbd{kill -USR1 13992}
address@hidden example
-In applications where 15 or fewer decimal places suffice,
-hardware double precision arithmetic can be adequate, and is usually much
faster.
-But you do need to keep in mind that every floating-point operation
-can suffer a new rounding error with catastrophic consequences as illustrated
-by our attempt to compute the value of the constant @value{PI}
-(@pxref{Floating-point Programming}).
-Extra precision can greatly enhance the stability and the accuracy
-of your computation in such cases.
address@hidden
+As usual, the profiled version of the program is written to
address@hidden, or to a different file if one specified with
+the @option{--profile} option.
-Repeated addition is not necessarily equivalent to multiplication
-in floating-point arithmetic. In the example in
address@hidden Programming}:
+Along with the regular profile, as shown earlier, the profile
+includes a trace of any active functions:
@example
-$ @kbd{gawk 'BEGIN @{}
-> @kbd{for (d = 1.1; d <= 1.5; d += 0.1)}
-> @kbd{i++}
-> @kbd{print i}
-> @address@hidden'}
address@hidden 4
+# Function Call Stack:
+
+# 3. baz
+# 2. bar
+# 1. foo
+# -- main --
@end example
address@hidden
-you may or may not succeed in getting the correct result by choosing
-an arbitrarily large value for @code{PREC}. Reformulation of
-the problem at hand is often the correct approach in such situations.
+You may send @command{gawk} the @code{USR1} signal as many times as you like.
+Each time, the profile and function call trace are appended to the output
+profile file.
address@hidden Arbitrary Precision Integers
address@hidden Arbitrary Precision Integer Arithmetic with @command{gawk}
address@hidden integer, arbitrary precision
address@hidden @code{HUP} signal
address@hidden @code{SIGHUP} signal
address@hidden signals, @code{HUP}/@code{SIGHUP}
+If you use the @code{HUP} signal instead of the @code{USR1} signal,
address@hidden produces the profile and the function call trace and then exits.
-If the option @option{--bignum} or @option{-M} is specified,
address@hidden performs all
-integer arithmetic using GMP arbitrary precision integers.
-Any number that looks like an integer in a program source or data file
-is stored as an arbitrary precision integer.
-The size of the integer is limited only by your computer's memory.
-The current floating-point context has no effect on operations involving
integers.
-For example, the following computes
address@hidden
address@hidden,
address@hidden iftex
address@hidden
-5^4^3^2,
address@hidden ifnottex
-the result of which is beyond the
-limits of ordinary @command{gawk} numbers:
address@hidden @code{INT} signal (MS-Windows)
address@hidden @code{SIGINT} signal (MS-Windows)
address@hidden signals, @code{INT}/@code{SIGINT} (MS-Windows)
address@hidden @code{QUIT} signal (MS-Windows)
address@hidden @code{SIGQUIT} signal (MS-Windows)
address@hidden signals, @code{QUIT}/@code{SIGQUIT} (MS-Windows)
+When @command{gawk} runs on MS-Windows systems, it uses the
address@hidden and @code{QUIT} signals for producing the profile and, in
+the case of the @code{INT} signal, @command{gawk} exits. This is
+because these systems don't support the @command{kill} command, so the
+only signals you can deliver to a program are those generated by the
+keyboard. The @code{INT} signal is generated by the
address@hidden@address@hidden or @address@hidden@key{BREAK}} key, while the
address@hidden signal is generated by the @address@hidden@key{\}} key.
address@hidden
-$ @kbd{gawk -M 'BEGIN @{}
-> @kbd{x = 5^4^3^2}
-> @kbd{print "# of digits =", length(x)}
-> @kbd{print substr(x, 1, 20), "...", substr(x, length(x) - 19, 20)}
-> @address@hidden'}
address@hidden # of digits = 183231
address@hidden 62060698786608744707 ... 92256259918212890625
address@hidden example
+Finally, @command{gawk} also accepts another option @option{--pretty-print}.
+When called this way, @command{gawk} ``pretty prints'' the program into
address@hidden, without any execution counts.
address@hidden ENDOFRANGE advgaw
address@hidden ENDOFRANGE gawadv
address@hidden ENDOFRANGE awkp
address@hidden ENDOFRANGE proawk
-If you were to compute the same value using arbitrary precision
-floating-point values instead, the precision needed for correct output
-(using the formula
address@hidden
address@hidden = 3.322 @cdot dps}),
-would be @math{3.322 @cdot 183231},
address@hidden iftex
address@hidden
address@hidden = 3.322 * dps}),
-would be 3.322 x 183231,
address@hidden ifnottex
-or 608693.
-(Thus, the floating-point representation requires over 30 times as
-many decimal digits!)
address@hidden Library Functions
address@hidden A Library of @command{awk} Functions
address@hidden STARTOFRANGE libf
address@hidden libraries of @command{awk} functions
address@hidden STARTOFRANGE flib
address@hidden functions, library
address@hidden STARTOFRANGE fudlib
address@hidden functions, user-defined, library of
-The result from an arithmetic operation with an integer and a floating-point
value
-is a floating-point value with a precision equal to the working precision.
-The following program calculates the eighth term in
-Sylvester's address@hidden, Eric W.
address@hidden's Sequence}. From MathWorld---A Wolfram Web Resource.
address@hidden://mathworld.wolfram.com/SylvestersSequence.html}}
-using a recurrence:
address@hidden, describes how to write
+your own @command{awk} functions. Writing functions is important, because
+it allows you to encapsulate algorithms and program tasks in a single
+place. It simplifies programming, making program development more
+manageable, and making programs more readable.
address@hidden
-$ @kbd{gawk -M 'BEGIN @{}
-> @kbd{s = 2.0}
-> @kbd{for (i = 1; i <= 7; i++)}
-> @kbd{s = s * (s - 1) + 1}
-> @kbd{print s}
-> @address@hidden'}
address@hidden 113423713055421845118910464
address@hidden example
+One valuable way to learn a new programming language is to @emph{read}
+programs in that language. To that end, this @value{CHAPTER}
+and @ref{Sample Programs},
+provide a good-sized body of code for you to read,
+and hopefully, to learn from.
-The output differs from the acutal number, 113423713055421844361000443,
-because the default precision of 53 is not enough to represent the
-floating-point results exactly. You can either increase the precision
-(100 is enough in this case), or replace the floating-point constant
address@hidden with an integer, to perform all computations using integer
-arithmetic to get the correct output.
address@hidden 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
+This @value{CHAPTER} presents a library of useful @command{awk} functions.
+Many of the sample programs presented later in this @value{DOCUMENT}
+use these functions.
+The functions are presented here in a progression from simple to complex.
-It will sometimes be necessary for @command{gawk} to implicitly convert an
-arbitrary precision integer into an arbitrary precision floating-point value.
-This is primarily because the MPFR library does not always provide the
-relevant interface to process arbitrary precision integers or mixed-mode
-numbers as needed by an operation or function.
-In such a case, the precision is set to the minimum value necessary
-for exact conversion, and the working precision is not used for this purpose.
-If this is not what you need or want, you can employ a subterfuge
-like this:
address@hidden Texinfo
address@hidden Program},
+presents a program that you can use to extract the source code for
+these example library functions and programs from the Texinfo source
+for this @value{DOCUMENT}.
+(This has already been done as part of the @command{gawk} distribution.)
address@hidden
-gawk -M 'BEGIN @{ n = 13; print (n + 0.0) % 2.0 @}'
address@hidden example
+If you have written one or more useful, general-purpose @command{awk} functions
+and would like to contribute them to the @command{awk} user community, see
address@hidden To Contribute}, for more information.
-You can avoid this issue altogether by specifying the number as a
floating-point value
-to begin with:
address@hidden portability, example programs
+The programs in this @value{CHAPTER} and in
address@hidden Programs},
+freely use features that are @command{gawk}-specific.
+Rewriting these programs for different implementations of @command{awk}
+is pretty straightforward.
address@hidden
-gawk -M 'BEGIN @{ n = 13.0; print n % 2.0 @}'
address@hidden example
address@hidden @bullet
address@hidden
+Diagnostic error messages are sent to @file{/dev/stderr}.
+Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"} if your system
+does not have a @file{/dev/stderr}, or if you cannot use @command{gawk}.
-Note that for the particular example above, there is likely best
-to just use the following:
address@hidden
+A number of programs use @code{nextfile}
+(@pxref{Nextfile Statement})
+to skip any remaining input in the input file.
+
address@hidden
address@hidden 12/2000: Thanks to Nelson Beebe for pointing out the output
issue.
address@hidden case sensitivity, example programs
address@hidden @code{IGNORECASE} variable, in example programs
+Finally, some of the programs choose to ignore upper- and lowercase
+distinctions in their input. They do so by assigning one to @code{IGNORECASE}.
+You can achieve almost the same address@hidden effects are
+not identical. Output of the transformed
+record will be in all lowercase, while @code{IGNORECASE} preserves the original
+contents of the input record.} by adding the following rule to the
+beginning of the program:
@example
-gawk -M 'BEGIN @{ n = 13; print n % 2 @}'
+# ignore case
address@hidden $0 = tolower($0) @}
@end example
address@hidden Advanced Features
address@hidden Advanced Features of @command{gawk}
address@hidden advanced features, network connections, See Also networks,
connections
address@hidden STARTOFRANGE gawadv
address@hidden @command{gawk}, features, advanced
address@hidden STARTOFRANGE advgaw
address@hidden advanced features, @command{gawk}
address@hidden
-Contributed by: Peter Langston <address@hidden>
-
- Found in Steve English's "signature" line:
-
-"Write documentation as if whoever reads it is a violent psychopath
-who knows where you live."
address@hidden ignore
address@hidden
address@hidden documentation as if whoever reads it is
-a violent psychopath who knows where you address@hidden
-Steve English, as quoted by Peter Langston
address@hidden quotation
-
-This @value{CHAPTER} discusses advanced features in @command{gawk}.
-It's a bit of a ``grab bag'' of items that are otherwise unrelated
-to each other.
-First, a command-line option allows @command{gawk} to recognize
-nondecimal numbers in input data, not just in @command{awk}
-programs.
-Then, @command{gawk}'s special features for sorting arrays are presented.
-Next, two-way I/O, discussed briefly in earlier parts of this
address@hidden, is described in full detail, along with the basics
-of TCP/IP networking. Finally, @command{gawk}
-can @dfn{profile} an @command{awk} program, making it possible to tune
-it for performance.
-
address@hidden Extensions},
-discusses the ability to dynamically add new built-in functions to
address@hidden As this feature is still immature and likely to change,
-its description is relegated to an appendix.
address@hidden
+Also, verify that all regexp and string constants used in
+comparisons use only lowercase letters.
address@hidden itemize
@menu
-* Nondecimal Data:: Allowing nondecimal input data.
-* Array Sorting:: Facilities for controlling array traversal and
- sorting arrays.
-* Two-way I/O:: Two-way communications with another process.
-* TCP/IP Networking:: Using @command{gawk} for network programming.
-* Profiling:: Profiling your @command{awk} programs.
+* Library Names:: How to best name private global variables in
+ library functions.
+* General Functions:: Functions that are of general use.
+* Data File Management:: Functions for managing command-line data
+ files.
+* Getopt Function:: A function for processing command-line
+ arguments.
+* Passwd Functions:: Functions for getting user information.
+* Group Functions:: Functions for getting group information.
+* Walking Arrays:: A function to walk arrays of arrays.
@end menu
address@hidden Nondecimal Data
address@hidden Allowing Nondecimal Input Data
address@hidden @code{--non-decimal-data} option
address@hidden advanced features, @command{gawk}, nondecimal input data
address@hidden input, address@hidden nondecimal
address@hidden constants, nondecimal
address@hidden Library Names
address@hidden Naming Library Function Global Variables
-If you run @command{gawk} with the @option{--non-decimal-data} option,
-you can have nondecimal constants in your input data:
address@hidden names, arrays/variables
address@hidden names, functions
address@hidden namespace issues
address@hidden @command{awk} programs, documenting
address@hidden documentation, of @command{awk} programs
+Due to the way the @command{awk} language evolved, variables are either
address@hidden (usable by the entire program) or @dfn{local} (usable just by
+a specific function). There is no intermediate state analogous to
address@hidden variables in C.
address@hidden line break here for small book format
address@hidden
-$ @kbd{echo 0123 123 0x123 |}
-> @kbd{gawk --non-decimal-data '@{ printf "%d, %d, %d\n",}
-> @kbd{$1, $2, $3 @}'}
address@hidden 83, 123, 291
address@hidden example
address@hidden variables, global, for library functions
address@hidden private variables
address@hidden variables, private
+Library functions often need to have global variables that they can use to
+preserve state information between calls to the function---for example,
address@hidden()}'s variable @code{_opti}
+(@pxref{Getopt Function}).
+Such variables are called @dfn{private}, since the only functions that need to
+use them are the ones in the library.
-For this feature to work, write your program so that
address@hidden treats your data as numeric:
+When writing a library function, you should try to choose names for your
+private variables that will not conflict with any variables used by
+either another library function or a user's main program. For example, a
+name like @code{i} or @code{j} is not a good choice, because user programs
+often use variable names like these for their own purposes.
address@hidden
-$ @kbd{echo 0123 123 0x123 | gawk '@{ print $1, $2, $3 @}'}
address@hidden 0123 123 0x123
address@hidden example
address@hidden programming conventions, private variable names
+The example programs shown in this @value{CHAPTER} all start the names of their
+private variables with an underscore (@samp{_}). Users generally don't use
+leading underscores in their variable names, so this convention immediately
+decreases the chances that the variable name will be accidentally shared
+with the user's program.
address@hidden
-The @code{print} statement treats its expressions as strings.
-Although the fields can act as numbers when necessary,
-they are still strings, so @code{print} does not try to treat them
-numerically. You may need to add zero to a field to force it to
-be treated as a number. For example:
address@hidden @code{_} (underscore), in names of private variables
address@hidden underscore (@code{_}), in names of private variables
+In addition, several of the library functions use a prefix that helps
+indicate what function or set of functions use the variables---for example,
address@hidden in the user database routines
+(@pxref{Passwd Functions}).
+This convention is recommended, since it even further decreases the
+chance of inadvertent conflict among variable names. Note that this
+convention is used equally well for variable names and for private
+function address@hidden all the library routines could have
+been rewritten to use this convention, this was not done, in order to
+show how our own @command{awk} programming style has evolved and to
+provide some basis for this discussion.}
+
+As a final note on variable naming, if a function makes global variables
+available for use by a main program, it is a good convention to start that
+variable's name with a capital letter---for
+example, @code{getopt()}'s @code{Opterr} and @code{Optind} variables
+(@pxref{Getopt Function}).
+The leading capital letter indicates that it is global, while the fact that
+the variable name is not all capital letters indicates that the variable is
+not one of @command{awk}'s built-in variables, such as @code{FS}.
+
address@hidden @code{--dump-variables} option
+It is also important that @emph{all} variables in library
+functions that do not need to save state are, in fact, declared
address@hidden@command{gawk}'s @option{--dump-variables} command-line
+option is useful for verifying this.} If this is not done, the variable
+could accidentally be used in the user's program, leading to bugs that
+are very difficult to track down:
@example
-$ @kbd{echo 0123 123 0x123 | gawk --non-decimal-data '}
-> @address@hidden print $1, $2, $3}
-> @kbd{print $1 + 0, $2 + 0, $3 + 0 @}'}
address@hidden 0123 123 0x123
address@hidden 83 123 291
+function lib_func(x, y, l1, l2)
address@hidden
+ @dots{}
+ @var{use variable} some_var # some_var should be local
+ @dots{} # but is not by oversight
address@hidden
@end example
-Because it is common to have decimal data with leading zeros, and because
-using this facility could lead to surprising results, the default is to leave
it
-disabled. If you want it, you must explicitly request it.
-
address@hidden programming conventions, @code{--non-decimal-data} option
address@hidden @code{--non-decimal-data} option, @code{strtonum()} function and
address@hidden @code{strtonum()} function (@command{gawk}),
@code{--non-decimal-data} option and
address@hidden CAUTION
address@hidden of this option is not recommended.}
-It can break old programs very badly.
-Instead, use the @code{strtonum()} function to convert your data
-(@pxref{Nondecimal-numbers}).
-This makes your programs easier to write and easier to read, and
-leads to less surprising results.
address@hidden quotation
address@hidden arrays, associative, library functions and
address@hidden libraries of @command{awk} functions, associative arrays and
address@hidden functions, library, associative arrays and
address@hidden Tcl
+A different convention, common in the Tcl community, is to use a single
+associative array to hold the values needed by the library function(s), or
+``package.'' This significantly decreases the number of actual global names
+in use. For example, the functions described in
address@hidden Functions},
+might have used array elements @address@hidden"inited"]}},
@address@hidden"total"]}},
address@hidden@w{PW_data["count"]}}, and @address@hidden"awklib"]}}, instead of
address@hidden@w{_pw_inited}}, @address@hidden, @address@hidden,
+and @address@hidden
address@hidden Array Sorting
address@hidden Controlling Array Traversal and Array Sorting
+The conventions presented in this @value{SECTION} are exactly
+that: conventions. You are not required to write your programs this
+way---we merely recommend that you do so.
address@hidden lets you control the order in which a @samp{for (i in array)}
-loop traverses an array.
address@hidden General Functions
address@hidden General Programming
-In addition, two built-in functions, @code{asort()} and @code{asorti()},
-let you sort arrays based on the array values and indices, respectively.
-These two functions also provide control over the sorting criteria used
-to order the elements during sorting.
+This @value{SECTION} presents a number of functions that are of general
+programming use.
@menu
-* Controlling Array Traversal:: How to use PROCINFO["sorted_in"].
-* Array Sorting Functions:: How to use @code{asort()} and @code{asorti()}.
+* Strtonum Function:: A replacement for the built-in
+ @code{strtonum()} function.
+* Assert Function:: A function for assertions in @command{awk}
+ programs.
+* Round Function:: A function for rounding if @code{sprintf()}
+ does not do it correctly.
+* Cliff Random Function:: The Cliff Random Number Generator.
+* Ordinal Functions:: Functions for using characters as numbers and
+ vice versa.
+* Join Function:: A function to join an array into a string.
+* Getlocaltime Function:: A function to get formatted times.
@end menu
address@hidden Controlling Array Traversal
address@hidden Controlling Array Traversal
-
-By default, the order in which a @samp{for (i in array)} loop
-scans an array is not defined; it is generally based upon
-the internal implementation of arrays inside @command{awk}.
-
-Often, though, it is desirable to be able to loop over the elements
-in a particular order that you, the programmer, choose. @command{gawk}
-lets you do this.
-
address@hidden Scanning}, describes how you can assign special,
-pre-defined values to @code{PROCINFO["sorted_in"]} in order to
-control the order in which @command{gawk} will traverse an array
-during a @code{for} loop.
address@hidden Strtonum Function
address@hidden Converting Strings To Numbers
-In addition, the value of @code{PROCINFO["sorted_in"]} can be a function name.
-This lets you traverse an array based on any custom criterion.
-The array elements are ordered according to the return value of this
-function. The comparison function should be defined with at least
-four arguments:
+The @code{strtonum()} function (@pxref{String Functions})
+is a @command{gawk} extension. The following function
+provides an implementation for other versions of @command{awk}:
@example
-function comp_func(i1, v1, i2, v2)
address@hidden
- @var{compare elements 1 and 2 in some fashion}
- @var{return < 0; 0; or > 0}
address@hidden
address@hidden example
-
-Here, @var{i1} and @var{i2} are the indices, and @var{v1} and @var{v2}
-are the corresponding values of the two elements being compared.
-Either @var{v1} or @var{v2}, or both, can be arrays if the array being
-traversed contains subarrays as values.
-(@xref{Arrays of Arrays}, for more information about subarrays.)
-The three possible return values are interpreted as follows:
-
address@hidden @code
address@hidden comp_func(i1, v1, i2, v2) < 0
-Index @var{i1} comes before index @var{i2} during loop traversal.
-
address@hidden comp_func(i1, v1, i2, v2) == 0
-Indices @var{i1} and @var{i2}
-come together but the relative order with respect to each other is undefined.
-
address@hidden comp_func(i1, v1, i2, v2) > 0
-Index @var{i1} comes after index @var{i2} during loop traversal.
address@hidden table
address@hidden file eg/lib/strtonum.awk
+# mystrtonum --- convert string to number
-Our first comparison function can be used to scan an array in
-numerical order of the indices:
address@hidden endfile
address@hidden
address@hidden file eg/lib/strtonum.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# February, 2004
address@hidden
-function cmp_num_idx(i1, v1, i2, v2)
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/strtonum.awk
+function mystrtonum(str, ret, chars, n, i, k, c)
@{
- # numerical index comparison, ascending order
- return (i1 - i2)
address@hidden
address@hidden example
+ if (str ~ /^0[0-7]*$/) @{
+ # octal
+ n = length(str)
+ ret = 0
+ for (i = 1; i <= n; i++) @{
+ c = substr(str, i, 1)
+ if ((k = index("01234567", c)) > 0)
+ k-- # adjust for 1-basing in awk
-Our second function traverses an array based on the string order of
-the element values rather than by indices:
+ ret = ret * 8 + k
+ @}
+ @} else if (str ~ /^0[xX][[:xdigit:]]+/) @{
+ # hexadecimal
+ str = substr(str, 3) # lop off leading 0x
+ n = length(str)
+ ret = 0
+ for (i = 1; i <= n; i++) @{
+ c = substr(str, i, 1)
+ c = tolower(c)
+ if ((k = index("0123456789", c)) > 0)
+ k-- # adjust for 1-basing in awk
+ else if ((k = index("abcdef", c)) > 0)
+ k += 9
address@hidden
-function cmp_str_val(i1, v1, i2, v2)
address@hidden
- # string value comparison, ascending order
- v1 = v1 ""
- v2 = v2 ""
- if (v1 < v2)
- return -1
- return (v1 != v2)
+ ret = ret * 16 + k
+ @}
+ @} else if (str ~ \
+ /^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|([.][0-9]+([Ee][-+]?[0-9]+)?))$/) @{
+ # decimal number, possibly floating point
+ ret = str + 0
+ @} else
+ ret = "NOT-A-NUMBER"
+
+ return ret
@}
+
+# BEGIN @{ # gawk test harness
+# a[1] = "25"
+# a[2] = ".31"
+# a[3] = "0123"
+# a[4] = "0xdeadBEEF"
+# a[5] = "123.45"
+# a[6] = "1.e3"
+# a[7] = "1.32"
+# a[7] = "1.32E2"
+#
+# for (i = 1; i in a; i++)
+# print a[i], strtonum(a[i]), mystrtonum(a[i])
+# @}
address@hidden endfile
@end example
-The third
-comparison function makes all numbers, and numeric strings without
-any leading or trailing spaces, come out first during loop traversal:
+The function first looks for C-style octal numbers (base 8).
+If the input string matches a regular expression describing octal
+numbers, then @code{mystrtonum()} loops through each character in the
+string. It sets @code{k} to the index in @code{"01234567"} of the current
+octal digit. Since the return value is one-based, the @samp{k--}
+adjusts @code{k} so it can be used in computing the return value.
address@hidden
-function cmp_num_str_val(i1, v1, i2, v2, n1, n2)
address@hidden
- # numbers before string value comparison, ascending order
- n1 = v1 + 0
- n2 = v2 + 0
- if (n1 == v1)
- return (n2 == v2) ? (n1 - n2) : -1
- else if (n2 == v2)
- return 1
- return (v1 < v2) ? -1 : (v1 != v2)
address@hidden
address@hidden example
+Similar logic applies to the code that checks for and converts a
+hexadecimal value, which starts with @samp{0x} or @samp{0X}.
+The use of @code{tolower()} simplifies the computation for finding
+the correct numeric value for each hexadecimal digit.
-Here is a main program to demonstrate how @command{gawk}
-behaves using each of the previous functions:
+Finally, if the string matches the (rather complicated) regexp for a
+regular decimal integer or floating-point number, the computation
address@hidden = str + 0} lets @command{awk} convert the value to a
+number.
+
+A commented-out test program is included, so that the function can
+be tested with @command{gawk} and the results compared to the built-in
address@hidden()} function.
+
address@hidden Assert Function
address@hidden Assertions
+
address@hidden STARTOFRANGE asse
address@hidden assertions
address@hidden STARTOFRANGE assef
address@hidden @code{assert()} function (C library)
address@hidden STARTOFRANGE libfass
address@hidden libraries of @command{awk} functions, assertions
address@hidden STARTOFRANGE flibass
address@hidden functions, library, assertions
address@hidden @command{awk} programs, lengthy, assertions
+When writing large programs, it is often useful to know
+that a condition or set of conditions is true. Before proceeding with a
+particular computation, you make a statement about what you believe to be
+the case. Such a statement is known as an
address@hidden The C language provides an @code{<assert.h>} header file
+and corresponding @code{assert()} macro that the programmer can use to make
+assertions. If an assertion fails, the @code{assert()} macro arranges to
+print a diagnostic message describing the condition that should have
+been true but was not, and then it kills the program. In C, using
address@hidden()} looks this:
@example
-BEGIN @{
- data["one"] = 10
- data["two"] = 20
- data[10] = "one"
- data[100] = 100
- data[20] = "two"
-
- f[1] = "cmp_num_idx"
- f[2] = "cmp_str_val"
- f[3] = "cmp_num_str_val"
- for (i = 1; i <= 3; i++) @{
- printf("Sort function: %s\n", f[i])
- PROCINFO["sorted_in"] = f[i]
- for (j in data)
- printf("\tdata[%s] = %s\n", j, data[j])
- print ""
- @}
+#include <assert.h>
+
+int myfunc(int a, double b)
address@hidden
+ assert(a <= 5 && b >= 17.1);
+ @dots{}
@}
@end example
-Here are the results when the program is run:
address@hidden
+If the assertion fails, the program prints a message similar to this:
@example
-$ @kbd{gawk -f compdemo.awk}
address@hidden Sort function: cmp_num_idx @ii{Sort by numeric index}
address@hidden data[two] = 20
address@hidden data[one] = 10 @ii{Both strings are numerically
zero}
address@hidden data[10] = one
address@hidden data[20] = two
address@hidden data[100] = 100
address@hidden
address@hidden Sort function: cmp_str_val @ii{Sort by element values as
strings}
address@hidden data[one] = 10
address@hidden data[100] = 100 @ii{String 100 is less than
string 20}
address@hidden data[two] = 20
address@hidden data[10] = one
address@hidden data[20] = two
address@hidden
address@hidden Sort function: cmp_num_str_val @ii{Sort all numeric values
before all strings}
address@hidden data[one] = 10
address@hidden data[two] = 20
address@hidden data[100] = 100
address@hidden data[10] = one
address@hidden data[20] = two
+prog.c:5: assertion failed: a <= 5 && b >= 17.1
@end example
-Consider sorting the entries of a GNU/Linux system password file
-according to login name. The following program sorts records
-by a specific field position and can be used for this purpose:
address@hidden @code{assert()} user-defined function
+The C language makes it possible to turn the condition into a string for use
+in printing the diagnostic message. This is not possible in @command{awk}, so
+this @code{assert()} function also requires a string version of the condition
+that is being tested.
+Following is the function:
@example
-# sort.awk --- simple program to sort by field position
-# field position is specified by the global variable POS
address@hidden file eg/lib/assert.awk
+# assert --- assert that a condition is true. Otherwise exit.
-function cmp_field(i1, v1, i2, v2)
address@hidden
- # comparison by value, as string, and ascending order
- return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS])
address@hidden
address@hidden endfile
address@hidden
address@hidden file eg/lib/assert.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# May, 1993
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/assert.awk
+function assert(condition, string)
@{
- for (i = 1; i <= NF; i++)
- a[NR][i] = $i
+ if (! condition) @{
+ printf("%s:%d: assertion failed: %s\n",
+ FILENAME, FNR, string) > "/dev/stderr"
+ _assert_exit = 1
+ exit 1
+ @}
@}
address@hidden
END @{
- PROCINFO["sorted_in"] = "cmp_field"
- if (POS < 1 || POS > NF)
- POS = 1
- for (i in a) @{
- for (j = 1; j <= NF; j++)
- printf("%s%c", a[i][j], j < NF ? ":" : "")
- print ""
- @}
+ if (_assert_exit)
+ exit 1
@}
address@hidden group
address@hidden endfile
@end example
-The first field in each entry of the password file is the user's login name,
-and the fields are separated by colons.
-Each record defines a subarray,
-with each field as an element in the subarray.
-Running the program produces the
-following output:
+The @code{assert()} function tests the @code{condition} parameter. If it
+is false, it prints a message to standard error, using the @code{string}
+parameter to describe the failed condition. It then sets the variable
address@hidden to one and executes the @code{exit} statement.
+The @code{exit} statement jumps to the @code{END} rule. If the @code{END}
+rules finds @code{_assert_exit} to be true, it exits immediately.
+
+The purpose of the test in the @code{END} rule is to
+keep any other @code{END} rules from running. When an assertion fails, the
+program should exit immediately.
+If no assertions fail, then @code{_assert_exit} is still
+false when the @code{END} rule is run normally, and the rest of the
+program's @code{END} rules execute.
+For all of this to work correctly, @file{assert.awk} must be the
+first source file read by @command{awk}.
+The function can be used in a program in the following way:
@example
-$ @kbd{gawk -vPOS=1 -F: -f sort.awk /etc/passwd}
address@hidden adm:x:3:4:adm:/var/adm:/sbin/nologin
address@hidden apache:x:48:48:Apache:/var/www:/sbin/nologin
address@hidden avahi:x:70:70:Avahi daemon:/:/sbin/nologin
address@hidden
+function myfunc(a, b)
address@hidden
+ assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1")
+ @dots{}
address@hidden
@end example
-The comparison should normally always return the same value when given a
-specific pair of array elements as its arguments. If inconsistent
-results are returned then the order is undefined. This behavior can be
-exploited to introduce random order into otherwise seemingly
-ordered data:
address@hidden
+If the assertion fails, you see a message similar to the following:
@example
-function cmp_randomize(i1, v1, i2, v2)
address@hidden
- # random order
- return (2 - 4 * rand())
address@hidden
+mydata:1357: assertion failed: a <= 5 && b >= 17.1
@end example
-As mentioned above, the order of the indices is arbitrary if two
-elements compare equal. This is usually not a problem, but letting
-the tied elements come out in arbitrary order can be an issue, especially
-when comparing item values. The partial ordering of the equal elements
-may change during the next loop traversal, if other elements are added or
-removed from the array. One way to resolve ties when comparing elements
-with otherwise equal values is to include the indices in the comparison
-rules. Note that doing this may make the loop traversal less efficient,
-so consider it only if necessary. The following comparison functions
-force a deterministic order, and are based on the fact that the
-indices of two elements are never equal:
-
address@hidden
-function cmp_numeric(i1, v1, i2, v2)
address@hidden
- # numerical value (and index) comparison, descending order
- return (v1 != v2) ? (v2 - v1) : (i2 - i1)
address@hidden
address@hidden @code{END} pattern, @code{assert()} user-defined function and
+There is a small problem with this version of @code{assert()}.
+An @code{END} rule is automatically added
+to the program calling @code{assert()}. Normally, if a program consists
+of just a @code{BEGIN} rule, the input files and/or standard input are
+not read. However, now that the program has an @code{END} rule, @command{awk}
+attempts to read the input @value{DF}s or standard input
+(@pxref{Using BEGIN/END}),
+most likely causing the program to hang as it waits for input.
-function cmp_string(i1, v1, i2, v2)
address@hidden
- # string value (and index) comparison, descending order
- v1 = v1 i1
- v2 = v2 i2
- return (v1 > v2) ? -1 : (v1 != v2)
address@hidden
address@hidden example
address@hidden @code{BEGIN} pattern, @code{assert()} user-defined function and
+There is a simple workaround to this:
+make sure that such a @code{BEGIN} rule always ends
+with an @code{exit} statement.
address@hidden ENDOFRANGE asse
address@hidden ENDOFRANGE assef
address@hidden ENDOFRANGE flibass
address@hidden ENDOFRANGE libfass
address@hidden Avoid using the term ``stable'' when describing the
unpredictable behavior
address@hidden if two items compare equal. Usually, the goal of a "stable
algorithm"
address@hidden is to maintain the original order of the items, which is a
meaningless
address@hidden concept for a list constructed from a hash.
address@hidden Round Function
address@hidden Rounding Numbers
-A custom comparison function can often simplify ordered loop
-traversal, and the sky is really the limit when it comes to
-designing such a function.
address@hidden rounding numbers
address@hidden numbers, rounding
address@hidden libraries of @command{awk} functions, rounding numbers
address@hidden functions, library, rounding numbers
address@hidden @code{print} statement, @code{sprintf()} function and
address@hidden @code{printf} statement, @code{sprintf()} function and
address@hidden @code{sprintf()} function, @code{print}/@code{printf} statements
and
+The way @code{printf} and @code{sprintf()}
+(@pxref{Printf})
+perform rounding often depends upon the system's C @code{sprintf()}
+subroutine. On many machines, @code{sprintf()} rounding is ``unbiased,''
+which means it doesn't always round a trailing @samp{.5} up, contrary
+to naive expectations. In unbiased rounding, @samp{.5} rounds to even,
+rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4. This means
+that if you are using a format that does rounding (e.g., @code{"%.0f"}),
+you should check what your system does. The following function does
+traditional rounding; it might be useful if your @command{awk}'s @code{printf}
+does unbiased rounding:
-When string comparisons are made during a sort, either for element
-values where one or both aren't numbers, or for element indices
-handled as strings, the value of @code{IGNORECASE}
-(@pxref{Built-in Variables}) controls whether
-the comparisons treat corresponding uppercase and lowercase letters as
-equivalent or distinct.
address@hidden @code{round()} user-defined function
address@hidden
address@hidden file eg/lib/round.awk
+# round.awk --- do normal rounding
address@hidden endfile
address@hidden
address@hidden file eg/lib/round.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# August, 1996
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/round.awk
-Another point to keep in mind is that in the case of subarrays
-the element values can themselves be arrays; a production comparison
-function should use the @code{isarray()} function
-(@pxref{Type Functions}),
-to check for this, and choose a defined sorting order for subarrays.
+function round(x, ival, aval, fraction)
address@hidden
+ ival = int(x) # integer part, int() truncates
-All sorting based on @code{PROCINFO["sorted_in"]}
-is disabled in POSIX mode,
-since the @code{PROCINFO} array is not special in that case.
+ # see if fractional part
+ if (ival == x) # no fraction
+ return ival # ensure no decimals
-As a side note, sorting the array indices before traversing
-the array has been reported to add 15% to 20% overhead to the
-execution time of @command{awk} programs. For this reason,
-sorted array traversal is not the default.
+ if (x < 0) @{
+ aval = -x # absolute value
+ ival = int(aval)
+ fraction = aval - ival
+ if (fraction >= .5)
+ return int(x) - 1 # -2.5 --> -3
+ else
+ return int(x) # -2.3 --> -2
+ @} else @{
+ fraction = x - ival
+ if (fraction >= .5)
+ return ival + 1
+ else
+ return ival
+ @}
address@hidden
address@hidden endfile
address@hidden don't include test harness in the file that gets installed
address@hidden The @command{gawk}
address@hidden maintainers believe that only the people who wish to use a
address@hidden feature should have to pay for it.
+# test harness
address@hidden print $0, round($0) @}
address@hidden example
address@hidden Array Sorting Functions
address@hidden Sorting Array Values and Indices with @command{gawk}
address@hidden Cliff Random Function
address@hidden The Cliff Random Number Generator
address@hidden random numbers, Cliff
address@hidden Cliff random numbers
address@hidden numbers, Cliff random
address@hidden functions, library, Cliff random numbers
address@hidden arrays, sorting
address@hidden @code{asort()} function (@command{gawk})
address@hidden @code{asort()} function (@command{gawk}), address@hidden sorting
address@hidden sort function, arrays, sorting
-In most @command{awk} implementations, sorting an array requires
-writing a @code{sort()} function.
-While this can be educational for exploring different sorting algorithms,
-usually that's not the point of the program.
address@hidden provides the built-in @code{asort()}
-and @code{asorti()} functions
-(@pxref{String Functions})
-for sorting arrays. For example:
+The
address@hidden://mathworld.wolfram.com/CliffRandomNumberGenerator.html, Cliff
random number generator}
+is a very simple random number generator that ``passes the noise sphere test
+for randomness by showing no structure.''
+It is easily programmed, in less than 10 lines of @command{awk} code:
address@hidden @code{cliff_rand()} user-defined function
@example
address@hidden the array} data
-n = asort(data)
-for (i = 1; i <= n; i++)
- @var{do something with} data[i]
address@hidden example
-
-After the call to @code{asort()}, the array @code{data} is indexed from 1
-to some number @var{n}, the total number of elements in @code{data}.
-(This count is @code{asort()}'s return value.)
address@hidden @value{LEQ} @code{data[2]} @value{LEQ} @code{data[3]}, and so on.
-The comparison is based on the type of the elements
-(@pxref{Typing and Comparison}).
-All numeric values come before all string values,
-which in turn come before all subarrays.
address@hidden file eg/lib/cliff_rand.awk
+# cliff_rand.awk --- generate Cliff random numbers
address@hidden endfile
address@hidden
address@hidden file eg/lib/cliff_rand.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# December 2000
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/cliff_rand.awk
address@hidden side effects, @code{asort()} function
-An important side effect of calling @code{asort()} is that
address@hidden array's original indices are irrevocably lost}.
-As this isn't always desirable, @code{asort()} accepts a
-second argument:
+BEGIN @{ _cliff_seed = 0.1 @}
address@hidden
address@hidden the array} source
-n = asort(source, dest)
-for (i = 1; i <= n; i++)
- @var{do something with} dest[i]
+function cliff_rand()
address@hidden
+ _cliff_seed = (100 * log(_cliff_seed)) % 1
+ if (_cliff_seed < 0)
+ _cliff_seed = - _cliff_seed
+ return _cliff_seed
address@hidden
address@hidden endfile
@end example
-In this case, @command{gawk} copies the @code{source} array into the
address@hidden array and then sorts @code{dest}, destroying its indices.
-However, the @code{source} array is not affected.
+This algorithm requires an initial ``seed'' of 0.1. Each new value
+uses the current seed as input for the calculation.
+If the built-in @code{rand()} function
+(@pxref{Numeric Functions})
+isn't random enough, you might try using this function instead.
address@hidden()} accepts a third string argument to control comparison of
-array elements. As with @code{PROCINFO["sorted_in"]}, this argument
-may be one of the predefined names that @command{gawk} provides
-(@pxref{Controlling Scanning}), or the name of a user-defined function
-(@pxref{Controlling Array Traversal}).
address@hidden Ordinal Functions
address@hidden Translating Between Characters and Numbers
address@hidden NOTE
-In all cases, the sorted element values consist of the original
-array's element values. The ability to control comparison merely
-affects the way in which they are sorted.
address@hidden quotation
address@hidden libraries of @command{awk} functions, character values as numbers
address@hidden functions, library, character values as numbers
address@hidden characters, values of as numbers
address@hidden numbers, as values of characters
+One commercial implementation of @command{awk} supplies a built-in function,
address@hidden()}, which takes a character and returns the numeric value for
that
+character in the machine's character set. If the string passed to
address@hidden()} has more than one character, only the first one is used.
-Often, what's needed is to sort on the values of the @emph{indices}
-instead of the values of the elements.
-To do that, use the
address@hidden()} function. The interface is identical to that of
address@hidden()}, except that the index values are used for sorting, and
-become the values of the result array:
+The inverse of this function is @code{chr()} (from the function of the same
+name in Pascal), which takes a number and returns the corresponding character.
+Both functions are written very nicely in @command{awk}; there is no real
+reason to build them into the @command{awk} interpreter:
address@hidden @code{ord()} user-defined function
address@hidden @code{chr()} user-defined function
@example
address@hidden source[$0] = some_func($0) @}
address@hidden file eg/lib/ord.awk
+# ord.awk --- do ord and chr
-END @{
- n = asorti(source, dest)
- for (i = 1; i <= n; i++) @{
- @ii{Work with sorted indices directly:}
- @var{do something with} dest[i]
- @dots{}
- @ii{Access original array via sorted indices:}
- @var{do something with} source[dest[i]]
- @}
address@hidden
address@hidden example
-
-Similar to @code{asort()},
-in all cases, the sorted element values consist of the original
-array's indices. The ability to control comparison merely
-affects the way in which they are sorted.
-
-Sorting the array by replacing the indices provides maximal flexibility.
-To traverse the elements in decreasing order, use a loop that goes from
address@hidden down to 1, either over the elements or over the address@hidden
-may also use one of the predefined sorting names that sorts in
-decreasing order.}
-
address@hidden reference counting, sorting arrays
-Copying array indices and elements isn't expensive in terms of memory.
-Internally, @command{gawk} maintains @dfn{reference counts} to data.
-For example, when @code{asort()} copies the first array to the second one,
-there is only one copy of the original array elements' data, even though
-both arrays use the values.
-
address@hidden Document It And Call It A Feature. Sigh.
address@hidden @command{gawk}, @code{IGNORECASE} variable in
address@hidden @code{IGNORECASE} variable
address@hidden arrays, sorting, @code{IGNORECASE} variable and
address@hidden @code{IGNORECASE} variable, array sorting and
-Because @code{IGNORECASE} affects string comparisons, the value
-of @code{IGNORECASE} also affects sorting for both @code{asort()} and
@code{asorti()}.
-Note also that the locale's sorting order does @emph{not}
-come into play; comparisons are based on character values address@hidden
-is true because locale-based comparison occurs only when in POSIX
-compatibility mode, and since @code{asort()} and @code{asorti()} are
address@hidden extensions, they are not available in that case.}
-Caveat Emptor.
-
address@hidden Two-way I/O
address@hidden Two-Way Communications with Another Process
address@hidden Brennan, Michael
address@hidden programmers, attractiveness of
address@hidden
address@hidden Path:
cssun.mathcs.emory.edu!gatech!newsxfer3.itd.umich.edu!news-peer.sprintlink.net!news-sea-19.sprintlink.net!news-in-west.sprintlink.net!news.sprintlink.net!Sprint!204.94.52.5!news.whidbey.com!brennan
-From: brennan@@whidbey.com (Mike Brennan)
-Newsgroups: comp.lang.awk
-Subject: Re: Learn the SECRET to Attract Women Easily
-Date: 4 Aug 1997 17:34:46 GMT
address@hidden Organization: WhidbeyNet
address@hidden Lines: 12
-Message-ID: <5s53rm$eca@@news.whidbey.com>
address@hidden References: <address@hidden>
address@hidden Reply-To: address@hidden
address@hidden NNTP-Posting-Host: asn202.whidbey.com
address@hidden X-Newsreader: slrn (0.9.4.1 UNIX)
address@hidden Xref: cssun.mathcs.emory.edu comp.lang.awk:5403
-
-On 3 Aug 1997 13:17:43 GMT, Want More Dates???
-<tracy78@@kilgrona.com> wrote:
->Learn the SECRET to Attract Women Easily
->
->The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women
-
-The scent of awk programmers is a lot more attractive to women than
-the scent of perl programmers.
---
-Mike Brennan
address@hidden brennan@@whidbey.com
address@hidden smallexample
-
address@hidden advanced features, @command{gawk}, address@hidden communicating
with
address@hidden processes, two-way communications with
-It is often useful to be able to
-send data to a separate program for
-processing and then read the result. This can always be
-done with temporary files:
-
address@hidden
-# Write the data for processing
-tempfile = ("mydata." PROCINFO["pid"])
-while (@var{not done with data})
- print @var{data} | ("subprogram > " tempfile)
-close("subprogram > " tempfile)
-
-# Read the results, remove tempfile when done
-while ((getline newdata < tempfile) > 0)
- @var{process} newdata @var{appropriately}
-close(tempfile)
-system("rm " tempfile)
address@hidden example
-
address@hidden
-This works, but not elegantly. Among other things, it requires that
-the program be run in a directory that cannot be shared among users;
-for example, @file{/tmp} will not do, as another user might happen
-to be using a temporary file with the same name.
-
address@hidden coprocesses
address@hidden input/output, two-way
address@hidden @code{|} (vertical bar), @code{|&} operator (I/O)
address@hidden vertical bar (@code{|}), @code{|&} operator (I/O)
address@hidden @command{csh} utility, @code{|&} operator, comparison with
-However, with @command{gawk}, it is possible to
-open a @emph{two-way} pipe to another process. The second process is
-termed a @dfn{coprocess}, since it runs in parallel with @command{gawk}.
-The two-way connection is created using the @samp{|&} operator
-(borrowed from the Korn shell, @command{ksh}):@footnote{This is very
-different from the same operator in the C shell.}
-
address@hidden
-do @{
- print @var{data} |& "subprogram"
- "subprogram" |& getline results
address@hidden while (@var{data left to process})
-close("subprogram")
address@hidden example
-
-The first time an I/O operation is executed using the @samp{|&}
-operator, @command{gawk} creates a two-way pipeline to a child process
-that runs the other program. Output created with @code{print}
-or @code{printf} is written to the program's standard input, and
-output from the program's standard output can be read by the @command{gawk}
-program using @code{getline}.
-As is the case with processes started by @samp{|}, the subprogram
-can be any program, or pipeline of programs, that can be started by
-the shell.
-
-There are some cautionary items to be aware of:
-
address@hidden @bullet
address@hidden
-As the code inside @command{gawk} currently stands, the coprocess's
-standard error goes to the same place that the parent @command{gawk}'s
-standard error goes. It is not possible to read the child's
-standard error separately.
-
address@hidden deadlocks
address@hidden buffering, input/output
address@hidden @code{getline} command, deadlock and
address@hidden
-I/O buffering may be a problem. @command{gawk} automatically
-flushes all output down the pipe to the coprocess.
-However, if the coprocess does not flush its output,
address@hidden may hang when doing a @code{getline} in order to read
-the coprocess's results. This could lead to a situation
-known as @dfn{deadlock}, where each process is waiting for the
-other one to do something.
address@hidden itemize
-
address@hidden @code{close()} function, two-way pipes and
-It is possible to close just one end of the two-way pipe to
-a coprocess, by supplying a second argument to the @code{close()}
-function of either @code{"to"} or @code{"from"}
-(@pxref{Close Files And Pipes}).
-These strings tell @command{gawk} to close the end of the pipe
-that sends data to the coprocess or the end that reads from it,
-respectively.
-
address@hidden @command{sort} utility, coprocesses and
-This is particularly necessary in order to use
-the system @command{sort} utility as part of a coprocess;
address@hidden must read @emph{all} of its input
-data before it can produce any output.
-The @command{sort} program does not receive an end-of-file indication
-until @command{gawk} closes the write end of the pipe.
-
-When you have finished writing data to the @command{sort}
-utility, you can close the @code{"to"} end of the pipe, and
-then start reading sorted data via @code{getline}.
-For example:
-
address@hidden
-BEGIN @{
- command = "LC_ALL=C sort"
- n = split("abcdefghijklmnopqrstuvwxyz", a, "")
-
- for (i = n; i > 0; i--)
- print a[i] |& command
- close(command, "to")
-
- while ((command |& getline line) > 0)
- print "got", line
- close(command)
address@hidden
address@hidden example
-
-This program writes the letters of the alphabet in reverse order, one
-per line, down the two-way pipe to @command{sort}. It then closes the
-write end of the pipe, so that @command{sort} receives an end-of-file
-indication. This causes @command{sort} to sort the data and write the
-sorted data back to the @command{gawk} program. Once all of the data
-has been read, @command{gawk} terminates the coprocess and exits.
-
-As a side note, the assignment @samp{LC_ALL=C} in the @command{sort}
-command ensures traditional Unix (ASCII) sorting from @command{sort}.
-
address@hidden @command{gawk}, @code{PROCINFO} array in
address@hidden @code{PROCINFO} array
-You may also use pseudo-ttys (ptys) for
-two-way communication instead of pipes, if your system supports them.
-This is done on a per-command basis, by setting a special element
-in the @code{PROCINFO} array
-(@pxref{Auto-set}),
-like so:
-
address@hidden
-command = "sort -nr" # command, save in convenience variable
-PROCINFO[command, "pty"] = 1 # update PROCINFO
-print @dots{} |& command # start two-way pipe
address@hidden
address@hidden example
-
address@hidden
-Using ptys avoids the buffer deadlock issues described earlier, at some
-loss in performance. If your system does not have ptys, or if all the
-system's ptys are in use, @command{gawk} automatically falls back to
-using regular pipes.
-
address@hidden TCP/IP Networking
address@hidden Using @command{gawk} for Network Programming
address@hidden advanced features, @command{gawk}, network programming
address@hidden networks, programming
address@hidden STARTOFRANGE tcpip
address@hidden TCP/IP
address@hidden @code{/inet/@dots{}} special files (@command{gawk})
address@hidden files, @code{/inet/@dots{}} (@command{gawk})
address@hidden @code{/inet4/@dots{}} special files (@command{gawk})
address@hidden files, @code{/inet4/@dots{}} (@command{gawk})
address@hidden @code{/inet6/@dots{}} special files (@command{gawk})
address@hidden files, @code{/inet6/@dots{}} (@command{gawk})
address@hidden @code{EMISTERED}
address@hidden
address@hidden:@*
-@ @ @ @ @i{A host is a host from coast to coast,@*
-@ @ @ @ and no-one can talk to host that's close,@*
-@ @ @ @ unless the host that isn't address@hidden
-@ @ @ @ is busy hung or dead.}
address@hidden quotation
-
-In addition to being able to open a two-way pipeline to a coprocess
-on the same system
-(@pxref{Two-way I/O}),
-it is possible to make a two-way connection to
-another process on another system across an IP network connection.
-
-You can think of this as just a @emph{very long} two-way pipeline to
-a coprocess.
-The way @command{gawk} decides that you want to use TCP/IP networking is
-by recognizing special @value{FN}s that begin with one of @samp{/inet/},
address@hidden/inet4/} or @samp{/inet6}.
-
-The full syntax of the special @value{FN} is
address@hidden/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}.
-The components are:
-
address@hidden @var
address@hidden net-type
-Specifies the kind of Internet connection to make.
-Use @samp{/inet4/} to force IPv4, and
address@hidden/inet6/} to force IPv6.
-Plain @samp{/inet/} (which used to be the only option) uses
-the system default, most likely IPv4.
-
address@hidden protocol
-The protocol to use over IP. This must be either @samp{tcp}, or
address@hidden, for a TCP or UDP IP connection,
-respectively. The use of TCP is recommended for most applications.
-
address@hidden local-port
address@hidden @code{getaddrinfo()} function (C library)
-The local TCP or UDP port number to use. Use a port number of @samp{0}
-when you want the system to pick a port. This is what you should do
-when writing a TCP or UDP client.
-You may also use a well-known service name, such as @samp{smtp}
-or @samp{http}, in which case @command{gawk} attempts to determine
-the predefined port number using the C @code{getaddrinfo()} function.
-
address@hidden remote-host
-The IP address or fully-qualified domain name of the Internet
-host to which you want to connect.
-
address@hidden remote-port
-The TCP or UDP port number to use on the given @var{remote-host}.
-Again, use @samp{0} if you don't care, or else a well-known
-service name.
address@hidden table
+# Global identifiers:
+# _ord_: numerical values indexed by characters
+# _ord_init: function to initialize _ord_
address@hidden endfile
address@hidden
address@hidden file eg/lib/ord.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# 16 January, 1992
+# 20 July, 1992, revised
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/ord.awk
address@hidden @command{gawk}, @code{ERRNO} variable in
address@hidden @code{ERRNO} variable
address@hidden NOTE
-Failure in opening a two-way socket will result in a non-fatal error
-being returned to the calling code. The value of @code{ERRNO} indicates
-the error (@pxref{Auto-set}).
address@hidden quotation
+BEGIN @{ _ord_init() @}
-Consider the following very simple example:
+function _ord_init( low, high, i, t)
address@hidden
+ low = sprintf("%c", 7) # BEL is ascii 7
+ if (low == "\a") @{ # regular ascii
+ low = 0
+ high = 127
+ @} else if (sprintf("%c", 128 + 7) == "\a") @{
+ # ascii, mark parity
+ low = 128
+ high = 255
+ @} else @{ # ebcdic(!)
+ low = 0
+ high = 255
+ @}
address@hidden
-BEGIN @{
- Service = "/inet/tcp/0/localhost/daytime"
- Service |& getline
- print $0
- close(Service)
+ for (i = low; i <= high; i++) @{
+ t = sprintf("%c", i)
+ _ord_[t] = i
+ @}
@}
address@hidden endfile
@end example
-This program reads the current date and time from the local system's
-TCP @samp{daytime} server.
-It then prints the results and closes the connection.
-
-Because this topic is extensive, the use of @command{gawk} for
-TCP/IP programming is documented separately.
address@hidden
-See
address@hidden, , General Introduction, gawkinet, TCP/IP Internetworking with
@command{gawk}},
address@hidden ifinfo
address@hidden
-See @cite{TCP/IP Internetworking with @command{gawk}},
-which comes as part of the @command{gawk} distribution,
address@hidden ifnotinfo
-for a much more complete introduction and discussion, as well as
-extensive examples.
-
address@hidden ENDOFRANGE tcpip
-
address@hidden Profiling
address@hidden Profiling Your @command{awk} Programs
address@hidden STARTOFRANGE awkp
address@hidden @command{awk} programs, profiling
address@hidden STARTOFRANGE proawk
address@hidden profiling @command{awk} programs
address@hidden profiling @command{gawk}
address@hidden @code{awkprof.out} file
address@hidden files, @code{awkprof.out}
-
-You may produce execution traces of your @command{awk} programs.
-This is done by passing the option @option{--profile} to @command{gawk}.
-When @command{gawk} has finished running, it creates a profile of your program
in a file
-named @file{awkprof.out}. Because it is profiling, it also executes up to 45%
slower than
address@hidden normally does.
-
address@hidden @code{--profile} option
-As shown in the following example,
-the @option{--profile} option can be used to change the name of the file
-where @command{gawk} will write the profile:
-
address@hidden
-gawk --profile=myprog.prof -f myprog.awk data1 data2
address@hidden example
-
address@hidden
-In the above example, @command{gawk} places the profile in
address@hidden instead of in @file{awkprof.out}.
-
-Here is a sample session showing a simple @command{awk} program, its input
data, and the
-results from running @command{gawk} with the @option{--profile} option.
-First, the @command{awk} program:
address@hidden character sets (machine character encodings)
address@hidden ASCII
address@hidden EBCDIC
address@hidden mark parity
+Some explanation of the numbers used by @code{chr} is worthwhile.
+The most prominent character set in use today is address@hidden
+is changing; many systems use Unicode, a very large character set
+that includes ASCII as a subset. On systems with full Unicode support,
+a character can occupy up to 32 bits, making simple tests such as
+used here prohibitively expensive.}
+Although an
+8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
+defines characters that use the values from 0 to address@hidden
+has been extended in many countries to use the values from 128 to 255
+for country-specific characters. If your system uses these extensions,
+you can simplify @code{_ord_init} to loop from 0 to 255.}
+In the now distant past,
+at least one minicomputer manufacturer
address@hidden Pr1me, blech
+used ASCII, but with mark parity, meaning that the leftmost bit in the byte
+is always 1. This means that on those systems, characters
+have numeric values from 128 to 255.
+Finally, large mainframe systems use the EBCDIC character set, which
+uses all 256 values.
+While there are other character sets in use on some older systems,
+they are not really worth worrying about:
@example
-BEGIN @{ print "First BEGIN rule" @}
-
-END @{ print "First END rule" @}
-
-/foo/ @{
- print "matched /foo/, gosh"
- for (i = 1; i <= 3; i++)
- sing()
address@hidden
-
address@hidden file eg/lib/ord.awk
+function ord(str, c)
@{
- if (/foo/)
- print "if is true"
- else
- print "else is true"
+ # only first character is of interest
+ c = substr(str, 1, 1)
+ return _ord_[c]
@}
-BEGIN @{ print "Second BEGIN rule" @}
-
-END @{ print "Second END rule" @}
-
-function sing( dummy)
+function chr(c)
@{
- print "I gotta be me!"
+ # force c to be numeric by adding 0
+ return sprintf("%c", c + 0)
@}
address@hidden example
-
-Following is the input data:
address@hidden endfile
address@hidden
-foo
-bar
-baz
-foo
-junk
+#### test code ####
+# BEGIN \
+# @{
+# for (;;) @{
+# printf("enter a character: ")
+# if (getline var <= 0)
+# break
+# printf("ord(%s) = %d\n", var, ord(var))
+# @}
+# @}
address@hidden endfile
@end example
-Here is the @file{awkprof.out} that results from running the @command{gawk}
-profiler on this program and data (this example also illustrates that
@command{awk}
-programmers sometimes have to work late):
-
address@hidden @code{BEGIN} pattern
address@hidden @code{END} pattern
address@hidden
- # gawk profile, created Sun Aug 13 00:00:15 2000
+An obvious improvement to these functions is to move the code for the
address@hidden@w{_ord_init}} function into the body of the @code{BEGIN} rule.
It was
+written this way initially for ease of development.
+There is a ``test program'' in a @code{BEGIN} rule, to test the
+function. It is commented out for production use.
- # BEGIN block(s)
address@hidden Join Function
address@hidden Merging an Array into a String
- BEGIN @{
- 1 print "First BEGIN rule"
- 1 print "Second BEGIN rule"
- @}
address@hidden libraries of @command{awk} functions, merging arrays into strings
address@hidden functions, library, merging arrays into strings
address@hidden strings, merging arrays into
address@hidden arrays, merging into strings
+When doing string processing, it is often useful to be able to join
+all the strings in an array into one long string. The following function,
address@hidden()}, accomplishes this task. It is used later in several of
+the application programs
+(@pxref{Sample Programs}).
- # Rule(s)
+Good function design is important; this function needs to be general but it
+should also have a reasonable default behavior. It is called with an array
+as well as the beginning and ending indices of the elements in the array to be
+merged. This assumes that the array indices are numeric---a reasonable
+assumption since the array was likely created with @code{split()}
+(@pxref{String Functions}):
- 5 /foo/ @{ # 2
- 2 print "matched /foo/, gosh"
- 6 for (i = 1; i <= 3; i++) @{
- 6 sing()
- @}
- @}
address@hidden @code{join()} user-defined function
address@hidden
address@hidden file eg/lib/join.awk
+# join.awk --- join an array into a string
address@hidden endfile
address@hidden
address@hidden file eg/lib/join.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# May 1993
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/join.awk
- 5 @{
- 5 if (/foo/) @{ # 2
- 2 print "if is true"
- 3 @} else @{
- 3 print "else is true"
- @}
- @}
+function join(array, start, end, sep, result, i)
address@hidden
+ if (sep == "")
+ sep = " "
+ else if (sep == SUBSEP) # magic value
+ sep = ""
+ result = array[start]
+ for (i = start + 1; i <= end; i++)
+ result = result sep array[i]
+ return result
address@hidden
address@hidden endfile
address@hidden example
- # END block(s)
+An optional additional argument is the separator to use when joining the
+strings back together. If the caller supplies a nonempty value,
address@hidden()} uses it; if it is not supplied, it has a null
+value. In this case, @code{join()} uses a single space as a default
+separator for the strings. If the value is equal to @code{SUBSEP},
+then @code{join()} joins the strings with no separator between them.
address@hidden serves as a ``magic'' value to indicate that there should
+be no separation between the component address@hidden would
+be nice if @command{awk} had an assignment operator for concatenation.
+The lack of an explicit operator for concatenation makes string operations
+more difficult than they really need to be.}
- END @{
- 1 print "First END rule"
- 1 print "Second END rule"
- @}
address@hidden Getlocaltime Function
address@hidden Managing the Time of Day
- # Functions, listed alphabetically
address@hidden libraries of @command{awk} functions, managing, time
address@hidden functions, library, managing time
address@hidden timestamps, formatted
address@hidden time, managing
+The @code{systime()} and @code{strftime()} functions described in
address@hidden Functions},
+provide the minimum functionality necessary for dealing with the time of day
+in human readable form. While @code{strftime()} is extensive, the control
+formats are not necessarily easy to remember or intuitively obvious when
+reading a program.
- 6 function sing(dummy)
- @{
- 6 print "I gotta be me!"
- @}
address@hidden example
+The following function, @code{getlocaltime()}, populates a user-supplied array
+with preformatted time information. It returns a string with the current
+time formatted in the same way as the @command{date} utility:
-This example illustrates many of the basic features of profiling output.
-They are as follows:
address@hidden @code{getlocaltime()} user-defined function
address@hidden
address@hidden file eg/lib/gettime.awk
+# getlocaltime.awk --- get the time of day in a usable format
address@hidden endfile
address@hidden
address@hidden file eg/lib/gettime.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain, May 1993
+#
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/gettime.awk
address@hidden @bullet
address@hidden
-The program is printed in the order @code{BEGIN} rule,
address@hidden rule,
-pattern/action rules,
address@hidden rule, @code{END} rule and functions, listed
-alphabetically.
-Multiple @code{BEGIN} and @code{END} rules are merged together,
-as are multiple @code{BEGINFILE} and @code{ENDFILE} rules.
+# Returns a string in the format of output of date(1)
+# Populates the array argument time with individual values:
+# time["second"] -- seconds (0 - 59)
+# time["minute"] -- minutes (0 - 59)
+# time["hour"] -- hours (0 - 23)
+# time["althour"] -- hours (0 - 12)
+# time["monthday"] -- day of month (1 - 31)
+# time["month"] -- month of year (1 - 12)
+# time["monthname"] -- name of the month
+# time["shortmonth"] -- short name of the month
+# time["year"] -- year modulo 100 (0 - 99)
+# time["fullyear"] -- full year
+# time["weekday"] -- day of week (Sunday = 0)
+# time["altweekday"] -- day of week (Monday = 0)
+# time["dayname"] -- name of weekday
+# time["shortdayname"] -- short name of weekday
+# time["yearday"] -- day of year (0 - 365)
+# time["timezone"] -- abbreviation of timezone name
+# time["ampm"] -- AM or PM designation
+# time["weeknum"] -- week number, Sunday first day
+# time["altweeknum"] -- week number, Monday first day
address@hidden patterns, counts
address@hidden
-Pattern-action rules have two counts.
-The first count, to the left of the rule, shows how many times
-the rule's pattern was @emph{tested}.
-The second count, to the right of the rule's opening left brace
-in a comment,
-shows how many times the rule's action was @emph{executed}.
-The difference between the two indicates how many times the rule's
-pattern evaluated to false.
+function getlocaltime(time, ret, now, i)
address@hidden
+ # get time once, avoids unnecessary system calls
+ now = systime()
address@hidden
-Similarly,
-the count for an @address@hidden statement shows how many times
-the condition was tested.
-To the right of the opening left brace for the @code{if}'s body
-is a count showing how many times the condition was true.
-The count for the @code{else}
-indicates how many times the test failed.
+ # return date(1)-style output
+ ret = strftime("%a %b %e %H:%M:%S %Z %Y", now)
address@hidden loops, count for header
address@hidden
-The count for a loop header (such as @code{for}
-or @code{while}) shows how many times the loop test was executed.
-(Because of this, you can't just look at the count on the first
-statement in a rule to determine how many times the rule was executed.
-If the first statement is a loop, the count is misleading.)
+ # clear out target array
+ delete time
address@hidden functions, user-defined, counts
address@hidden user-defined, functions, counts
address@hidden
-For user-defined functions, the count next to the @code{function}
-keyword indicates how many times the function was called.
-The counts next to the statements in the body show how many times
-those statements were executed.
+ # fill in values, force numeric values to be
+ # numeric by adding 0
+ time["second"] = strftime("%S", now) + 0
+ time["minute"] = strftime("%M", now) + 0
+ time["hour"] = strftime("%H", now) + 0
+ time["althour"] = strftime("%I", now) + 0
+ time["monthday"] = strftime("%d", now) + 0
+ time["month"] = strftime("%m", now) + 0
+ time["monthname"] = strftime("%B", now)
+ time["shortmonth"] = strftime("%b", now)
+ time["year"] = strftime("%y", now) + 0
+ time["fullyear"] = strftime("%Y", now) + 0
+ time["weekday"] = strftime("%w", now) + 0
+ time["altweekday"] = strftime("%u", now) + 0
+ time["dayname"] = strftime("%A", now)
+ time["shortdayname"] = strftime("%a", now)
+ time["yearday"] = strftime("%j", now) + 0
+ time["timezone"] = strftime("%Z", now)
+ time["ampm"] = strftime("%p", now)
+ time["weeknum"] = strftime("%U", now) + 0
+ time["altweeknum"] = strftime("%W", now) + 0
address@hidden @address@hidden@}} (braces)
address@hidden braces (@address@hidden@}})
address@hidden
-The layout uses ``K&R'' style with TABs.
-Braces are used everywhere, even when
-the body of an @code{if}, @code{else}, or loop is only a single statement.
+ return ret
address@hidden
address@hidden endfile
address@hidden example
address@hidden @code{()} (parentheses)
address@hidden parentheses @code{()}
address@hidden
-Parentheses are used only where needed, as indicated by the structure
-of the program and the precedence rules.
address@hidden extra verbiage here satisfies the copyeditor. ugh.
-For example, @samp{(3 + 5) * 4} means add three plus five, then multiply
-the total by four. However, @samp{3 + 5 * 4} has no parentheses, and
-means @samp{3 + (5 * 4)}.
+The string indices are easier to use and read than the various formats
+required by @code{strftime()}. The @code{alarm} program presented in
address@hidden Program},
+uses this function.
+A more general design for the @code{getlocaltime()} function would have
+allowed the user to supply an optional timestamp value to use instead
+of the current time.
address@hidden
address@hidden
-All string concatenations are parenthesized too.
-(This could be made a bit smarter.)
address@hidden ignore
address@hidden Data File Management
address@hidden @value{DDF} Management
address@hidden
-Parentheses are used around the arguments to @code{print}
-and @code{printf} only when
-the @code{print} or @code{printf} statement is followed by a redirection.
-Similarly, if
-the target of a redirection isn't a scalar, it gets parenthesized.
address@hidden STARTOFRANGE dataf
address@hidden files, managing
address@hidden STARTOFRANGE libfdataf
address@hidden libraries of @command{awk} functions, managing, @value{DF}s
address@hidden STARTOFRANGE flibdataf
address@hidden functions, library, managing @value{DF}s
+This @value{SECTION} presents functions that are useful for managing
+command-line @value{DF}s.
address@hidden
address@hidden supplies leading comments in
-front of the @code{BEGIN} and @code{END} rules,
-the pattern/action rules, and the functions.
address@hidden
+* Filetrans Function:: A function for handling data file transitions.
+* Rewind Function:: A function for rereading the current file.
+* File Checking:: Checking that data files are readable.
+* Empty Files:: Checking for zero-length files.
+* Ignoring Assigns:: Treating assignments as file names.
address@hidden menu
address@hidden itemize
address@hidden Filetrans Function
address@hidden Noting @value{DDF} Boundaries
-The profiled version of your program may not look exactly like what you
-typed when you wrote it. This is because @command{gawk} creates the
-profiled version by ``pretty printing'' its internal representation of
-the program. The advantage to this is that @command{gawk} can produce
-a standard representation. The disadvantage is that all source-code
-comments are lost, as are the distinctions among multiple @code{BEGIN},
address@hidden, @code{BEGINFILE}, and @code{ENDFILE} rules. Also, things such
as:
address@hidden files, managing, @value{DF} boundaries
address@hidden files, initialization and cleanup
+The @code{BEGIN} and @code{END} rules are each executed exactly once at
+the beginning and end of your @command{awk} program, respectively
+(@pxref{BEGIN/END}).
+We (the @command{gawk} authors) once had a user who mistakenly thought that the
address@hidden rule is executed at the beginning of each @value{DF} and the
address@hidden rule is executed at the end of each @value{DF}.
address@hidden
-/foo/
address@hidden example
+When informed
+that this was not the case, the user requested that we add new special
+patterns to @command{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that
+would have the desired behavior. He even supplied us the code to do so.
address@hidden
-come out as:
+Adding these special patterns to @command{gawk} wasn't necessary;
+the job can be done cleanly in @command{awk} itself, as illustrated
+by the following library program.
+It arranges to call two user-supplied functions, @code{beginfile()} and
address@hidden()}, at the beginning and end of each @value{DF}.
+Besides solving the problem in only nine(!) lines of code, it does so
address@hidden; this works with any implementation of @command{awk}:
@example
-/foo/ @{
- print $0
+# transfile.awk
+#
+# Give the user a hook for filename transitions
+#
+# The user must supply functions beginfile() and endfile()
+# that each take the name of the file being started or
+# finished, respectively.
address@hidden #
address@hidden # Arnold Robbins, arnold@@skeeve.com, Public Domain
address@hidden # January 1992
+
+FILENAME != _oldfilename \
address@hidden
+ if (_oldfilename != "")
+ endfile(_oldfilename)
+ _oldfilename = FILENAME
+ beginfile(FILENAME)
@}
+
+END @{ endfile(FILENAME) @}
@end example
address@hidden
-which is correct, but possibly surprising.
+This file must be loaded before the user's ``main'' program, so that the
+rule it supplies is executed first.
address@hidden profiling @command{awk} programs, dynamically
address@hidden @command{gawk} program, dynamic profiling
-Besides creating profiles when a program has completed,
address@hidden can produce a profile while it is running.
-This is useful if your @command{awk} program goes into an
-infinite loop and you want to see what has been executed.
-To use this feature, run @command{gawk} with the @option{--profile}
-option in the background:
+This rule relies on @command{awk}'s @code{FILENAME} variable that
+automatically changes for each new @value{DF}. The current @value{FN} is
+saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does
+not equal @code{_oldfilename}, then a new @value{DF} is being processed and
+it is necessary to call @code{endfile()} for the old file. Because
address@hidden()} should only be called if a file has been processed, the
+program first checks to make sure that @code{_oldfilename} is not the null
+string. The program then assigns the current @value{FN} to
address@hidden and calls @code{beginfile()} for the file.
+Because, like all @command{awk} variables, @code{_oldfilename} is
+initialized to the null string, this rule executes correctly even for the
+first @value{DF}.
address@hidden
-$ @kbd{gawk --profile -f myprog &}
-[1] 13992
address@hidden example
+The program also supplies an @code{END} rule to do the final processing for
+the last file. Because this @code{END} rule comes before any @code{END} rules
+supplied in the ``main'' program, @code{endfile()} is called first. Once
+again the value of multiple @code{BEGIN} and @code{END} rules should be clear.
address@hidden @command{kill} address@hidden dynamic profiling
address@hidden @code{USR1} signal
address@hidden @code{SIGUSR1} signal
address@hidden signals, @code{USR1}/@code{SIGUSR1}
address@hidden
-The shell prints a job number and process ID number; in this case, 13992.
-Use the @command{kill} command to send the @code{USR1} signal
-to @command{gawk}:
address@hidden @code{beginfile()} user-defined function
address@hidden @code{endfile()} user-defined function
+If the same @value{DF} occurs twice in a row on the command line, then
address@hidden()} and @code{beginfile()} are not executed at the end of the
+first pass and at the beginning of the second pass.
+The following version solves the problem:
@example
-$ @kbd{kill -USR1 13992}
address@hidden example
address@hidden file eg/lib/ftrans.awk
+# ftrans.awk --- handle data file transitions
+#
+# user supplies beginfile() and endfile() functions
address@hidden endfile
address@hidden
address@hidden file eg/lib/ftrans.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# November 1992
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/ftrans.awk
address@hidden
-As usual, the profiled version of the program is written to
address@hidden, or to a different file if one specified with
-the @option{--profile} option.
+FNR == 1 @{
+ if (_filename_ != "")
+ endfile(_filename_)
+ _filename_ = FILENAME
+ beginfile(FILENAME)
address@hidden
-Along with the regular profile, as shown earlier, the profile
-includes a trace of any active functions:
+END @{ endfile(_filename_) @}
address@hidden endfile
address@hidden example
address@hidden
-# Function Call Stack:
address@hidden Program},
+shows how this library function can be used and
+how it simplifies writing the main program.
-# 3. baz
-# 2. bar
-# 1. foo
-# -- main --
address@hidden example
address@hidden fakenode --- for prepinfo
address@hidden Advanced Notes: So Why Does @command{gawk} have @code{BEGINFILE}
and @code{ENDFILE}?
-You may send @command{gawk} the @code{USR1} signal as many times as you like.
-Each time, the profile and function call trace are appended to the output
-profile file.
+You are probably wondering, if @code{beginfile()} and @code{endfile()}
+functions can do the job, why does @command{gawk} have
address@hidden and @code{ENDFILE} patterns (@pxref{BEGINFILE/ENDFILE})?
address@hidden @code{HUP} signal
address@hidden @code{SIGHUP} signal
address@hidden signals, @code{HUP}/@code{SIGHUP}
-If you use the @code{HUP} signal instead of the @code{USR1} signal,
address@hidden produces the profile and the function call trace and then exits.
+Good question. Normally, if @command{awk} cannot open a file, this
+causes an immediate fatal error. In this case, there is no way for a
+user-defined function to deal with the problem, since the mechanism for
+calling it relies on the file being open and at the first record. Thus,
+the main reason for @code{BEGINFILE} is to give you a ``hook'' to catch
+files that cannot be processed. @code{ENDFILE} exists for symmetry,
+and because it provides an easy way to do per-file cleanup processing.
address@hidden @code{INT} signal (MS-Windows)
address@hidden @code{SIGINT} signal (MS-Windows)
address@hidden signals, @code{INT}/@code{SIGINT} (MS-Windows)
address@hidden @code{QUIT} signal (MS-Windows)
address@hidden @code{SIGQUIT} signal (MS-Windows)
address@hidden signals, @code{QUIT}/@code{SIGQUIT} (MS-Windows)
-When @command{gawk} runs on MS-Windows systems, it uses the
address@hidden and @code{QUIT} signals for producing the profile and, in
-the case of the @code{INT} signal, @command{gawk} exits. This is
-because these systems don't support the @command{kill} command, so the
-only signals you can deliver to a program are those generated by the
-keyboard. The @code{INT} signal is generated by the
address@hidden@address@hidden or @address@hidden@key{BREAK}} key, while the
address@hidden signal is generated by the @address@hidden@key{\}} key.
address@hidden Rewind Function
address@hidden Rereading the Current File
-Finally, @command{gawk} also accepts another option @option{--pretty-print}.
-When called this way, @command{gawk} ``pretty prints'' the program into
address@hidden, without any execution counts.
address@hidden ENDOFRANGE advgaw
address@hidden ENDOFRANGE gawadv
address@hidden ENDOFRANGE awkp
address@hidden ENDOFRANGE proawk
address@hidden files, reading
+Another request for a new built-in function was for a @code{rewind()}
+function that would make it possible to reread the current file.
+The requesting user didn't want to have to use @code{getline}
+(@pxref{Getline})
+inside a loop.
address@hidden Library Functions
address@hidden A Library of @command{awk} Functions
address@hidden STARTOFRANGE libf
address@hidden libraries of @command{awk} functions
address@hidden STARTOFRANGE flib
address@hidden functions, library
address@hidden STARTOFRANGE fudlib
address@hidden functions, user-defined, library of
+However, as long as you are not in the @code{END} rule, it is
+quite easy to arrange to immediately close the current input file
+and then start over with it from the top.
+For lack of a better name, we'll call it @code{rewind()}:
address@hidden, describes how to write
-your own @command{awk} functions. Writing functions is important, because
-it allows you to encapsulate algorithms and program tasks in a single
-place. It simplifies programming, making program development more
-manageable, and making programs more readable.
address@hidden @code{rewind()} user-defined function
address@hidden
address@hidden file eg/lib/rewind.awk
+# rewind.awk --- rewind the current file and start over
address@hidden endfile
address@hidden
address@hidden file eg/lib/rewind.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# September 2000
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/rewind.awk
-One valuable way to learn a new programming language is to @emph{read}
-programs in that language. To that end, this @value{CHAPTER}
-and @ref{Sample Programs},
-provide a good-sized body of code for you to read,
-and hopefully, to learn from.
+function rewind( i)
address@hidden
+ # shift remaining arguments up
+ for (i = ARGC; i > ARGIND; i--)
+ ARGV[i] = ARGV[i-1]
address@hidden 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
-This @value{CHAPTER} presents a library of useful @command{awk} functions.
-Many of the sample programs presented later in this @value{DOCUMENT}
-use these functions.
-The functions are presented here in a progression from simple to complex.
+ # make sure gawk knows to keep going
+ ARGC++
address@hidden Texinfo
address@hidden Program},
-presents a program that you can use to extract the source code for
-these example library functions and programs from the Texinfo source
-for this @value{DOCUMENT}.
-(This has already been done as part of the @command{gawk} distribution.)
+ # make current file next to get done
+ ARGV[ARGIND+1] = FILENAME
-If you have written one or more useful, general-purpose @command{awk} functions
-and would like to contribute them to the @command{awk} user community, see
address@hidden To Contribute}, for more information.
+ # do it
+ nextfile
address@hidden
address@hidden endfile
address@hidden example
address@hidden portability, example programs
-The programs in this @value{CHAPTER} and in
address@hidden Programs},
-freely use features that are @command{gawk}-specific.
-Rewriting these programs for different implementations of @command{awk}
-is pretty straightforward.
+This code relies on the @code{ARGIND} variable
+(@pxref{Auto-set}),
+which is specific to @command{gawk}.
+If you are not using
address@hidden, you can use ideas presented in
address@hidden
+the previous @value{SECTION}
address@hidden ifnotinfo
address@hidden
address@hidden Function},
address@hidden ifinfo
+to either update @code{ARGIND} on your own
+or modify this code as appropriate.
address@hidden @bullet
address@hidden
-Diagnostic error messages are sent to @file{/dev/stderr}.
-Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"} if your system
-does not have a @file{/dev/stderr}, or if you cannot use @command{gawk}.
+The @code{rewind()} function also relies on the @code{nextfile} keyword
+(@pxref{Nextfile Statement}).
address@hidden
-A number of programs use @code{nextfile}
-(@pxref{Nextfile Statement})
-to skip any remaining input in the input file.
address@hidden File Checking
address@hidden Checking for Readable @value{DDF}s
address@hidden
address@hidden 12/2000: Thanks to Nelson Beebe for pointing out the output
issue.
address@hidden case sensitivity, example programs
address@hidden @code{IGNORECASE} variable, in example programs
-Finally, some of the programs choose to ignore upper- and lowercase
-distinctions in their input. They do so by assigning one to @code{IGNORECASE}.
-You can achieve almost the same address@hidden effects are
-not identical. Output of the transformed
-record will be in all lowercase, while @code{IGNORECASE} preserves the original
-contents of the input record.} by adding the following rule to the
-beginning of the program:
address@hidden troubleshooting, readable @value{DF}s
address@hidden readable @address@hidden checking
address@hidden files, skipping
+Normally, if you give @command{awk} a @value{DF} that isn't readable,
+it stops with a fatal error. There are times when you
+might want to just ignore such files and keep going. You can
+do this by prepending the following program to your @command{awk}
+program:
address@hidden @code{readable.awk} program
@example
-# ignore case
address@hidden $0 = tolower($0) @}
address@hidden example
-
address@hidden
-Also, verify that all regexp and string constants used in
-comparisons use only lowercase letters.
address@hidden itemize
address@hidden file eg/lib/readable.awk
+# readable.awk --- library file to skip over unreadable files
address@hidden endfile
address@hidden
address@hidden file eg/lib/readable.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# October 2000
+# December 2010
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/readable.awk
address@hidden
-* Library Names:: How to best name private global variables in
- library functions.
-* General Functions:: Functions that are of general use.
-* Data File Management:: Functions for managing command-line data
- files.
-* Getopt Function:: A function for processing command-line
- arguments.
-* Passwd Functions:: Functions for getting user information.
-* Group Functions:: Functions for getting group information.
-* Walking Arrays:: A function to walk arrays of arrays.
address@hidden menu
+BEGIN @{
+ for (i = 1; i < ARGC; i++) @{
+ if (ARGV[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/ \
+ || ARGV[i] == "-" || ARGV[i] == "/dev/stdin")
+ continue # assignment or standard input
+ else if ((getline junk < ARGV[i]) < 0) # unreadable
+ delete ARGV[i]
+ else
+ close(ARGV[i])
+ @}
address@hidden
address@hidden endfile
address@hidden example
address@hidden Library Names
address@hidden Naming Library Function Global Variables
address@hidden troubleshooting, @code{getline} function
+This works, because the @code{getline} won't be fatal.
+Removing the element from @code{ARGV} with @code{delete}
+skips the file (since it's no longer in the list).
+See also @ref{ARGC and ARGV}.
address@hidden names, arrays/variables
address@hidden names, functions
address@hidden namespace issues
address@hidden @command{awk} programs, documenting
address@hidden documentation, of @command{awk} programs
-Due to the way the @command{awk} language evolved, variables are either
address@hidden (usable by the entire program) or @dfn{local} (usable just by
-a specific function). There is no intermediate state analogous to
address@hidden variables in C.
address@hidden Empty Files
address@hidden Checking For Zero-length Files
address@hidden variables, global, for library functions
address@hidden private variables
address@hidden variables, private
-Library functions often need to have global variables that they can use to
-preserve state information between calls to the function---for example,
address@hidden()}'s variable @code{_opti}
-(@pxref{Getopt Function}).
-Such variables are called @dfn{private}, since the only functions that need to
-use them are the ones in the library.
+All known @command{awk} implementations silently skip over zero-length files.
+This is a by-product of @command{awk}'s implicit
+read-a-record-and-match-against-the-rules loop: when @command{awk}
+tries to read a record from an empty file, it immediately receives an
+end of file indication, closes the file, and proceeds on to the next
+command-line @value{DF}, @emph{without} executing any user-level
address@hidden program code.
-When writing a library function, you should try to choose names for your
-private variables that will not conflict with any variables used by
-either another library function or a user's main program. For example, a
-name like @code{i} or @code{j} is not a good choice, because user programs
-often use variable names like these for their own purposes.
+Using @command{gawk}'s @code{ARGIND} variable
+(@pxref{Built-in Variables}), it is possible to detect when an empty
address@hidden has been skipped. Similar to the library file presented
+in @ref{Filetrans Function}, the following library file calls a function named
address@hidden()} that the user must provide. The arguments passed are
+the @value{FN} and the position in @code{ARGV} where it was found:
address@hidden programming conventions, private variable names
-The example programs shown in this @value{CHAPTER} all start the names of their
-private variables with an underscore (@samp{_}). Users generally don't use
-leading underscores in their variable names, so this convention immediately
-decreases the chances that the variable name will be accidentally shared
-with the user's program.
address@hidden @code{zerofile.awk} program
address@hidden
address@hidden file eg/lib/zerofile.awk
+# zerofile.awk --- library file to process empty input files
address@hidden endfile
address@hidden
address@hidden file eg/lib/zerofile.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# June 2003
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/zerofile.awk
address@hidden @code{_} (underscore), in names of private variables
address@hidden underscore (@code{_}), in names of private variables
-In addition, several of the library functions use a prefix that helps
-indicate what function or set of functions use the variables---for example,
address@hidden in the user database routines
-(@pxref{Passwd Functions}).
-This convention is recommended, since it even further decreases the
-chance of inadvertent conflict among variable names. Note that this
-convention is used equally well for variable names and for private
-function address@hidden all the library routines could have
-been rewritten to use this convention, this was not done, in order to
-show how our own @command{awk} programming style has evolved and to
-provide some basis for this discussion.}
+BEGIN @{ Argind = 0 @}
-As a final note on variable naming, if a function makes global variables
-available for use by a main program, it is a good convention to start that
-variable's name with a capital letter---for
-example, @code{getopt()}'s @code{Opterr} and @code{Optind} variables
-(@pxref{Getopt Function}).
-The leading capital letter indicates that it is global, while the fact that
-the variable name is not all capital letters indicates that the variable is
-not one of @command{awk}'s built-in variables, such as @code{FS}.
+ARGIND > Argind + 1 @{
+ for (Argind++; Argind < ARGIND; Argind++)
+ zerofile(ARGV[Argind], Argind)
address@hidden
address@hidden @code{--dump-variables} option
-It is also important that @emph{all} variables in library
-functions that do not need to save state are, in fact, declared
address@hidden@command{gawk}'s @option{--dump-variables} command-line
-option is useful for verifying this.} If this is not done, the variable
-could accidentally be used in the user's program, leading to bugs that
-are very difficult to track down:
+ARGIND != Argind @{ Argind = ARGIND @}
address@hidden
-function lib_func(x, y, l1, l2)
address@hidden
- @dots{}
- @var{use variable} some_var # some_var should be local
- @dots{} # but is not by oversight
+END @{
+ if (ARGIND > Argind)
+ for (Argind++; Argind <= ARGIND; Argind++)
+ zerofile(ARGV[Argind], Argind)
@}
address@hidden endfile
@end example
address@hidden arrays, associative, library functions and
address@hidden libraries of @command{awk} functions, associative arrays and
address@hidden functions, library, associative arrays and
address@hidden Tcl
-A different convention, common in the Tcl community, is to use a single
-associative array to hold the values needed by the library function(s), or
-``package.'' This significantly decreases the number of actual global names
-in use. For example, the functions described in
address@hidden Functions},
-might have used array elements @address@hidden"inited"]}},
@address@hidden"total"]}},
address@hidden@w{PW_data["count"]}}, and @address@hidden"awklib"]}}, instead of
address@hidden@w{_pw_inited}}, @address@hidden, @address@hidden,
-and @address@hidden
+The user-level variable @code{Argind} allows the @command{awk} program
+to track its progress through @code{ARGV}. Whenever the program detects
+that @code{ARGIND} is greater than @samp{Argind + 1}, it means that one or
+more empty files were skipped. The action then calls @code{zerofile()} for
+each such file, incrementing @code{Argind} along the way.
-The conventions presented in this @value{SECTION} are exactly
-that: conventions. You are not required to write your programs this
-way---we merely recommend that you do so.
+The @samp{Argind != ARGIND} rule simply keeps @code{Argind} up to date
+in the normal case.
address@hidden General Functions
address@hidden General Programming
+Finally, the @code{END} rule catches the case of any empty files at
+the end of the command-line arguments. Note that the test in the
+condition of the @code{for} loop uses the @samp{<=} operator,
+not @samp{<}.
-This @value{SECTION} presents a number of functions that are of general
-programming use.
+As an exercise, you might consider whether this same problem can
+be solved without relying on @command{gawk}'s @code{ARGIND} variable.
+
+As a second exercise, revise this code to handle the case where
+an intervening value in @code{ARGV} is a variable assignment.
+
address@hidden
+# zerofile2.awk --- same thing, portably
+
+BEGIN @{
+ ARGIND = Argind = 0
+ for (i = 1; i < ARGC; i++)
+ Fnames[ARGV[i]]++
+
address@hidden
+FNR == 1 @{
+ while (ARGV[ARGIND] != FILENAME)
+ ARGIND++
+ Seen[FILENAME]++
+ if (Seen[FILENAME] == Fnames[FILENAME])
+ do
+ ARGIND++
+ while (ARGV[ARGIND] != FILENAME)
address@hidden
+ARGIND > Argind + 1 @{
+ for (Argind++; Argind < ARGIND; Argind++)
+ zerofile(ARGV[Argind], Argind)
address@hidden
+ARGIND != Argind @{
+ Argind = ARGIND
address@hidden
+END @{
+ if (ARGIND < ARGC - 1)
+ ARGIND = ARGC - 1
+ if (ARGIND > Argind)
+ for (Argind++; Argind <= ARGIND; Argind++)
+ zerofile(ARGV[Argind], Argind)
address@hidden
address@hidden ignore
address@hidden
-* Strtonum Function:: A replacement for the built-in
- @code{strtonum()} function.
-* Assert Function:: A function for assertions in @command{awk}
- programs.
-* Round Function:: A function for rounding if @code{sprintf()}
- does not do it correctly.
-* Cliff Random Function:: The Cliff Random Number Generator.
-* Ordinal Functions:: Functions for using characters as numbers and
- vice versa.
-* Join Function:: A function to join an array into a string.
-* Getlocaltime Function:: A function to get formatted times.
address@hidden menu
address@hidden Ignoring Assigns
address@hidden Treating Assignments as @value{FFN}s
address@hidden Strtonum Function
address@hidden Converting Strings To Numbers
address@hidden assignments as filenames
address@hidden filenames, assignments as
+Occasionally, you might not want @command{awk} to process command-line
+variable assignments
+(@pxref{Assignment Options}).
+In particular, if you have a @value{FN} that contain an @samp{=} character,
address@hidden treats the @value{FN} as an assignment, and does not process it.
-The @code{strtonum()} function (@pxref{String Functions})
-is a @command{gawk} extension. The following function
-provides an implementation for other versions of @command{awk}:
+Some users have suggested an additional command-line option for @command{gawk}
+to disable command-line assignments. However, some simple programming with
+a library file does the trick:
address@hidden @code{noassign.awk} program
@example
address@hidden file eg/lib/strtonum.awk
-# mystrtonum --- convert string to number
-
address@hidden file eg/lib/noassign.awk
+# noassign.awk --- library file to avoid the need for a
+# special option that disables command-line assignments
@c endfile
@ignore
address@hidden file eg/lib/strtonum.awk
address@hidden file eg/lib/noassign.awk
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# February, 2004
-
+# October 1999
@c endfile
@end ignore
address@hidden file eg/lib/strtonum.awk
-function mystrtonum(str, ret, chars, n, i, k, c)
address@hidden
- if (str ~ /^0[0-7]*$/) @{
- # octal
- n = length(str)
- ret = 0
- for (i = 1; i <= n; i++) @{
- c = substr(str, i, 1)
- if ((k = index("01234567", c)) > 0)
- k-- # adjust for 1-basing in awk
-
- ret = ret * 8 + k
- @}
- @} else if (str ~ /^0[xX][[:xdigit:]]+/) @{
- # hexadecimal
- str = substr(str, 3) # lop off leading 0x
- n = length(str)
- ret = 0
- for (i = 1; i <= n; i++) @{
- c = substr(str, i, 1)
- c = tolower(c)
- if ((k = index("0123456789", c)) > 0)
- k-- # adjust for 1-basing in awk
- else if ((k = index("abcdef", c)) > 0)
- k += 9
-
- ret = ret * 16 + k
- @}
- @} else if (str ~ \
- /^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|([.][0-9]+([Ee][-+]?[0-9]+)?))$/) @{
- # decimal number, possibly floating point
- ret = str + 0
- @} else
- ret = "NOT-A-NUMBER"
address@hidden file eg/lib/noassign.awk
- return ret
+function disable_assigns(argc, argv, i)
address@hidden
+ for (i = 1; i < argc; i++)
+ if (argv[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/)
+ argv[i] = ("./" argv[i])
@}
-# BEGIN @{ # gawk test harness
-# a[1] = "25"
-# a[2] = ".31"
-# a[3] = "0123"
-# a[4] = "0xdeadBEEF"
-# a[5] = "123.45"
-# a[6] = "1.e3"
-# a[7] = "1.32"
-# a[7] = "1.32E2"
-#
-# for (i = 1; i in a; i++)
-# print a[i], strtonum(a[i]), mystrtonum(a[i])
-# @}
+BEGIN @{
+ if (No_command_assign)
+ disable_assigns(ARGC, ARGV)
address@hidden
@c endfile
@end example
-The function first looks for C-style octal numbers (base 8).
-If the input string matches a regular expression describing octal
-numbers, then @code{mystrtonum()} loops through each character in the
-string. It sets @code{k} to the index in @code{"01234567"} of the current
-octal digit. Since the return value is one-based, the @samp{k--}
-adjusts @code{k} so it can be used in computing the return value.
+You then run your program this way:
-Similar logic applies to the code that checks for and converts a
-hexadecimal value, which starts with @samp{0x} or @samp{0X}.
-The use of @code{tolower()} simplifies the computation for finding
-the correct numeric value for each hexadecimal digit.
address@hidden
+awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk *
address@hidden example
-Finally, if the string matches the (rather complicated) regexp for a
-regular decimal integer or floating-point number, the computation
address@hidden = str + 0} lets @command{awk} convert the value to a
-number.
+The function works by looping through the arguments.
+It prepends @samp{./} to
+any argument that matches the form
+of a variable assignment, turning that argument into a @value{FN}.
-A commented-out test program is included, so that the function can
-be tested with @command{gawk} and the results compared to the built-in
address@hidden()} function.
+The use of @code{No_command_assign} allows you to disable command-line
+assignments at invocation time, by giving the variable a true value.
+When not set, it is initially zero (i.e., false), so the command-line arguments
+are left alone.
address@hidden ENDOFRANGE dataf
address@hidden ENDOFRANGE flibdataf
address@hidden ENDOFRANGE libfdataf
address@hidden Assert Function
address@hidden Assertions
address@hidden Getopt Function
address@hidden Processing Command-Line Options
address@hidden STARTOFRANGE asse
address@hidden assertions
address@hidden STARTOFRANGE assef
address@hidden @code{assert()} function (C library)
address@hidden STARTOFRANGE libfass
address@hidden libraries of @command{awk} functions, assertions
address@hidden STARTOFRANGE flibass
address@hidden functions, library, assertions
address@hidden @command{awk} programs, lengthy, assertions
-When writing large programs, it is often useful to know
-that a condition or set of conditions is true. Before proceeding with a
-particular computation, you make a statement about what you believe to be
-the case. Such a statement is known as an
address@hidden The C language provides an @code{<assert.h>} header file
-and corresponding @code{assert()} macro that the programmer can use to make
-assertions. If an assertion fails, the @code{assert()} macro arranges to
-print a diagnostic message describing the condition that should have
-been true but was not, and then it kills the program. In C, using
address@hidden()} looks this:
address@hidden STARTOFRANGE libfclo
address@hidden libraries of @command{awk} functions, command-line options
address@hidden STARTOFRANGE flibclo
address@hidden functions, library, command-line options
address@hidden STARTOFRANGE clop
address@hidden command-line options, processing
address@hidden STARTOFRANGE oclp
address@hidden options, command-line, processing
address@hidden STARTOFRANGE clibf
address@hidden functions, library, C library
address@hidden arguments, processing
+Most utilities on POSIX compatible systems take options on
+the command line that can be used to change the way a program behaves.
address@hidden is an example of such a program
+(@pxref{Options}).
+Often, options take @dfn{arguments}; i.e., data that the program needs to
+correctly obey the command-line option. For example, @command{awk}'s
address@hidden option requires a string to use as the field separator.
+The first occurrence on the command line of either @option{--} or a
+string that does not begin with @samp{-} ends the options.
+
address@hidden @code{getopt()} function (C library)
+Modern Unix systems provide a C function named @code{getopt()} for processing
+command-line arguments. The programmer provides a string describing the
+one-letter options. If an option requires an argument, it is followed in the
+string with a colon. @code{getopt()} is also passed the
+count and values of the command-line arguments and is called in a loop.
address@hidden()} processes the command-line arguments for option letters.
+Each time around the loop, it returns a single character representing the
+next option letter that it finds, or @samp{?} if it finds an invalid option.
+When it returns @minus{}1, there are no options left on the command line.
+
+When using @code{getopt()}, options that do not take arguments can be
+grouped together. Furthermore, options that take arguments require that the
+argument be present. The argument can immediately follow the option letter,
+or it can be a separate command-line argument.
+
+Given a hypothetical program that takes
+three command-line options, @option{-a}, @option{-b}, and @option{-c}, where
address@hidden requires an argument, all of the following are valid ways of
+invoking the program:
@example
-#include <assert.h>
+prog -a -b foo -c data1 data2 data3
+prog -ac -bfoo -- data1 data2 data3
+prog -acbfoo data1 data2 data3
address@hidden example
-int myfunc(int a, double b)
+Notice that when the argument is grouped with its option, the rest of
+the argument is considered to be the option's argument.
+In this example, @option{-acbfoo} indicates that all of the
address@hidden, @option{-b}, and @option{-c} options were supplied,
+and that @samp{foo} is the argument to the @option{-b} option.
+
address@hidden()} provides four external variables that the programmer can use:
+
address@hidden @code
address@hidden optind
+The index in the argument value array (@code{argv}) where the first
+nonoption command-line argument can be found.
+
address@hidden optarg
+The string value of the argument to an option.
+
address@hidden opterr
+Usually @code{getopt()} prints an error message when it finds an invalid
+option. Setting @code{opterr} to zero disables this feature. (An
+application might want to print its own error message.)
+
address@hidden optopt
+The letter representing the command-line option.
address@hidden While not usually documented, most versions supply this variable.
address@hidden table
+
+The following C fragment shows how @code{getopt()} might process command-line
+arguments for @command{awk}:
+
address@hidden
+int
+main(int argc, char *argv[])
@{
- assert(a <= 5 && b >= 17.1);
- @dots{}
+ @dots{}
+ /* print our own message */
+ opterr = 0;
+ while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{
+ switch (c) @{
+ case 'f': /* file */
+ @dots{}
+ break;
+ case 'F': /* field separator */
+ @dots{}
+ break;
+ case 'v': /* variable assignment */
+ @dots{}
+ break;
+ case 'W': /* extension */
+ @dots{}
+ break;
+ case '?':
+ default:
+ usage();
+ break;
+ @}
+ @}
+ @dots{}
@}
@end example
-If the assertion fails, the program prints a message similar to this:
+As a side point, @command{gawk} actually uses the GNU @code{getopt_long()}
+function to process both normal and GNU-style long options
+(@pxref{Options}).
address@hidden
-prog.c:5: assertion failed: a <= 5 && b >= 17.1
address@hidden example
+The abstraction provided by @code{getopt()} is very useful and is quite
+handy in @command{awk} programs as well. Following is an @command{awk}
+version of @code{getopt()}. This function highlights one of the
+greatest weaknesses in @command{awk}, which is that it is very poor at
+manipulating single characters. Repeated calls to @code{substr()} are
+necessary for accessing individual characters
+(@pxref{String Functions})address@hidden
+function was written before @command{gawk} acquired the ability to
+split strings into single characters using @code{""} as the separator.
+We have left it alone, since using @code{substr()} is more portable.}
address@hidden FIXME: could use split(str, a, "") to do it more easily.
address@hidden @code{assert()} user-defined function
-The C language makes it possible to turn the condition into a string for use
-in printing the diagnostic message. This is not possible in @command{awk}, so
-this @code{assert()} function also requires a string version of the condition
-that is being tested.
-Following is the function:
+The discussion that follows walks through the code a bit at a time:
address@hidden @code{getopt()} user-defined function
@example
address@hidden file eg/lib/assert.awk
-# assert --- assert that a condition is true. Otherwise exit.
-
address@hidden file eg/lib/getopt.awk
+# getopt.awk --- Do C library getopt(3) function in awk
@c endfile
@ignore
address@hidden file eg/lib/assert.awk
address@hidden file eg/lib/getopt.awk
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# May, 1993
-
+#
+# Initial version: March, 1991
+# Revised: May, 1993
@c endfile
@end ignore
address@hidden file eg/lib/assert.awk
-function assert(condition, string)
address@hidden
- if (! condition) @{
- printf("%s:%d: assertion failed: %s\n",
- FILENAME, FNR, string) > "/dev/stderr"
- _assert_exit = 1
- exit 1
- @}
address@hidden
address@hidden file eg/lib/getopt.awk
address@hidden
-END @{
- if (_assert_exit)
- exit 1
address@hidden
address@hidden group
+# External variables:
+# Optind -- index in ARGV of first nonoption argument
+# Optarg -- string value of argument to current option
+# Opterr -- if nonzero, print our own diagnostic
+# Optopt -- current option letter
+
+# Returns:
+# -1 at end of options
+# "?" for unrecognized option
+# <c> a character representing the current option
+
+# Private Data:
+# _opti -- index in multi-flag option, e.g., -abc
@c endfile
@end example
-The @code{assert()} function tests the @code{condition} parameter. If it
-is false, it prints a message to standard error, using the @code{string}
-parameter to describe the failed condition. It then sets the variable
address@hidden to one and executes the @code{exit} statement.
-The @code{exit} statement jumps to the @code{END} rule. If the @code{END}
-rules finds @code{_assert_exit} to be true, it exits immediately.
+The function starts out with comments presenting
+a list of the global variables it uses,
+what the return values are, what they mean, and any global variables that
+are ``private'' to this library function. Such documentation is essential
+for any program, and particularly for library functions.
-The purpose of the test in the @code{END} rule is to
-keep any other @code{END} rules from running. When an assertion fails, the
-program should exit immediately.
-If no assertions fail, then @code{_assert_exit} is still
-false when the @code{END} rule is run normally, and the rest of the
-program's @code{END} rules execute.
-For all of this to work correctly, @file{assert.awk} must be the
-first source file read by @command{awk}.
-The function can be used in a program in the following way:
+The @code{getopt()} function first checks that it was indeed called with
+a string of options (the @code{options} parameter). If @code{options}
+has a zero length, @code{getopt()} immediately returns @minus{}1:
address@hidden @code{getopt()} user-defined function
@example
-function myfunc(a, b)
address@hidden file eg/lib/getopt.awk
+function getopt(argc, argv, options, thisopt, i)
@{
- assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1")
- @dots{}
address@hidden
+ if (length(options) == 0) # no options given
+ return -1
+
address@hidden
+ if (argv[Optind] == "--") @{ # all done
+ Optind++
+ _opti = 0
+ return -1
address@hidden group
+ @} else if (argv[Optind] !~ /^-[^:[:space:]]/) @{
+ _opti = 0
+ return -1
+ @}
address@hidden endfile
@end example
address@hidden
-If the assertion fails, you see a message similar to the following:
+The next thing to check for is the end of the options. A @option{--}
+ends the command-line options, as does any command-line argument that
+does not begin with a @samp{-}. @code{Optind} is used to step through
+the array of command-line arguments; it retains its value across calls
+to @code{getopt()}, because it is a global variable.
+
+The regular expression that is used, @address@hidden/^-[^:[:space:]/}},
+checks for a @samp{-} followed by anything
+that is not whitespace and not a colon.
+If the current command-line argument does not match this pattern,
+it is not an option, and it ends option processing. Continuing on:
@example
-mydata:1357: assertion failed: a <= 5 && b >= 17.1
address@hidden file eg/lib/getopt.awk
+ if (_opti == 0)
+ _opti = 2
+ thisopt = substr(argv[Optind], _opti, 1)
+ Optopt = thisopt
+ i = index(options, thisopt)
+ if (i == 0) @{
+ if (Opterr)
+ printf("%c -- invalid option\n",
+ thisopt) > "/dev/stderr"
+ if (_opti >= length(argv[Optind])) @{
+ Optind++
+ _opti = 0
+ @} else
+ _opti++
+ return "?"
+ @}
address@hidden endfile
@end example
address@hidden @code{END} pattern, @code{assert()} user-defined function and
-There is a small problem with this version of @code{assert()}.
-An @code{END} rule is automatically added
-to the program calling @code{assert()}. Normally, if a program consists
-of just a @code{BEGIN} rule, the input files and/or standard input are
-not read. However, now that the program has an @code{END} rule, @command{awk}
-attempts to read the input @value{DF}s or standard input
-(@pxref{Using BEGIN/END}),
-most likely causing the program to hang as it waits for input.
-
address@hidden @code{BEGIN} pattern, @code{assert()} user-defined function and
-There is a simple workaround to this:
-make sure that such a @code{BEGIN} rule always ends
-with an @code{exit} statement.
address@hidden ENDOFRANGE asse
address@hidden ENDOFRANGE assef
address@hidden ENDOFRANGE flibass
address@hidden ENDOFRANGE libfass
-
address@hidden Round Function
address@hidden Rounding Numbers
+The @code{_opti} variable tracks the position in the current command-line
+argument (@code{argv[Optind]}). If multiple options are
+grouped together with one @samp{-} (e.g., @option{-abx}), it is necessary
+to return them to the user one at a time.
address@hidden rounding numbers
address@hidden numbers, rounding
address@hidden libraries of @command{awk} functions, rounding numbers
address@hidden functions, library, rounding numbers
address@hidden @code{print} statement, @code{sprintf()} function and
address@hidden @code{printf} statement, @code{sprintf()} function and
address@hidden @code{sprintf()} function, @code{print}/@code{printf} statements
and
-The way @code{printf} and @code{sprintf()}
-(@pxref{Printf})
-perform rounding often depends upon the system's C @code{sprintf()}
-subroutine. On many machines, @code{sprintf()} rounding is ``unbiased,''
-which means it doesn't always round a trailing @samp{.5} up, contrary
-to naive expectations. In unbiased rounding, @samp{.5} rounds to even,
-rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4. This means
-that if you are using a format that does rounding (e.g., @code{"%.0f"}),
-you should check what your system does. The following function does
-traditional rounding; it might be useful if your @command{awk}'s @code{printf}
-does unbiased rounding:
+If @code{_opti} is equal to zero, it is set to two, which is the index in
+the string of the next character to look at (we skip the @samp{-}, which
+is at position one). The variable @code{thisopt} holds the character,
+obtained with @code{substr()}. It is saved in @code{Optopt} for the main
+program to use.
address@hidden @code{round()} user-defined function
address@hidden
address@hidden file eg/lib/round.awk
-# round.awk --- do normal rounding
address@hidden endfile
address@hidden
address@hidden file eg/lib/round.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# August, 1996
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/round.awk
+If @code{thisopt} is not in the @code{options} string, then it is an
+invalid option. If @code{Opterr} is nonzero, @code{getopt()} prints an error
+message on the standard error that is similar to the message from the C
+version of @code{getopt()}.
-function round(x, ival, aval, fraction)
address@hidden
- ival = int(x) # integer part, int() truncates
+Because the option is invalid, it is necessary to skip it and move on to the
+next option character. If @code{_opti} is greater than or equal to the
+length of the current command-line argument, it is necessary to move on
+to the next argument, so @code{Optind} is incremented and @code{_opti} is reset
+to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely
+incremented.
- # see if fractional part
- if (ival == x) # no fraction
- return ival # ensure no decimals
+In any case, because the option is invalid, @code{getopt()} returns @code{"?"}.
+The main program can examine @code{Optopt} if it needs to know what the
+invalid option letter actually is. Continuing on:
- if (x < 0) @{
- aval = -x # absolute value
- ival = int(aval)
- fraction = aval - ival
- if (fraction >= .5)
- return int(x) - 1 # -2.5 --> -3
- else
- return int(x) # -2.3 --> -2
- @} else @{
- fraction = x - ival
- if (fraction >= .5)
- return ival + 1
- else
- return ival
- @}
address@hidden
address@hidden
address@hidden file eg/lib/getopt.awk
+ if (substr(options, i + 1, 1) == ":") @{
+ # get option argument
+ if (length(substr(argv[Optind], _opti + 1)) > 0)
+ Optarg = substr(argv[Optind], _opti + 1)
+ else
+ Optarg = argv[++Optind]
+ _opti = 0
+ @} else
+ Optarg = ""
@c endfile
address@hidden don't include test harness in the file that gets installed
-
-# test harness
address@hidden print $0, round($0) @}
@end example
address@hidden Cliff Random Function
address@hidden The Cliff Random Number Generator
address@hidden random numbers, Cliff
address@hidden Cliff random numbers
address@hidden numbers, Cliff random
address@hidden functions, library, Cliff random numbers
-
-The
address@hidden://mathworld.wolfram.com/CliffRandomNumberGenerator.html, Cliff
random number generator}
-is a very simple random number generator that ``passes the noise sphere test
-for randomness by showing no structure.''
-It is easily programmed, in less than 10 lines of @command{awk} code:
+If the option requires an argument, the option letter is followed by a colon
+in the @code{options} string. If there are remaining characters in the
+current command-line argument (@code{argv[Optind]}), then the rest of that
+string is assigned to @code{Optarg}. Otherwise, the next command-line
+argument is used (@samp{-xFOO} versus @address@hidden FOO}}). In either case,
address@hidden is reset to zero, because there are no more characters left to
+examine in the current command-line argument. Continuing:
address@hidden @code{cliff_rand()} user-defined function
@example
address@hidden file eg/lib/cliff_rand.awk
-# cliff_rand.awk --- generate Cliff random numbers
address@hidden endfile
address@hidden
address@hidden file eg/lib/cliff_rand.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# December 2000
address@hidden file eg/lib/getopt.awk
+ if (_opti == 0 || _opti >= length(argv[Optind])) @{
+ Optind++
+ _opti = 0
+ @} else
+ _opti++
+ return thisopt
address@hidden
@c endfile
address@hidden ignore
address@hidden file eg/lib/cliff_rand.awk
address@hidden example
-BEGIN @{ _cliff_seed = 0.1 @}
+Finally, if @code{_opti} is either zero or greater than the length of the
+current command-line argument, it means this element in @code{argv} is
+through being processed, so @code{Optind} is incremented to point to the
+next element in @code{argv}. If neither condition is true, then only
address@hidden is incremented, so that the next option letter can be processed
+on the next call to @code{getopt()}.
-function cliff_rand()
address@hidden
- _cliff_seed = (100 * log(_cliff_seed)) % 1
- if (_cliff_seed < 0)
- _cliff_seed = - _cliff_seed
- return _cliff_seed
+The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one.
address@hidden is set to one, since the default behavior is for @code{getopt()}
+to print a diagnostic message upon seeing an invalid option. @code{Optind}
+is set to one, since there's no reason to look at the program name, which is
+in @code{ARGV[0]}:
+
address@hidden
address@hidden file eg/lib/getopt.awk
+BEGIN @{
+ Opterr = 1 # default is to diagnose
+ Optind = 1 # skip ARGV[0]
+
+ # test program
+ if (_getopt_test) @{
+ while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
+ printf("c = <%c>, optarg = <%s>\n",
+ _go_c, Optarg)
+ printf("non-option arguments:\n")
+ for (; Optind < ARGC; Optind++)
+ printf("\tARGV[%d] = <%s>\n",
+ Optind, ARGV[Optind])
+ @}
@}
@c endfile
@end example
-This algorithm requires an initial ``seed'' of 0.1. Each new value
-uses the current seed as input for the calculation.
-If the built-in @code{rand()} function
-(@pxref{Numeric Functions})
-isn't random enough, you might try using this function instead.
+The rest of the @code{BEGIN} rule is a simple test program. Here is the
+result of two sample runs of the test program:
address@hidden Ordinal Functions
address@hidden Translating Between Characters and Numbers
address@hidden
+$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x}
address@hidden c = <a>, optarg = <>
address@hidden c = <c>, optarg = <>
address@hidden c = <b>, optarg = <ARG>
address@hidden non-option arguments:
address@hidden ARGV[3] = <bax>
address@hidden ARGV[4] = <-x>
address@hidden libraries of @command{awk} functions, character values as numbers
address@hidden functions, library, character values as numbers
address@hidden characters, values of as numbers
address@hidden numbers, as values of characters
-One commercial implementation of @command{awk} supplies a built-in function,
address@hidden()}, which takes a character and returns the numeric value for
that
-character in the machine's character set. If the string passed to
address@hidden()} has more than one character, only the first one is used.
+$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc}
address@hidden c = <a>, optarg = <>
address@hidden x -- invalid option
address@hidden c = <?>, optarg = <>
address@hidden non-option arguments:
address@hidden ARGV[4] = <xyz>
address@hidden ARGV[5] = <abc>
address@hidden example
-The inverse of this function is @code{chr()} (from the function of the same
-name in Pascal), which takes a number and returns the corresponding character.
-Both functions are written very nicely in @command{awk}; there is no real
-reason to build them into the @command{awk} interpreter:
+In both runs,
+the first @option{--} terminates the arguments to @command{awk}, so that it
does
+not try to interpret the @option{-a}, etc., as its own options.
address@hidden @code{ord()} user-defined function
address@hidden @code{chr()} user-defined function
address@hidden
address@hidden file eg/lib/ord.awk
-# ord.awk --- do ord and chr
address@hidden NOTE
+After @code{getopt()} is through, it is the responsibility of the user level
+code to
+clear out all the elements of @code{ARGV} from 1 to @code{Optind},
+so that @command{awk} does not try to process the command-line options
+as @value{FN}s.
address@hidden quotation
-# Global identifiers:
-# _ord_: numerical values indexed by characters
-# _ord_init: function to initialize _ord_
+Several of the sample programs presented in
address@hidden Programs},
+use @code{getopt()} to process their arguments.
address@hidden ENDOFRANGE libfclo
address@hidden ENDOFRANGE flibclo
address@hidden ENDOFRANGE clop
address@hidden ENDOFRANGE oclp
+
address@hidden Passwd Functions
address@hidden Reading the User Database
+
address@hidden STARTOFRANGE libfudata
address@hidden libraries of @command{awk} functions, user database, reading
address@hidden STARTOFRANGE flibudata
address@hidden functions, library, user database, reading
address@hidden STARTOFRANGE udatar
address@hidden user address@hidden reading
address@hidden STARTOFRANGE dataur
address@hidden database, address@hidden reading
address@hidden @code{PROCINFO} array
+The @code{PROCINFO} array
+(@pxref{Built-in Variables})
+provides access to the current user's real and effective user and group ID
+numbers, and if available, the user's supplementary group set.
+However, because these are numbers, they do not provide very useful
+information to the average user. There needs to be some way to find the
+user information associated with the user and group ID numbers. This
address@hidden presents a suite of functions for retrieving information from the
+user database. @xref{Group Functions},
+for a similar suite that retrieves information from the group database.
+
address@hidden @code{getpwent()} function (C library)
address@hidden @code{getpwent()} user-defined function
address@hidden users, information about, retrieving
address@hidden login information
address@hidden account information
address@hidden password file
address@hidden files, password
+The POSIX standard does not define the file where user information is
+kept. Instead, it provides the @code{<pwd.h>} header file
+and several C language subroutines for obtaining user information.
+The primary function is @code{getpwent()}, for ``get password entry.''
+The ``password'' comes from the original user database file,
address@hidden/etc/passwd}, which stores user information, along with the
+encrypted passwords (hence the name).
+
address@hidden @command{pwcat} program
+While an @command{awk} program could simply read @file{/etc/passwd}
+directly, this file may not contain complete information about the
+system's set of address@hidden is often the case that password
+information is stored in a network database.} To be sure you are able to
+produce a readable and complete version of the user database, it is necessary
+to write a small C program that calls @code{getpwent()}. @code{getpwent()}
+is defined as returning a pointer to a @code{struct passwd}. Each time it
+is called, it returns the next entry in the database. When there are
+no more entries, it returns @code{NULL}, the null pointer. When this
+happens, the C program should call @code{endpwent()} to close the database.
+Following is @command{pwcat}, a C program that ``cats'' the password database:
+
address@hidden Use old style function header for portability to old systems
(SunOS, HP/UX).
+
address@hidden
address@hidden file eg/lib/pwcat.c
+/*
+ * pwcat.c
+ *
+ * Generate a printable version of the password database
+ */
@c endfile
@ignore
address@hidden file eg/lib/ord.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# 16 January, 1992
-# 20 July, 1992, revised
address@hidden file eg/lib/pwcat.c
+/*
+ * Arnold Robbins, arnold@@skeeve.com, May 1993
+ * Public Domain
+ * December 2010, move to ANSI C definition for main().
+ */
+
+#if HAVE_CONFIG_H
+#include <config.h>
+#endif
+
@c endfile
@end ignore
address@hidden file eg/lib/ord.awk
address@hidden file eg/lib/pwcat.c
+#include <stdio.h>
+#include <pwd.h>
-BEGIN @{ _ord_init() @}
address@hidden endfile
address@hidden
address@hidden file eg/lib/pwcat.c
+#if defined (STDC_HEADERS)
+#include <stdlib.h>
+#endif
-function _ord_init( low, high, i, t)
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/pwcat.c
+int
+main(int argc, char **argv)
@{
- low = sprintf("%c", 7) # BEL is ascii 7
- if (low == "\a") @{ # regular ascii
- low = 0
- high = 127
- @} else if (sprintf("%c", 128 + 7) == "\a") @{
- # ascii, mark parity
- low = 128
- high = 255
- @} else @{ # ebcdic(!)
- low = 0
- high = 255
- @}
+ struct passwd *p;
+
+ while ((p = getpwent()) != NULL)
address@hidden endfile
address@hidden
address@hidden file eg/lib/pwcat.c
+#ifdef ZOS_USS
+ printf("%s:%ld:%ld:%s:%s\n",
+ p->pw_name, (long) p->pw_uid,
+ (long) p->pw_gid, p->pw_dir, p->pw_shell);
+#else
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/pwcat.c
+ printf("%s:%s:%ld:%ld:%s:%s:%s\n",
+ p->pw_name, p->pw_passwd, (long) p->pw_uid,
+ (long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
address@hidden endfile
address@hidden
address@hidden file eg/lib/pwcat.c
+#endif
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/pwcat.c
- for (i = low; i <= high; i++) @{
- t = sprintf("%c", i)
- _ord_[t] = i
- @}
+ endpwent();
+ return 0;
@}
@c endfile
@end example
address@hidden character sets (machine character encodings)
address@hidden ASCII
address@hidden EBCDIC
address@hidden mark parity
-Some explanation of the numbers used by @code{chr} is worthwhile.
-The most prominent character set in use today is address@hidden
-is changing; many systems use Unicode, a very large character set
-that includes ASCII as a subset. On systems with full Unicode support,
-a character can occupy up to 32 bits, making simple tests such as
-used here prohibitively expensive.}
-Although an
-8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
-defines characters that use the values from 0 to address@hidden
-has been extended in many countries to use the values from 128 to 255
-for country-specific characters. If your system uses these extensions,
-you can simplify @code{_ord_init} to loop from 0 to 255.}
-In the now distant past,
-at least one minicomputer manufacturer
address@hidden Pr1me, blech
-used ASCII, but with mark parity, meaning that the leftmost bit in the byte
-is always 1. This means that on those systems, characters
-have numeric values from 128 to 255.
-Finally, large mainframe systems use the EBCDIC character set, which
-uses all 256 values.
-While there are other character sets in use on some older systems,
-they are not really worth worrying about:
+If you don't understand C, don't worry about it.
+The output from @command{pwcat} is the user database, in the traditional
address@hidden/etc/passwd} format of colon-separated fields. The fields are:
address@hidden
address@hidden file eg/lib/ord.awk
-function ord(str, c)
address@hidden
- # only first character is of interest
- c = substr(str, 1, 1)
- return _ord_[c]
address@hidden
address@hidden @asis
address@hidden Login name
+The user's login name.
-function chr(c)
address@hidden
- # force c to be numeric by adding 0
- return sprintf("%c", c + 0)
address@hidden
address@hidden endfile
address@hidden Encrypted password
+The user's encrypted password. This may not be available on some systems.
-#### test code ####
-# BEGIN \
-# @{
-# for (;;) @{
-# printf("enter a character: ")
-# if (getline var <= 0)
-# break
-# printf("ord(%s) = %d\n", var, ord(var))
-# @}
-# @}
address@hidden endfile
address@hidden example
address@hidden User-ID
+The user's numeric user ID number.
+(On some systems it's a C @code{long}, and not an @code{int}. Thus
+we cast it to @code{long} for all cases.)
-An obvious improvement to these functions is to move the code for the
address@hidden@w{_ord_init}} function into the body of the @code{BEGIN} rule.
It was
-written this way initially for ease of development.
-There is a ``test program'' in a @code{BEGIN} rule, to test the
-function. It is commented out for production use.
address@hidden Group-ID
+The user's numeric group ID number.
+(Similar comments about @code{long} vs.@: @code{int} apply here.)
address@hidden Join Function
address@hidden Merging an Array into a String
address@hidden Full name
+The user's full name, and perhaps other information associated with the
+user.
address@hidden libraries of @command{awk} functions, merging arrays into strings
address@hidden functions, library, merging arrays into strings
address@hidden strings, merging arrays into
address@hidden arrays, merging into strings
-When doing string processing, it is often useful to be able to join
-all the strings in an array into one long string. The following function,
address@hidden()}, accomplishes this task. It is used later in several of
-the application programs
-(@pxref{Sample Programs}).
address@hidden Home directory
+The user's login (or ``home'') directory (familiar to shell programmers as
address@hidden).
-Good function design is important; this function needs to be general but it
-should also have a reasonable default behavior. It is called with an array
-as well as the beginning and ending indices of the elements in the array to be
-merged. This assumes that the array indices are numeric---a reasonable
-assumption since the array was likely created with @code{split()}
-(@pxref{String Functions}):
address@hidden Login shell
+The program that is run when the user logs in. This is usually a
+shell, such as Bash.
address@hidden table
address@hidden @code{join()} user-defined function
+A few lines representative of @command{pwcat}'s output are as follows:
+
address@hidden Jacobs, Andrew
address@hidden Robbins, Arnold
address@hidden Robbins, Miriam
@example
address@hidden file eg/lib/join.awk
-# join.awk --- join an array into a string
+$ @kbd{pwcat}
address@hidden root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
address@hidden nobody:*:65534:65534::/:
address@hidden daemon:*:1:1::/:
address@hidden sys:*:2:2::/:/bin/csh
address@hidden bin:*:3:3::/bin:
address@hidden arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
address@hidden miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
address@hidden andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh
address@hidden
address@hidden example
+
+With that introduction, following is a group of functions for getting user
+information. There are several functions here, corresponding to the C
+functions of the same names:
+
address@hidden @code{_pw_init()} user-defined function
address@hidden
address@hidden file eg/lib/passwdawk.in
+# passwd.awk --- access password file information
@c endfile
@ignore
address@hidden file eg/lib/join.awk
address@hidden file eg/lib/passwdawk.in
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
+# Revised October 2000
+# Revised December 2010
@c endfile
@end ignore
address@hidden file eg/lib/join.awk
address@hidden file eg/lib/passwdawk.in
-function join(array, start, end, sep, result, i)
+BEGIN @{
+ # tailor this to suit your system
+ _pw_awklib = "/usr/local/libexec/awk/"
address@hidden
+
+function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat)
@{
- if (sep == "")
- sep = " "
- else if (sep == SUBSEP) # magic value
- sep = ""
- result = array[start]
- for (i = start + 1; i <= end; i++)
- result = result sep array[i]
- return result
+ if (_pw_inited)
+ return
+
+ oldfs = FS
+ oldrs = RS
+ olddol0 = $0
+ using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
+ using_fpat = (PROCINFO["FS"] == "FPAT")
+ FS = ":"
+ RS = "\n"
+
+ pwcat = _pw_awklib "pwcat"
+ while ((pwcat | getline) > 0) @{
+ _pw_byname[$1] = $0
+ _pw_byuid[$3] = $0
+ _pw_bycount[++_pw_total] = $0
+ @}
+ close(pwcat)
+ _pw_count = 0
+ _pw_inited = 1
+ FS = oldfs
+ if (using_fw)
+ FIELDWIDTHS = FIELDWIDTHS
+ else if (using_fpat)
+ FPAT = FPAT
+ RS = oldrs
+ $0 = olddol0
@}
@c endfile
@end example
-An optional additional argument is the separator to use when joining the
-strings back together. If the caller supplies a nonempty value,
address@hidden()} uses it; if it is not supplied, it has a null
-value. In this case, @code{join()} uses a single space as a default
-separator for the strings. If the value is equal to @code{SUBSEP},
-then @code{join()} joins the strings with no separator between them.
address@hidden serves as a ``magic'' value to indicate that there should
-be no separation between the component address@hidden would
-be nice if @command{awk} had an assignment operator for concatenation.
-The lack of an explicit operator for concatenation makes string operations
-more difficult than they really need to be.}
address@hidden @code{BEGIN} pattern, @code{pwcat} program
+The @code{BEGIN} rule sets a private variable to the directory where
address@hidden is stored. Because it is used to help out an @command{awk}
library
+routine, we have chosen to put it in @file{/usr/local/libexec/awk};
+however, you might want it to be in a different directory on your system.
address@hidden Getlocaltime Function
address@hidden Managing the Time of Day
+The function @code{_pw_init()} keeps three copies of the user information
+in three associative arrays. The arrays are indexed by username
+(@code{_pw_byname}), by user ID number (@code{_pw_byuid}), and by order of
+occurrence (@code{_pw_bycount}).
+The variable @code{_pw_inited} is used for efficiency, since @code{_pw_init()}
+needs to be called only once.
address@hidden libraries of @command{awk} functions, managing, time
address@hidden functions, library, managing time
address@hidden timestamps, formatted
address@hidden time, managing
-The @code{systime()} and @code{strftime()} functions described in
address@hidden Functions},
-provide the minimum functionality necessary for dealing with the time of day
-in human readable form. While @code{strftime()} is extensive, the control
-formats are not necessarily easy to remember or intuitively obvious when
-reading a program.
address@hidden @code{getline} command, @code{_pw_init()} function
+Because this function uses @code{getline} to read information from
address@hidden, it first saves the values of @code{FS}, @code{RS}, and
@code{$0}.
+It notes in the variable @code{using_fw} whether field splitting
+with @code{FIELDWIDTHS} is in effect or not.
+Doing so is necessary, since these functions could be called
+from anywhere within a user's program, and the user may have his
+or her
+own way of splitting records and fields.
+
address@hidden @code{PROCINFO} array
+The @code{using_fw} variable checks @code{PROCINFO["FS"]}, which
+is @code{"FIELDWIDTHS"} if field splitting is being done with
address@hidden This makes it possible to restore the correct
+field-splitting mechanism later. The test can only be true for
address@hidden It is false if using @code{FS} or @code{FPAT},
+or on some other @command{awk} implementation.
+
+The code that checks for using @code{FPAT}, using @code{using_fpat}
+and @code{PROCINFO["FS"]} is similar.
+
+The main part of the function uses a loop to read database lines, split
+the line into fields, and then store the line into each array as necessary.
+When the loop is done, @address@hidden()}} cleans up by closing the pipeline,
+setting @address@hidden to one, and restoring @code{FS}
+(and @code{FIELDWIDTHS} or @code{FPAT}
+if necessary), @code{RS}, and @code{$0}.
+The use of @address@hidden is explained shortly.
-The following function, @code{getlocaltime()}, populates a user-supplied array
-with preformatted time information. It returns a string with the current
-time formatted in the same way as the @command{date} utility:
address@hidden @code{getpwnam()} function (C library)
+The @code{getpwnam()} function takes a username as a string argument. If that
+user is in the database, it returns the appropriate line. Otherwise, it
+relies on the array reference to a nonexistent
+element to create the element with the null string as its value:
address@hidden @code{getlocaltime()} user-defined function
address@hidden @code{getpwnam()} user-defined function
@example
address@hidden file eg/lib/gettime.awk
-# getlocaltime.awk --- get the time of day in a usable format
address@hidden endfile
address@hidden
address@hidden file eg/lib/gettime.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain, May 1993
-#
address@hidden
address@hidden file eg/lib/passwdawk.in
+function getpwnam(name)
address@hidden
+ _pw_init()
+ return _pw_byname[name]
address@hidden
@c endfile
address@hidden ignore
address@hidden file eg/lib/gettime.awk
address@hidden group
address@hidden example
-# Returns a string in the format of output of date(1)
-# Populates the array argument time with individual values:
-# time["second"] -- seconds (0 - 59)
-# time["minute"] -- minutes (0 - 59)
-# time["hour"] -- hours (0 - 23)
-# time["althour"] -- hours (0 - 12)
-# time["monthday"] -- day of month (1 - 31)
-# time["month"] -- month of year (1 - 12)
-# time["monthname"] -- name of the month
-# time["shortmonth"] -- short name of the month
-# time["year"] -- year modulo 100 (0 - 99)
-# time["fullyear"] -- full year
-# time["weekday"] -- day of week (Sunday = 0)
-# time["altweekday"] -- day of week (Monday = 0)
-# time["dayname"] -- name of weekday
-# time["shortdayname"] -- short name of weekday
-# time["yearday"] -- day of year (0 - 365)
-# time["timezone"] -- abbreviation of timezone name
-# time["ampm"] -- AM or PM designation
-# time["weeknum"] -- week number, Sunday first day
-# time["altweeknum"] -- week number, Monday first day
address@hidden @code{getpwuid()} function (C library)
+Similarly,
+the @code{getpwuid} function takes a user ID number argument. If that
+user number is in the database, it returns the appropriate line. Otherwise, it
+returns the null string:
-function getlocaltime(time, ret, now, i)
address@hidden @code{getpwuid()} user-defined function
address@hidden
address@hidden file eg/lib/passwdawk.in
+function getpwuid(uid)
@{
- # get time once, avoids unnecessary system calls
- now = systime()
-
- # return date(1)-style output
- ret = strftime("%a %b %e %H:%M:%S %Z %Y", now)
-
- # clear out target array
- delete time
+ _pw_init()
+ return _pw_byuid[uid]
address@hidden
address@hidden endfile
address@hidden example
- # fill in values, force numeric values to be
- # numeric by adding 0
- time["second"] = strftime("%S", now) + 0
- time["minute"] = strftime("%M", now) + 0
- time["hour"] = strftime("%H", now) + 0
- time["althour"] = strftime("%I", now) + 0
- time["monthday"] = strftime("%d", now) + 0
- time["month"] = strftime("%m", now) + 0
- time["monthname"] = strftime("%B", now)
- time["shortmonth"] = strftime("%b", now)
- time["year"] = strftime("%y", now) + 0
- time["fullyear"] = strftime("%Y", now) + 0
- time["weekday"] = strftime("%w", now) + 0
- time["altweekday"] = strftime("%u", now) + 0
- time["dayname"] = strftime("%A", now)
- time["shortdayname"] = strftime("%a", now)
- time["yearday"] = strftime("%j", now) + 0
- time["timezone"] = strftime("%Z", now)
- time["ampm"] = strftime("%p", now)
- time["weeknum"] = strftime("%U", now) + 0
- time["altweeknum"] = strftime("%W", now) + 0
address@hidden @code{getpwent()} function (C library)
+The @code{getpwent()} function simply steps through the database, one entry at
+a time. It uses @code{_pw_count} to track its current position in the
address@hidden array:
- return ret
address@hidden @code{getpwent()} user-defined function
address@hidden
address@hidden file eg/lib/passwdawk.in
+function getpwent()
address@hidden
+ _pw_init()
+ if (_pw_count < _pw_total)
+ return _pw_bycount[++_pw_count]
+ return ""
@}
@c endfile
@end example
-The string indices are easier to use and read than the various formats
-required by @code{strftime()}. The @code{alarm} program presented in
address@hidden Program},
-uses this function.
-A more general design for the @code{getlocaltime()} function would have
-allowed the user to supply an optional timestamp value to use instead
-of the current time.
-
address@hidden Data File Management
address@hidden @value{DDF} Management
address@hidden @code{endpwent()} function (C library)
+The @address@hidden()}} function resets @address@hidden to zero, so that
+subsequent calls to @code{getpwent()} start over again:
address@hidden STARTOFRANGE dataf
address@hidden files, managing
address@hidden STARTOFRANGE libfdataf
address@hidden libraries of @command{awk} functions, managing, @value{DF}s
address@hidden STARTOFRANGE flibdataf
address@hidden functions, library, managing @value{DF}s
-This @value{SECTION} presents functions that are useful for managing
-command-line @value{DF}s.
address@hidden @code{endpwent()} user-defined function
address@hidden
address@hidden file eg/lib/passwdawk.in
+function endpwent()
address@hidden
+ _pw_count = 0
address@hidden
address@hidden endfile
address@hidden example
address@hidden
-* Filetrans Function:: A function for handling data file transitions.
-* Rewind Function:: A function for rereading the current file.
-* File Checking:: Checking that data files are readable.
-* Empty Files:: Checking for zero-length files.
-* Ignoring Assigns:: Treating assignments as file names.
address@hidden menu
+A conscious design decision in this suite is that each subroutine calls
address@hidden@w{_pw_init()}} to initialize the database arrays.
+The overhead of running
+a separate process to generate the user database, and the I/O to scan it,
+are only incurred if the user's main program actually calls one of these
+functions. If this library file is loaded along with a user's program, but
+none of the routines are ever called, then there is no extra runtime overhead.
+(The alternative is move the body of @address@hidden()}} into a
address@hidden rule, which always runs @command{pwcat}. This simplifies the
+code but runs an extra process that may never be needed.)
address@hidden Filetrans Function
address@hidden Noting @value{DDF} Boundaries
+In turn, calling @code{_pw_init()} is not too expensive, because the
address@hidden variable keeps the program from reading the data more than
+once. If you are worried about squeezing every last cycle out of your
address@hidden program, the check of @code{_pw_inited} could be moved out of
address@hidden()} and duplicated in all the other functions. In practice,
+this is not necessary, since most @command{awk} programs are I/O-bound,
+and such a change would clutter up the code.
address@hidden files, managing, @value{DF} boundaries
address@hidden files, initialization and cleanup
-The @code{BEGIN} and @code{END} rules are each executed exactly once at
-the beginning and end of your @command{awk} program, respectively
-(@pxref{BEGIN/END}).
-We (the @command{gawk} authors) once had a user who mistakenly thought that the
address@hidden rule is executed at the beginning of each @value{DF} and the
address@hidden rule is executed at the end of each @value{DF}.
+The @command{id} program in @ref{Id Program},
+uses these functions.
address@hidden ENDOFRANGE libfudata
address@hidden ENDOFRANGE flibudata
address@hidden ENDOFRANGE udatar
address@hidden ENDOFRANGE dataur
-When informed
-that this was not the case, the user requested that we add new special
-patterns to @command{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that
-would have the desired behavior. He even supplied us the code to do so.
address@hidden Group Functions
address@hidden Reading the Group Database
-Adding these special patterns to @command{gawk} wasn't necessary;
-the job can be done cleanly in @command{awk} itself, as illustrated
-by the following library program.
-It arranges to call two user-supplied functions, @code{beginfile()} and
address@hidden()}, at the beginning and end of each @value{DF}.
-Besides solving the problem in only nine(!) lines of code, it does so
address@hidden; this works with any implementation of @command{awk}:
address@hidden STARTOFRANGE libfgdata
address@hidden libraries of @command{awk} functions, group database, reading
address@hidden STARTOFRANGE flibgdata
address@hidden functions, library, group database, reading
address@hidden STARTOFRANGE gdatar
address@hidden group database, reading
address@hidden STARTOFRANGE datagr
address@hidden database, group, reading
address@hidden @code{PROCINFO} array
address@hidden @code{getgrent()} function (C library)
address@hidden @code{getgrent()} user-defined function
address@hidden address@hidden information about
address@hidden account information
address@hidden group file
address@hidden files, group
+Much of the discussion presented in
address@hidden Functions},
+applies to the group database as well. Although there has traditionally
+been a well-known file (@file{/etc/group}) in a well-known format, the POSIX
+standard only provides a set of C library routines
+(@code{<grp.h>} and @code{getgrent()})
+for accessing the information.
+Even though this file may exist, it may not have
+complete information. Therefore, as with the user database, it is necessary
+to have a small C program that generates the group database as its output.
address@hidden, a C program that ``cats'' the group database,
+is as follows:
address@hidden @command{grcat} program
@example
-# transfile.awk
-#
-# Give the user a hook for filename transitions
-#
-# The user must supply functions beginfile() and endfile()
-# that each take the name of the file being started or
-# finished, respectively.
address@hidden #
address@hidden # Arnold Robbins, arnold@@skeeve.com, Public Domain
address@hidden # January 1992
-
-FILENAME != _oldfilename \
address@hidden
- if (_oldfilename != "")
- endfile(_oldfilename)
- _oldfilename = FILENAME
- beginfile(FILENAME)
address@hidden
-
-END @{ endfile(FILENAME) @}
address@hidden example
address@hidden file eg/lib/grcat.c
+/*
+ * grcat.c
+ *
+ * Generate a printable version of the group database
+ */
address@hidden endfile
address@hidden
address@hidden file eg/lib/grcat.c
+/*
+ * Arnold Robbins, arnold@@skeeve.com, May 1993
+ * Public Domain
+ * December 2010, move to ANSI C definition for main().
+ */
-This file must be loaded before the user's ``main'' program, so that the
-rule it supplies is executed first.
+/* For OS/2, do nothing. */
+#if HAVE_CONFIG_H
+#include <config.h>
+#endif
-This rule relies on @command{awk}'s @code{FILENAME} variable that
-automatically changes for each new @value{DF}. The current @value{FN} is
-saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does
-not equal @code{_oldfilename}, then a new @value{DF} is being processed and
-it is necessary to call @code{endfile()} for the old file. Because
address@hidden()} should only be called if a file has been processed, the
-program first checks to make sure that @code{_oldfilename} is not the null
-string. The program then assigns the current @value{FN} to
address@hidden and calls @code{beginfile()} for the file.
-Because, like all @command{awk} variables, @code{_oldfilename} is
-initialized to the null string, this rule executes correctly even for the
-first @value{DF}.
+#if defined (STDC_HEADERS)
+#include <stdlib.h>
+#endif
-The program also supplies an @code{END} rule to do the final processing for
-the last file. Because this @code{END} rule comes before any @code{END} rules
-supplied in the ``main'' program, @code{endfile()} is called first. Once
-again the value of multiple @code{BEGIN} and @code{END} rules should be clear.
+#ifndef HAVE_GETGRENT
+int main() { return 0; }
+#else
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/grcat.c
+#include <stdio.h>
+#include <grp.h>
address@hidden @code{beginfile()} user-defined function
address@hidden @code{endfile()} user-defined function
-If the same @value{DF} occurs twice in a row on the command line, then
address@hidden()} and @code{beginfile()} are not executed at the end of the
-first pass and at the beginning of the second pass.
-The following version solves the problem:
+int
+main(int argc, char **argv)
address@hidden
+ struct group *g;
+ int i;
address@hidden
address@hidden file eg/lib/ftrans.awk
-# ftrans.awk --- handle data file transitions
-#
-# user supplies beginfile() and endfile() functions
+ while ((g = getgrent()) != NULL) @{
@c endfile
@ignore
address@hidden file eg/lib/ftrans.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# November 1992
address@hidden file eg/lib/grcat.c
+#ifdef ZOS_USS
+ printf("%s:%ld:", g->gr_name, (long) g->gr_gid);
+#else
@c endfile
@end ignore
address@hidden file eg/lib/ftrans.awk
-
-FNR == 1 @{
- if (_filename_ != "")
- endfile(_filename_)
- _filename_ = FILENAME
- beginfile(FILENAME)
address@hidden file eg/lib/grcat.c
+ printf("%s:%s:%ld:", g->gr_name, g->gr_passwd,
+ (long) g->gr_gid);
address@hidden endfile
address@hidden
address@hidden file eg/lib/grcat.c
+#endif
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/grcat.c
+ for (i = 0; g->gr_mem[i] != NULL; i++) @{
+ printf("%s", g->gr_mem[i]);
address@hidden
+ if (g->gr_mem[i+1] != NULL)
+ putchar(',');
+ @}
address@hidden group
+ putchar('\n');
+ @}
+ endgrent();
+ return 0;
@}
-
-END @{ endfile(_filename_) @}
@c endfile
address@hidden
address@hidden file eg/lib/grcat.c
+#endif /* HAVE_GETGRENT */
address@hidden endfile
address@hidden ignore
@end example
address@hidden Program},
-shows how this library function can be used and
-how it simplifies writing the main program.
+Each line in the group database represents one group. The fields are
+separated with colons and represent the following information:
address@hidden fakenode --- for prepinfo
address@hidden Advanced Notes: So Why Does @command{gawk} have @code{BEGINFILE}
and @code{ENDFILE}?
address@hidden @asis
address@hidden Group Name
+The group's name.
-You are probably wondering, if @code{beginfile()} and @code{endfile()}
-functions can do the job, why does @command{gawk} have
address@hidden and @code{ENDFILE} patterns (@pxref{BEGINFILE/ENDFILE})?
address@hidden Group Password
+The group's encrypted password. In practice, this field is never used;
+it is usually empty or set to @samp{*}.
-Good question. Normally, if @command{awk} cannot open a file, this
-causes an immediate fatal error. In this case, there is no way for a
-user-defined function to deal with the problem, since the mechanism for
-calling it relies on the file being open and at the first record. Thus,
-the main reason for @code{BEGINFILE} is to give you a ``hook'' to catch
-files that cannot be processed. @code{ENDFILE} exists for symmetry,
-and because it provides an easy way to do per-file cleanup processing.
address@hidden Group ID Number
+The group's numeric group ID number;
+this number must be unique within the file.
+(On some systems it's a C @code{long}, and not an @code{int}. Thus
+we cast it to @code{long} for all cases.)
address@hidden Rewind Function
address@hidden Rereading the Current File
address@hidden Group Member List
+A comma-separated list of user names. These users are members of the group.
+Modern Unix systems allow users to be members of several groups
+simultaneously. If your system does, then there are elements
address@hidden"group1"} through @code{"address@hidden"} in @code{PROCINFO}
+for those group ID numbers.
+(Note that @code{PROCINFO} is a @command{gawk} extension;
address@hidden Variables}.)
address@hidden table
address@hidden files, reading
-Another request for a new built-in function was for a @code{rewind()}
-function that would make it possible to reread the current file.
-The requesting user didn't want to have to use @code{getline}
-(@pxref{Getline})
-inside a loop.
+Here is what running @command{grcat} might produce:
-However, as long as you are not in the @code{END} rule, it is
-quite easy to arrange to immediately close the current input file
-and then start over with it from the top.
-For lack of a better name, we'll call it @code{rewind()}:
address@hidden
+$ @kbd{grcat}
address@hidden wheel:*:0:arnold
address@hidden nogroup:*:65534:
address@hidden daemon:*:1:
address@hidden kmem:*:2:
address@hidden staff:*:10:arnold,miriam,andy
address@hidden other:*:20:
address@hidden
address@hidden example
address@hidden @code{rewind()} user-defined function
+Here are the functions for obtaining information from the group database.
+There are several, modeled after the C library functions of the same names:
+
address@hidden @code{getline} command, @code{_gr_init()} user-defined function
address@hidden @code{_gr_init()} user-defined function
@example
address@hidden file eg/lib/rewind.awk
-# rewind.awk --- rewind the current file and start over
address@hidden file eg/lib/groupawk.in
+# group.awk --- functions for dealing with the group file
@c endfile
@ignore
address@hidden file eg/lib/rewind.awk
address@hidden file eg/lib/groupawk.in
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# September 2000
+# May 1993
+# Revised October 2000
+# Revised December 2010
@c endfile
@end ignore
address@hidden file eg/lib/rewind.awk
address@hidden line break on _gr_init for smallbook
address@hidden file eg/lib/groupawk.in
-function rewind( i)
+BEGIN \
@{
- # shift remaining arguments up
- for (i = ARGC; i > ARGIND; i--)
- ARGV[i] = ARGV[i-1]
-
- # make sure gawk knows to keep going
- ARGC++
-
- # make current file next to get done
- ARGV[ARGIND+1] = FILENAME
-
- # do it
- nextfile
+ # Change to suit your system
+ _gr_awklib = "/usr/local/libexec/awk/"
@}
address@hidden endfile
address@hidden example
-
-This code relies on the @code{ARGIND} variable
-(@pxref{Auto-set}),
-which is specific to @command{gawk}.
-If you are not using
address@hidden, you can use ideas presented in
address@hidden
-the previous @value{SECTION}
address@hidden ifnotinfo
address@hidden
address@hidden Function},
address@hidden ifinfo
-to either update @code{ARGIND} on your own
-or modify this code as appropriate.
-
-The @code{rewind()} function also relies on the @code{nextfile} keyword
-(@pxref{Nextfile Statement}).
address@hidden File Checking
address@hidden Checking for Readable @value{DDF}s
-
address@hidden troubleshooting, readable @value{DF}s
address@hidden readable @address@hidden checking
address@hidden files, skipping
-Normally, if you give @command{awk} a @value{DF} that isn't readable,
-it stops with a fatal error. There are times when you
-might want to just ignore such files and keep going. You can
-do this by prepending the following program to your @command{awk}
-program:
+function _gr_init( oldfs, oldrs, olddol0, grcat,
+ using_fw, using_fpat, n, a, i)
address@hidden
+ if (_gr_inited)
+ return
address@hidden @code{readable.awk} program
address@hidden
address@hidden file eg/lib/readable.awk
-# readable.awk --- library file to skip over unreadable files
address@hidden endfile
address@hidden
address@hidden file eg/lib/readable.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# October 2000
-# December 2010
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/readable.awk
+ oldfs = FS
+ oldrs = RS
+ olddol0 = $0
+ using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
+ using_fpat = (PROCINFO["FS"] == "FPAT")
+ FS = ":"
+ RS = "\n"
-BEGIN @{
- for (i = 1; i < ARGC; i++) @{
- if (ARGV[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/ \
- || ARGV[i] == "-" || ARGV[i] == "/dev/stdin")
- continue # assignment or standard input
- else if ((getline junk < ARGV[i]) < 0) # unreadable
- delete ARGV[i]
+ grcat = _gr_awklib "grcat"
+ while ((grcat | getline) > 0) @{
+ if ($1 in _gr_byname)
+ _gr_byname[$1] = _gr_byname[$1] "," $4
else
- close(ARGV[i])
+ _gr_byname[$1] = $0
+ if ($3 in _gr_bygid)
+ _gr_bygid[$3] = _gr_bygid[$3] "," $4
+ else
+ _gr_bygid[$3] = $0
+
+ n = split($4, a, "[ \t]*,[ \t]*")
+ for (i = 1; i <= n; i++)
+ if (a[i] in _gr_groupsbyuser)
+ _gr_groupsbyuser[a[i]] = \
+ _gr_groupsbyuser[a[i]] " " $1
+ else
+ _gr_groupsbyuser[a[i]] = $1
+
+ _gr_bycount[++_gr_count] = $0
@}
+ close(grcat)
+ _gr_count = 0
+ _gr_inited++
+ FS = oldfs
+ if (using_fw)
+ FIELDWIDTHS = FIELDWIDTHS
+ else if (using_fpat)
+ FPAT = FPAT
+ RS = oldrs
+ $0 = olddol0
@}
@c endfile
@end example
address@hidden troubleshooting, @code{getline} function
-This works, because the @code{getline} won't be fatal.
-Removing the element from @code{ARGV} with @code{delete}
-skips the file (since it's no longer in the list).
-See also @ref{ARGC and ARGV}.
+The @code{BEGIN} rule sets a private variable to the directory where
address@hidden is stored. Because it is used to help out an @command{awk}
library
+routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might
+want it to be in a different directory on your system.
address@hidden Empty Files
address@hidden Checking For Zero-length Files
+These routines follow the same general outline as the user database routines
+(@pxref{Passwd Functions}).
+The @address@hidden variable is used to
+ensure that the database is scanned no more than once.
+The @address@hidden()}} function first saves @code{FS},
address@hidden, and
address@hidden, and then sets @code{FS} and @code{RS} to the correct values for
+scanning the group information.
+It also takes care to note whether @code{FIELDWIDTHS} or @code{FPAT}
+is being used, and to restore the appropriate field splitting mechanism.
-All known @command{awk} implementations silently skip over zero-length files.
-This is a by-product of @command{awk}'s implicit
-read-a-record-and-match-against-the-rules loop: when @command{awk}
-tries to read a record from an empty file, it immediately receives an
-end of file indication, closes the file, and proceeds on to the next
-command-line @value{DF}, @emph{without} executing any user-level
address@hidden program code.
+The group information is stored is several associative arrays.
+The arrays are indexed by group name (@address@hidden), by group ID number
+(@address@hidden), and by position in the database (@address@hidden).
+There is an additional array indexed by user name (@address@hidden),
+which is a space-separated list of groups to which each user belongs.
-Using @command{gawk}'s @code{ARGIND} variable
-(@pxref{Built-in Variables}), it is possible to detect when an empty
address@hidden has been skipped. Similar to the library file presented
-in @ref{Filetrans Function}, the following library file calls a function named
address@hidden()} that the user must provide. The arguments passed are
-the @value{FN} and the position in @code{ARGV} where it was found:
+Unlike the user database, it is possible to have multiple records in the
+database for the same group. This is common when a group has a large number
+of members. A pair of such entries might look like the following:
address@hidden @code{zerofile.awk} program
@example
address@hidden file eg/lib/zerofile.awk
-# zerofile.awk --- library file to process empty input files
address@hidden endfile
address@hidden
address@hidden file eg/lib/zerofile.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# June 2003
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/zerofile.awk
+tvpeople:*:101:johnny,jay,arsenio
+tvpeople:*:101:david,conan,tom,joan
address@hidden example
-BEGIN @{ Argind = 0 @}
+For this reason, @code{_gr_init()} looks to see if a group name or
+group ID number is already seen. If it is, then the user names are
+simply concatenated onto the previous list of users. (There is actually a
+subtle problem with the code just presented. Suppose that
+the first time there were no names. This code adds the names with
+a leading comma. It also doesn't check that there is a @code{$4}.)
-ARGIND > Argind + 1 @{
- for (Argind++; Argind < ARGIND; Argind++)
- zerofile(ARGV[Argind], Argind)
address@hidden
+Finally, @code{_gr_init()} closes the pipeline to @command{grcat}, restores
address@hidden (and @code{FIELDWIDTHS} or @code{FPAT} if necessary), @code{RS},
and @code{$0},
+initializes @code{_gr_count} to zero
+(it is used later), and makes @code{_gr_inited} nonzero.
-ARGIND != Argind @{ Argind = ARGIND @}
address@hidden @code{getgrnam()} function (C library)
+The @code{getgrnam()} function takes a group name as its argument, and if that
+group exists, it is returned.
+Otherwise, it
+relies on the array reference to a nonexistent
+element to create the element with the null string as its value:
-END @{
- if (ARGIND > Argind)
- for (Argind++; Argind <= ARGIND; Argind++)
- zerofile(ARGV[Argind], Argind)
address@hidden @code{getgrnam()} user-defined function
address@hidden
address@hidden file eg/lib/groupawk.in
+function getgrnam(group)
address@hidden
+ _gr_init()
+ return _gr_byname[group]
@}
@c endfile
@end example
-The user-level variable @code{Argind} allows the @command{awk} program
-to track its progress through @code{ARGV}. Whenever the program detects
-that @code{ARGIND} is greater than @samp{Argind + 1}, it means that one or
-more empty files were skipped. The action then calls @code{zerofile()} for
-each such file, incrementing @code{Argind} along the way.
address@hidden @code{getgrgid()} function (C library)
+The @code{getgrgid()} function is similar; it takes a numeric group ID and
+looks up the information associated with that group ID:
-The @samp{Argind != ARGIND} rule simply keeps @code{Argind} up to date
-in the normal case.
address@hidden @code{getgrgid()} user-defined function
address@hidden
address@hidden file eg/lib/groupawk.in
+function getgrgid(gid)
address@hidden
+ _gr_init()
+ return _gr_bygid[gid]
address@hidden
address@hidden endfile
address@hidden example
-Finally, the @code{END} rule catches the case of any empty files at
-the end of the command-line arguments. Note that the test in the
-condition of the @code{for} loop uses the @samp{<=} operator,
-not @samp{<}.
address@hidden @code{getgruser()} function (C library)
+The @code{getgruser()} function does not have a C counterpart. It takes a
+user name and returns the list of groups that have the user as a member:
-As an exercise, you might consider whether this same problem can
-be solved without relying on @command{gawk}'s @code{ARGIND} variable.
address@hidden @code{getgruser()} function, user-defined
address@hidden
address@hidden file eg/lib/groupawk.in
+function getgruser(user)
address@hidden
+ _gr_init()
+ return _gr_groupsbyuser[user]
address@hidden
address@hidden endfile
address@hidden example
-As a second exercise, revise this code to handle the case where
-an intervening value in @code{ARGV} is a variable assignment.
address@hidden @code{getgrent()} function (C library)
+The @code{getgrent()} function steps through the database one entry at a time.
+It uses @code{_gr_count} to track its position in the list:
address@hidden
-# zerofile2.awk --- same thing, portably
address@hidden @code{getgrent()} user-defined function
address@hidden
address@hidden file eg/lib/groupawk.in
+function getgrent()
address@hidden
+ _gr_init()
+ if (++_gr_count in _gr_bycount)
+ return _gr_bycount[_gr_count]
+ return ""
address@hidden
address@hidden endfile
address@hidden example
address@hidden ENDOFRANGE clibf
-BEGIN @{
- ARGIND = Argind = 0
- for (i = 1; i < ARGC; i++)
- Fnames[ARGV[i]]++
address@hidden @code{endgrent()} function (C library)
+The @code{endgrent()} function resets @code{_gr_count} to zero so that
@code{getgrent()} can
+start over again:
address@hidden @code{endgrent()} user-defined function
address@hidden
address@hidden file eg/lib/groupawk.in
+function endgrent()
address@hidden
+ _gr_count = 0
@}
-FNR == 1 @{
- while (ARGV[ARGIND] != FILENAME)
- ARGIND++
- Seen[FILENAME]++
- if (Seen[FILENAME] == Fnames[FILENAME])
- do
- ARGIND++
- while (ARGV[ARGIND] != FILENAME)
address@hidden
-ARGIND > Argind + 1 @{
- for (Argind++; Argind < ARGIND; Argind++)
- zerofile(ARGV[Argind], Argind)
address@hidden
-ARGIND != Argind @{
- Argind = ARGIND
address@hidden
-END @{
- if (ARGIND < ARGC - 1)
- ARGIND = ARGC - 1
- if (ARGIND > Argind)
- for (Argind++; Argind <= ARGIND; Argind++)
- zerofile(ARGV[Argind], Argind)
address@hidden
address@hidden ignore
address@hidden endfile
address@hidden example
address@hidden Ignoring Assigns
address@hidden Treating Assignments as @value{FFN}s
+As with the user database routines, each function calls @code{_gr_init()} to
+initialize the arrays. Doing so only incurs the extra overhead of running
address@hidden if these functions are used (as opposed to moving the body of
address@hidden()} into a @code{BEGIN} rule).
address@hidden assignments as filenames
address@hidden filenames, assignments as
-Occasionally, you might not want @command{awk} to process command-line
-variable assignments
-(@pxref{Assignment Options}).
-In particular, if you have a @value{FN} that contain an @samp{=} character,
address@hidden treats the @value{FN} as an assignment, and does not process it.
+Most of the work is in scanning the database and building the various
+associative arrays. The functions that the user calls are themselves very
+simple, relying on @command{awk}'s associative arrays to do work.
-Some users have suggested an additional command-line option for @command{gawk}
-to disable command-line assignments. However, some simple programming with
-a library file does the trick:
+The @command{id} program in @ref{Id Program},
+uses these functions.
address@hidden @code{noassign.awk} program
address@hidden
address@hidden file eg/lib/noassign.awk
-# noassign.awk --- library file to avoid the need for a
-# special option that disables command-line assignments
address@hidden endfile
address@hidden
address@hidden file eg/lib/noassign.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# October 1999
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/noassign.awk
address@hidden Walking Arrays
address@hidden Traversing Arrays of Arrays
+
address@hidden of Arrays}, described how @command{gawk}
+provides arrays of arrays. In particular, any element of
+an array may be either a scalar, or another array. The
address@hidden()} function (@pxref{Type Functions})
+lets you distinguish an array
+from a scalar.
+The following function, @code{walk_array()}, recursively traverses
+an array, printing each element's indices and value.
+You call it with the array and a string representing the name
+of the array:
-function disable_assigns(argc, argv, i)
address@hidden @code{walk_array()} user-defined function
address@hidden
address@hidden file eg/lib/walkarray.awk
+function walk_array(arr, name, i)
@{
- for (i = 1; i < argc; i++)
- if (argv[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/)
- argv[i] = ("./" argv[i])
+ for (i in arr) @{
+ if (isarray(arr[i]))
+ walk_array(arr[i], (name "[" i "]"))
+ else
+ printf("%s[%s] = %s\n", name, i, arr[i])
+ @}
@}
address@hidden endfile
address@hidden example
+
address@hidden
+It works by looping over each element of the array. If any given
+element is itself an array, the function calls itself recursively,
+passing the subarray and a new string representing the current index.
+Otherwise, the function simply prints the element's name, index, and value.
+Here is a main program to demonstrate:
address@hidden
BEGIN @{
- if (No_command_assign)
- disable_assigns(ARGC, ARGV)
+ a[1] = 1
+ a[2][1] = 21
+ a[2][2] = 22
+ a[3] = 3
+ a[4][1][1] = 411
+ a[4][2] = 42
+
+ walk_array(a, "a")
@}
address@hidden endfile
@end example
-You then run your program this way:
+When run, the program produces the following output:
@example
-awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk *
+$ @kbd{gawk -f walk_array.awk}
address@hidden a[4][1][1] = 411
address@hidden a[4][2] = 42
address@hidden a[1] = 1
address@hidden a[2][1] = 21
address@hidden a[2][2] = 22
address@hidden a[3] = 3
@end example
-The function works by looping through the arguments.
-It prepends @samp{./} to
-any argument that matches the form
-of a variable assignment, turning that argument into a @value{FN}.
address@hidden ENDOFRANGE libfgdata
address@hidden ENDOFRANGE flibgdata
address@hidden ENDOFRANGE gdatar
address@hidden ENDOFRANGE libf
address@hidden ENDOFRANGE flib
address@hidden ENDOFRANGE fudlib
address@hidden ENDOFRANGE datagr
-The use of @code{No_command_assign} allows you to disable command-line
-assignments at invocation time, by giving the variable a true value.
-When not set, it is initially zero (i.e., false), so the command-line arguments
-are left alone.
address@hidden ENDOFRANGE dataf
address@hidden ENDOFRANGE flibdataf
address@hidden ENDOFRANGE libfdataf
address@hidden Sample Programs
address@hidden Practical @command{awk} Programs
address@hidden STARTOFRANGE awkpex
address@hidden @command{awk} programs, examples of
address@hidden Getopt Function
address@hidden Processing Command-Line Options
address@hidden Functions},
+presents the idea that reading programs in a language contributes to
+learning that language. This @value{CHAPTER} continues that theme,
+presenting a potpourri of @command{awk} programs for your reading
+enjoyment.
address@hidden
+There are three sections.
+The first describes how to run the programs presented
+in this @value{CHAPTER}.
address@hidden STARTOFRANGE libfclo
address@hidden libraries of @command{awk} functions, command-line options
address@hidden STARTOFRANGE flibclo
address@hidden functions, library, command-line options
address@hidden STARTOFRANGE clop
address@hidden command-line options, processing
address@hidden STARTOFRANGE oclp
address@hidden options, command-line, processing
address@hidden STARTOFRANGE clibf
address@hidden functions, library, C library
address@hidden arguments, processing
-Most utilities on POSIX compatible systems take options on
-the command line that can be used to change the way a program behaves.
address@hidden is an example of such a program
-(@pxref{Options}).
-Often, options take @dfn{arguments}; i.e., data that the program needs to
-correctly obey the command-line option. For example, @command{awk}'s
address@hidden option requires a string to use as the field separator.
-The first occurrence on the command line of either @option{--} or a
-string that does not begin with @samp{-} ends the options.
+The second presents @command{awk}
+versions of several common POSIX utilities.
+These are programs that you are hopefully already familiar with,
+and therefore, whose problems are understood.
+By reimplementing these programs in @command{awk},
+you can focus on the @command{awk}-related aspects of solving
+the programming problem.
address@hidden @code{getopt()} function (C library)
-Modern Unix systems provide a C function named @code{getopt()} for processing
-command-line arguments. The programmer provides a string describing the
-one-letter options. If an option requires an argument, it is followed in the
-string with a colon. @code{getopt()} is also passed the
-count and values of the command-line arguments and is called in a loop.
address@hidden()} processes the command-line arguments for option letters.
-Each time around the loop, it returns a single character representing the
-next option letter that it finds, or @samp{?} if it finds an invalid option.
-When it returns @minus{}1, there are no options left on the command line.
+The third is a grab bag of interesting programs.
+These solve a number of different data-manipulation and management
+problems. Many of the programs are short, which emphasizes @command{awk}'s
+ability to do a lot in just a few lines of code.
address@hidden ifnotinfo
-When using @code{getopt()}, options that do not take arguments can be
-grouped together. Furthermore, options that take arguments require that the
-argument be present. The argument can immediately follow the option letter,
-or it can be a separate command-line argument.
+Many of these programs use library functions presented in
address@hidden Functions}.
-Given a hypothetical program that takes
-three command-line options, @option{-a}, @option{-b}, and @option{-c}, where
address@hidden requires an argument, all of the following are valid ways of
-invoking the program:
address@hidden
+* Running Examples:: How to run these examples.
+* Clones:: Clones of common utilities.
+* Miscellaneous Programs:: Some interesting @command{awk} programs.
address@hidden menu
+
address@hidden Running Examples
address@hidden Running the Example Programs
+
+To run a given program, you would typically do something like this:
@example
-prog -a -b foo -c data1 data2 data3
-prog -ac -bfoo -- data1 data2 data3
-prog -acbfoo data1 data2 data3
+awk -f @var{program} -- @var{options} @var{files}
@end example
-Notice that when the argument is grouped with its option, the rest of
-the argument is considered to be the option's argument.
-In this example, @option{-acbfoo} indicates that all of the
address@hidden, @option{-b}, and @option{-c} options were supplied,
-and that @samp{foo} is the argument to the @option{-b} option.
address@hidden
+Here, @var{program} is the name of the @command{awk} program (such as
address@hidden), @var{options} are any command-line options for the
+program that start with a @samp{-}, and @var{files} are the actual @value{DF}s.
address@hidden()} provides four external variables that the programmer can use:
+If your system supports the @samp{#!} executable interpreter mechanism
+(@pxref{Executable Scripts}),
+you can instead run your program directly:
address@hidden @code
address@hidden optind
-The index in the argument value array (@code{argv}) where the first
-nonoption command-line argument can be found.
address@hidden
+cut.awk -c1-8 myfiles > results
address@hidden example
address@hidden optarg
-The string value of the argument to an option.
+If your @command{awk} is not @command{gawk}, you may instead need to use this:
address@hidden opterr
-Usually @code{getopt()} prints an error message when it finds an invalid
-option. Setting @code{opterr} to zero disables this feature. (An
-application might want to print its own error message.)
address@hidden
+cut.awk -- -c1-8 myfiles > results
address@hidden example
address@hidden optopt
-The letter representing the command-line option.
address@hidden While not usually documented, most versions supply this variable.
address@hidden table
address@hidden Clones
address@hidden Reinventing Wheels for Fun and Profit
address@hidden STARTOFRANGE posimawk
address@hidden POSIX, address@hidden implementing in @command{awk}
-The following C fragment shows how @code{getopt()} might process command-line
-arguments for @command{awk}:
+This @value{SECTION} presents a number of POSIX utilities implemented in
address@hidden Reinventing these programs in @command{awk} is often enjoyable,
+because the algorithms can be very clearly expressed, and the code is usually
+very concise and simple. This is true because @command{awk} does so much for
you.
address@hidden
-int
-main(int argc, char *argv[])
address@hidden
- @dots{}
- /* print our own message */
- opterr = 0;
- while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{
- switch (c) @{
- case 'f': /* file */
- @dots{}
- break;
- case 'F': /* field separator */
- @dots{}
- break;
- case 'v': /* variable assignment */
- @dots{}
- break;
- case 'W': /* extension */
- @dots{}
- break;
- case '?':
- default:
- usage();
- break;
- @}
- @}
- @dots{}
address@hidden
+It should be noted that these programs are not necessarily intended to
+replace the installed versions on your system.
+Nor may all of these programs be fully compliant with the most recent
+POSIX standard. This is not a problem; their
+purpose is to illustrate @command{awk} language programming for ``real world''
+tasks.
+
+The programs are presented in alphabetical order.
+
address@hidden
+* Cut Program:: The @command{cut} utility.
+* Egrep Program:: The @command{egrep} utility.
+* Id Program:: The @command{id} utility.
+* Split Program:: The @command{split} utility.
+* Tee Program:: The @command{tee} utility.
+* Uniq Program:: The @command{uniq} utility.
+* Wc Program:: The @command{wc} utility.
address@hidden menu
+
address@hidden Cut Program
address@hidden Cutting out Fields and Columns
+
address@hidden @command{cut} utility
address@hidden STARTOFRANGE cut
address@hidden @command{cut} utility
address@hidden STARTOFRANGE ficut
address@hidden fields, cutting
address@hidden STARTOFRANGE colcut
address@hidden columns, cutting
+The @command{cut} utility selects, or ``cuts,'' characters or fields
+from its standard input and sends them to its standard output.
+Fields are separated by TABs by default,
+but you may supply a command-line option to change the field
address@hidden (i.e., the field-separator character). @command{cut}'s
+definition of fields is less general than @command{awk}'s.
+
+A common use of @command{cut} might be to pull out just the login name of
+logged-on users from the output of @command{who}. For example, the following
+pipeline generates a sorted, unique list of the logged-on users:
+
address@hidden
+who | cut -c1-8 | sort | uniq
@end example
-As a side point, @command{gawk} actually uses the GNU @code{getopt_long()}
-function to process both normal and GNU-style long options
-(@pxref{Options}).
+The options for @command{cut} are:
-The abstraction provided by @code{getopt()} is very useful and is quite
-handy in @command{awk} programs as well. Following is an @command{awk}
-version of @code{getopt()}. This function highlights one of the
-greatest weaknesses in @command{awk}, which is that it is very poor at
-manipulating single characters. Repeated calls to @code{substr()} are
-necessary for accessing individual characters
-(@pxref{String Functions})address@hidden
-function was written before @command{gawk} acquired the ability to
-split strings into single characters using @code{""} as the separator.
-We have left it alone, since using @code{substr()} is more portable.}
address@hidden FIXME: could use split(str, a, "") to do it more easily.
address@hidden @code
address@hidden -c @var{list}
+Use @var{list} as the list of characters to cut out. Items within the list
+may be separated by commas, and ranges of characters can be separated with
+dashes. The list @samp{1-8,15,22-35} specifies characters 1 through
+8, 15, and 22 through 35.
-The discussion that follows walks through the code a bit at a time:
address@hidden -f @var{list}
+Use @var{list} as the list of fields to cut out.
address@hidden @code{getopt()} user-defined function
address@hidden -d @var{delim}
+Use @var{delim} as the field-separator character instead of the TAB
+character.
+
address@hidden -s
+Suppress printing of lines that do not contain the field delimiter.
address@hidden table
+
+The @command{awk} implementation of @command{cut} uses the @code{getopt()}
library
+function (@pxref{Getopt Function})
+and the @code{join()} library function
+(@pxref{Join Function}).
+
+The program begins with a comment describing the options, the library
+functions needed, and a @code{usage()} function that prints out a usage
+message and exits. @code{usage()} is called if invalid arguments are
+supplied:
+
address@hidden @code{cut.awk} program
@example
address@hidden file eg/lib/getopt.awk
-# getopt.awk --- Do C library getopt(3) function in awk
address@hidden file eg/prog/cut.awk
+# cut.awk --- implement cut in awk
@c endfile
@ignore
address@hidden file eg/lib/getopt.awk
address@hidden file eg/prog/cut.awk
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
-#
-# Initial version: March, 1991
-# Revised: May, 1993
+# May 1993
@c endfile
@end ignore
address@hidden file eg/lib/getopt.awk
-
-# External variables:
-# Optind -- index in ARGV of first nonoption argument
-# Optarg -- string value of argument to current option
-# Opterr -- if nonzero, print our own diagnostic
-# Optopt -- current option letter
-
-# Returns:
-# -1 at end of options
-# "?" for unrecognized option
-# <c> a character representing the current option
-
-# Private Data:
-# _opti -- index in multi-flag option, e.g., -abc
address@hidden endfile
address@hidden example
-
-The function starts out with comments presenting
-a list of the global variables it uses,
-what the return values are, what they mean, and any global variables that
-are ``private'' to this library function. Such documentation is essential
-for any program, and particularly for library functions.
-
-The @code{getopt()} function first checks that it was indeed called with
-a string of options (the @code{options} parameter). If @code{options}
-has a zero length, @code{getopt()} immediately returns @minus{}1:
address@hidden file eg/prog/cut.awk
address@hidden @code{getopt()} user-defined function
address@hidden
address@hidden file eg/lib/getopt.awk
-function getopt(argc, argv, options, thisopt, i)
address@hidden
- if (length(options) == 0) # no options given
- return -1
+# Options:
+# -f list Cut fields
+# -d c Field delimiter character
+# -c list Cut characters
+#
+# -s Suppress lines without the delimiter
+#
+# Requires getopt() and join() library functions
@group
- if (argv[Optind] == "--") @{ # all done
- Optind++
- _opti = 0
- return -1
+function usage( e1, e2)
address@hidden
+ e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
+ e2 = "usage: cut [-c list] [files...]"
+ print e1 > "/dev/stderr"
+ print e2 > "/dev/stderr"
+ exit 1
address@hidden
@end group
- @} else if (argv[Optind] !~ /^-[^:[:space:]]/) @{
- _opti = 0
- return -1
- @}
address@hidden endfile
address@hidden example
-
-The next thing to check for is the end of the options. A @option{--}
-ends the command-line options, as does any command-line argument that
-does not begin with a @samp{-}. @code{Optind} is used to step through
-the array of command-line arguments; it retains its value across calls
-to @code{getopt()}, because it is a global variable.
-
-The regular expression that is used, @address@hidden/^-[^:[:space:]/}},
-checks for a @samp{-} followed by anything
-that is not whitespace and not a colon.
-If the current command-line argument does not match this pattern,
-it is not an option, and it ends option processing. Continuing on:
-
address@hidden
address@hidden file eg/lib/getopt.awk
- if (_opti == 0)
- _opti = 2
- thisopt = substr(argv[Optind], _opti, 1)
- Optopt = thisopt
- i = index(options, thisopt)
- if (i == 0) @{
- if (Opterr)
- printf("%c -- invalid option\n",
- thisopt) > "/dev/stderr"
- if (_opti >= length(argv[Optind])) @{
- Optind++
- _opti = 0
- @} else
- _opti++
- return "?"
- @}
@c endfile
@end example
-The @code{_opti} variable tracks the position in the current command-line
-argument (@code{argv[Optind]}). If multiple options are
-grouped together with one @samp{-} (e.g., @option{-abx}), it is necessary
-to return them to the user one at a time.
-
-If @code{_opti} is equal to zero, it is set to two, which is the index in
-the string of the next character to look at (we skip the @samp{-}, which
-is at position one). The variable @code{thisopt} holds the character,
-obtained with @code{substr()}. It is saved in @code{Optopt} for the main
-program to use.
-
-If @code{thisopt} is not in the @code{options} string, then it is an
-invalid option. If @code{Opterr} is nonzero, @code{getopt()} prints an error
-message on the standard error that is similar to the message from the C
-version of @code{getopt()}.
-
-Because the option is invalid, it is necessary to skip it and move on to the
-next option character. If @code{_opti} is greater than or equal to the
-length of the current command-line argument, it is necessary to move on
-to the next argument, so @code{Optind} is incremented and @code{_opti} is reset
-to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely
-incremented.
address@hidden
+The variables @code{e1} and @code{e2} are used so that the function
+fits nicely on the
address@hidden
+page.
address@hidden ifnotinfo
address@hidden
+screen.
address@hidden ifnottex
-In any case, because the option is invalid, @code{getopt()} returns @code{"?"}.
-The main program can examine @code{Optopt} if it needs to know what the
-invalid option letter actually is. Continuing on:
address@hidden @code{BEGIN} pattern, running @command{awk} programs and
address@hidden @code{FS} variable, running @command{awk} programs and
+Next comes a @code{BEGIN} rule that parses the command-line options.
+It sets @code{FS} to a single TAB character, because that is @command{cut}'s
+default field separator. The rule then sets the output field separator to be
the
+same as the input field separator. A loop using @code{getopt()} steps
+through the command-line options. Exactly one of the variables
address@hidden or @code{by_chars} is set to true, to indicate that
+processing should be done by fields or by characters, respectively.
+When cutting by characters, the output field separator is set to the null
+string:
@example
address@hidden file eg/lib/getopt.awk
- if (substr(options, i + 1, 1) == ":") @{
- # get option argument
- if (length(substr(argv[Optind], _opti + 1)) > 0)
- Optarg = substr(argv[Optind], _opti + 1)
address@hidden file eg/prog/cut.awk
+BEGIN \
address@hidden
+ FS = "\t" # default
+ OFS = FS
+ while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{
+ if (c == "f") @{
+ by_fields = 1
+ fieldlist = Optarg
+ @} else if (c == "c") @{
+ by_chars = 1
+ fieldlist = Optarg
+ OFS = ""
+ @} else if (c == "d") @{
+ if (length(Optarg) > 1) @{
+ printf("Using first character of %s" \
+ " for delimiter\n", Optarg) > "/dev/stderr"
+ Optarg = substr(Optarg, 1, 1)
+ @}
+ FS = Optarg
+ OFS = FS
+ if (FS == " ") # defeat awk semantics
+ FS = "[ ]"
+ @} else if (c == "s")
+ suppress++
else
- Optarg = argv[++Optind]
- _opti = 0
- @} else
- Optarg = ""
address@hidden endfile
address@hidden example
-
-If the option requires an argument, the option letter is followed by a colon
-in the @code{options} string. If there are remaining characters in the
-current command-line argument (@code{argv[Optind]}), then the rest of that
-string is assigned to @code{Optarg}. Otherwise, the next command-line
-argument is used (@samp{-xFOO} versus @address@hidden FOO}}). In either case,
address@hidden is reset to zero, because there are no more characters left to
-examine in the current command-line argument. Continuing:
+ usage()
+ @}
address@hidden
address@hidden file eg/lib/getopt.awk
- if (_opti == 0 || _opti >= length(argv[Optind])) @{
- Optind++
- _opti = 0
- @} else
- _opti++
- return thisopt
address@hidden
+ # Clear out options
+ for (i = 1; i < Optind; i++)
+ ARGV[i] = ""
@c endfile
@end example
-Finally, if @code{_opti} is either zero or greater than the length of the
-current command-line argument, it means this element in @code{argv} is
-through being processed, so @code{Optind} is incremented to point to the
-next element in @code{argv}. If neither condition is true, then only
address@hidden is incremented, so that the next option letter can be processed
-on the next call to @code{getopt()}.
address@hidden field separators, spaces as
+The code must take
+special care when the field delimiter is a space. Using
+a single space (@address@hidden" "}}) for the value of @code{FS} is
address@hidden would separate fields with runs of spaces,
+TABs, and/or newlines, and we want them to be separated with individual
+spaces. Also remember that after @code{getopt()} is through
+(as described in @ref{Getopt Function}),
+we have to
+clear out all the elements of @code{ARGV} from 1 to @code{Optind},
+so that @command{awk} does not try to process the command-line options
+as @value{FN}s.
-The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one.
address@hidden is set to one, since the default behavior is for @code{getopt()}
-to print a diagnostic message upon seeing an invalid option. @code{Optind}
-is set to one, since there's no reason to look at the program name, which is
-in @code{ARGV[0]}:
+After dealing with the command-line options, the program verifies that the
+options make sense. Only one or the other of @option{-c} and @option{-f}
+should be used, and both require a field list. Then the program calls
+either @code{set_fieldlist()} or @code{set_charlist()} to pull apart the
+list of fields or characters:
@example
address@hidden file eg/lib/getopt.awk
-BEGIN @{
- Opterr = 1 # default is to diagnose
- Optind = 1 # skip ARGV[0]
address@hidden file eg/prog/cut.awk
+ if (by_fields && by_chars)
+ usage()
- # test program
- if (_getopt_test) @{
- while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
- printf("c = <%c>, optarg = <%s>\n",
- _go_c, Optarg)
- printf("non-option arguments:\n")
- for (; Optind < ARGC; Optind++)
- printf("\tARGV[%d] = <%s>\n",
- Optind, ARGV[Optind])
+ if (by_fields == 0 && by_chars == 0)
+ by_fields = 1 # default
+
+ if (fieldlist == "") @{
+ print "cut: needs list for -c or -f" > "/dev/stderr"
+ exit 1
@}
+
+ if (by_fields)
+ set_fieldlist()
+ else
+ set_charlist()
@}
@c endfile
@end example
-The rest of the @code{BEGIN} rule is a simple test program. Here is the
-result of two sample runs of the test program:
address@hidden()} splits the field list apart at the commas
+into an array. Then, for each element of the array, it looks to
+see if the element is actually a range, and if so, splits it apart.
+The function checks the range
+to make sure that the first number is smaller than the second.
+Each number in the list is added to the @code{flist} array, which
+simply lists the fields that will be printed. Normal field splitting
+is used. The program lets @command{awk} handle the job of doing the
+field splitting:
@example
-$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x}
address@hidden c = <a>, optarg = <>
address@hidden c = <c>, optarg = <>
address@hidden c = <b>, optarg = <ARG>
address@hidden non-option arguments:
address@hidden ARGV[3] = <bax>
address@hidden ARGV[4] = <-x>
-
-$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc}
address@hidden c = <a>, optarg = <>
address@hidden x -- invalid option
address@hidden c = <?>, optarg = <>
address@hidden non-option arguments:
address@hidden ARGV[4] = <xyz>
address@hidden ARGV[5] = <abc>
address@hidden file eg/prog/cut.awk
+function set_fieldlist( n, m, i, j, k, f, g)
address@hidden
+ n = split(fieldlist, f, ",")
+ j = 1 # index in flist
+ for (i = 1; i <= n; i++) @{
+ if (index(f[i], "-") != 0) @{ # a range
+ m = split(f[i], g, "-")
address@hidden
+ if (m != 2 || g[1] >= g[2]) @{
+ printf("bad field list: %s\n",
+ f[i]) > "/dev/stderr"
+ exit 1
+ @}
address@hidden group
+ for (k = g[1]; k <= g[2]; k++)
+ flist[j++] = k
+ @} else
+ flist[j++] = f[i]
+ @}
+ nfields = j - 1
address@hidden
address@hidden endfile
@end example
-In both runs,
-the first @option{--} terminates the arguments to @command{awk}, so that it
does
-not try to interpret the @option{-a}, etc., as its own options.
-
address@hidden NOTE
-After @code{getopt()} is through, it is the responsibility of the user level
-code to
-clear out all the elements of @code{ARGV} from 1 to @code{Optind},
-so that @command{awk} does not try to process the command-line options
-as @value{FN}s.
address@hidden quotation
-
-Several of the sample programs presented in
address@hidden Programs},
-use @code{getopt()} to process their arguments.
address@hidden ENDOFRANGE libfclo
address@hidden ENDOFRANGE flibclo
address@hidden ENDOFRANGE clop
address@hidden ENDOFRANGE oclp
-
address@hidden Passwd Functions
address@hidden Reading the User Database
-
address@hidden STARTOFRANGE libfudata
address@hidden libraries of @command{awk} functions, user database, reading
address@hidden STARTOFRANGE flibudata
address@hidden functions, library, user database, reading
address@hidden STARTOFRANGE udatar
address@hidden user address@hidden reading
address@hidden STARTOFRANGE dataur
address@hidden database, address@hidden reading
address@hidden @code{PROCINFO} array
-The @code{PROCINFO} array
-(@pxref{Built-in Variables})
-provides access to the current user's real and effective user and group ID
-numbers, and if available, the user's supplementary group set.
-However, because these are numbers, they do not provide very useful
-information to the average user. There needs to be some way to find the
-user information associated with the user and group ID numbers. This
address@hidden presents a suite of functions for retrieving information from the
-user database. @xref{Group Functions},
-for a similar suite that retrieves information from the group database.
-
address@hidden @code{getpwent()} function (C library)
address@hidden @code{getpwent()} user-defined function
address@hidden users, information about, retrieving
address@hidden login information
address@hidden account information
address@hidden password file
address@hidden files, password
-The POSIX standard does not define the file where user information is
-kept. Instead, it provides the @code{<pwd.h>} header file
-and several C language subroutines for obtaining user information.
-The primary function is @code{getpwent()}, for ``get password entry.''
-The ``password'' comes from the original user database file,
address@hidden/etc/passwd}, which stores user information, along with the
-encrypted passwords (hence the name).
-
address@hidden @command{pwcat} program
-While an @command{awk} program could simply read @file{/etc/passwd}
-directly, this file may not contain complete information about the
-system's set of address@hidden is often the case that password
-information is stored in a network database.} To be sure you are able to
-produce a readable and complete version of the user database, it is necessary
-to write a small C program that calls @code{getpwent()}. @code{getpwent()}
-is defined as returning a pointer to a @code{struct passwd}. Each time it
-is called, it returns the next entry in the database. When there are
-no more entries, it returns @code{NULL}, the null pointer. When this
-happens, the C program should call @code{endpwent()} to close the database.
-Following is @command{pwcat}, a C program that ``cats'' the password database:
+The @code{set_charlist()} function is more complicated than
address@hidden()}.
+The idea here is to use @command{gawk}'s @code{FIELDWIDTHS} variable
+(@pxref{Constant Size}),
+which describes constant-width input. When using a character list, that is
+exactly what we have.
address@hidden Use old style function header for portability to old systems
(SunOS, HP/UX).
+Setting up @code{FIELDWIDTHS} is more complicated than simply listing the
+fields that need to be printed. We have to keep track of the fields to
+print and also the intervening characters that have to be skipped.
+For example, suppose you wanted characters 1 through 8, 15, and
+22 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value
+for @code{FIELDWIDTHS} is @address@hidden"8 6 1 6 14"}}. This yields five
+fields, and the fields to print
+are @code{$1}, @code{$3}, and @code{$5}.
+The intermediate fields are @dfn{filler},
+which is stuff in between the desired data.
address@hidden lists the fields to print, and @code{t} tracks the
+complete field list, including filler fields:
@example
address@hidden file eg/lib/pwcat.c
-/*
- * pwcat.c
- *
- * Generate a printable version of the password database
- */
address@hidden file eg/prog/cut.awk
+function set_charlist( field, i, j, f, g, t,
+ filler, last, len)
address@hidden
+ field = 1 # count total fields
+ n = split(fieldlist, f, ",")
+ j = 1 # index in flist
+ for (i = 1; i <= n; i++) @{
+ if (index(f[i], "-") != 0) @{ # range
+ m = split(f[i], g, "-")
+ if (m != 2 || g[1] >= g[2]) @{
+ printf("bad character list: %s\n",
+ f[i]) > "/dev/stderr"
+ exit 1
+ @}
+ len = g[2] - g[1] + 1
+ if (g[1] > 1) # compute length of filler
+ filler = g[1] - last - 1
+ else
+ filler = 0
address@hidden
+ if (filler)
+ t[field++] = filler
address@hidden group
+ t[field++] = len # length of field
+ last = g[2]
+ flist[j++] = field - 1
+ @} else @{
+ if (f[i] > 1)
+ filler = f[i] - last - 1
+ else
+ filler = 0
+ if (filler)
+ t[field++] = filler
+ t[field++] = 1
+ last = f[i]
+ flist[j++] = field - 1
+ @}
+ @}
+ FIELDWIDTHS = join(t, 1, field - 1)
+ nfields = j - 1
address@hidden
@c endfile
address@hidden
address@hidden file eg/lib/pwcat.c
-/*
- * Arnold Robbins, arnold@@skeeve.com, May 1993
- * Public Domain
- * December 2010, move to ANSI C definition for main().
- */
address@hidden example
+
+Next is the rule that actually processes the data. If the @option{-s} option
+is given, then @code{suppress} is true. The first @code{if} statement
+makes sure that the input record does have the field separator. If
address@hidden is processing fields, @code{suppress} is true, and the field
+separator character is not in the record, then the record is skipped.
-#if HAVE_CONFIG_H
-#include <config.h>
-#endif
+If the record is valid, then @command{gawk} has split the data
+into fields, either using the character in @code{FS} or using fixed-length
+fields and @code{FIELDWIDTHS}. The loop goes through the list of fields
+that should be printed. The corresponding field is printed if it contains
data.
+If the next field also has data, then the separator character is
+written out between the fields:
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/pwcat.c
-#include <stdio.h>
-#include <pwd.h>
address@hidden
address@hidden file eg/prog/cut.awk
address@hidden
+ if (by_fields && suppress && index($0, FS) != 0)
+ next
+ for (i = 1; i <= nfields; i++) @{
+ if ($flist[i] != "") @{
+ printf "%s", $flist[i]
+ if (i < nfields && $flist[i+1] != "")
+ printf "%s", OFS
+ @}
+ @}
+ print ""
address@hidden
@c endfile
address@hidden
address@hidden file eg/lib/pwcat.c
-#if defined (STDC_HEADERS)
-#include <stdlib.h>
-#endif
address@hidden example
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/pwcat.c
-int
-main(int argc, char **argv)
address@hidden
- struct passwd *p;
+This version of @command{cut} relies on @command{gawk}'s @code{FIELDWIDTHS}
+variable to do the character-based cutting. While it is possible in
+other @command{awk} implementations to use @code{substr()}
+(@pxref{String Functions}),
+it is also extremely painful.
+The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem
+of picking the input line apart by characters.
address@hidden ENDOFRANGE cut
address@hidden ENDOFRANGE ficut
address@hidden ENDOFRANGE colcut
- while ((p = getpwent()) != NULL)
address@hidden endfile
address@hidden
address@hidden file eg/lib/pwcat.c
-#ifdef ZOS_USS
- printf("%s:%ld:%ld:%s:%s\n",
- p->pw_name, (long) p->pw_uid,
- (long) p->pw_gid, p->pw_dir, p->pw_shell);
-#else
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/pwcat.c
- printf("%s:%s:%ld:%ld:%s:%s:%s\n",
- p->pw_name, p->pw_passwd, (long) p->pw_uid,
- (long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
address@hidden endfile
address@hidden
address@hidden file eg/lib/pwcat.c
-#endif
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/pwcat.c
address@hidden Exercise: Rewrite using split with "".
- endpwent();
- return 0;
address@hidden
address@hidden endfile
address@hidden Egrep Program
address@hidden Searching for Regular Expressions in Files
+
address@hidden STARTOFRANGE regexps
address@hidden regular expressions, searching for
address@hidden STARTOFRANGE sfregexp
address@hidden searching, files for regular expressions
address@hidden STARTOFRANGE fsregexp
address@hidden files, searching for regular expressions
address@hidden @command{egrep} utility
+The @command{egrep} utility searches files for patterns. It uses regular
+expressions that are almost identical to those available in @command{awk}
+(@pxref{Regexp}).
+You invoke it as follows:
+
address@hidden
+egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{}
@end example
-If you don't understand C, don't worry about it.
-The output from @command{pwcat} is the user database, in the traditional
address@hidden/etc/passwd} format of colon-separated fields. The fields are:
+The @var{pattern} is a regular expression. In typical usage, the regular
+expression is quoted to prevent the shell from expanding any of the
+special characters as @value{FN} wildcards. Normally, @command{egrep}
+prints the lines that matched. If multiple @value{FN}s are provided on
+the command line, each output line is preceded by the name of the file
+and a colon.
address@hidden @asis
address@hidden Login name
-The user's login name.
+The options to @command{egrep} are as follows:
address@hidden Encrypted password
-The user's encrypted password. This may not be available on some systems.
address@hidden @code
address@hidden -c
+Print out a count of the lines that matched the pattern, instead of the
+lines themselves.
address@hidden User-ID
-The user's numeric user ID number.
-(On some systems it's a C @code{long}, and not an @code{int}. Thus
-we cast it to @code{long} for all cases.)
address@hidden -s
+Be silent. No output is produced and the exit value indicates whether
+the pattern was matched.
address@hidden Group-ID
-The user's numeric group ID number.
-(Similar comments about @code{long} vs.@: @code{int} apply here.)
address@hidden -v
+Invert the sense of the test. @command{egrep} prints the lines that do
address@hidden match the pattern and exits successfully if the pattern is not
+matched.
address@hidden Full name
-The user's full name, and perhaps other information associated with the
-user.
address@hidden -i
+Ignore case distinctions in both the pattern and the input data.
address@hidden Home directory
-The user's login (or ``home'') directory (familiar to shell programmers as
address@hidden).
address@hidden -l
+Only print (list) the names of the files that matched, not the lines that
matched.
address@hidden Login shell
-The program that is run when the user logs in. This is usually a
-shell, such as Bash.
address@hidden -e @var{pattern}
+Use @var{pattern} as the regexp to match. The purpose of the @option{-e}
+option is to allow patterns that start with a @samp{-}.
@end table
-A few lines representative of @command{pwcat}'s output are as follows:
-
address@hidden Jacobs, Andrew
address@hidden Robbins, Arnold
address@hidden Robbins, Miriam
address@hidden
-$ @kbd{pwcat}
address@hidden root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
address@hidden nobody:*:65534:65534::/:
address@hidden daemon:*:1:1::/:
address@hidden sys:*:2:2::/:/bin/csh
address@hidden bin:*:3:3::/bin:
address@hidden arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
address@hidden miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
address@hidden andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh
address@hidden
address@hidden example
+This version uses the @code{getopt()} library function
+(@pxref{Getopt Function})
+and the file transition library program
+(@pxref{Filetrans Function}).
-With that introduction, following is a group of functions for getting user
-information. There are several functions here, corresponding to the C
-functions of the same names:
+The program begins with a descriptive comment and then a @code{BEGIN} rule
+that processes the command-line arguments with @code{getopt()}. The
@option{-i}
+(ignore case) option is particularly easy with @command{gawk}; we just use the
address@hidden built-in variable
+(@pxref{Built-in Variables}):
address@hidden @code{_pw_init()} user-defined function
address@hidden @code{egrep.awk} program
@example
address@hidden file eg/lib/passwdawk.in
-# passwd.awk --- access password file information
address@hidden file eg/prog/egrep.awk
+# egrep.awk --- simulate egrep in awk
+#
@c endfile
@ignore
address@hidden file eg/lib/passwdawk.in
-#
address@hidden file eg/prog/egrep.awk
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
-# Revised October 2000
-# Revised December 2010
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/passwdawk.in
-
-BEGIN @{
- # tailor this to suit your system
- _pw_awklib = "/usr/local/libexec/awk/"
address@hidden
-
-function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat)
address@hidden
- if (_pw_inited)
- return
-
- oldfs = FS
- oldrs = RS
- olddol0 = $0
- using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
- using_fpat = (PROCINFO["FS"] == "FPAT")
- FS = ":"
- RS = "\n"
- pwcat = _pw_awklib "pwcat"
- while ((pwcat | getline) > 0) @{
- _pw_byname[$1] = $0
- _pw_byuid[$3] = $0
- _pw_bycount[++_pw_total] = $0
- @}
- close(pwcat)
- _pw_count = 0
- _pw_inited = 1
- FS = oldfs
- if (using_fw)
- FIELDWIDTHS = FIELDWIDTHS
- else if (using_fpat)
- FPAT = FPAT
- RS = oldrs
- $0 = olddol0
address@hidden
@c endfile
address@hidden example
-
address@hidden @code{BEGIN} pattern, @code{pwcat} program
-The @code{BEGIN} rule sets a private variable to the directory where
address@hidden is stored. Because it is used to help out an @command{awk}
library
-routine, we have chosen to put it in @file{/usr/local/libexec/awk};
-however, you might want it to be in a different directory on your system.
-
-The function @code{_pw_init()} keeps three copies of the user information
-in three associative arrays. The arrays are indexed by username
-(@code{_pw_byname}), by user ID number (@code{_pw_byuid}), and by order of
-occurrence (@code{_pw_bycount}).
-The variable @code{_pw_inited} is used for efficiency, since @code{_pw_init()}
-needs to be called only once.
-
address@hidden @code{getline} command, @code{_pw_init()} function
-Because this function uses @code{getline} to read information from
address@hidden, it first saves the values of @code{FS}, @code{RS}, and
@code{$0}.
-It notes in the variable @code{using_fw} whether field splitting
-with @code{FIELDWIDTHS} is in effect or not.
-Doing so is necessary, since these functions could be called
-from anywhere within a user's program, and the user may have his
-or her
-own way of splitting records and fields.
address@hidden ignore
address@hidden file eg/prog/egrep.awk
+# Options:
+# -c count of lines
+# -s silent - use exit value
+# -v invert test, success if no match
+# -i ignore case
+# -l print filenames only
+# -e argument is pattern
+#
+# Requires getopt and file transition library functions
address@hidden @code{PROCINFO} array
-The @code{using_fw} variable checks @code{PROCINFO["FS"]}, which
-is @code{"FIELDWIDTHS"} if field splitting is being done with
address@hidden This makes it possible to restore the correct
-field-splitting mechanism later. The test can only be true for
address@hidden It is false if using @code{FS} or @code{FPAT},
-or on some other @command{awk} implementation.
+BEGIN @{
+ while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{
+ if (c == "c")
+ count_only++
+ else if (c == "s")
+ no_print++
+ else if (c == "v")
+ invert++
+ else if (c == "i")
+ IGNORECASE = 1
+ else if (c == "l")
+ filenames_only++
+ else if (c == "e")
+ pattern = Optarg
+ else
+ usage()
+ @}
address@hidden endfile
address@hidden example
-The code that checks for using @code{FPAT}, using @code{using_fpat}
-and @code{PROCINFO["FS"]} is similar.
+Next comes the code that handles the @command{egrep}-specific behavior. If no
+pattern is supplied with @option{-e}, the first nonoption on the
+command line is used. The @command{awk} command-line arguments up to
@code{ARGV[Optind]}
+are cleared, so that @command{awk} won't try to process them as files. If no
+files are specified, the standard input is used, and if multiple files are
+specified, we make sure to note this so that the @value{FN}s can precede the
+matched lines in the output:
-The main part of the function uses a loop to read database lines, split
-the line into fields, and then store the line into each array as necessary.
-When the loop is done, @address@hidden()}} cleans up by closing the pipeline,
-setting @address@hidden to one, and restoring @code{FS}
-(and @code{FIELDWIDTHS} or @code{FPAT}
-if necessary), @code{RS}, and @code{$0}.
-The use of @address@hidden is explained shortly.
address@hidden
address@hidden file eg/prog/egrep.awk
+ if (pattern == "")
+ pattern = ARGV[Optind++]
address@hidden @code{getpwnam()} function (C library)
-The @code{getpwnam()} function takes a username as a string argument. If that
-user is in the database, it returns the appropriate line. Otherwise, it
-relies on the array reference to a nonexistent
-element to create the element with the null string as its value:
+ for (i = 1; i < Optind; i++)
+ ARGV[i] = ""
+ if (Optind >= ARGC) @{
+ ARGV[1] = "-"
+ ARGC = 2
+ @} else if (ARGC - Optind > 1)
+ do_filenames++
address@hidden @code{getpwnam()} user-defined function
address@hidden
address@hidden
address@hidden file eg/lib/passwdawk.in
-function getpwnam(name)
address@hidden
- _pw_init()
- return _pw_byname[name]
+# if (IGNORECASE)
+# pattern = tolower(pattern)
@}
@c endfile
address@hidden group
@end example
address@hidden @code{getpwuid()} function (C library)
-Similarly,
-the @code{getpwuid} function takes a user ID number argument. If that
-user number is in the database, it returns the appropriate line. Otherwise, it
-returns the null string:
+The last two lines are commented out, since they are not needed in
address@hidden They should be uncommented if you have to use another version
+of @command{awk}.
+
+The next set of lines should be uncommented if you are not using
address@hidden This rule translates all the characters in the input line
+into lowercase if the @option{-i} option is address@hidden
+also introduces a subtle bug;
+if a match happens, we output the translated line, not the original.}
+The rule is
+commented out since it is not necessary with @command{gawk}:
+
address@hidden Exercise: Fix this, w/array and new line as key to original line
address@hidden @code{getpwuid()} user-defined function
@example
address@hidden file eg/lib/passwdawk.in
-function getpwuid(uid)
address@hidden
- _pw_init()
- return _pw_byuid[uid]
address@hidden
address@hidden file eg/prog/egrep.awk
address@hidden
+# if (IGNORECASE)
+# $0 = tolower($0)
address@hidden
@c endfile
@end example
address@hidden @code{getpwent()} function (C library)
-The @code{getpwent()} function simply steps through the database, one entry at
-a time. It uses @code{_pw_count} to track its current position in the
address@hidden array:
+The @code{beginfile()} function is called by the rule in @file{ftrans.awk}
+when each new file is processed. In this case, it is very simple; all it
+does is initialize a variable @code{fcount} to zero. @code{fcount} tracks
+how many lines in the current file matched the pattern.
+Naming the parameter @code{junk} shows we know that @code{beginfile()}
+is called with a parameter, but that we're not interested in its value:
address@hidden @code{getpwent()} user-defined function
@example
address@hidden file eg/lib/passwdawk.in
-function getpwent()
address@hidden file eg/prog/egrep.awk
+function beginfile(junk)
@{
- _pw_init()
- if (_pw_count < _pw_total)
- return _pw_bycount[++_pw_count]
- return ""
+ fcount = 0
@}
@c endfile
@end example
address@hidden @code{endpwent()} function (C library)
-The @address@hidden()}} function resets @address@hidden to zero, so that
-subsequent calls to @code{getpwent()} start over again:
+The @code{endfile()} function is called after each file has been processed.
+It affects the output only when the user wants a count of the number of lines
that
+matched. @code{no_print} is true only if the exit status is desired.
address@hidden is true if line counts are desired. @command{egrep}
+therefore only prints line counts if printing and counting are enabled.
+The output format must be adjusted depending upon the number of files to
+process. Finally, @code{fcount} is added to @code{total}, so that we
+know the total number of lines that matched the pattern:
address@hidden @code{endpwent()} user-defined function
@example
address@hidden file eg/lib/passwdawk.in
-function endpwent()
address@hidden file eg/prog/egrep.awk
+function endfile(file)
@{
- _pw_count = 0
+ if (! no_print && count_only) @{
+ if (do_filenames)
+ print file ":" fcount
+ else
+ print fcount
+ @}
+
+ total += fcount
@}
@c endfile
@end example
-A conscious design decision in this suite is that each subroutine calls
address@hidden@w{_pw_init()}} to initialize the database arrays.
-The overhead of running
-a separate process to generate the user database, and the I/O to scan it,
-are only incurred if the user's main program actually calls one of these
-functions. If this library file is loaded along with a user's program, but
-none of the routines are ever called, then there is no extra runtime overhead.
-(The alternative is move the body of @address@hidden()}} into a
address@hidden rule, which always runs @command{pwcat}. This simplifies the
-code but runs an extra process that may never be needed.)
-
-In turn, calling @code{_pw_init()} is not too expensive, because the
address@hidden variable keeps the program from reading the data more than
-once. If you are worried about squeezing every last cycle out of your
address@hidden program, the check of @code{_pw_inited} could be moved out of
address@hidden()} and duplicated in all the other functions. In practice,
-this is not necessary, since most @command{awk} programs are I/O-bound,
-and such a change would clutter up the code.
-
-The @command{id} program in @ref{Id Program},
-uses these functions.
address@hidden ENDOFRANGE libfudata
address@hidden ENDOFRANGE flibudata
address@hidden ENDOFRANGE udatar
address@hidden ENDOFRANGE dataur
-
address@hidden Group Functions
address@hidden Reading the Group Database
+The following rule does most of the work of matching lines. The variable
address@hidden is true if the line matched the pattern. If the user
+wants lines that did not match, the sense of @code{matches} is inverted
+using the @samp{!} operator. @code{fcount} is incremented with the value of
address@hidden, which is either one or zero, depending upon a
+successful or unsuccessful match. If the line does not match, the
address@hidden statement just moves on to the next record.
address@hidden STARTOFRANGE libfgdata
address@hidden libraries of @command{awk} functions, group database, reading
address@hidden STARTOFRANGE flibgdata
address@hidden functions, library, group database, reading
address@hidden STARTOFRANGE gdatar
address@hidden group database, reading
address@hidden STARTOFRANGE datagr
address@hidden database, group, reading
address@hidden @code{PROCINFO} array
address@hidden @code{getgrent()} function (C library)
address@hidden @code{getgrent()} user-defined function
address@hidden address@hidden information about
address@hidden account information
address@hidden group file
address@hidden files, group
-Much of the discussion presented in
address@hidden Functions},
-applies to the group database as well. Although there has traditionally
-been a well-known file (@file{/etc/group}) in a well-known format, the POSIX
-standard only provides a set of C library routines
-(@code{<grp.h>} and @code{getgrent()})
-for accessing the information.
-Even though this file may exist, it may not have
-complete information. Therefore, as with the user database, it is necessary
-to have a small C program that generates the group database as its output.
address@hidden, a C program that ``cats'' the group database,
-is as follows:
+A number of additional tests are made, but they are only done if we
+are not counting lines. First, if the user only wants exit status
+(@code{no_print} is true), then it is enough to know that @emph{one}
+line in this file matched, and we can skip on to the next file with
address@hidden Similarly, if we are only printing @value{FN}s, we can
+print the @value{FN}, and then skip to the next file with @code{nextfile}.
+Finally, each line is printed, with a leading @value{FN} and colon
+if necessary:
address@hidden @command{grcat} program
address@hidden @code{!} (exclamation point), @code{!} operator
address@hidden exclamation point (@code{!}), @code{!} operator
@example
address@hidden file eg/lib/grcat.c
-/*
- * grcat.c
- *
- * Generate a printable version of the group database
- */
address@hidden endfile
address@hidden
address@hidden file eg/lib/grcat.c
-/*
- * Arnold Robbins, arnold@@skeeve.com, May 1993
- * Public Domain
- * December 2010, move to ANSI C definition for main().
- */
-
-/* For OS/2, do nothing. */
-#if HAVE_CONFIG_H
-#include <config.h>
-#endif
-
-#if defined (STDC_HEADERS)
-#include <stdlib.h>
-#endif
-
-#ifndef HAVE_GETGRENT
-int main() { return 0; }
-#else
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/grcat.c
-#include <stdio.h>
-#include <grp.h>
-
-int
-main(int argc, char **argv)
address@hidden file eg/prog/egrep.awk
@{
- struct group *g;
- int i;
+ matches = ($0 ~ pattern)
+ if (invert)
+ matches = ! matches
- while ((g = getgrent()) != NULL) @{
address@hidden endfile
address@hidden
address@hidden file eg/lib/grcat.c
-#ifdef ZOS_USS
- printf("%s:%ld:", g->gr_name, (long) g->gr_gid);
-#else
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/grcat.c
- printf("%s:%s:%ld:", g->gr_name, g->gr_passwd,
- (long) g->gr_gid);
address@hidden endfile
address@hidden
address@hidden file eg/lib/grcat.c
-#endif
address@hidden endfile
address@hidden ignore
address@hidden file eg/lib/grcat.c
- for (i = 0; g->gr_mem[i] != NULL; i++) @{
- printf("%s", g->gr_mem[i]);
address@hidden
- if (g->gr_mem[i+1] != NULL)
- putchar(',');
+ fcount += matches # 1 or 0
+
+ if (! matches)
+ next
+
+ if (! count_only) @{
+ if (no_print)
+ nextfile
+
+ if (filenames_only) @{
+ print FILENAME
+ nextfile
@}
address@hidden group
- putchar('\n');
+
+ if (do_filenames)
+ print FILENAME ":" $0
+ else
+ print
@}
- endgrent();
- return 0;
@}
@c endfile
address@hidden
address@hidden file eg/lib/grcat.c
-#endif /* HAVE_GETGRENT */
address@hidden example
+
+The @code{END} rule takes care of producing the correct exit status. If
+there are no matches, the exit status is one; otherwise it is zero:
+
address@hidden
address@hidden file eg/prog/egrep.awk
+END \
address@hidden
+ if (total == 0)
+ exit 1
+ exit 0
address@hidden
@c endfile
address@hidden ignore
@end example
-Each line in the group database represents one group. The fields are
-separated with colons and represent the following information:
+The @code{usage()} function prints a usage message in case of invalid options,
+and then exits:
address@hidden @asis
address@hidden Group Name
-The group's name.
address@hidden
address@hidden file eg/prog/egrep.awk
+function usage( e)
address@hidden
+ e = "Usage: egrep [-csvil] [-e pat] [files ...]"
+ e = e "\n\tegrep [-csvil] pat [files ...]"
+ print e > "/dev/stderr"
+ exit 1
address@hidden
address@hidden endfile
address@hidden example
address@hidden Group Password
-The group's encrypted password. In practice, this field is never used;
-it is usually empty or set to @samp{*}.
+The variable @code{e} is used so that the function fits nicely
+on the printed page.
address@hidden Group ID Number
-The group's numeric group ID number;
-this number must be unique within the file.
-(On some systems it's a C @code{long}, and not an @code{int}. Thus
-we cast it to @code{long} for all cases.)
address@hidden @code{END} pattern, backslash continuation and
address@hidden @code{\} (backslash), continuing lines and
address@hidden backslash (@code{\}), continuing lines and
+Just a note on programming style: you may have noticed that the @code{END}
+rule uses backslash continuation, with the open brace on a line by
+itself. This is so that it more closely resembles the way functions
+are written. Many of the examples
+in this @value{CHAPTER}
+use this style. You can decide for yourself if you like writing
+your @code{BEGIN} and @code{END} rules this way
+or not.
address@hidden ENDOFRANGE regexps
address@hidden ENDOFRANGE sfregexp
address@hidden ENDOFRANGE fsregexp
address@hidden Group Member List
-A comma-separated list of user names. These users are members of the group.
-Modern Unix systems allow users to be members of several groups
-simultaneously. If your system does, then there are elements
address@hidden"group1"} through @code{"address@hidden"} in @code{PROCINFO}
-for those group ID numbers.
-(Note that @code{PROCINFO} is a @command{gawk} extension;
address@hidden Variables}.)
address@hidden table
address@hidden Id Program
address@hidden Printing out User Information
-Here is what running @command{grcat} might produce:
address@hidden printing, user information
address@hidden users, information about, printing
address@hidden @command{id} utility
+The @command{id} utility lists a user's real and effective user ID numbers,
+real and effective group ID numbers, and the user's group set, if any.
address@hidden only prints the effective user ID and group ID if they are
+different from the real ones. If possible, @command{id} also supplies the
+corresponding user and group names. The output might look like this:
@example
-$ @kbd{grcat}
address@hidden wheel:*:0:arnold
address@hidden nogroup:*:65534:
address@hidden daemon:*:1:
address@hidden kmem:*:2:
address@hidden staff:*:10:arnold,miriam,andy
address@hidden other:*:20:
address@hidden
+$ @kbd{id}
address@hidden uid=500(arnold) gid=500(arnold) groups=6(disk),7(lp),19(floppy)
@end example
-Here are the functions for obtaining information from the group database.
-There are several, modeled after the C library functions of the same names:
address@hidden @code{PROCINFO} array
+This information is part of what is provided by @command{gawk}'s
address@hidden array (@pxref{Built-in Variables}).
+However, the @command{id} utility provides a more palatable output than just
+individual numbers.
address@hidden @code{getline} command, @code{_gr_init()} user-defined function
address@hidden @code{_gr_init()} user-defined function
+Here is a simple version of @command{id} written in @command{awk}.
+It uses the user database library functions
+(@pxref{Passwd Functions})
+and the group database library functions
+(@pxref{Group Functions}):
+
+The program is fairly straightforward. All the work is done in the
address@hidden rule. The user and group ID numbers are obtained from
address@hidden
+The code is repetitive. The entry in the user database for the real user ID
+number is split into parts at the @samp{:}. The name is the first field.
+Similar code is used for the effective user ID number and the group
+numbers:
+
address@hidden @code{id.awk} program
@example
address@hidden file eg/lib/groupawk.in
-# group.awk --- functions for dealing with the group file
address@hidden file eg/prog/id.awk
+# id.awk --- implement id in awk
+#
+# Requires user and group library functions
@c endfile
@ignore
address@hidden file eg/lib/groupawk.in
address@hidden file eg/prog/id.awk
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
-# Revised October 2000
-# Revised December 2010
+# Revised February 1996
+
@c endfile
@end ignore
address@hidden line break on _gr_init for smallbook
address@hidden file eg/lib/groupawk.in
address@hidden file eg/prog/id.awk
+# output is:
+# uid=12(foo) euid=34(bar) gid=3(baz) \
+# egid=5(blat) groups=9(nine),2(two),1(one)
address@hidden
BEGIN \
@{
- # Change to suit your system
- _gr_awklib = "/usr/local/libexec/awk/"
address@hidden
+ uid = PROCINFO["uid"]
+ euid = PROCINFO["euid"]
+ gid = PROCINFO["gid"]
+ egid = PROCINFO["egid"]
address@hidden group
-function _gr_init( oldfs, oldrs, olddol0, grcat,
- using_fw, using_fpat, n, a, i)
address@hidden
- if (_gr_inited)
- return
+ printf("uid=%d", uid)
+ pw = getpwuid(uid)
+ if (pw != "") @{
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ @}
- oldfs = FS
- oldrs = RS
- olddol0 = $0
- using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
- using_fpat = (PROCINFO["FS"] == "FPAT")
- FS = ":"
- RS = "\n"
+ if (euid != uid) @{
+ printf(" euid=%d", euid)
+ pw = getpwuid(euid)
+ if (pw != "") @{
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ @}
+ @}
- grcat = _gr_awklib "grcat"
- while ((grcat | getline) > 0) @{
- if ($1 in _gr_byname)
- _gr_byname[$1] = _gr_byname[$1] "," $4
- else
- _gr_byname[$1] = $0
- if ($3 in _gr_bygid)
- _gr_bygid[$3] = _gr_bygid[$3] "," $4
- else
- _gr_bygid[$3] = $0
+ printf(" gid=%d", gid)
+ pw = getgrgid(gid)
+ if (pw != "") @{
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ @}
- n = split($4, a, "[ \t]*,[ \t]*")
- for (i = 1; i <= n; i++)
- if (a[i] in _gr_groupsbyuser)
- _gr_groupsbyuser[a[i]] = \
- _gr_groupsbyuser[a[i]] " " $1
- else
- _gr_groupsbyuser[a[i]] = $1
+ if (egid != gid) @{
+ printf(" egid=%d", egid)
+ pw = getgrgid(egid)
+ if (pw != "") @{
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ @}
+ @}
- _gr_bycount[++_gr_count] = $0
+ for (i = 1; ("group" i) in PROCINFO; i++) @{
+ if (i == 1)
+ printf(" groups=")
+ group = PROCINFO["group" i]
+ printf("%d", group)
+ pw = getgrgid(group)
+ if (pw != "") @{
+ split(pw, a, ":")
+ printf("(%s)", a[1])
+ @}
+ if (("group" (i+1)) in PROCINFO)
+ printf(",")
@}
- close(grcat)
- _gr_count = 0
- _gr_inited++
- FS = oldfs
- if (using_fw)
- FIELDWIDTHS = FIELDWIDTHS
- else if (using_fpat)
- FPAT = FPAT
- RS = oldrs
- $0 = olddol0
+
+ print ""
@}
@c endfile
@end example
-The @code{BEGIN} rule sets a private variable to the directory where
address@hidden is stored. Because it is used to help out an @command{awk}
library
-routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might
-want it to be in a different directory on your system.
-
-These routines follow the same general outline as the user database routines
-(@pxref{Passwd Functions}).
-The @address@hidden variable is used to
-ensure that the database is scanned no more than once.
-The @address@hidden()}} function first saves @code{FS},
address@hidden, and
address@hidden, and then sets @code{FS} and @code{RS} to the correct values for
-scanning the group information.
-It also takes care to note whether @code{FIELDWIDTHS} or @code{FPAT}
-is being used, and to restore the appropriate field splitting mechanism.
address@hidden @code{in} operator
+The test in the @code{for} loop is worth noting.
+Any supplementary groups in the @code{PROCINFO} array have the
+indices @code{"group1"} through @code{"address@hidden"} for some
address@hidden, i.e., the total number of supplementary groups.
+However, we don't know in advance how many of these groups
+there are.
-The group information is stored is several associative arrays.
-The arrays are indexed by group name (@address@hidden), by group ID number
-(@address@hidden), and by position in the database (@address@hidden).
-There is an additional array indexed by user name (@address@hidden),
-which is a space-separated list of groups to which each user belongs.
+This loop works by starting at one, concatenating the value with
address@hidden"group"}, and then using @code{in} to see if that value is
+in the array. Eventually, @code{i} is incremented past
+the last group in the array and the loop exits.
-Unlike the user database, it is possible to have multiple records in the
-database for the same group. This is common when a group has a large number
-of members. A pair of such entries might look like the following:
+The loop is also correct if there are @emph{no} supplementary
+groups; then the condition is false the first time it's
+tested, and the loop body never executes.
address@hidden
-tvpeople:*:101:johnny,jay,arsenio
-tvpeople:*:101:david,conan,tom,joan
address@hidden example
address@hidden exercise!!!
address@hidden
+The POSIX version of @command{id} takes arguments that control which
+information is printed. Modify this version to accept the same
+arguments and perform in the same way.
address@hidden ignore
-For this reason, @code{_gr_init()} looks to see if a group name or
-group ID number is already seen. If it is, then the user names are
-simply concatenated onto the previous list of users. (There is actually a
-subtle problem with the code just presented. Suppose that
-the first time there were no names. This code adds the names with
-a leading comma. It also doesn't check that there is a @code{$4}.)
address@hidden Split Program
address@hidden Splitting a Large File into Pieces
-Finally, @code{_gr_init()} closes the pipeline to @command{grcat}, restores
address@hidden (and @code{FIELDWIDTHS} or @code{FPAT} if necessary), @code{RS},
and @code{$0},
-initializes @code{_gr_count} to zero
-(it is used later), and makes @code{_gr_inited} nonzero.
address@hidden FIXME: One day, update to current POSIX version of split
address@hidden @code{getgrnam()} function (C library)
-The @code{getgrnam()} function takes a group name as its argument, and if that
-group exists, it is returned.
-Otherwise, it
-relies on the array reference to a nonexistent
-element to create the element with the null string as its value:
address@hidden STARTOFRANGE filspl
address@hidden files, splitting
address@hidden @code{split} utility
+The @command{split} program splits large text files into smaller pieces.
+Usage is as follows:@footnote{This is the traditional usage. The
+POSIX usage is different, but not relevant for what the program
+aims to demonstrate.}
address@hidden @code{getgrnam()} user-defined function
@example
address@hidden file eg/lib/groupawk.in
-function getgrnam(group)
address@hidden
- _gr_init()
- return _gr_byname[group]
address@hidden
address@hidden endfile
+split @address@hidden@r{]} file @r{[} @var{prefix} @r{]}
@end example
address@hidden @code{getgrgid()} function (C library)
-The @code{getgrgid()} function is similar; it takes a numeric group ID and
-looks up the information associated with that group ID:
+By default,
+the output files are named @file{xaa}, @file{xab}, and so on. Each file has
+1000 lines in it, with the likely exception of the last file. To change the
+number of lines in each file, supply a number on the command line
+preceded with a minus; e.g., @samp{-500} for files with 500 lines in them
+instead of 1000. To change the name of the output files to something like
address@hidden, @file{myfileab}, and so on, supply an additional
+argument that specifies the @value{FN} prefix.
address@hidden @code{getgrgid()} user-defined function
address@hidden
address@hidden file eg/lib/groupawk.in
-function getgrgid(gid)
address@hidden
- _gr_init()
- return _gr_bygid[gid]
address@hidden
address@hidden endfile
address@hidden example
+Here is a version of @command{split} in @command{awk}. It uses the
address@hidden()} and @code{chr()} functions presented in
address@hidden Functions}.
address@hidden @code{getgruser()} function (C library)
-The @code{getgruser()} function does not have a C counterpart. It takes a
-user name and returns the list of groups that have the user as a member:
+The program first sets its defaults, and then tests to make sure there are
+not too many arguments. It then looks at each argument in turn. The
+first argument could be a minus sign followed by a number. If it is, this
happens
+to look like a negative number, so it is made positive, and that is the
+count of lines. The data @value{FN} is skipped over and the final argument
+is used as the prefix for the output @value{FN}s:
address@hidden @code{getgruser()} function, user-defined
address@hidden @code{split.awk} program
@example
address@hidden file eg/lib/groupawk.in
-function getgruser(user)
address@hidden
- _gr_init()
- return _gr_groupsbyuser[user]
address@hidden
address@hidden file eg/prog/split.awk
+# split.awk --- do split in awk
+#
+# Requires ord() and chr() library functions
@c endfile
address@hidden example
-
address@hidden @code{getgrent()} function (C library)
-The @code{getgrent()} function steps through the database one entry at a time.
-It uses @code{_gr_count} to track its position in the list:
address@hidden
address@hidden file eg/prog/split.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# May 1993
address@hidden @code{getgrent()} user-defined function
address@hidden
address@hidden file eg/lib/groupawk.in
-function getgrent()
address@hidden
- _gr_init()
- if (++_gr_count in _gr_bycount)
- return _gr_bycount[_gr_count]
- return ""
address@hidden
@c endfile
address@hidden example
address@hidden ENDOFRANGE clibf
address@hidden ignore
address@hidden file eg/prog/split.awk
+# usage: split [-num] [file] [outname]
address@hidden @code{endgrent()} function (C library)
-The @code{endgrent()} function resets @code{_gr_count} to zero so that
@code{getgrent()} can
-start over again:
+BEGIN @{
+ outfile = "x" # default
+ count = 1000
+ if (ARGC > 4)
+ usage()
address@hidden @code{endgrent()} user-defined function
address@hidden
address@hidden file eg/lib/groupawk.in
-function endgrent()
address@hidden
- _gr_count = 0
+ i = 1
+ if (ARGV[i] ~ /^-[[:digit:]]+$/) @{
+ count = -ARGV[i]
+ ARGV[i] = ""
+ i++
+ @}
+ # test argv in case reading from stdin instead of file
+ if (i in ARGV)
+ i++ # skip data file name
+ if (i in ARGV) @{
+ outfile = ARGV[i]
+ ARGV[i] = ""
+ @}
+
+ s1 = s2 = "a"
+ out = (outfile s1 s2)
@}
@c endfile
@end example
-As with the user database routines, each function calls @code{_gr_init()} to
-initialize the arrays. Doing so only incurs the extra overhead of running
address@hidden if these functions are used (as opposed to moving the body of
address@hidden()} into a @code{BEGIN} rule).
-
-Most of the work is in scanning the database and building the various
-associative arrays. The functions that the user calls are themselves very
-simple, relying on @command{awk}'s associative arrays to do work.
-
-The @command{id} program in @ref{Id Program},
-uses these functions.
-
address@hidden Walking Arrays
address@hidden Traversing Arrays of Arrays
-
address@hidden of Arrays}, described how @command{gawk}
-provides arrays of arrays. In particular, any element of
-an array may be either a scalar, or another array. The
address@hidden()} function (@pxref{Type Functions})
-lets you distinguish an array
-from a scalar.
-The following function, @code{walk_array()}, recursively traverses
-an array, printing each element's indices and value.
-You call it with the array and a string representing the name
-of the array:
+The next rule does most of the work. @code{tcount} (temporary count) tracks
+how many lines have been printed to the output file so far. If it is greater
+than @code{count}, it is time to close the current file and start a new one.
address@hidden and @code{s2} track the current suffixes for the @value{FN}. If
+they are both @samp{z}, the file is just too big. Otherwise, @code{s1}
+moves to the next letter in the alphabet and @code{s2} starts over again at
address@hidden:
address@hidden @code{walk_array()} user-defined function
address@hidden else on separate line here for page breaking
@example
address@hidden file eg/lib/walkarray.awk
-function walk_array(arr, name, i)
address@hidden file eg/prog/split.awk
@{
- for (i in arr) @{
- if (isarray(arr[i]))
- walk_array(arr[i], (name "[" i "]"))
+ if (++tcount > count) @{
+ close(out)
+ if (s2 == "z") @{
+ if (s1 == "z") @{
+ printf("split: %s is too large to split\n",
+ FILENAME) > "/dev/stderr"
+ exit 1
+ @}
+ s1 = chr(ord(s1) + 1)
+ s2 = "a"
+ @}
address@hidden
else
- printf("%s[%s] = %s\n", name, i, arr[i])
+ s2 = chr(ord(s2) + 1)
address@hidden group
+ out = (outfile s1 s2)
+ tcount = 1
@}
+ print > out
@}
@c endfile
@end example
address@hidden Exercise: do this with just awk builtin functions,
index("abc..."), substr, etc.
+
@noindent
-It works by looping over each element of the array. If any given
-element is itself an array, the function calls itself recursively,
-passing the subarray and a new string representing the current index.
-Otherwise, the function simply prints the element's name, index, and value.
-Here is a main program to demonstrate:
+The @code{usage()} function simply prints an error message and exits:
@example
-BEGIN @{
- a[1] = 1
- a[2][1] = 21
- a[2][2] = 22
- a[3] = 3
- a[4][1][1] = 411
- a[4][2] = 42
-
- walk_array(a, "a")
address@hidden file eg/prog/split.awk
+function usage( e)
address@hidden
+ e = "usage: split [-num] [file] [outname]"
+ print e > "/dev/stderr"
+ exit 1
@}
address@hidden endfile
@end example
-When run, the program produces the following output:
-
address@hidden
-$ @kbd{gawk -f walk_array.awk}
address@hidden a[4][1][1] = 411
address@hidden a[4][2] = 42
address@hidden a[1] = 1
address@hidden a[2][1] = 21
address@hidden a[2][2] = 22
address@hidden a[3] = 3
address@hidden example
-
address@hidden ENDOFRANGE libfgdata
address@hidden ENDOFRANGE flibgdata
address@hidden ENDOFRANGE gdatar
address@hidden ENDOFRANGE libf
address@hidden ENDOFRANGE flib
address@hidden ENDOFRANGE fudlib
address@hidden ENDOFRANGE datagr
-
address@hidden Sample Programs
address@hidden Practical @command{awk} Programs
address@hidden STARTOFRANGE awkpex
address@hidden @command{awk} programs, examples of
-
address@hidden Functions},
-presents the idea that reading programs in a language contributes to
-learning that language. This @value{CHAPTER} continues that theme,
-presenting a potpourri of @command{awk} programs for your reading
-enjoyment.
address@hidden
+The variable @code{e} is used so that the function
+fits nicely on the
address@hidden
+screen.
address@hidden ifinfo
@ifnotinfo
-There are three sections.
-The first describes how to run the programs presented
-in this @value{CHAPTER}.
-
-The second presents @command{awk}
-versions of several common POSIX utilities.
-These are programs that you are hopefully already familiar with,
-and therefore, whose problems are understood.
-By reimplementing these programs in @command{awk},
-you can focus on the @command{awk}-related aspects of solving
-the programming problem.
-
-The third is a grab bag of interesting programs.
-These solve a number of different data-manipulation and management
-problems. Many of the programs are short, which emphasizes @command{awk}'s
-ability to do a lot in just a few lines of code.
+page.
@end ifnotinfo
-Many of these programs use library functions presented in
address@hidden Functions}.
+This program is a bit sloppy; it relies on @command{awk} to automatically
close the last file
+instead of doing it in an @code{END} rule.
+It also assumes that letters are contiguous in the character set,
+which isn't true for EBCDIC systems.
address@hidden
-* Running Examples:: How to run these examples.
-* Clones:: Clones of common utilities.
-* Miscellaneous Programs:: Some interesting @command{awk} programs.
address@hidden menu
address@hidden Exercise: Fix these problems.
address@hidden BFD...
address@hidden ENDOFRANGE filspl
address@hidden Running Examples
address@hidden Running the Example Programs
address@hidden Tee Program
address@hidden Duplicating Output into Multiple Files
-To run a given program, you would typically do something like this:
address@hidden files, address@hidden duplicating output into
address@hidden output, duplicating into files
address@hidden @code{tee} utility
+The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies
+its standard input to its standard output and also duplicates it to the
+files named on the command line. Its usage is as follows:
@example
-awk -f @var{program} -- @var{options} @var{files}
+tee @address@hidden file @dots{}
@end example
address@hidden
-Here, @var{program} is the name of the @command{awk} program (such as
address@hidden), @var{options} are any command-line options for the
-program that start with a @samp{-}, and @var{files} are the actual @value{DF}s.
+The @option{-a} option tells @code{tee} to append to the named files, instead
of
+truncating them and starting over.
-If your system supports the @samp{#!} executable interpreter mechanism
-(@pxref{Executable Scripts}),
-you can instead run your program directly:
+The @code{BEGIN} rule first makes a copy of all the command-line arguments
+into an array named @code{copy}.
address@hidden is not copied, since it is not needed.
address@hidden cannot use @code{ARGV} directly, since @command{awk} attempts to
+process each @value{FN} in @code{ARGV} as input data.
+
address@hidden flag variables
+If the first argument is @option{-a}, then the flag variable
address@hidden is set to true, and both @code{ARGV[1]} and
address@hidden are deleted. If @code{ARGC} is less than two, then no
address@hidden were supplied and @code{tee} prints a usage message and exits.
+Finally, @command{awk} is forced to read the standard input by setting
address@hidden to @code{"-"} and @code{ARGC} to two:
address@hidden @code{tee.awk} program
@example
-cut.awk -c1-8 myfiles > results
address@hidden file eg/prog/tee.awk
+# tee.awk --- tee in awk
+#
+# Copy standard input to all named output files.
+# Append content if -a option is supplied.
+#
address@hidden endfile
address@hidden
address@hidden file eg/prog/tee.awk
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# May 1993
+# Revised December 1995
+
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/tee.awk
+BEGIN \
address@hidden
+ for (i = 1; i < ARGC; i++)
+ copy[i] = ARGV[i]
+
+ if (ARGV[1] == "-a") @{
+ append = 1
+ delete ARGV[1]
+ delete copy[1]
+ ARGC--
+ @}
+ if (ARGC < 2) @{
+ print "usage: tee [-a] file ..." > "/dev/stderr"
+ exit 1
+ @}
+ ARGV[1] = "-"
+ ARGC = 2
address@hidden
address@hidden endfile
@end example
-If your @command{awk} is not @command{gawk}, you may instead need to use this:
+The following single rule does all the work. Since there is no pattern, it is
+executed for each line of input. The body of the rule simply prints the
+line into each file on the command line, and then to the standard output:
@example
-cut.awk -- -c1-8 myfiles > results
address@hidden file eg/prog/tee.awk
address@hidden
+ # moving the if outside the loop makes it run faster
+ if (append)
+ for (i in copy)
+ print >> copy[i]
+ else
+ for (i in copy)
+ print > copy[i]
+ print
address@hidden
address@hidden endfile
@end example
address@hidden Clones
address@hidden Reinventing Wheels for Fun and Profit
address@hidden STARTOFRANGE posimawk
address@hidden POSIX, address@hidden implementing in @command{awk}
address@hidden
+It is also possible to write the loop this way:
-This @value{SECTION} presents a number of POSIX utilities implemented in
address@hidden Reinventing these programs in @command{awk} is often enjoyable,
-because the algorithms can be very clearly expressed, and the code is usually
-very concise and simple. This is true because @command{awk} does so much for
you.
address@hidden
+for (i in copy)
+ if (append)
+ print >> copy[i]
+ else
+ print > copy[i]
address@hidden example
-It should be noted that these programs are not necessarily intended to
-replace the installed versions on your system.
-Nor may all of these programs be fully compliant with the most recent
-POSIX standard. This is not a problem; their
-purpose is to illustrate @command{awk} language programming for ``real world''
-tasks.
address@hidden
+This is more concise but it is also less efficient. The @samp{if} is
+tested for each record and for each output file. By duplicating the loop
+body, the @samp{if} is only tested once for each input record. If there are
address@hidden input records and @var{M} output files, the first method only
+executes @var{N} @samp{if} statements, while the second executes
address@hidden@address@hidden @samp{if} statements.
-The programs are presented in alphabetical order.
+Finally, the @code{END} rule cleans up by closing all the output files:
address@hidden
-* Cut Program:: The @command{cut} utility.
-* Egrep Program:: The @command{egrep} utility.
-* Id Program:: The @command{id} utility.
-* Split Program:: The @command{split} utility.
-* Tee Program:: The @command{tee} utility.
-* Uniq Program:: The @command{uniq} utility.
-* Wc Program:: The @command{wc} utility.
address@hidden menu
address@hidden
address@hidden file eg/prog/tee.awk
+END \
address@hidden
+ for (i in copy)
+ close(copy[i])
address@hidden
address@hidden endfile
address@hidden example
address@hidden Cut Program
address@hidden Cutting out Fields and Columns
address@hidden Uniq Program
address@hidden Printing Nonduplicated Lines of Text
address@hidden @command{cut} utility
address@hidden STARTOFRANGE cut
address@hidden @command{cut} utility
address@hidden STARTOFRANGE ficut
address@hidden fields, cutting
address@hidden STARTOFRANGE colcut
address@hidden columns, cutting
-The @command{cut} utility selects, or ``cuts,'' characters or fields
-from its standard input and sends them to its standard output.
-Fields are separated by TABs by default,
-but you may supply a command-line option to change the field
address@hidden (i.e., the field-separator character). @command{cut}'s
-definition of fields is less general than @command{awk}'s.
address@hidden FIXME: One day, update to current POSIX version of uniq
-A common use of @command{cut} might be to pull out just the login name of
-logged-on users from the output of @command{who}. For example, the following
-pipeline generates a sorted, unique list of the logged-on users:
address@hidden STARTOFRANGE prunt
address@hidden printing, unduplicated lines of text
address@hidden STARTOFRANGE tpul
address@hidden address@hidden printing, unduplicated lines of
address@hidden @command{uniq} utility
+The @command{uniq} utility reads sorted lines of data on its standard
+input, and by default removes duplicate lines. In other words, it only
+prints unique lines---hence the name. @command{uniq} has a number of
+options. The usage is as follows:
@example
-who | cut -c1-8 | sort | uniq
+uniq @r{[}-udc @address@hidden@r{]]} @address@hidden@r{]} @r{[} @var{input
file} @r{[} @var{output file} @r{]]}
@end example
-The options for @command{cut} are:
+The options for @command{uniq} are:
@table @code
address@hidden -c @var{list}
-Use @var{list} as the list of characters to cut out. Items within the list
-may be separated by commas, and ranges of characters can be separated with
-dashes. The list @samp{1-8,15,22-35} specifies characters 1 through
-8, 15, and 22 through 35.
address@hidden -d
+Print only repeated lines.
address@hidden -f @var{list}
-Use @var{list} as the list of fields to cut out.
address@hidden -u
+Print only nonrepeated lines.
address@hidden -d @var{delim}
-Use @var{delim} as the field-separator character instead of the TAB
-character.
address@hidden -c
+Count lines. This option overrides @option{-d} and @option{-u}. Both repeated
+and nonrepeated lines are counted.
address@hidden -s
-Suppress printing of lines that do not contain the field delimiter.
address@hidden address@hidden
+Skip @var{n} fields before comparing lines. The definition of fields
+is similar to @command{awk}'s default: nonwhitespace characters separated
+by runs of spaces and/or TABs.
+
address@hidden address@hidden
+Skip @var{n} characters before comparing lines. Any fields specified with
address@hidden@var{n}} are skipped first.
+
address@hidden @var{input file}
+Data is read from the input file named on the command line, instead of from
+the standard input.
+
address@hidden @var{output file}
+The generated output is sent to the named output file, instead of to the
+standard output.
@end table
-The @command{awk} implementation of @command{cut} uses the @code{getopt()}
library
-function (@pxref{Getopt Function})
+Normally @command{uniq} behaves as if both the @option{-d} and
address@hidden options are provided.
+
address@hidden uses the
address@hidden()} library function
+(@pxref{Getopt Function})
and the @code{join()} library function
(@pxref{Join Function}).
-The program begins with a comment describing the options, the library
-functions needed, and a @code{usage()} function that prints out a usage
-message and exits. @code{usage()} is called if invalid arguments are
-supplied:
+The program begins with a @code{usage()} function and then a brief outline of
+the options and their meanings in comments.
+The @code{BEGIN} rule deals with the command-line arguments and options. It
+uses a trick to get @code{getopt()} to handle options of the form @samp{-25},
+treating such an option as the option letter @samp{2} with an argument of
address@hidden If indeed two or more digits are supplied (@code{Optarg} looks
+like a number), @code{Optarg} is
+concatenated with the option digit and then the result is added to zero to make
+it into a number. If there is only one digit in the option, then
address@hidden is not needed. In this case, @code{Optind} must be decremented
so that
address@hidden()} processes it next time. This code is admittedly a bit
+tricky.
address@hidden @code{cut.awk} program
+If no options are supplied, then the default is taken, to print both
+repeated and nonrepeated lines. The output file, if provided, is assigned
+to @code{outputfile}. Early on, @code{outputfile} is initialized to the
+standard output, @file{/dev/stdout}:
+
address@hidden @code{uniq.awk} program
@example
address@hidden file eg/prog/cut.awk
-# cut.awk --- implement cut in awk
address@hidden file eg/prog/uniq.awk
address@hidden
+# uniq.awk --- do uniq in awk
+#
+# Requires getopt() and join() library functions
address@hidden group
@c endfile
@ignore
address@hidden file eg/prog/cut.awk
address@hidden file eg/prog/uniq.awk
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
@c endfile
@end ignore
address@hidden file eg/prog/cut.awk
-
-# Options:
-# -f list Cut fields
-# -d c Field delimiter character
-# -c list Cut characters
-#
-# -s Suppress lines without the delimiter
-#
-# Requires getopt() and join() library functions
address@hidden file eg/prog/uniq.awk
address@hidden
-function usage( e1, e2)
+function usage( e)
@{
- e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
- e2 = "usage: cut [-c list] [files...]"
- print e1 > "/dev/stderr"
- print e2 > "/dev/stderr"
+ e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
+ print e > "/dev/stderr"
exit 1
@}
address@hidden group
address@hidden endfile
address@hidden example
-
address@hidden
-The variables @code{e1} and @code{e2} are used so that the function
-fits nicely on the
address@hidden
-page.
address@hidden ifnotinfo
address@hidden
-screen.
address@hidden ifnottex
address@hidden @code{BEGIN} pattern, running @command{awk} programs and
address@hidden @code{FS} variable, running @command{awk} programs and
-Next comes a @code{BEGIN} rule that parses the command-line options.
-It sets @code{FS} to a single TAB character, because that is @command{cut}'s
-default field separator. The rule then sets the output field separator to be
the
-same as the input field separator. A loop using @code{getopt()} steps
-through the command-line options. Exactly one of the variables
address@hidden or @code{by_chars} is set to true, to indicate that
-processing should be done by fields or by characters, respectively.
-When cutting by characters, the output field separator is set to the null
-string:
+# -c count lines. overrides -d and -u
+# -d only repeated lines
+# -u only nonrepeated lines
+# -n skip n fields
+# +n skip n characters, skip fields first
address@hidden
address@hidden file eg/prog/cut.awk
-BEGIN \
+BEGIN \
@{
- FS = "\t" # default
- OFS = FS
- while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{
- if (c == "f") @{
- by_fields = 1
- fieldlist = Optarg
- @} else if (c == "c") @{
- by_chars = 1
- fieldlist = Optarg
- OFS = ""
- @} else if (c == "d") @{
- if (length(Optarg) > 1) @{
- printf("Using first character of %s" \
- " for delimiter\n", Optarg) > "/dev/stderr"
- Optarg = substr(Optarg, 1, 1)
+ count = 1
+ outputfile = "/dev/stdout"
+ opts = "udc0:1:2:3:4:5:6:7:8:9:"
+ while ((c = getopt(ARGC, ARGV, opts)) != -1) @{
+ if (c == "u")
+ non_repeated_only++
+ else if (c == "d")
+ repeated_only++
+ else if (c == "c")
+ do_count++
+ else if (index("0123456789", c) != 0) @{
+ # getopt requires args to options
+ # this messes us up for things like -5
+ if (Optarg ~ /^[[:digit:]]+$/)
+ fcount = (c Optarg) + 0
+ else @{
+ fcount = c + 0
+ Optind--
@}
- FS = Optarg
- OFS = FS
- if (FS == " ") # defeat awk semantics
- FS = "[ ]"
- @} else if (c == "s")
- suppress++
- else
+ @} else
usage()
@}
- # Clear out options
- for (i = 1; i < Optind; i++)
- ARGV[i] = ""
address@hidden endfile
address@hidden example
-
address@hidden field separators, spaces as
-The code must take
-special care when the field delimiter is a space. Using
-a single space (@address@hidden" "}}) for the value of @code{FS} is
address@hidden would separate fields with runs of spaces,
-TABs, and/or newlines, and we want them to be separated with individual
-spaces. Also remember that after @code{getopt()} is through
-(as described in @ref{Getopt Function}),
-we have to
-clear out all the elements of @code{ARGV} from 1 to @code{Optind},
-so that @command{awk} does not try to process the command-line options
-as @value{FN}s.
-
-After dealing with the command-line options, the program verifies that the
-options make sense. Only one or the other of @option{-c} and @option{-f}
-should be used, and both require a field list. Then the program calls
-either @code{set_fieldlist()} or @code{set_charlist()} to pull apart the
-list of fields or characters:
-
address@hidden
address@hidden file eg/prog/cut.awk
- if (by_fields && by_chars)
- usage()
-
- if (by_fields == 0 && by_chars == 0)
- by_fields = 1 # default
-
- if (fieldlist == "") @{
- print "cut: needs list for -c or -f" > "/dev/stderr"
- exit 1
+ if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) @{
+ charcount = substr(ARGV[Optind], 2) + 0
+ Optind++
@}
- if (by_fields)
- set_fieldlist()
- else
- set_charlist()
address@hidden
address@hidden endfile
address@hidden example
-
address@hidden()} splits the field list apart at the commas
-into an array. Then, for each element of the array, it looks to
-see if the element is actually a range, and if so, splits it apart.
-The function checks the range
-to make sure that the first number is smaller than the second.
-Each number in the list is added to the @code{flist} array, which
-simply lists the fields that will be printed. Normal field splitting
-is used. The program lets @command{awk} handle the job of doing the
-field splitting:
+ for (i = 1; i < Optind; i++)
+ ARGV[i] = ""
address@hidden
address@hidden file eg/prog/cut.awk
-function set_fieldlist( n, m, i, j, k, f, g)
address@hidden
- n = split(fieldlist, f, ",")
- j = 1 # index in flist
- for (i = 1; i <= n; i++) @{
- if (index(f[i], "-") != 0) @{ # a range
- m = split(f[i], g, "-")
address@hidden
- if (m != 2 || g[1] >= g[2]) @{
- printf("bad field list: %s\n",
- f[i]) > "/dev/stderr"
- exit 1
- @}
address@hidden group
- for (k = g[1]; k <= g[2]; k++)
- flist[j++] = k
- @} else
- flist[j++] = f[i]
+ if (repeated_only == 0 && non_repeated_only == 0)
+ repeated_only = non_repeated_only = 1
+
+ if (ARGC - Optind == 2) @{
+ outputfile = ARGV[ARGC - 1]
+ ARGV[ARGC - 1] = ""
@}
- nfields = j - 1
@}
@c endfile
@end example
-The @code{set_charlist()} function is more complicated than
address@hidden()}.
-The idea here is to use @command{gawk}'s @code{FIELDWIDTHS} variable
-(@pxref{Constant Size}),
-which describes constant-width input. When using a character list, that is
-exactly what we have.
-
-Setting up @code{FIELDWIDTHS} is more complicated than simply listing the
-fields that need to be printed. We have to keep track of the fields to
-print and also the intervening characters that have to be skipped.
-For example, suppose you wanted characters 1 through 8, 15, and
-22 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value
-for @code{FIELDWIDTHS} is @address@hidden"8 6 1 6 14"}}. This yields five
-fields, and the fields to print
-are @code{$1}, @code{$3}, and @code{$5}.
-The intermediate fields are @dfn{filler},
-which is stuff in between the desired data.
address@hidden lists the fields to print, and @code{t} tracks the
-complete field list, including filler fields:
+The following function, @code{are_equal()}, compares the current line,
address@hidden, to the
+previous line, @code{last}. It handles skipping fields and characters.
+If no field count and no character count are specified, @code{are_equal()}
+simply returns one or zero depending upon the result of a simple string
+comparison of @code{last} and @code{$0}. Otherwise, things get more
+complicated.
+If fields have to be skipped, each line is broken into an array using
address@hidden()}
+(@pxref{String Functions});
+the desired fields are then joined back into a line using @code{join()}.
+The joined lines are stored in @code{clast} and @code{cline}.
+If no fields are skipped, @code{clast} and @code{cline} are set to
address@hidden and @code{$0}, respectively.
+Finally, if characters are skipped, @code{substr()} is used to strip off the
+leading @code{charcount} characters in @code{clast} and @code{cline}. The
+two strings are then compared and @code{are_equal()} returns the result:
@example
address@hidden file eg/prog/cut.awk
-function set_charlist( field, i, j, f, g, t,
- filler, last, len)
address@hidden file eg/prog/uniq.awk
+function are_equal( n, m, clast, cline, alast, aline)
@{
- field = 1 # count total fields
- n = split(fieldlist, f, ",")
- j = 1 # index in flist
- for (i = 1; i <= n; i++) @{
- if (index(f[i], "-") != 0) @{ # range
- m = split(f[i], g, "-")
- if (m != 2 || g[1] >= g[2]) @{
- printf("bad character list: %s\n",
- f[i]) > "/dev/stderr"
- exit 1
- @}
- len = g[2] - g[1] + 1
- if (g[1] > 1) # compute length of filler
- filler = g[1] - last - 1
- else
- filler = 0
address@hidden
- if (filler)
- t[field++] = filler
address@hidden group
- t[field++] = len # length of field
- last = g[2]
- flist[j++] = field - 1
- @} else @{
- if (f[i] > 1)
- filler = f[i] - last - 1
- else
- filler = 0
- if (filler)
- t[field++] = filler
- t[field++] = 1
- last = f[i]
- flist[j++] = field - 1
- @}
+ if (fcount == 0 && charcount == 0)
+ return (last == $0)
+
+ if (fcount > 0) @{
+ n = split(last, alast)
+ m = split($0, aline)
+ clast = join(alast, fcount+1, n)
+ cline = join(aline, fcount+1, m)
+ @} else @{
+ clast = last
+ cline = $0
@}
- FIELDWIDTHS = join(t, 1, field - 1)
- nfields = j - 1
+ if (charcount) @{
+ clast = substr(clast, charcount + 1)
+ cline = substr(cline, charcount + 1)
+ @}
+
+ return (clast == cline)
@}
@c endfile
@end example
-Next is the rule that actually processes the data. If the @option{-s} option
-is given, then @code{suppress} is true. The first @code{if} statement
-makes sure that the input record does have the field separator. If
address@hidden is processing fields, @code{suppress} is true, and the field
-separator character is not in the record, then the record is skipped.
+The following two rules are the body of the program. The first one is
+executed only for the very first line of data. It sets @code{last} equal to
address@hidden, so that subsequent lines of text have something to be compared
to.
-If the record is valid, then @command{gawk} has split the data
-into fields, either using the character in @code{FS} or using fixed-length
-fields and @code{FIELDWIDTHS}. The loop goes through the list of fields
-that should be printed. The corresponding field is printed if it contains
data.
-If the next field also has data, then the separator character is
-written out between the fields:
+The second rule does the work. The variable @code{equal} is one or zero,
+depending upon the results of @code{are_equal()}'s comparison. If
@command{uniq}
+is counting repeated lines, and the lines are equal, then it increments the
@code{count} variable.
+Otherwise, it prints the line and resets @code{count},
+since the two lines are not equal.
+
+If @command{uniq} is not counting, and if the lines are equal, @code{count} is
incremented.
+Nothing is printed, since the point is to remove duplicates.
+Otherwise, if @command{uniq} is counting repeated lines and more than
+one line is seen, or if @command{uniq} is counting nonrepeated lines
+and only one line is seen, then the line is printed, and @code{count}
+is reset.
+
+Finally, similar logic is used in the @code{END} rule to print the final
+line of input data:
@example
address@hidden file eg/prog/cut.awk
address@hidden file eg/prog/uniq.awk
+NR == 1 @{
+ last = $0
+ next
address@hidden
+
@{
- if (by_fields && suppress && index($0, FS) != 0)
- next
+ equal = are_equal()
- for (i = 1; i <= nfields; i++) @{
- if ($flist[i] != "") @{
- printf "%s", $flist[i]
- if (i < nfields && $flist[i+1] != "")
- printf "%s", OFS
+ if (do_count) @{ # overrides -d and -u
+ if (equal)
+ count++
+ else @{
+ printf("%4d %s\n", count, last) > outputfile
+ last = $0
+ count = 1 # reset
@}
+ next
@}
- print ""
+
+ if (equal)
+ count++
+ else @{
+ if ((repeated_only && count > 1) ||
+ (non_repeated_only && count == 1))
+ print last > outputfile
+ last = $0
+ count = 1
+ @}
address@hidden
+
+END @{
+ if (do_count)
+ printf("%4d %s\n", count, last) > outputfile
+ else if ((repeated_only && count > 1) ||
+ (non_repeated_only && count == 1))
+ print last > outputfile
+ close(outputfile)
@}
@c endfile
@end example
address@hidden ENDOFRANGE prunt
address@hidden ENDOFRANGE tpul
-This version of @command{cut} relies on @command{gawk}'s @code{FIELDWIDTHS}
-variable to do the character-based cutting. While it is possible in
-other @command{awk} implementations to use @code{substr()}
-(@pxref{String Functions}),
-it is also extremely painful.
-The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem
-of picking the input line apart by characters.
address@hidden ENDOFRANGE cut
address@hidden ENDOFRANGE ficut
address@hidden ENDOFRANGE colcut
-
address@hidden Exercise: Rewrite using split with "".
address@hidden Wc Program
address@hidden Counting Things
address@hidden Egrep Program
address@hidden Searching for Regular Expressions in Files
address@hidden FIXME: One day, update to current POSIX version of wc
address@hidden STARTOFRANGE regexps
address@hidden regular expressions, searching for
address@hidden STARTOFRANGE sfregexp
address@hidden searching, files for regular expressions
address@hidden STARTOFRANGE fsregexp
address@hidden files, searching for regular expressions
address@hidden @command{egrep} utility
-The @command{egrep} utility searches files for patterns. It uses regular
-expressions that are almost identical to those available in @command{awk}
-(@pxref{Regexp}).
-You invoke it as follows:
address@hidden STARTOFRANGE count
address@hidden counting
address@hidden STARTOFRANGE infco
address@hidden input files, counting elements in
address@hidden STARTOFRANGE woco
address@hidden words, counting
address@hidden STARTOFRANGE chco
address@hidden characters, counting
address@hidden STARTOFRANGE lico
address@hidden lines, counting
address@hidden @command{wc} utility
+The @command{wc} (word count) utility counts lines, words, and characters in
+one or more input files. Its usage is as follows:
@example
-egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{}
+wc @address@hidden @r{[} @var{files} @dots{} @r{]}
@end example
-The @var{pattern} is a regular expression. In typical usage, the regular
-expression is quoted to prevent the shell from expanding any of the
-special characters as @value{FN} wildcards. Normally, @command{egrep}
-prints the lines that matched. If multiple @value{FN}s are provided on
-the command line, each output line is preceded by the name of the file
-and a colon.
-
-The options to @command{egrep} are as follows:
+If no files are specified on the command line, @command{wc} reads its standard
+input. If there are multiple files, it also prints total counts for all
+the files. The options and their meanings are shown in the following list:
@table @code
address@hidden -c
-Print out a count of the lines that matched the pattern, instead of the
-lines themselves.
-
address@hidden -s
-Be silent. No output is produced and the exit value indicates whether
-the pattern was matched.
-
address@hidden -v
-Invert the sense of the test. @command{egrep} prints the lines that do
address@hidden match the pattern and exits successfully if the pattern is not
-matched.
-
address@hidden -i
-Ignore case distinctions in both the pattern and the input data.
-
@item -l
-Only print (list) the names of the files that matched, not the lines that
matched.
+Count only lines.
address@hidden -e @var{pattern}
-Use @var{pattern} as the regexp to match. The purpose of the @option{-e}
-option is to allow patterns that start with a @samp{-}.
address@hidden -w
+Count only words.
+A ``word'' is a contiguous sequence of nonwhitespace characters, separated
+by spaces and/or TABs. Luckily, this is the normal way @command{awk} separates
+fields in its input data.
+
address@hidden -c
+Count only characters.
@end table
-This version uses the @code{getopt()} library function
+Implementing @command{wc} in @command{awk} is particularly elegant,
+since @command{awk} does a lot of the work for us; it splits lines into
+words (i.e., fields) and counts them, it counts lines (i.e., records),
+and it can easily tell us how long a line is.
+
+This program uses the @code{getopt()} library function
(@pxref{Getopt Function})
-and the file transition library program
+and the file-transition functions
(@pxref{Filetrans Function}).
-The program begins with a descriptive comment and then a @code{BEGIN} rule
-that processes the command-line arguments with @code{getopt()}. The
@option{-i}
-(ignore case) option is particularly easy with @command{gawk}; we just use the
address@hidden built-in variable
-(@pxref{Built-in Variables}):
+This version has one notable difference from traditional versions of
address@hidden: it always prints the counts in the order lines, words,
+and characters. Traditional versions note the order of the @option{-l},
address@hidden, and @option{-c} options on the command line, and print the
+counts in that order.
address@hidden @code{egrep.awk} program
+The @code{BEGIN} rule does the argument processing. The variable
address@hidden is true if more than one file is named on the
+command line:
+
address@hidden @code{wc.awk} program
@example
address@hidden file eg/prog/egrep.awk
-# egrep.awk --- simulate egrep in awk
-#
address@hidden file eg/prog/wc.awk
+# wc.awk --- count lines, words, characters
@c endfile
@ignore
address@hidden file eg/prog/egrep.awk
address@hidden file eg/prog/wc.awk
+#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
-
@c endfile
@end ignore
address@hidden file eg/prog/egrep.awk
address@hidden file eg/prog/wc.awk
+
# Options:
-# -c count of lines
-# -s silent - use exit value
-# -v invert test, success if no match
-# -i ignore case
-# -l print filenames only
-# -e argument is pattern
+# -l only count lines
+# -w only count words
+# -c only count characters
#
-# Requires getopt and file transition library functions
+# Default is to count lines, words, characters
+#
+# Requires getopt() and file transition library functions
BEGIN @{
- while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{
- if (c == "c")
- count_only++
- else if (c == "s")
- no_print++
- else if (c == "v")
- invert++
- else if (c == "i")
- IGNORECASE = 1
- else if (c == "l")
- filenames_only++
- else if (c == "e")
- pattern = Optarg
- else
- usage()
+ # let getopt() print a message about
+ # invalid options. we ignore them
+ while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
+ if (c == "l")
+ do_lines = 1
+ else if (c == "w")
+ do_words = 1
+ else if (c == "c")
+ do_chars = 1
@}
address@hidden endfile
address@hidden example
-
-Next comes the code that handles the @command{egrep}-specific behavior. If no
-pattern is supplied with @option{-e}, the first nonoption on the
-command line is used. The @command{awk} command-line arguments up to
@code{ARGV[Optind]}
-are cleared, so that @command{awk} won't try to process them as files. If no
-files are specified, the standard input is used, and if multiple files are
-specified, we make sure to note this so that the @value{FN}s can precede the
-matched lines in the output:
-
address@hidden
address@hidden file eg/prog/egrep.awk
- if (pattern == "")
- pattern = ARGV[Optind++]
-
for (i = 1; i < Optind; i++)
ARGV[i] = ""
- if (Optind >= ARGC) @{
- ARGV[1] = "-"
- ARGC = 2
- @} else if (ARGC - Optind > 1)
- do_filenames++
-# if (IGNORECASE)
-# pattern = tolower(pattern)
+ # if no options, do all
+ if (! do_lines && ! do_words && ! do_chars)
+ do_lines = do_words = do_chars = 1
+
+ print_total = (ARGC - i > 2)
@}
@c endfile
@end example
-The last two lines are commented out, since they are not needed in
address@hidden They should be uncommented if you have to use another version
-of @command{awk}.
+The @code{beginfile()} function is simple; it just resets the counts of lines,
+words, and characters to zero, and saves the current @value{FN} in
address@hidden:
-The next set of lines should be uncommented if you are not using
address@hidden This rule translates all the characters in the input line
-into lowercase if the @option{-i} option is address@hidden
-also introduces a subtle bug;
-if a match happens, we output the translated line, not the original.}
-The rule is
-commented out since it is not necessary with @command{gawk}:
address@hidden
address@hidden file eg/prog/wc.awk
+function beginfile(file)
address@hidden
+ lines = words = chars = 0
+ fname = FILENAME
address@hidden
address@hidden endfile
address@hidden example
address@hidden Exercise: Fix this, w/array and new line as key to original line
+The @code{endfile()} function adds the current file's numbers to the running
+totals of lines, words, and address@hidden@command{wc} can't just use the
value of
address@hidden in @code{endfile()}. If you examine
+the code in
address@hidden Function},
+you will see that
address@hidden has already been reset by the time
address@hidden()} is called.} It then prints out those numbers
+for the file that was just read. It relies on @code{beginfile()} to reset the
+numbers for the following @value{DF}:
address@hidden FIXME: ONE DAY: make the above footnote an exercise,
address@hidden instead of giving away the answer.
@example
address@hidden file eg/prog/egrep.awk
address@hidden
-# if (IGNORECASE)
-# $0 = tolower($0)
address@hidden
address@hidden file eg/prog/wc.awk
+function endfile(file)
address@hidden
+ tlines += lines
+ twords += words
+ tchars += chars
+ if (do_lines)
+ printf "\t%d", lines
address@hidden
+ if (do_words)
+ printf "\t%d", words
address@hidden group
+ if (do_chars)
+ printf "\t%d", chars
+ printf "\t%s\n", fname
address@hidden
@c endfile
@end example
-The @code{beginfile()} function is called by the rule in @file{ftrans.awk}
-when each new file is processed. In this case, it is very simple; all it
-does is initialize a variable @code{fcount} to zero. @code{fcount} tracks
-how many lines in the current file matched the pattern.
-Naming the parameter @code{junk} shows we know that @code{beginfile()}
-is called with a parameter, but that we're not interested in its value:
+There is one rule that is executed for each line. It adds the length of
+the record, plus one, to @address@hidden @command{gawk}
+understands multibyte locales, this code counts characters, not bytes.}
+Adding one plus the record length
+is needed because the newline character separating records (the value
+of @code{RS}) is not part of the record itself, and thus not included
+in its length. Next, @code{lines} is incremented for each line read,
+and @code{words} is incremented by the value of @code{NF}, which is the
+number of ``words'' on this line:
@example
address@hidden file eg/prog/egrep.awk
-function beginfile(junk)
address@hidden file eg/prog/wc.awk
+# do per line
@{
- fcount = 0
+ chars += length($0) + 1 # get newline
+ lines++
+ words += NF
@}
@c endfile
@end example
-The @code{endfile()} function is called after each file has been processed.
-It affects the output only when the user wants a count of the number of lines
that
-matched. @code{no_print} is true only if the exit status is desired.
address@hidden is true if line counts are desired. @command{egrep}
-therefore only prints line counts if printing and counting are enabled.
-The output format must be adjusted depending upon the number of files to
-process. Finally, @code{fcount} is added to @code{total}, so that we
-know the total number of lines that matched the pattern:
+Finally, the @code{END} rule simply prints the totals for all the files:
@example
address@hidden file eg/prog/egrep.awk
-function endfile(file)
address@hidden
- if (! no_print && count_only) @{
- if (do_filenames)
- print file ":" fcount
- else
- print fcount
address@hidden file eg/prog/wc.awk
+END @{
+ if (print_total) @{
+ if (do_lines)
+ printf "\t%d", tlines
+ if (do_words)
+ printf "\t%d", twords
+ if (do_chars)
+ printf "\t%d", tchars
+ print "\ttotal"
@}
-
- total += fcount
@}
@c endfile
@end example
address@hidden ENDOFRANGE count
address@hidden ENDOFRANGE infco
address@hidden ENDOFRANGE lico
address@hidden ENDOFRANGE woco
address@hidden ENDOFRANGE chco
address@hidden ENDOFRANGE posimawk
-The following rule does most of the work of matching lines. The variable
address@hidden is true if the line matched the pattern. If the user
-wants lines that did not match, the sense of @code{matches} is inverted
-using the @samp{!} operator. @code{fcount} is incremented with the value of
address@hidden, which is either one or zero, depending upon a
-successful or unsuccessful match. If the line does not match, the
address@hidden statement just moves on to the next record.
-
-A number of additional tests are made, but they are only done if we
-are not counting lines. First, if the user only wants exit status
-(@code{no_print} is true), then it is enough to know that @emph{one}
-line in this file matched, and we can skip on to the next file with
address@hidden Similarly, if we are only printing @value{FN}s, we can
-print the @value{FN}, and then skip to the next file with @code{nextfile}.
-Finally, each line is printed, with a leading @value{FN} and colon
-if necessary:
address@hidden Miscellaneous Programs
address@hidden A Grab Bag of @command{awk} Programs
address@hidden @code{!} (exclamation point), @code{!} operator
address@hidden exclamation point (@code{!}), @code{!} operator
address@hidden
address@hidden file eg/prog/egrep.awk
address@hidden
- matches = ($0 ~ pattern)
- if (invert)
- matches = ! matches
+This @value{SECTION} is a large ``grab bag'' of miscellaneous programs.
+We hope you find them both interesting and enjoyable.
- fcount += matches # 1 or 0
address@hidden
+* Dupword Program:: Finding duplicated words in a document.
+* Alarm Program:: An alarm clock.
+* Translate Program:: A program similar to the @command{tr} utility.
+* Labels Program:: Printing mailing labels.
+* Word Sorting:: A program to produce a word usage count.
+* History Sorting:: Eliminating duplicate entries from a history
+ file.
+* Extract Program:: Pulling out programs from Texinfo source
+ files.
+* Simple Sed:: A Simple Stream Editor.
+* Igawk Program:: A wrapper for @command{awk} that includes
+ files.
+* Anagram Program:: Finding anagrams from a dictionary.
+* Signature Program:: People do amazing things with too much time on
+ their hands.
address@hidden menu
- if (! matches)
- next
address@hidden Dupword Program
address@hidden Finding Duplicated Words in a Document
- if (! count_only) @{
- if (no_print)
- nextfile
address@hidden words, address@hidden searching for
address@hidden searching, for words
address@hidden address@hidden searching
+A common error when writing large amounts of prose is to accidentally
+duplicate words. Typically you will see this in text as something like ``the
+the program does the address@hidden'' When the text is online, often
+the duplicated words occur at the end of one line and the
address@hidden
+the
address@hidden iftex
+beginning of
+another, making them very difficult to spot.
address@hidden as here!
- if (filenames_only) @{
- print FILENAME
- nextfile
- @}
+This program, @file{dupword.awk}, scans through a file one line at a time
+and looks for adjacent occurrences of the same word. It also saves the last
+word on a line (in the variable @code{prev}) for comparison with the first
+word on the next line.
- if (do_filenames)
- print FILENAME ":" $0
- else
- print
- @}
address@hidden
address@hidden endfile
address@hidden example
address@hidden Texinfo
+The first two statements make sure that the line is all lowercase,
+so that, for example, ``The'' and ``the'' compare equal to each other.
+The next statement replaces nonalphanumeric and nonwhitespace characters
+with spaces, so that punctuation does not affect the comparison either.
+The characters are replaced with spaces so that formatting controls
+don't create nonsense words (e.g., the Texinfo @samp{@@address@hidden@}}
+becomes @samp{codeNF} if punctuation is simply deleted). The record is
+then resplit into fields, yielding just the actual words on the line,
+and ensuring that there are no empty fields.
-The @code{END} rule takes care of producing the correct exit status. If
-there are no matches, the exit status is one; otherwise it is zero:
+If there are no fields left after removing all the punctuation, the
+current record is skipped. Otherwise, the program loops through each
+word, comparing it to the previous one:
address@hidden @code{dupword.awk} program
@example
address@hidden file eg/prog/egrep.awk
-END \
address@hidden
- if (total == 0)
- exit 1
- exit 0
address@hidden
address@hidden file eg/prog/dupword.awk
+# dupword.awk --- find duplicate words in text
@c endfile
address@hidden example
-
-The @code{usage()} function prints a usage message in case of invalid options,
-and then exits:
address@hidden
address@hidden file eg/prog/dupword.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# December 1991
+# Revised October 2000
address@hidden
address@hidden file eg/prog/egrep.awk
-function usage( e)
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/dupword.awk
@{
- e = "Usage: egrep [-csvil] [-e pat] [files ...]"
- e = e "\n\tegrep [-csvil] pat [files ...]"
- print e > "/dev/stderr"
- exit 1
+ $0 = tolower($0)
+ gsub(/[^[:alnum:][:blank:]]/, " ");
+ $0 = $0 # re-split
+ if (NF == 0)
+ next
+ if ($1 == prev)
+ printf("%s:%d: duplicate %s\n",
+ FILENAME, FNR, $1)
+ for (i = 2; i <= NF; i++)
+ if ($i == $(i-1))
+ printf("%s:%d: duplicate %s\n",
+ FILENAME, FNR, $i)
+ prev = $NF
@}
@c endfile
@end example
-The variable @code{e} is used so that the function fits nicely
-on the printed page.
-
address@hidden @code{END} pattern, backslash continuation and
address@hidden @code{\} (backslash), continuing lines and
address@hidden backslash (@code{\}), continuing lines and
-Just a note on programming style: you may have noticed that the @code{END}
-rule uses backslash continuation, with the open brace on a line by
-itself. This is so that it more closely resembles the way functions
-are written. Many of the examples
-in this @value{CHAPTER}
-use this style. You can decide for yourself if you like writing
-your @code{BEGIN} and @code{END} rules this way
-or not.
address@hidden ENDOFRANGE regexps
address@hidden ENDOFRANGE sfregexp
address@hidden ENDOFRANGE fsregexp
-
address@hidden Id Program
address@hidden Printing out User Information
-
address@hidden printing, user information
address@hidden users, information about, printing
address@hidden @command{id} utility
-The @command{id} utility lists a user's real and effective user ID numbers,
-real and effective group ID numbers, and the user's group set, if any.
address@hidden only prints the effective user ID and group ID if they are
-different from the real ones. If possible, @command{id} also supplies the
-corresponding user and group names. The output might look like this:
-
address@hidden
-$ @kbd{id}
address@hidden uid=500(arnold) gid=500(arnold) groups=6(disk),7(lp),19(floppy)
address@hidden example
address@hidden Alarm Program
address@hidden An Alarm Clock Program
address@hidden insomnia, cure for
address@hidden Robbins, Arnold
address@hidden
address@hidden cures insomnia like a ringing alarm address@hidden
+Arnold Robbins
address@hidden quotation
address@hidden @code{PROCINFO} array
-This information is part of what is provided by @command{gawk}'s
address@hidden array (@pxref{Built-in Variables}).
-However, the @command{id} utility provides a more palatable output than just
-individual numbers.
address@hidden STARTOFRANGE tialarm
address@hidden time, alarm clock example program
address@hidden STARTOFRANGE alaex
address@hidden alarm clock example program
+The following program is a simple ``alarm clock'' program.
+You give it a time of day and an optional message. At the specified time,
+it prints the message on the standard output. In addition, you can give it
+the number of times to repeat the message as well as a delay between
+repetitions.
-Here is a simple version of @command{id} written in @command{awk}.
-It uses the user database library functions
-(@pxref{Passwd Functions})
-and the group database library functions
-(@pxref{Group Functions}):
+This program uses the @code{getlocaltime()} function from
address@hidden Function}.
-The program is fairly straightforward. All the work is done in the
address@hidden rule. The user and group ID numbers are obtained from
address@hidden
-The code is repetitive. The entry in the user database for the real user ID
-number is split into parts at the @samp{:}. The name is the first field.
-Similar code is used for the effective user ID number and the group
-numbers:
+All the work is done in the @code{BEGIN} rule. The first part is argument
+checking and setting of defaults: the delay, the count, and the message to
+print. If the user supplied a message without the ASCII BEL
+character (known as the ``alert'' character, @code{"\a"}), then it is added to
+the message. (On many systems, printing the ASCII BEL generates an
+audible alert. Thus when the alarm goes off, the system calls attention
+to itself in case the user is not looking at the computer.)
+Just for a change, this program uses a @code{switch} statement
+(@pxref{Switch Statement}), but the processing could be done with a series of
address@hidden@code{else} statements instead.
+Here is the program:
address@hidden @code{id.awk} program
address@hidden @code{alarm.awk} program
@example
address@hidden file eg/prog/id.awk
-# id.awk --- implement id in awk
address@hidden file eg/prog/alarm.awk
+# alarm.awk --- set an alarm
#
-# Requires user and group library functions
+# Requires getlocaltime() library function
@c endfile
@ignore
address@hidden file eg/prog/id.awk
address@hidden file eg/prog/alarm.awk
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
-# Revised February 1996
+# Revised December 2010
+
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/alarm.awk
+# usage: alarm time [ "message" [ count [ delay ] ] ]
+
+BEGIN \
address@hidden
+ # Initial argument sanity checking
+ usage1 = "usage: alarm time ['message' [count [delay]]]"
+ usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
+
+ if (ARGC < 2) @{
+ print usage1 > "/dev/stderr"
+ print usage2 > "/dev/stderr"
+ exit 1
+ @}
+ switch (ARGC) @{
+ case 5:
+ delay = ARGV[4] + 0
+ # fall through
+ case 4:
+ count = ARGV[3] + 0
+ # fall through
+ case 3:
+ message = ARGV[2]
+ break
+ default:
+ if (ARGV[1] !~ /[[:digit:]]?[[:digit:]]:[[:digit:address@hidden@}/) @{
+ print usage1 > "/dev/stderr"
+ print usage2 > "/dev/stderr"
+ exit 1
+ @}
+ break
+ @}
+
+ # set defaults for once we reach the desired time
+ if (delay == 0)
+ delay = 180 # 3 minutes
address@hidden
+ if (count == 0)
+ count = 5
address@hidden group
+ if (message == "")
+ message = sprintf("\aIt is now %s!\a", ARGV[1])
+ else if (index(message, "\a") == 0)
+ message = "\a" message "\a"
address@hidden endfile
address@hidden example
+
+The next @value{SECTION} of code turns the alarm time into hours and minutes,
+converts it (if necessary) to a 24-hour clock, and then turns that
+time into a count of the seconds since midnight. Next it turns the current
+time into a count of seconds since midnight. The difference between the two
+is how long to wait before setting off the alarm:
+
address@hidden
address@hidden file eg/prog/alarm.awk
+ # split up alarm time
+ split(ARGV[1], atime, ":")
+ hour = atime[1] + 0 # force numeric
+ minute = atime[2] + 0 # force numeric
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/id.awk
-# output is:
-# uid=12(foo) euid=34(bar) gid=3(baz) \
-# egid=5(blat) groups=9(nine),2(two),1(one)
+ # get current broken down time
+ getlocaltime(now)
address@hidden
-BEGIN \
address@hidden
- uid = PROCINFO["uid"]
- euid = PROCINFO["euid"]
- gid = PROCINFO["gid"]
- egid = PROCINFO["egid"]
address@hidden group
+ # if time given is 12-hour hours and it's after that
+ # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
+ # then add 12 to real hour
+ if (hour < 12 && now["hour"] > hour)
+ hour += 12
- printf("uid=%d", uid)
- pw = getpwuid(uid)
- if (pw != "") @{
- split(pw, a, ":")
- printf("(%s)", a[1])
- @}
+ # set target time in seconds since midnight
+ target = (hour * 60 * 60) + (minute * 60)
- if (euid != uid) @{
- printf(" euid=%d", euid)
- pw = getpwuid(euid)
- if (pw != "") @{
- split(pw, a, ":")
- printf("(%s)", a[1])
- @}
- @}
+ # get current time in seconds since midnight
+ current = (now["hour"] * 60 * 60) + \
+ (now["minute"] * 60) + now["second"]
- printf(" gid=%d", gid)
- pw = getgrgid(gid)
- if (pw != "") @{
- split(pw, a, ":")
- printf("(%s)", a[1])
+ # how long to sleep for
+ naptime = target - current
+ if (naptime <= 0) @{
+ print "time is in the past!" > "/dev/stderr"
+ exit 1
@}
address@hidden endfile
address@hidden example
- if (egid != gid) @{
- printf(" egid=%d", egid)
- pw = getgrgid(egid)
- if (pw != "") @{
- split(pw, a, ":")
- printf("(%s)", a[1])
- @}
- @}
address@hidden @command{sleep} utility
+Finally, the program uses the @code{system()} function
+(@pxref{I/O Functions})
+to call the @command{sleep} utility. The @command{sleep} utility simply pauses
+for the given number of seconds. If the exit status is not zero,
+the program assumes that @command{sleep} was interrupted and exits. If
address@hidden exited with an OK status (zero), then the program prints the
+message in a loop, again using @command{sleep} to delay for however many
+seconds are necessary:
- for (i = 1; ("group" i) in PROCINFO; i++) @{
- if (i == 1)
- printf(" groups=")
- group = PROCINFO["group" i]
- printf("%d", group)
- pw = getgrgid(group)
- if (pw != "") @{
- split(pw, a, ":")
- printf("(%s)", a[1])
- @}
- if (("group" (i+1)) in PROCINFO)
- printf(",")
address@hidden
address@hidden file eg/prog/alarm.awk
+ # zzzzzz..... go away if interrupted
+ if (system(sprintf("sleep %d", naptime)) != 0)
+ exit 1
+
+ # time to notify!
+ command = sprintf("sleep %d", delay)
+ for (i = 1; i <= count; i++) @{
+ print message
+ # if sleep command interrupted, go away
+ if (system(command) != 0)
+ break
@}
- print ""
+ exit 0
@}
@c endfile
@end example
address@hidden ENDOFRANGE tialarm
address@hidden ENDOFRANGE alaex
address@hidden @code{in} operator
-The test in the @code{for} loop is worth noting.
-Any supplementary groups in the @code{PROCINFO} array have the
-indices @code{"group1"} through @code{"address@hidden"} for some
address@hidden, i.e., the total number of supplementary groups.
-However, we don't know in advance how many of these groups
-there are.
address@hidden Translate Program
address@hidden Transliterating Characters
-This loop works by starting at one, concatenating the value with
address@hidden"group"}, and then using @code{in} to see if that value is
-in the array. Eventually, @code{i} is incremented past
-the last group in the array and the loop exits.
address@hidden STARTOFRANGE chtra
address@hidden characters, transliterating
address@hidden @command{tr} utility
+The system @command{tr} utility transliterates characters. For example, it is
+often used to map uppercase letters into lowercase for further processing:
-The loop is also correct if there are @emph{no} supplementary
-groups; then the condition is false the first time it's
-tested, and the loop body never executes.
address@hidden
address@hidden data} | tr 'A-Z' 'a-z' | @var{process data} @dots{}
address@hidden example
address@hidden exercise!!!
address@hidden
-The POSIX version of @command{id} takes arguments that control which
-information is printed. Modify this version to accept the same
-arguments and perform in the same way.
address@hidden ignore
address@hidden requires two lists of address@hidden some older
+systems,
address@hidden ORA
+including Solaris,
address@hidden ifset
address@hidden may require that the lists be written as
+range expressions enclosed in square brackets (@samp{[a-z]}) and quoted,
+to prevent the shell from attempting a @value{FN} expansion. This is
+not a feature.} When processing the input, the first character in the
+first list is replaced with the first character in the second list,
+the second character in the first list is replaced with the second
+character in the second list, and so on. If there are more characters
+in the ``from'' list than in the ``to'' list, the last character of the
+``to'' list is used for the remaining characters in the ``from'' list.
address@hidden Split Program
address@hidden Splitting a Large File into Pieces
+Some time ago,
address@hidden early or mid-1989!
+a user proposed that a transliteration function should
+be added to @command{gawk}.
address@hidden Wishing to avoid gratuitous new features,
address@hidden at least theoretically
+The following program was written to
+prove that character transliteration could be done with a user-level
+function. This program is not as complete as the system @command{tr} utility
+but it does most of the job.
address@hidden FIXME: One day, update to current POSIX version of split
+The @command{translate} program demonstrates one of the few weaknesses
+of standard @command{awk}: dealing with individual characters is very
+painful, requiring repeated use of the @code{substr()}, @code{index()},
+and @code{gsub()} built-in functions
+(@pxref{String Functions})address@hidden
+program was written before @command{gawk} acquired the ability to
+split each character in a string into separate array elements.}
address@hidden Exercise: How might you use this new feature to simplify the
program?
+There are two functions. The first, @code{stranslate()}, takes three
+arguments:
address@hidden STARTOFRANGE filspl
address@hidden files, splitting
address@hidden @code{split} utility
-The @command{split} program splits large text files into smaller pieces.
-Usage is as follows:@footnote{This is the traditional usage. The
-POSIX usage is different, but not relevant for what the program
-aims to demonstrate.}
address@hidden @code
address@hidden from
+A list of characters from which to translate.
address@hidden
-split @address@hidden@r{]} file @r{[} @var{prefix} @r{]}
address@hidden example
address@hidden to
+A list of characters to which to translate.
-By default,
-the output files are named @file{xaa}, @file{xab}, and so on. Each file has
-1000 lines in it, with the likely exception of the last file. To change the
-number of lines in each file, supply a number on the command line
-preceded with a minus; e.g., @samp{-500} for files with 500 lines in them
-instead of 1000. To change the name of the output files to something like
address@hidden, @file{myfileab}, and so on, supply an additional
-argument that specifies the @value{FN} prefix.
address@hidden target
+The string on which to do the translation.
address@hidden table
-Here is a version of @command{split} in @command{awk}. It uses the
address@hidden()} and @code{chr()} functions presented in
address@hidden Functions}.
+Associative arrays make the translation part fairly easy. @code{t_ar} holds
+the ``to'' characters, indexed by the ``from'' characters. Then a simple
+loop goes through @code{from}, one character at a time. For each character
+in @code{from}, if the character appears in @code{target},
+it is replaced with the corresponding @code{to} character.
-The program first sets its defaults, and then tests to make sure there are
-not too many arguments. It then looks at each argument in turn. The
-first argument could be a minus sign followed by a number. If it is, this
happens
-to look like a negative number, so it is made positive, and that is the
-count of lines. The data @value{FN} is skipped over and the final argument
-is used as the prefix for the output @value{FN}s:
+The @code{translate()} function simply calls @code{stranslate()} using
@code{$0}
+as the target. The main program sets two global variables, @code{FROM} and
address@hidden, from the command line, and then changes @code{ARGV} so that
address@hidden reads from the standard input.
address@hidden @code{split.awk} program
+Finally, the processing rule simply calls @code{translate()} for each record:
+
address@hidden @code{translate.awk} program
@example
address@hidden file eg/prog/split.awk
-# split.awk --- do split in awk
-#
-# Requires ord() and chr() library functions
address@hidden file eg/prog/translate.awk
+# translate.awk --- do tr-like stuff
@c endfile
@ignore
address@hidden file eg/prog/split.awk
address@hidden file eg/prog/translate.awk
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# May 1993
-
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/split.awk
-# usage: split [-num] [file] [outname]
-
-BEGIN @{
- outfile = "x" # default
- count = 1000
- if (ARGC > 4)
- usage()
-
- i = 1
- if (ARGV[i] ~ /^-[[:digit:]]+$/) @{
- count = -ARGV[i]
- ARGV[i] = ""
- i++
- @}
- # test argv in case reading from stdin instead of file
- if (i in ARGV)
- i++ # skip data file name
- if (i in ARGV) @{
- outfile = ARGV[i]
- ARGV[i] = ""
- @}
-
- s1 = s2 = "a"
- out = (outfile s1 s2)
address@hidden
address@hidden endfile
address@hidden example
+# August 1989
+# February 2009 - bug fix
-The next rule does most of the work. @code{tcount} (temporary count) tracks
-how many lines have been printed to the output file so far. If it is greater
-than @code{count}, it is time to close the current file and start a new one.
address@hidden and @code{s2} track the current suffixes for the @value{FN}. If
-they are both @samp{z}, the file is just too big. Otherwise, @code{s1}
-moves to the next letter in the alphabet and @code{s2} starts over again at
address@hidden:
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/translate.awk
+# Bugs: does not handle things like: tr A-Z a-z, it has
+# to be spelled out. However, if `to' is shorter than `from',
+# the last character in `to' is used for the rest of `from'.
address@hidden else on separate line here for page breaking
address@hidden
address@hidden file eg/prog/split.awk
+function stranslate(from, to, target, lf, lt, ltarget, t_ar, i, c,
+ result)
@{
- if (++tcount > count) @{
- close(out)
- if (s2 == "z") @{
- if (s1 == "z") @{
- printf("split: %s is too large to split\n",
- FILENAME) > "/dev/stderr"
- exit 1
- @}
- s1 = chr(ord(s1) + 1)
- s2 = "a"
- @}
address@hidden
- else
- s2 = chr(ord(s2) + 1)
address@hidden group
- out = (outfile s1 s2)
- tcount = 1
+ lf = length(from)
+ lt = length(to)
+ ltarget = length(target)
+ for (i = 1; i <= lt; i++)
+ t_ar[substr(from, i, 1)] = substr(to, i, 1)
+ if (lt < lf)
+ for (; i <= lf; i++)
+ t_ar[substr(from, i, 1)] = substr(to, lt, 1)
+ for (i = 1; i <= ltarget; i++) @{
+ c = substr(target, i, 1)
+ if (c in t_ar)
+ c = t_ar[c]
+ result = result c
@}
- print > out
+ return result
@}
address@hidden endfile
address@hidden example
address@hidden Exercise: do this with just awk builtin functions,
index("abc..."), substr, etc.
+function translate(from, to)
address@hidden
+ return $0 = stranslate(from, to, $0)
address@hidden
address@hidden
-The @code{usage()} function simply prints an error message and exits:
+# main program
+BEGIN @{
address@hidden
+ if (ARGC < 3) @{
+ print "usage: translate from to" > "/dev/stderr"
+ exit
+ @}
address@hidden group
+ FROM = ARGV[1]
+ TO = ARGV[2]
+ ARGC = 2
+ ARGV[1] = "-"
address@hidden
address@hidden
address@hidden file eg/prog/split.awk
-function usage( e)
@{
- e = "usage: split [-num] [file] [outname]"
- print e > "/dev/stderr"
- exit 1
+ translate(FROM, TO)
+ print
@}
@c endfile
@end example
address@hidden
-The variable @code{e} is used so that the function
-fits nicely on the
address@hidden
-screen.
address@hidden ifinfo
address@hidden
-page.
address@hidden ifnotinfo
+While it is possible to do character transliteration in a user-level
+function, it is not necessarily efficient, and we (the @command{gawk}
+authors) started to consider adding a built-in function. However,
+shortly after writing this program, we learned that the System V Release 4
address@hidden had added the @code{toupper()} and @code{tolower()} functions
+(@pxref{String Functions}).
+These functions handle the vast majority of the
+cases where character transliteration is necessary, and so we chose to
+simply add those functions to @command{gawk} as well and then leave well
+enough alone.
-This program is a bit sloppy; it relies on @command{awk} to automatically
close the last file
-instead of doing it in an @code{END} rule.
-It also assumes that letters are contiguous in the character set,
-which isn't true for EBCDIC systems.
+An obvious improvement to this program would be to set up the
address@hidden array only once, in a @code{BEGIN} rule. However, this
+assumes that the ``from'' and ``to'' lists
+will never change throughout the lifetime of the program.
address@hidden ENDOFRANGE chtra
address@hidden Exercise: Fix these problems.
address@hidden BFD...
address@hidden ENDOFRANGE filspl
address@hidden Labels Program
address@hidden Printing Mailing Labels
address@hidden Tee Program
address@hidden Duplicating Output into Multiple Files
address@hidden STARTOFRANGE prml
address@hidden printing, mailing labels
address@hidden STARTOFRANGE mlprint
address@hidden mailing address@hidden printing
+Here is a ``real world''@footnote{``Real world'' is defined as
+``a program actually used to get something done.''}
+program. This
+script reads lists of names and
+addresses and generates mailing labels. Each page of labels has 20 labels
+on it, two across and 10 down. The addresses are guaranteed to be no more
+than five lines of data. Each address is separated from the next by a blank
+line.
address@hidden files, address@hidden duplicating output into
address@hidden output, duplicating into files
address@hidden @code{tee} utility
-The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies
-its standard input to its standard output and also duplicates it to the
-files named on the command line. Its usage is as follows:
+The basic idea is to read 20 labels worth of data. Each line of each label
+is stored in the @code{line} array. The single rule takes care of filling
+the @code{line} array and printing the page when 20 labels have been read.
+
+The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that
address@hidden splits records at blank lines
+(@pxref{Records}).
+It sets @code{MAXLINES} to 100, since 100 is the maximum number
+of lines on the page (20 * 5 = 100).
+
+Most of the work is done in the @code{printpage()} function.
+The label lines are stored sequentially in the @code{line} array. But they
+have to print horizontally; @code{line[1]} next to @code{line[6]},
address@hidden next to @code{line[7]}, and so on. Two loops are used to
+accomplish this. The outer loop, controlled by @code{i}, steps through
+every 10 lines of data; this is each row of labels. The inner loop,
+controlled by @code{j}, goes through the lines within the row.
+As @code{j} goes from 0 to 4, @samp{i+j} is the @code{j}-th line in
+the row, and @samp{i+j+5} is the entry next to it. The output ends up
+looking something like this:
@example
-tee @address@hidden file @dots{}
+line 1 line 6
+line 2 line 7
+line 3 line 8
+line 4 line 9
+line 5 line 10
address@hidden
@end example
-The @option{-a} option tells @code{tee} to append to the named files, instead
of
-truncating them and starting over.
address@hidden
+The @code{printf} format string @samp{%-41s} left-aligns
+the data and prints it within a fixed-width field.
-The @code{BEGIN} rule first makes a copy of all the command-line arguments
-into an array named @code{copy}.
address@hidden is not copied, since it is not needed.
address@hidden cannot use @code{ARGV} directly, since @command{awk} attempts to
-process each @value{FN} in @code{ARGV} as input data.
+As a final note, an extra blank line is printed at lines 21 and 61, to keep
+the output lined up on the labels. This is dependent on the particular
+brand of labels in use when the program was written. You will also note
+that there are two blank lines at the top and two blank lines at the bottom.
address@hidden flag variables
-If the first argument is @option{-a}, then the flag variable
address@hidden is set to true, and both @code{ARGV[1]} and
address@hidden are deleted. If @code{ARGC} is less than two, then no
address@hidden were supplied and @code{tee} prints a usage message and exits.
-Finally, @command{awk} is forced to read the standard input by setting
address@hidden to @code{"-"} and @code{ARGC} to two:
+The @code{END} rule arranges to flush the final page of labels; there may
+not have been an even multiple of 20 labels in the data:
address@hidden @code{tee.awk} program
address@hidden @code{labels.awk} program
@example
address@hidden file eg/prog/tee.awk
-# tee.awk --- tee in awk
-#
-# Copy standard input to all named output files.
-# Append content if -a option is supplied.
-#
address@hidden file eg/prog/labels.awk
+# labels.awk --- print mailing labels
@c endfile
@ignore
address@hidden file eg/prog/tee.awk
address@hidden file eg/prog/labels.awk
+#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# May 1993
-# Revised December 1995
-
+# June 1992
+# December 2010, minor edits
@c endfile
@end ignore
address@hidden file eg/prog/tee.awk
-BEGIN \
address@hidden
- for (i = 1; i < ARGC; i++)
- copy[i] = ARGV[i]
address@hidden file eg/prog/labels.awk
- if (ARGV[1] == "-a") @{
- append = 1
- delete ARGV[1]
- delete copy[1]
- ARGC--
- @}
- if (ARGC < 2) @{
- print "usage: tee [-a] file ..." > "/dev/stderr"
- exit 1
- @}
- ARGV[1] = "-"
- ARGC = 2
address@hidden
address@hidden endfile
address@hidden example
+# Each label is 5 lines of data that may have blank lines.
+# The label sheets have 2 blank lines at the top and 2 at
+# the bottom.
-The following single rule does all the work. Since there is no pattern, it is
-executed for each line of input. The body of the rule simply prints the
-line into each file on the command line, and then to the standard output:
+BEGIN @{ RS = "" ; MAXLINES = 100 @}
address@hidden
address@hidden file eg/prog/tee.awk
+function printpage( i, j)
@{
- # moving the if outside the loop makes it run faster
- if (append)
- for (i in copy)
- print >> copy[i]
- else
- for (i in copy)
- print > copy[i]
- print
address@hidden
address@hidden endfile
address@hidden example
+ if (Nlines <= 0)
+ return
address@hidden
-It is also possible to write the loop this way:
+ printf "\n\n" # header
+
+ for (i = 1; i <= Nlines; i += 10) @{
+ if (i == 21 || i == 61)
+ print ""
+ for (j = 0; j < 5; j++) @{
+ if (i + j > MAXLINES)
+ break
+ printf " %-41s %s\n", line[i+j], line[i+j+5]
+ @}
+ print ""
+ @}
address@hidden
-for (i in copy)
- if (append)
- print >> copy[i]
- else
- print > copy[i]
address@hidden example
+ printf "\n\n" # footer
address@hidden
-This is more concise but it is also less efficient. The @samp{if} is
-tested for each record and for each output file. By duplicating the loop
-body, the @samp{if} is only tested once for each input record. If there are
address@hidden input records and @var{M} output files, the first method only
-executes @var{N} @samp{if} statements, while the second executes
address@hidden@address@hidden @samp{if} statements.
+ delete line
address@hidden
-Finally, the @code{END} rule cleans up by closing all the output files:
+# main rule
address@hidden
+ if (Count >= 20) @{
+ printpage()
+ Count = 0
+ Nlines = 0
+ @}
+ n = split($0, a, "\n")
+ for (i = 1; i <= n; i++)
+ line[++Nlines] = a[i]
+ for (; i <= 5; i++)
+ line[++Nlines] = ""
+ Count++
address@hidden
address@hidden
address@hidden file eg/prog/tee.awk
END \
@{
- for (i in copy)
- close(copy[i])
+ printpage()
@}
@c endfile
@end example
address@hidden ENDOFRANGE prml
address@hidden ENDOFRANGE mlprint
address@hidden Uniq Program
address@hidden Printing Nonduplicated Lines of Text
-
address@hidden FIXME: One day, update to current POSIX version of uniq
-
address@hidden STARTOFRANGE prunt
address@hidden printing, unduplicated lines of text
address@hidden STARTOFRANGE tpul
address@hidden address@hidden printing, unduplicated lines of
address@hidden @command{uniq} utility
-The @command{uniq} utility reads sorted lines of data on its standard
-input, and by default removes duplicate lines. In other words, it only
-prints unique lines---hence the name. @command{uniq} has a number of
-options. The usage is as follows:
-
address@hidden
-uniq @r{[}-udc @address@hidden@r{]]} @address@hidden@r{]} @r{[} @var{input
file} @r{[} @var{output file} @r{]]}
address@hidden example
address@hidden Word Sorting
address@hidden Generating Word-Usage Counts
-The options for @command{uniq} are:
address@hidden STARTOFRANGE worus
address@hidden words, usage address@hidden generating
address@hidden @code
address@hidden -d
-Print only repeated lines.
+When working with large amounts of text, it can be interesting to know
+how often different words appear. For example, an author may overuse
+certain words, in which case she might wish to find synonyms to substitute
+for words that appear too often. This @value{SUBSECTION} develops a
+program for counting words and presenting the frequency information
+in a useful format.
address@hidden -u
-Print only nonrepeated lines.
+At first glance, a program like this would seem to do the job:
address@hidden -c
-Count lines. This option overrides @option{-d} and @option{-u}. Both repeated
-and nonrepeated lines are counted.
address@hidden
+# Print list of word frequencies
address@hidden address@hidden
-Skip @var{n} fields before comparing lines. The definition of fields
-is similar to @command{awk}'s default: nonwhitespace characters separated
-by runs of spaces and/or TABs.
address@hidden
+ for (i = 1; i <= NF; i++)
+ freq[$i]++
address@hidden
address@hidden address@hidden
-Skip @var{n} characters before comparing lines. Any fields specified with
address@hidden@var{n}} are skipped first.
+END @{
+ for (word in freq)
+ printf "%s\t%d\n", word, freq[word]
address@hidden
address@hidden example
address@hidden @var{input file}
-Data is read from the input file named on the command line, instead of from
-the standard input.
+The program relies on @command{awk}'s default field splitting
+mechanism to break each line up into ``words,'' and uses an
+associative array named @code{freq}, indexed by each word, to count
+the number of times the word occurs. In the @code{END} rule,
+it prints the counts.
address@hidden @var{output file}
-The generated output is sent to the named output file, instead of to the
-standard output.
address@hidden table
+This program has several problems that prevent it from being
+useful on real text files:
-Normally @command{uniq} behaves as if both the @option{-d} and
address@hidden options are provided.
address@hidden @bullet
address@hidden
+The @command{awk} language considers upper- and lowercase characters to be
+distinct. Therefore, ``bartender'' and ``Bartender'' are not treated
+as the same word. This is undesirable, since in normal text, words
+are capitalized if they begin sentences, and a frequency analyzer should not
+be sensitive to capitalization.
address@hidden uses the
address@hidden()} library function
-(@pxref{Getopt Function})
-and the @code{join()} library function
-(@pxref{Join Function}).
address@hidden
+Words are detected using the @command{awk} convention that fields are
+separated just by whitespace. Other characters in the input (except
+newlines) don't have any special meaning to @command{awk}. This means that
+punctuation characters count as part of words.
-The program begins with a @code{usage()} function and then a brief outline of
-the options and their meanings in comments.
-The @code{BEGIN} rule deals with the command-line arguments and options. It
-uses a trick to get @code{getopt()} to handle options of the form @samp{-25},
-treating such an option as the option letter @samp{2} with an argument of
address@hidden If indeed two or more digits are supplied (@code{Optarg} looks
-like a number), @code{Optarg} is
-concatenated with the option digit and then the result is added to zero to make
-it into a number. If there is only one digit in the option, then
address@hidden is not needed. In this case, @code{Optind} must be decremented
so that
address@hidden()} processes it next time. This code is admittedly a bit
-tricky.
address@hidden
+The output does not come out in any useful order. You're more likely to be
+interested in which words occur most frequently or in having an alphabetized
+table of how frequently each word occurs.
address@hidden itemize
-If no options are supplied, then the default is taken, to print both
-repeated and nonrepeated lines. The output file, if provided, is assigned
-to @code{outputfile}. Early on, @code{outputfile} is initialized to the
-standard output, @file{/dev/stdout}:
address@hidden @command{sort} utility
+The first problem can be solved by using @code{tolower()} to remove case
+distinctions. The second problem can be solved by using @code{gsub()}
+to remove punctuation characters. Finally, we solve the third problem
+by using the system @command{sort} utility to process the output of the
address@hidden script. Here is the new version of the program:
address@hidden @code{uniq.awk} program
address@hidden @code{wordfreq.awk} program
@example
address@hidden file eg/prog/uniq.awk
address@hidden
-# uniq.awk --- do uniq in awk
-#
-# Requires getopt() and join() library functions
address@hidden group
address@hidden endfile
address@hidden
address@hidden file eg/prog/uniq.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# May 1993
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/uniq.awk
address@hidden file eg/prog/wordfreq.awk
+# wordfreq.awk --- print list of word frequencies
-function usage( e)
@{
- e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
- print e > "/dev/stderr"
- exit 1
+ $0 = tolower($0) # remove case distinctions
+ # remove punctuation
+ gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
+ for (i = 1; i <= NF; i++)
+ freq[$i]++
@}
-# -c count lines. overrides -d and -u
-# -d only repeated lines
-# -u only nonrepeated lines
-# -n skip n fields
-# +n skip n characters, skip fields first
-
-BEGIN \
address@hidden
- count = 1
- outputfile = "/dev/stdout"
- opts = "udc0:1:2:3:4:5:6:7:8:9:"
- while ((c = getopt(ARGC, ARGV, opts)) != -1) @{
- if (c == "u")
- non_repeated_only++
- else if (c == "d")
- repeated_only++
- else if (c == "c")
- do_count++
- else if (index("0123456789", c) != 0) @{
- # getopt requires args to options
- # this messes us up for things like -5
- if (Optarg ~ /^[[:digit:]]+$/)
- fcount = (c Optarg) + 0
- else @{
- fcount = c + 0
- Optind--
- @}
- @} else
- usage()
- @}
-
- if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) @{
- charcount = substr(ARGV[Optind], 2) + 0
- Optind++
- @}
-
- for (i = 1; i < Optind; i++)
- ARGV[i] = ""
-
- if (repeated_only == 0 && non_repeated_only == 0)
- repeated_only = non_repeated_only = 1
-
- if (ARGC - Optind == 2) @{
- outputfile = ARGV[ARGC - 1]
- ARGV[ARGC - 1] = ""
- @}
address@hidden
@c endfile
+END @{
+ for (word in freq)
+ printf "%s\t%d\n", word, freq[word]
address@hidden
@end example
-The following function, @code{are_equal()}, compares the current line,
address@hidden, to the
-previous line, @code{last}. It handles skipping fields and characters.
-If no field count and no character count are specified, @code{are_equal()}
-simply returns one or zero depending upon the result of a simple string
-comparison of @code{last} and @code{$0}. Otherwise, things get more
-complicated.
-If fields have to be skipped, each line is broken into an array using
address@hidden()}
-(@pxref{String Functions});
-the desired fields are then joined back into a line using @code{join()}.
-The joined lines are stored in @code{clast} and @code{cline}.
-If no fields are skipped, @code{clast} and @code{cline} are set to
address@hidden and @code{$0}, respectively.
-Finally, if characters are skipped, @code{substr()} is used to strip off the
-leading @code{charcount} characters in @code{clast} and @code{cline}. The
-two strings are then compared and @code{are_equal()} returns the result:
+Assuming we have saved this program in a file named @file{wordfreq.awk},
+and that the data is in @file{file1}, the following pipeline:
@example
address@hidden file eg/prog/uniq.awk
-function are_equal( n, m, clast, cline, alast, aline)
address@hidden
- if (fcount == 0 && charcount == 0)
- return (last == $0)
+awk -f wordfreq.awk file1 | sort -k 2nr
address@hidden example
- if (fcount > 0) @{
- n = split(last, alast)
- m = split($0, aline)
- clast = join(alast, fcount+1, n)
- cline = join(aline, fcount+1, m)
- @} else @{
- clast = last
- cline = $0
- @}
- if (charcount) @{
- clast = substr(clast, charcount + 1)
- cline = substr(cline, charcount + 1)
- @}
address@hidden
+produces a table of the words appearing in @file{file1} in order of
+decreasing frequency.
- return (clast == cline)
+The @command{awk} program suitably massages the
+data and produces a word frequency table, which is not ordered.
+The @command{awk} script's output is then sorted by the @command{sort}
+utility and printed on the screen.
+
+The options given to @command{sort}
+specify a sort that uses the second field of each input line (skipping
+one field), that the sort keys should be treated as numeric quantities
+(otherwise @samp{15} would come before @samp{5}), and that the sorting
+should be done in descending (reverse) order.
+
+The @command{sort} could even be done from within the program, by changing
+the @code{END} action to:
+
address@hidden
address@hidden file eg/prog/wordfreq.awk
+END @{
+ sort = "sort -k 2nr"
+ for (word in freq)
+ printf "%s\t%d\n", word, freq[word] | sort
+ close(sort)
@}
@c endfile
@end example
-The following two rules are the body of the program. The first one is
-executed only for the very first line of data. It sets @code{last} equal to
address@hidden, so that subsequent lines of text have something to be compared
to.
+This way of sorting must be used on systems that do not
+have true pipes at the command-line (or batch-file) level.
+See the general operating system documentation for more information on how
+to use the @command{sort} program.
address@hidden ENDOFRANGE worus
-The second rule does the work. The variable @code{equal} is one or zero,
-depending upon the results of @code{are_equal()}'s comparison. If
@command{uniq}
-is counting repeated lines, and the lines are equal, then it increments the
@code{count} variable.
-Otherwise, it prints the line and resets @code{count},
-since the two lines are not equal.
address@hidden History Sorting
address@hidden Removing Duplicates from Unsorted Text
-If @command{uniq} is not counting, and if the lines are equal, @code{count} is
incremented.
-Nothing is printed, since the point is to remove duplicates.
-Otherwise, if @command{uniq} is counting repeated lines and more than
-one line is seen, or if @command{uniq} is counting nonrepeated lines
-and only one line is seen, then the line is printed, and @code{count}
-is reset.
address@hidden STARTOFRANGE lidu
address@hidden lines, address@hidden removing
+The @command{uniq} program
+(@pxref{Uniq Program}),
+removes duplicate lines from @emph{sorted} data.
-Finally, similar logic is used in the @code{END} rule to print the final
-line of input data:
+Suppose, however, you need to remove duplicate lines from a @value{DF} but
+that you want to preserve the order the lines are in. A good example of
+this might be a shell history file. The history file keeps a copy of all
+the commands you have entered, and it is not unusual to repeat a command
+several times in a row. Occasionally you might want to compact the history
+by removing duplicate entries. Yet it is desirable to maintain the order
+of the original commands.
+This simple program does the job. It uses two arrays. The @code{data}
+array is indexed by the text of each line.
+For each line, @code{data[$0]} is incremented.
+If a particular line has not
+been seen before, then @code{data[$0]} is zero.
+In this case, the text of the line is stored in @code{lines[count]}.
+Each element of @code{lines} is a unique command, and the indices of
address@hidden indicate the order in which those lines are encountered.
+The @code{END} rule simply prints out the lines, in order:
+
address@hidden Rakitzis, Byron
address@hidden @code{histsort.awk} program
@example
address@hidden file eg/prog/uniq.awk
-NR == 1 @{
- last = $0
- next
address@hidden
address@hidden file eg/prog/histsort.awk
+# histsort.awk --- compact a shell history file
+# Thanks to Byron Rakitzis for the general idea
address@hidden endfile
address@hidden
address@hidden file eg/prog/histsort.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# May 1993
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/histsort.awk
address@hidden
@{
- equal = are_equal()
-
- if (do_count) @{ # overrides -d and -u
- if (equal)
- count++
- else @{
- printf("%4d %s\n", count, last) > outputfile
- last = $0
- count = 1 # reset
- @}
- next
- @}
-
- if (equal)
- count++
- else @{
- if ((repeated_only && count > 1) ||
- (non_repeated_only && count == 1))
- print last > outputfile
- last = $0
- count = 1
- @}
+ if (data[$0]++ == 0)
+ lines[++count] = $0
@}
address@hidden group
address@hidden
END @{
- if (do_count)
- printf("%4d %s\n", count, last) > outputfile
- else if ((repeated_only && count > 1) ||
- (non_repeated_only && count == 1))
- print last > outputfile
- close(outputfile)
+ for (i = 1; i <= count; i++)
+ print lines[i]
@}
address@hidden group
@c endfile
@end example
address@hidden ENDOFRANGE prunt
address@hidden ENDOFRANGE tpul
-
address@hidden Wc Program
address@hidden Counting Things
-
address@hidden FIXME: One day, update to current POSIX version of wc
address@hidden STARTOFRANGE count
address@hidden counting
address@hidden STARTOFRANGE infco
address@hidden input files, counting elements in
address@hidden STARTOFRANGE woco
address@hidden words, counting
address@hidden STARTOFRANGE chco
address@hidden characters, counting
address@hidden STARTOFRANGE lico
address@hidden lines, counting
address@hidden @command{wc} utility
-The @command{wc} (word count) utility counts lines, words, and characters in
-one or more input files. Its usage is as follows:
+This program also provides a foundation for generating other useful
+information. For example, using the following @code{print} statement in the
address@hidden rule indicates how often a particular command is used:
@example
-wc @address@hidden @r{[} @var{files} @dots{} @r{]}
+print data[lines[i]], lines[i]
@end example
-If no files are specified on the command line, @command{wc} reads its standard
-input. If there are multiple files, it also prints total counts for all
-the files. The options and their meanings are shown in the following list:
+This works because @code{data[$0]} is incremented each time a line is
+seen.
address@hidden ENDOFRANGE lidu
address@hidden @code
address@hidden -l
-Count only lines.
address@hidden Extract Program
address@hidden Extracting Programs from Texinfo Source Files
+
address@hidden STARTOFRANGE texse
address@hidden Texinfo, extracting programs from source files
address@hidden STARTOFRANGE fitex
address@hidden files, address@hidden extracting programs from
address@hidden
+Both this chapter and the previous chapter
+(@ref{Library Functions})
+present a large number of @command{awk} programs.
address@hidden ifnotinfo
address@hidden
+The nodes
address@hidden Functions},
+and @ref{Sample Programs},
+are the top level nodes for a large number of @command{awk} programs.
address@hidden ifinfo
+If you want to experiment with these programs, it is tedious to have to type
+them in by hand. Here we present a program that can extract parts of a
+Texinfo input file into separate files.
+
address@hidden Texinfo
+This @value{DOCUMENT} is written in @uref{http://texinfo.org, Texinfo},
+the GNU project's document formatting language.
+A single Texinfo source file can be used to produce both
+printed and online documentation.
address@hidden
+Texinfo is fully documented in the book
address@hidden GNU Documentation Format},
+available from the Free Software Foundation.
address@hidden ifnotinfo
address@hidden
+The Texinfo language is described fully, starting with
address@hidden, , Texinfo, texinfo,Texinfo---The GNU Documentation Format}.
address@hidden ifinfo
+
+For our purposes, it is enough to know three things about Texinfo input
+files:
+
address@hidden @bullet
address@hidden
+The ``at'' symbol (@samp{@@}) is special in Texinfo, much as
+the backslash (@samp{\}) is in C
+or @command{awk}. Literal @samp{@@} symbols are represented in Texinfo source
+files as @samp{@@@@}.
+
address@hidden
+Comments start with either @samp{@@c} or @samp{@@comment}.
+The file-extraction program works by using special comments that start
+at the beginning of a line.
+
address@hidden
+Lines containing @samp{@@group} and @samp{@@end group} commands bracket
+example text that should not be split across a page boundary.
+(Unfortunately, @TeX{} isn't always smart enough to do things exactly right,
+so we have to give it some help.)
address@hidden itemize
+
+The following program, @file{extract.awk}, reads through a Texinfo source
+file and does two things, based on the special comments.
+Upon seeing @address@hidden@@c system @dots{}}},
+it runs a command, by extracting the command text from the
+control line and passing it on to the @code{system()} function
+(@pxref{I/O Functions}).
+Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to
+the file @var{filename}, until @samp{@@c endfile} is encountered.
+The rules in @file{extract.awk} match either @samp{@@c} or
address@hidden@@comment} by letting the @samp{omment} part be optional.
+Lines containing @samp{@@group} and @samp{@@end group} are simply removed.
address@hidden uses the @code{join()} library function
+(@pxref{Join Function}).
+
+The example programs in the online Texinfo source for @address@hidden
+(@file{gawk.texi}) have all been bracketed inside @samp{file} and
address@hidden lines. The @command{gawk} distribution uses a copy of
address@hidden to extract the sample programs and install many
+of them in a standard directory where @command{gawk} can find them.
+The Texinfo file looks something like this:
address@hidden -w
-Count only words.
-A ``word'' is a contiguous sequence of nonwhitespace characters, separated
-by spaces and/or TABs. Luckily, this is the normal way @command{awk} separates
-fields in its input data.
address@hidden
address@hidden
+This program has a @@address@hidden@} rule,
+that prints a nice message:
address@hidden -c
-Count only characters.
address@hidden table
+@@example
+@@c file examples/messages.awk
+BEGIN @@@{ print "Don't panic!" @@@}
+@@c end file
+@@end example
-Implementing @command{wc} in @command{awk} is particularly elegant,
-since @command{awk} does a lot of the work for us; it splits lines into
-words (i.e., fields) and counts them, it counts lines (i.e., records),
-and it can easily tell us how long a line is.
+It also prints some final advice:
-This program uses the @code{getopt()} library function
-(@pxref{Getopt Function})
-and the file-transition functions
-(@pxref{Filetrans Function}).
+@@example
+@@c file examples/messages.awk
+END @@@{ print "Always avoid bored archeologists!" @@@}
+@@c end file
+@@end example
address@hidden
address@hidden example
-This version has one notable difference from traditional versions of
address@hidden: it always prints the counts in the order lines, words,
-and characters. Traditional versions note the order of the @option{-l},
address@hidden, and @option{-c} options on the command line, and print the
-counts in that order.
address@hidden begins by setting @code{IGNORECASE} to one, so that
+mixed upper- and lowercase letters in the directives won't matter.
-The @code{BEGIN} rule does the argument processing. The variable
address@hidden is true if more than one file is named on the
-command line:
+The first rule handles calling @code{system()}, checking that a command is
+given (@code{NF} is at least three) and also checking that the command
+exits with a zero exit status, signifying OK:
address@hidden @code{wc.awk} program
address@hidden @code{extract.awk} program
@example
address@hidden file eg/prog/wc.awk
-# wc.awk --- count lines, words, characters
address@hidden file eg/prog/extract.awk
+# extract.awk --- extract files and run programs
+# from texinfo files
@c endfile
@ignore
address@hidden file eg/prog/wc.awk
address@hidden file eg/prog/extract.awk
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
+# Revised September 2000
@c endfile
@end ignore
address@hidden file eg/prog/wc.awk
address@hidden file eg/prog/extract.awk
-# Options:
-# -l only count lines
-# -w only count words
-# -c only count characters
-#
-# Default is to count lines, words, characters
-#
-# Requires getopt() and file transition library functions
+BEGIN @{ IGNORECASE = 1 @}
-BEGIN @{
- # let getopt() print a message about
- # invalid options. we ignore them
- while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
- if (c == "l")
- do_lines = 1
- else if (c == "w")
- do_words = 1
- else if (c == "c")
- do_chars = 1
+/^@@c(omment)?[ \t]+system/ \
address@hidden
+ if (NF < 3) @{
+ e = (FILENAME ":" FNR)
+ e = (e ": badly formed `system' line")
+ print e > "/dev/stderr"
+ next
+ @}
+ $1 = ""
+ $2 = ""
+ stat = system($0)
+ if (stat != 0) @{
+ e = (FILENAME ":" FNR)
+ e = (e ": warning: system returned " stat)
+ print e > "/dev/stderr"
@}
- for (i = 1; i < Optind; i++)
- ARGV[i] = ""
-
- # if no options, do all
- if (! do_lines && ! do_words && ! do_chars)
- do_lines = do_words = do_chars = 1
-
- print_total = (ARGC - i > 2)
@}
@c endfile
@end example
-The @code{beginfile()} function is simple; it just resets the counts of lines,
-words, and characters to zero, and saves the current @value{FN} in
address@hidden:
address@hidden
+The variable @code{e} is used so that the rule
+fits nicely on the
address@hidden
+page.
address@hidden ifnotinfo
address@hidden
+screen.
address@hidden ifnottex
+
+The second rule handles moving data into files. It verifies that a
address@hidden is given in the directive. If the file named is not the
+current file, then the current file is closed. Keeping the current file
+open until a new file is encountered allows the use of the @samp{>}
+redirection for printing the contents, keeping open file management
+simple.
+
+The @code{for} loop does the work. It reads lines using @code{getline}
+(@pxref{Getline}).
+For an unexpected end of file, it calls the @address@hidden()}}
+function. If the line is an ``endfile'' line, then it breaks out of
+the loop.
+If the line is an @samp{@@group} or @samp{@@end group} line, then it
+ignores it and goes on to the next line.
+Similarly, comments within examples are also ignored.
+
+Most of the work is in the following few lines. If the line has no @samp{@@}
+symbols, the program can print it directly.
+Otherwise, each leading @samp{@@} must be stripped off.
+To remove the @samp{@@} symbols, the line is split into separate elements of
+the array @code{a}, using the @code{split()} function
+(@pxref{String Functions}).
+The @samp{@@} symbol is used as the separator character.
+Each element of @code{a} that is empty indicates two successive @samp{@@}
+symbols in the original line. For each two empty elements (@samp{@@@@} in
+the original file), we have to add a single @samp{@@} symbol back
address@hidden program was written before @command{gawk} had the
address@hidden()} function. Consider how you might use it to simplify the code.}
+
+When the processing of the array is finished, @code{join()} is called with the
+value of @code{SUBSEP}, to rejoin the pieces back into a single
+line. That line is then printed to the output file:
@example
address@hidden file eg/prog/wc.awk
-function beginfile(file)
address@hidden file eg/prog/extract.awk
+/^@@c(omment)?[ \t]+file/ \
@{
- lines = words = chars = 0
- fname = FILENAME
+ if (NF != 3) @{
+ e = (FILENAME ":" FNR ": badly formed `file' line")
+ print e > "/dev/stderr"
+ next
+ @}
+ if ($3 != curfile) @{
+ if (curfile != "")
+ close(curfile)
+ curfile = $3
+ @}
+
+ for (;;) @{
+ if ((getline line) <= 0)
+ unexpected_eof()
+ if (line ~ /^@@c(omment)?[ \t]+endfile/)
+ break
+ else if (line ~ /^@@(end[ \t]+)?group/)
+ continue
+ else if (line ~ /^@@c(omment+)?[ \t]+/)
+ continue
+ if (index(line, "@@") == 0) @{
+ print line > curfile
+ continue
+ @}
+ n = split(line, a, "@@")
+ # if a[1] == "", means leading @@,
+ # don't add one back in.
+ for (i = 2; i <= n; i++) @{
+ if (a[i] == "") @{ # was an @@@@
+ a[i] = "@@"
+ if (a[i+1] == "")
+ i++
+ @}
+ @}
+ print join(a, 1, n, SUBSEP) > curfile
+ @}
@}
@c endfile
@end example
-The @code{endfile()} function adds the current file's numbers to the running
-totals of lines, words, and address@hidden@command{wc} can't just use the
value of
address@hidden in @code{endfile()}. If you examine
-the code in
address@hidden Function},
-you will see that
address@hidden has already been reset by the time
address@hidden()} is called.} It then prints out those numbers
-for the file that was just read. It relies on @code{beginfile()} to reset the
-numbers for the following @value{DF}:
address@hidden FIXME: ONE DAY: make the above footnote an exercise,
address@hidden instead of giving away the answer.
+An important thing to note is the use of the @samp{>} redirection.
+Output done with @samp{>} only opens the file once; it stays open and
+subsequent output is appended to the file
+(@pxref{Redirection}).
+This makes it easy to mix program text and explanatory prose for the same
+sample source file (as has been done here!) without any hassle. The file is
+only closed when a new data @value{FN} is encountered or at the end of the
+input file.
+
+Finally, the function @address@hidden()}} prints an appropriate
+error message and then exits.
+The @code{END} rule handles the final cleanup, closing the open file:
address@hidden function lb put on same line for page breaking. sigh
@example
address@hidden file eg/prog/wc.awk
-function endfile(file)
address@hidden
- tlines += lines
- twords += words
- tchars += chars
- if (do_lines)
- printf "\t%d", lines
address@hidden file eg/prog/extract.awk
@group
- if (do_words)
- printf "\t%d", words
+function unexpected_eof()
address@hidden
+ printf("%s:%d: unexpected EOF or error\n",
+ FILENAME, FNR) > "/dev/stderr"
+ exit 1
address@hidden
@end group
- if (do_chars)
- printf "\t%d", chars
- printf "\t%s\n", fname
+
+END @{
+ if (curfile)
+ close(curfile)
@}
@c endfile
@end example
address@hidden ENDOFRANGE texse
address@hidden ENDOFRANGE fitex
+
address@hidden Simple Sed
address@hidden A Simple Stream Editor
+
address@hidden @command{sed} utility
address@hidden stream editors
+The @command{sed} utility is a stream editor, a program that reads a
+stream of data, makes changes to it, and passes it on.
+It is often used to make global changes to a large file or to a stream
+of data generated by a pipeline of commands.
+While @command{sed} is a complicated program in its own right, its most common
+use is to perform global substitutions in the middle of a pipeline:
+
address@hidden
+command1 < orig.data | sed 's/old/new/g' | command2 > result
address@hidden example
-There is one rule that is executed for each line. It adds the length of
-the record, plus one, to @address@hidden @command{gawk}
-understands multibyte locales, this code counts characters, not bytes.}
-Adding one plus the record length
-is needed because the newline character separating records (the value
-of @code{RS}) is not part of the record itself, and thus not included
-in its length. Next, @code{lines} is incremented for each line read,
-and @code{words} is incremented by the value of @code{NF}, which is the
-number of ``words'' on this line:
+Here, @samp{s/old/new/g} tells @command{sed} to look for the regexp
address@hidden on each input line and globally replace it with the text
address@hidden, i.e., all the occurrences on a line. This is similar to
address@hidden's @code{gsub()} function
+(@pxref{String Functions}).
+
+The following program, @file{awksed.awk}, accepts at least two command-line
+arguments: the pattern to look for and the text to replace it with. Any
+additional arguments are treated as data @value{FN}s to process. If none
+are provided, the standard input is used:
address@hidden Brennan, Michael
address@hidden @command{awksed.awk} program
address@hidden @cindex simple stream editor
address@hidden @cindex stream editor, simple
@example
address@hidden file eg/prog/wc.awk
-# do per line
address@hidden file eg/prog/awksed.awk
+# awksed.awk --- do s/foo/bar/g using just print
+# Thanks to Michael Brennan for the idea
address@hidden endfile
address@hidden
address@hidden file eg/prog/awksed.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# August 1995
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/awksed.awk
+
+function usage()
@{
- chars += length($0) + 1 # get newline
- lines++
- words += NF
+ print "usage: awksed pat repl [files...]" > "/dev/stderr"
+ exit 1
@}
address@hidden endfile
address@hidden example
-Finally, the @code{END} rule simply prints the totals for all the files:
+BEGIN @{
+ # validate arguments
+ if (ARGC < 3)
+ usage()
address@hidden
address@hidden file eg/prog/wc.awk
-END @{
- if (print_total) @{
- if (do_lines)
- printf "\t%d", tlines
- if (do_words)
- printf "\t%d", twords
- if (do_chars)
- printf "\t%d", tchars
- print "\ttotal"
- @}
+ RS = ARGV[1]
+ ORS = ARGV[2]
+
+ # don't use arguments as files
+ ARGV[1] = ARGV[2] = ""
address@hidden
+
address@hidden
+# look ma, no hands!
address@hidden
+ if (RT == "")
+ printf "%s", $0
+ else
+ print
@}
address@hidden group
@c endfile
@end example
address@hidden ENDOFRANGE count
address@hidden ENDOFRANGE infco
address@hidden ENDOFRANGE lico
address@hidden ENDOFRANGE woco
address@hidden ENDOFRANGE chco
address@hidden ENDOFRANGE posimawk
address@hidden Miscellaneous Programs
address@hidden A Grab Bag of @command{awk} Programs
+The program relies on @command{gawk}'s ability to have @code{RS} be a regexp,
+as well as on the setting of @code{RT} to the actual text that terminates the
+record (@pxref{Records}).
-This @value{SECTION} is a large ``grab bag'' of miscellaneous programs.
-We hope you find them both interesting and enjoyable.
+The idea is to have @code{RS} be the pattern to look for. @command{gawk}
+automatically sets @code{$0} to the text between matches of the pattern.
+This is text that we want to keep, unmodified. Then, by setting @code{ORS}
+to the replacement text, a simple @code{print} statement outputs the
+text we want to keep, followed by the replacement text.
address@hidden
-* Dupword Program:: Finding duplicated words in a document.
-* Alarm Program:: An alarm clock.
-* Translate Program:: A program similar to the @command{tr} utility.
-* Labels Program:: Printing mailing labels.
-* Word Sorting:: A program to produce a word usage count.
-* History Sorting:: Eliminating duplicate entries from a history
- file.
-* Extract Program:: Pulling out programs from Texinfo source
- files.
-* Simple Sed:: A Simple Stream Editor.
-* Igawk Program:: A wrapper for @command{awk} that includes
- files.
-* Anagram Program:: Finding anagrams from a dictionary.
-* Signature Program:: People do amazing things with too much time on
- their hands.
address@hidden menu
+There is one wrinkle to this scheme, which is what to do if the last record
+doesn't end with text that matches @code{RS}. Using a @code{print}
+statement unconditionally prints the replacement text, which is not correct.
+However, if the file did not end in text that matches @code{RS}, @code{RT}
+is set to the null string. In this case, we can print @code{$0} using
address@hidden
+(@pxref{Printf}).
address@hidden Dupword Program
address@hidden Finding Duplicated Words in a Document
+The @code{BEGIN} rule handles the setup, checking for the right number
+of arguments and calling @code{usage()} if there is a problem. Then it sets
address@hidden and @code{ORS} from the command-line arguments and sets
address@hidden and @code{ARGV[2]} to the null string, so that they are
+not treated as @value{FN}s
+(@pxref{ARGC and ARGV}).
address@hidden words, address@hidden searching for
address@hidden searching, for words
address@hidden address@hidden searching
-A common error when writing large amounts of prose is to accidentally
-duplicate words. Typically you will see this in text as something like ``the
-the program does the address@hidden'' When the text is online, often
-the duplicated words occur at the end of one line and the
address@hidden
-the
address@hidden iftex
-beginning of
-another, making them very difficult to spot.
address@hidden as here!
+The @code{usage()} function prints an error message and exits.
+Finally, the single rule handles the printing scheme outlined above,
+using @code{print} or @code{printf} as appropriate, depending upon the
+value of @code{RT}.
-This program, @file{dupword.awk}, scans through a file one line at a time
-and looks for adjacent occurrences of the same word. It also saves the last
-word on a line (in the variable @code{prev}) for comparison with the first
-word on the next line.
address@hidden
+Exercise, compare the performance of this version with the more
+straightforward:
address@hidden Texinfo
-The first two statements make sure that the line is all lowercase,
-so that, for example, ``The'' and ``the'' compare equal to each other.
-The next statement replaces nonalphanumeric and nonwhitespace characters
-with spaces, so that punctuation does not affect the comparison either.
-The characters are replaced with spaces so that formatting controls
-don't create nonsense words (e.g., the Texinfo @samp{@@address@hidden@}}
-becomes @samp{codeNF} if punctuation is simply deleted). The record is
-then resplit into fields, yielding just the actual words on the line,
-and ensuring that there are no empty fields.
+BEGIN {
+ pat = ARGV[1]
+ repl = ARGV[2]
+ ARGV[1] = ARGV[2] = ""
+}
-If there are no fields left after removing all the punctuation, the
-current record is skipped. Otherwise, the program loops through each
-word, comparing it to the previous one:
+{ gsub(pat, repl); print }
address@hidden @code{dupword.awk} program
address@hidden
address@hidden file eg/prog/dupword.awk
-# dupword.awk --- find duplicate words in text
address@hidden endfile
address@hidden
address@hidden file eg/prog/dupword.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# December 1991
-# Revised October 2000
+Exercise: what are the advantages and disadvantages of this version versus sed?
+ Advantage: egrep regexps
+ speed (?)
+ Disadvantage: no & in replacement text
address@hidden endfile
+Others?
@end ignore
address@hidden file eg/prog/dupword.awk
address@hidden
- $0 = tolower($0)
- gsub(/[^[:alnum:][:blank:]]/, " ");
- $0 = $0 # re-split
- if (NF == 0)
- next
- if ($1 == prev)
- printf("%s:%d: duplicate %s\n",
- FILENAME, FNR, $1)
- for (i = 2; i <= NF; i++)
- if ($i == $(i-1))
- printf("%s:%d: duplicate %s\n",
- FILENAME, FNR, $i)
- prev = $NF
+
address@hidden Igawk Program
address@hidden An Easy Way to Use Library Functions
+
address@hidden STARTOFRANGE libfex
address@hidden libraries of @command{awk} functions, example program for using
address@hidden STARTOFRANGE flibex
address@hidden functions, library, example program for using
+In @ref{Include Files}, we saw how @command{gawk} provides a built-in
+file-inclusion capability. However, this is a @command{gawk} extension.
+This @value{SECTION} provides the motivation for making file inclusion
+available for standard @command{awk}, and shows how to do it using a
+combination of shell and @command{awk} programming.
+
+Using library functions in @command{awk} can be very beneficial. It
+encourages code reuse and the writing of general functions. Programs are
+smaller and therefore clearer.
+However, using library functions is only easy when writing @command{awk}
+programs; it is painful when running them, requiring multiple @option{-f}
+options. If @command{gawk} is unavailable, then so too is the @env{AWKPATH}
+environment variable and the ability to put @command{awk} functions into a
+library directory (@pxref{Options}).
+It would be nice to be able to write programs in the following manner:
+
address@hidden
+# library functions
+@@include getopt.awk
+@@include join.awk
address@hidden
+
+# main program
+BEGIN @{
+ while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
+ @dots{}
+ @dots{}
@}
address@hidden endfile
@end example
address@hidden Alarm Program
address@hidden An Alarm Clock Program
address@hidden insomnia, cure for
address@hidden Robbins, Arnold
address@hidden
address@hidden cures insomnia like a ringing alarm address@hidden
-Arnold Robbins
address@hidden quotation
+The following program, @file{igawk.sh}, provides this service.
+It simulates @command{gawk}'s searching of the @env{AWKPATH} variable
+and also allows @dfn{nested} includes; i.e., a file that is included
+with @samp{@@include} can contain further @samp{@@include} statements.
address@hidden makes an effort to only include files once, so that nested
+includes don't accidentally include a library function twice.
+
address@hidden should behave just like @command{gawk} externally. This
+means it should accept all of @command{gawk}'s command-line arguments,
+including the ability to have multiple source files specified via
address@hidden, and the ability to mix command-line and library source files.
+
+The program is written using the POSIX Shell (@command{sh}) command
address@hidden explaining the @command{sh} language is beyond
+the scope of this book. We provide some minimal explanations, but see
+a good shell programming book if you wish to understand things in more
+depth.} It works as follows:
+
address@hidden
address@hidden
+Loop through the arguments, saving anything that doesn't represent
address@hidden source code for later, when the expanded program is run.
address@hidden STARTOFRANGE tialarm
address@hidden time, alarm clock example program
address@hidden STARTOFRANGE alaex
address@hidden alarm clock example program
-The following program is a simple ``alarm clock'' program.
-You give it a time of day and an optional message. At the specified time,
-it prints the message on the standard output. In addition, you can give it
-the number of times to repeat the message as well as a delay between
-repetitions.
address@hidden
+For any arguments that do represent @command{awk} text, put the arguments into
+a shell variable that will be expanded. There are two cases:
-This program uses the @code{getlocaltime()} function from
address@hidden Function}.
address@hidden a
address@hidden
+Literal text, provided with @option{--source} or @option{--source=}. This
+text is just appended directly.
-All the work is done in the @code{BEGIN} rule. The first part is argument
-checking and setting of defaults: the delay, the count, and the message to
-print. If the user supplied a message without the ASCII BEL
-character (known as the ``alert'' character, @code{"\a"}), then it is added to
-the message. (On many systems, printing the ASCII BEL generates an
-audible alert. Thus when the alarm goes off, the system calls attention
-to itself in case the user is not looking at the computer.)
-Just for a change, this program uses a @code{switch} statement
-(@pxref{Switch Statement}), but the processing could be done with a series of
address@hidden@code{else} statements instead.
-Here is the program:
address@hidden
+Source @value{FN}s, provided with @option{-f}. We use a neat trick and append
address@hidden@@include @var{filename}} to the shell variable's contents.
Since the file-inclusion
+program works the way @command{gawk} does, this gets the text
+of the file included into the program at the correct point.
address@hidden enumerate
address@hidden @code{alarm.awk} program
address@hidden
address@hidden file eg/prog/alarm.awk
-# alarm.awk --- set an alarm
-#
-# Requires getlocaltime() library function
address@hidden endfile
address@hidden
address@hidden file eg/prog/alarm.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# May 1993
-# Revised December 2010
address@hidden
+Run an @command{awk} program (naturally) over the shell variable's contents to
expand
address@hidden@@include} statements. The expanded program is placed in a second
+shell variable.
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/alarm.awk
-# usage: alarm time [ "message" [ count [ delay ] ] ]
address@hidden
+Run the expanded program with @command{gawk} and any other original
command-line
+arguments that the user supplied (such as the data @value{FN}s).
address@hidden enumerate
-BEGIN \
address@hidden
- # Initial argument sanity checking
- usage1 = "usage: alarm time ['message' [count [delay]]]"
- usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
+This program uses shell variables extensively: for storing command-line
arguments,
+the text of the @command{awk} program that will expand the user's program, for
the
+user's original program, and for the expanded program. Doing so removes some
+potential problems that might arise were we to use temporary files instead,
+at the cost of making the script somewhat more complicated.
- if (ARGC < 2) @{
- print usage1 > "/dev/stderr"
- print usage2 > "/dev/stderr"
- exit 1
- @}
- switch (ARGC) @{
- case 5:
- delay = ARGV[4] + 0
- # fall through
- case 4:
- count = ARGV[3] + 0
- # fall through
- case 3:
- message = ARGV[2]
- break
- default:
- if (ARGV[1] !~ /[[:digit:]]?[[:digit:]]:[[:digit:address@hidden@}/) @{
- print usage1 > "/dev/stderr"
- print usage2 > "/dev/stderr"
- exit 1
- @}
- break
- @}
+The initial part of the program turns on shell tracing if the first
+argument is @samp{debug}.
- # set defaults for once we reach the desired time
- if (delay == 0)
- delay = 180 # 3 minutes
address@hidden
- if (count == 0)
- count = 5
address@hidden group
- if (message == "")
- message = sprintf("\aIt is now %s!\a", ARGV[1])
- else if (index(message, "\a") == 0)
- message = "\a" message "\a"
address@hidden endfile
address@hidden example
+The next part loops through all the command-line arguments.
+There are several cases of interest:
-The next @value{SECTION} of code turns the alarm time into hours and minutes,
-converts it (if necessary) to a 24-hour clock, and then turns that
-time into a count of the seconds since midnight. Next it turns the current
-time into a count of seconds since midnight. The difference between the two
-is how long to wait before setting off the alarm:
address@hidden @code
address@hidden --
+This ends the arguments to @command{igawk}. Anything else should be passed on
+to the user's @command{awk} program without being evaluated.
address@hidden
address@hidden file eg/prog/alarm.awk
- # split up alarm time
- split(ARGV[1], atime, ":")
- hour = atime[1] + 0 # force numeric
- minute = atime[2] + 0 # force numeric
address@hidden -W
+This indicates that the next option is specific to @command{gawk}. To make
+argument processing easier, the @option{-W} is appended to the front of the
+remaining arguments and the loop continues. (This is an @command{sh}
+programming trick. Don't worry about it if you are not familiar with
address@hidden)
- # get current broken down time
- getlocaltime(now)
address@hidden address@hidden,} -F
+These are saved and passed on to @command{gawk}.
- # if time given is 12-hour hours and it's after that
- # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
- # then add 12 to real hour
- if (hour < 12 && now["hour"] > hour)
- hour += 12
address@hidden address@hidden,} address@hidden,} address@hidden,} -Wfile=
+The @value{FN} is appended to the shell variable @code{program} with an
address@hidden@@include} statement.
+The @command{expr} utility is used to remove the leading option part of the
+argument (e.g., @samp{--file=}).
+(Typical @command{sh} usage would be to use the @command{echo} and
@command{sed}
+utilities to do this work. Unfortunately, some versions of @command{echo}
evaluate
+escape sequences in their arguments, possibly mangling the program text.
+Using @command{expr} avoids this problem.)
- # set target time in seconds since midnight
- target = (hour * 60 * 60) + (minute * 60)
address@hidden address@hidden,} address@hidden,} -Wsource=
+The source text is appended to @code{program}.
- # get current time in seconds since midnight
- current = (now["hour"] * 60 * 60) + \
- (now["minute"] * 60) + now["second"]
address@hidden address@hidden,} -Wversion
address@hidden prints its version number, runs @samp{gawk --version}
+to get the @command{gawk} version information, and then exits.
address@hidden table
- # how long to sleep for
- naptime = target - current
- if (naptime <= 0) @{
- print "time is in the past!" > "/dev/stderr"
- exit 1
- @}
address@hidden endfile
address@hidden example
+If none of the @option{-f}, @option{--file}, @option{-Wfile},
@option{--source},
+or @option{-Wsource} arguments are supplied, then the first nonoption argument
+should be the @command{awk} program. If there are no command-line
+arguments left, @command{igawk} prints an error message and exits.
+Otherwise, the first argument is appended to @code{program}.
+In any case, after the arguments have been processed,
address@hidden contains the complete text of the original @command{awk}
+program.
address@hidden @command{sleep} utility
-Finally, the program uses the @code{system()} function
-(@pxref{I/O Functions})
-to call the @command{sleep} utility. The @command{sleep} utility simply pauses
-for the given number of seconds. If the exit status is not zero,
-the program assumes that @command{sleep} was interrupted and exits. If
address@hidden exited with an OK status (zero), then the program prints the
-message in a loop, again using @command{sleep} to delay for however many
-seconds are necessary:
+The program is as follows:
address@hidden @code{igawk.sh} program
@example
address@hidden file eg/prog/alarm.awk
- # zzzzzz..... go away if interrupted
- if (system(sprintf("sleep %d", naptime)) != 0)
- exit 1
address@hidden file eg/prog/igawk.sh
+#! /bin/sh
+# igawk --- like gawk but do @@include processing
address@hidden endfile
address@hidden
address@hidden file eg/prog/igawk.sh
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# July 1993
+# December 2010, minor edits
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/igawk.sh
- # time to notify!
- command = sprintf("sleep %d", delay)
- for (i = 1; i <= count; i++) @{
- print message
- # if sleep command interrupted, go away
- if (system(command) != 0)
- break
- @}
+if [ "$1" = debug ]
+then
+ set -x
+ shift
+fi
- exit 0
address@hidden
address@hidden endfile
address@hidden example
address@hidden ENDOFRANGE tialarm
address@hidden ENDOFRANGE alaex
+# A literal newline, so that program text is formatted correctly
+n='
+'
address@hidden Translate Program
address@hidden Transliterating Characters
+# Initialize variables to empty
+program=
+opts=
+
+while [ $# -ne 0 ] # loop over arguments
+do
+ case $1 in
+ --) shift
+ break ;;
+
+ -W) shift
+ # The address@hidden'message here'@} construct prints a
+ # diagnostic if $x is the null string
+ set -- -W"address@hidden@@?'missing operand'@}"
+ continue ;;
address@hidden STARTOFRANGE chtra
address@hidden characters, transliterating
address@hidden @command{tr} utility
-The system @command{tr} utility transliterates characters. For example, it is
-often used to map uppercase letters into lowercase for further processing:
+ -[vF]) opts="$opts $1 'address@hidden'missing operand'@}'"
+ shift ;;
address@hidden
address@hidden data} | tr 'A-Z' 'a-z' | @var{process data} @dots{}
address@hidden example
+ -[vF]*) opts="$opts '$1'" ;;
address@hidden requires two lists of address@hidden some older
-systems,
address@hidden ORA
-including Solaris,
address@hidden ifset
address@hidden may require that the lists be written as
-range expressions enclosed in square brackets (@samp{[a-z]}) and quoted,
-to prevent the shell from attempting a @value{FN} expansion. This is
-not a feature.} When processing the input, the first character in the
-first list is replaced with the first character in the second list,
-the second character in the first list is replaced with the second
-character in the second list, and so on. If there are more characters
-in the ``from'' list than in the ``to'' list, the last character of the
-``to'' list is used for the remaining characters in the ``from'' list.
+ -f) program="$program$n@@include address@hidden'missing operand'@}"
+ shift ;;
-Some time ago,
address@hidden early or mid-1989!
-a user proposed that a transliteration function should
-be added to @command{gawk}.
address@hidden Wishing to avoid gratuitous new features,
address@hidden at least theoretically
-The following program was written to
-prove that character transliteration could be done with a user-level
-function. This program is not as complete as the system @command{tr} utility
-but it does most of the job.
+ -f*) f=$(expr "$1" : '-f\(.*\)')
+ program="$program$n@@include $f" ;;
-The @command{translate} program demonstrates one of the few weaknesses
-of standard @command{awk}: dealing with individual characters is very
-painful, requiring repeated use of the @code{substr()}, @code{index()},
-and @code{gsub()} built-in functions
-(@pxref{String Functions})address@hidden
-program was written before @command{gawk} acquired the ability to
-split each character in a string into separate array elements.}
address@hidden Exercise: How might you use this new feature to simplify the
program?
-There are two functions. The first, @code{stranslate()}, takes three
-arguments:
+ -[W-]file=*)
+ f=$(expr "$1" : '-.file=\(.*\)')
+ program="$program$n@@include $f" ;;
address@hidden @code
address@hidden from
-A list of characters from which to translate.
+ -[W-]file)
+ program="$program$n@@include address@hidden'missing operand'@}"
+ shift ;;
address@hidden to
-A list of characters to which to translate.
+ -[W-]source=*)
+ t=$(expr "$1" : '-.source=\(.*\)')
+ program="$program$n$t" ;;
address@hidden target
-The string on which to do the translation.
address@hidden table
+ -[W-]source)
+ program="address@hidden'missing operand'@}"
+ shift ;;
-Associative arrays make the translation part fairly easy. @code{t_ar} holds
-the ``to'' characters, indexed by the ``from'' characters. Then a simple
-loop goes through @code{from}, one character at a time. For each character
-in @code{from}, if the character appears in @code{target},
-it is replaced with the corresponding @code{to} character.
+ -[W-]version)
+ echo igawk: version 3.0 1>&2
+ gawk --version
+ exit 0 ;;
-The @code{translate()} function simply calls @code{stranslate()} using
@code{$0}
-as the target. The main program sets two global variables, @code{FROM} and
address@hidden, from the command line, and then changes @code{ARGV} so that
address@hidden reads from the standard input.
+ -[W-]*) opts="$opts '$1'" ;;
-Finally, the processing rule simply calls @code{translate()} for each record:
+ *) break ;;
+ esac
+ shift
+done
address@hidden @code{translate.awk} program
address@hidden
address@hidden file eg/prog/translate.awk
-# translate.awk --- do tr-like stuff
address@hidden endfile
address@hidden
address@hidden file eg/prog/translate.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# August 1989
-# February 2009 - bug fix
+if [ -z "$program" ]
+then
+ address@hidden'missing program'@}
+ shift
+fi
+# At this point, `program' has the program.
@c endfile
address@hidden example
+
+The @command{awk} program to process @samp{@@include} directives
+is stored in the shell variable @code{expand_prog}. Doing this keeps
+the shell script readable. The @command{awk} program
+reads through the user's program, one line at a time, using @code{getline}
+(@pxref{Getline}). The input
address@hidden and @samp{@@include} statements are managed using a stack.
+As each @samp{@@include} is encountered, the current @value{FN} is
+``pushed'' onto the stack and the file named in the @samp{@@include}
+directive becomes the current @value{FN}. As each file is finished,
+the stack is ``popped,'' and the previous input file becomes the current
+input file again. The process is started by making the original file
+the first one on the stack.
+
+The @code{pathto()} function does the work of finding the full path to
+a file. It simulates @command{gawk}'s behavior when searching the
address@hidden environment variable
+(@pxref{AWKPATH Variable}).
+If a @value{FN} has a @samp{/} in it, no path search is done.
+Similarly, if the @value{FN} is @code{"-"}, then that string is
+used as-is. Otherwise,
+the @value{FN} is concatenated with the name of each directory in
+the path, and an attempt is made to open the generated @value{FN}.
+The only way to test if a file can be read in @command{awk} is to go
+ahead and try to read it with @code{getline}; this is what @code{pathto()}
address@hidden some very old versions of @command{awk}, the test
address@hidden junk < t} can loop forever if the file exists but is empty.
+Caveat emptor.} If the file can be read, it is closed and the @value{FN}
+is returned:
+
address@hidden
+An alternative way to test for the file's existence would be to call
address@hidden("test -r " t)}, which uses the @command{test} utility to
+see if the file exists and is readable. The disadvantage to this method
+is that it requires creating an extra process and can thus be slightly
+slower.
@end ignore
address@hidden file eg/prog/translate.awk
-# Bugs: does not handle things like: tr A-Z a-z, it has
-# to be spelled out. However, if `to' is shorter than `from',
-# the last character in `to' is used for the rest of `from'.
-function stranslate(from, to, target, lf, lt, ltarget, t_ar, i, c,
- result)
address@hidden
address@hidden file eg/prog/igawk.sh
+expand_prog='
+
+function pathto(file, i, t, junk)
@{
- lf = length(from)
- lt = length(to)
- ltarget = length(target)
- for (i = 1; i <= lt; i++)
- t_ar[substr(from, i, 1)] = substr(to, i, 1)
- if (lt < lf)
- for (; i <= lf; i++)
- t_ar[substr(from, i, 1)] = substr(to, lt, 1)
- for (i = 1; i <= ltarget; i++) @{
- c = substr(target, i, 1)
- if (c in t_ar)
- c = t_ar[c]
- result = result c
+ if (index(file, "/") != 0)
+ return file
+
+ if (file == "-")
+ return file
+
+ for (i = 1; i <= ndirs; i++) @{
+ t = (pathlist[i] "/" file)
address@hidden
+ if ((getline junk < t) > 0) @{
+ # found it
+ close(t)
+ return t
+ @}
address@hidden group
@}
- return result
+ return ""
@}
address@hidden endfile
address@hidden example
-function translate(from, to)
address@hidden
- return $0 = stranslate(from, to, $0)
address@hidden
+The main program is contained inside one @code{BEGIN} rule. The first thing it
+does is set up the @code{pathlist} array that @code{pathto()} uses. After
+splitting the path on @samp{:}, null elements are replaced with @code{"."},
+which represents the current directory:
-# main program
address@hidden
address@hidden file eg/prog/igawk.sh
BEGIN @{
address@hidden
- if (ARGC < 3) @{
- print "usage: translate from to" > "/dev/stderr"
- exit
+ path = ENVIRON["AWKPATH"]
+ ndirs = split(path, pathlist, ":")
+ for (i = 1; i <= ndirs; i++) @{
+ if (pathlist[i] == "")
+ pathlist[i] = "."
@}
address@hidden endfile
address@hidden example
+
+The stack is initialized with @code{ARGV[1]}, which will be @file{/dev/stdin}.
+The main loop comes next. Input lines are read in succession. Lines that
+do not start with @samp{@@include} are printed verbatim.
+If the line does start with @samp{@@include}, the @value{FN} is in @code{$2}.
address@hidden()} is called to generate the full path. If it cannot, then the
program
+prints an error message and continues.
+
+The next thing to check is if the file is included already. The
address@hidden array is indexed by the full @value{FN} of each included
+file and it tracks this information for us. If the file is
+seen again, a warning message is printed. Otherwise, the new @value{FN} is
+pushed onto the stack and processing continues.
+
+Finally, when @code{getline} encounters the end of the input file, the file
+is closed and the stack is popped. When @code{stackptr} is less than zero,
+the program is done:
+
address@hidden
address@hidden file eg/prog/igawk.sh
+ stackptr = 0
+ input[stackptr] = ARGV[1] # ARGV[1] is first file
+
+ for (; stackptr >= 0; stackptr--) @{
+ while ((getline < input[stackptr]) > 0) @{
+ if (tolower($1) != "@@include") @{
+ print
+ continue
+ @}
+ fpath = pathto($2)
address@hidden
+ if (fpath == "") @{
+ printf("igawk:%s:%d: cannot find %s\n",
+ input[stackptr], FNR, $2) > "/dev/stderr"
+ continue
+ @}
@end group
- FROM = ARGV[1]
- TO = ARGV[2]
- ARGC = 2
- ARGV[1] = "-"
address@hidden
+ if (! (fpath in processed)) @{
+ processed[fpath] = input[stackptr]
+ input[++stackptr] = fpath # push onto stack
+ @} else
+ print $2, "included in", input[stackptr],
+ "already included in",
+ processed[fpath] > "/dev/stderr"
+ @}
+ close(input[stackptr])
+ @}
address@hidden' # close quote ends `expand_prog' variable
address@hidden
- translate(FROM, TO)
- print
address@hidden
+processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF
+$program
+EOF
+)
@c endfile
@end example
-While it is possible to do character transliteration in a user-level
-function, it is not necessarily efficient, and we (the @command{gawk}
-authors) started to consider adding a built-in function. However,
-shortly after writing this program, we learned that the System V Release 4
address@hidden had added the @code{toupper()} and @code{tolower()} functions
-(@pxref{String Functions}).
-These functions handle the vast majority of the
-cases where character transliteration is necessary, and so we chose to
-simply add those functions to @command{gawk} as well and then leave well
-enough alone.
+The shell construct @address@hidden << @var{marker}} is called a @dfn{here
document}.
+Everything in the shell script up to the @var{marker} is fed to @var{command}
as input.
+The shell processes the contents of the here document for variable and command
substitution
+(and possibly other things as well, depending upon the shell).
-An obvious improvement to this program would be to set up the
address@hidden array only once, in a @code{BEGIN} rule. However, this
-assumes that the ``from'' and ``to'' lists
-will never change throughout the lifetime of the program.
address@hidden ENDOFRANGE chtra
+The shell construct @samp{$(@dots{})} is called @dfn{command substitution}.
+The output of the command inside the parentheses is substituted
+into the command line.
+Because the result is used in a variable assignment,
+it is saved as a single string, even if the results contain whitespace.
address@hidden Labels Program
address@hidden Printing Mailing Labels
+The expanded program is saved in the variable @code{processed_program}.
+It's done in these steps:
address@hidden STARTOFRANGE prml
address@hidden printing, mailing labels
address@hidden STARTOFRANGE mlprint
address@hidden mailing address@hidden printing
-Here is a ``real world''@footnote{``Real world'' is defined as
-``a program actually used to get something done.''}
-program. This
-script reads lists of names and
-addresses and generates mailing labels. Each page of labels has 20 labels
-on it, two across and 10 down. The addresses are guaranteed to be no more
-than five lines of data. Each address is separated from the next by a blank
-line.
address@hidden
address@hidden
+Run @command{gawk} with the @samp{@@include}-processing program (the
+value of the @code{expand_prog} shell variable) on standard input.
-The basic idea is to read 20 labels worth of data. Each line of each label
-is stored in the @code{line} array. The single rule takes care of filling
-the @code{line} array and printing the page when 20 labels have been read.
address@hidden
+Standard input is the contents of the user's program, from the shell variable
@code{program}.
+Its contents are fed to @command{gawk} via a here document.
-The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that
address@hidden splits records at blank lines
-(@pxref{Records}).
-It sets @code{MAXLINES} to 100, since 100 is the maximum number
-of lines on the page (20 * 5 = 100).
address@hidden
+The results of this processing are saved in the shell variable
@code{processed_program} by using command substitution.
address@hidden enumerate
-Most of the work is done in the @code{printpage()} function.
-The label lines are stored sequentially in the @code{line} array. But they
-have to print horizontally; @code{line[1]} next to @code{line[6]},
address@hidden next to @code{line[7]}, and so on. Two loops are used to
-accomplish this. The outer loop, controlled by @code{i}, steps through
-every 10 lines of data; this is each row of labels. The inner loop,
-controlled by @code{j}, goes through the lines within the row.
-As @code{j} goes from 0 to 4, @samp{i+j} is the @code{j}-th line in
-the row, and @samp{i+j+5} is the entry next to it. The output ends up
-looking something like this:
+The last step is to call @command{gawk} with the expanded program,
+along with the original
+options and command-line arguments that the user supplied.
+
address@hidden this causes more problems than it solves, so leave it out.
address@hidden
+The special file @file{/dev/null} is passed as a @value{DF} to @command{gawk}
+to handle an interesting case. Suppose that the user's program only has
+a @code{BEGIN} rule and there are no @value{DF}s to read.
+The program should exit without reading any @value{DF}s.
+However, suppose that an included library file defines an @code{END}
+rule of its own. In this case, @command{gawk} will hang, reading standard
+input. In order to avoid this, @file{/dev/null} is explicitly added to the
+command-line. Reading from @file{/dev/null} always returns an immediate
+end of file indication.
+
address@hidden Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh.
address@hidden ignore
@example
-line 1 line 6
-line 2 line 7
-line 3 line 8
-line 4 line 9
-line 5 line 10
address@hidden
address@hidden file eg/prog/igawk.sh
+eval gawk $opts -- '"$processed_program"' '"$@@"'
address@hidden endfile
@end example
address@hidden
-The @code{printf} format string @samp{%-41s} left-aligns
-the data and prints it within a fixed-width field.
+The @command{eval} command is a shell construct that reruns the shell's parsing
+process. This keeps things properly quoted.
-As a final note, an extra blank line is printed at lines 21 and 61, to keep
-the output lined up on the labels. This is dependent on the particular
-brand of labels in use when the program was written. You will also note
-that there are two blank lines at the top and two blank lines at the bottom.
+This version of @command{igawk} represents my fifth version of this program.
+There are four key simplifications that make the program work better:
-The @code{END} rule arranges to flush the final page of labels; there may
-not have been an even multiple of 20 labels in the data:
address@hidden @bullet
address@hidden
+Using @samp{@@include} even for the files named with @option{-f} makes building
+the initial collected @command{awk} program much simpler; all the
address@hidden@@include} processing can be done once.
address@hidden @code{labels.awk} program
address@hidden
address@hidden file eg/prog/labels.awk
-# labels.awk --- print mailing labels
address@hidden endfile
address@hidden
+Not trying to save the line read with @code{getline}
+in the @code{pathto()} function when testing for the
+file's accessibility for use with the main program simplifies things
+considerably.
address@hidden what problem does this engender though - exercise
address@hidden answer, reading from "-" or /dev/stdin
+
address@hidden
+Using a @code{getline} loop in the @code{BEGIN} rule does it all in one
+place. It is not necessary to call out to a separate loop for processing
+nested @samp{@@include} statements.
+
address@hidden
+Instead of saving the expanded program in a temporary file, putting it in a
shell variable
+avoids some potential security problems.
+This has the disadvantage that the script relies upon more features
+of the @command{sh} language, making it harder to follow for those who
+aren't familiar with @command{sh}.
address@hidden itemize
+
+Also, this program illustrates that it is often worthwhile to combine
address@hidden and @command{awk} programming together. You can usually
+accomplish quite a lot, without having to resort to low-level programming
+in C or C++, and it is frequently easier to do certain kinds of string
+and argument manipulation using the shell than it is in @command{awk}.
+
+Finally, @command{igawk} shows that it is not always necessary to add new
+features to a program; they can often be layered on top.
@ignore
address@hidden file eg/prog/labels.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# June 1992
-# December 2010, minor edits
address@hidden endfile
+With @command{igawk},
+there is no real reason to build @samp{@@include} processing into
address@hidden itself.
@end ignore
address@hidden file eg/prog/labels.awk
-# Each label is 5 lines of data that may have blank lines.
-# The label sheets have 2 blank lines at the top and 2 at
-# the bottom.
address@hidden search paths
address@hidden search paths, for source files
address@hidden source address@hidden search path for
address@hidden files, address@hidden search path for
address@hidden directories, searching
+As an additional example of this, consider the idea of having two
+files in a directory in the search path:
-BEGIN @{ RS = "" ; MAXLINES = 100 @}
address@hidden @file
address@hidden default.awk
+This file contains a set of default library functions, such
+as @code{getopt()} and @code{assert()}.
-function printpage( i, j)
address@hidden
- if (Nlines <= 0)
- return
address@hidden site.awk
+This file contains library functions that are specific to a site or
+installation; i.e., locally developed functions.
+Having a separate file allows @file{default.awk} to change with
+new @command{gawk} releases, without requiring the system administrator to
+update it each time by adding the local functions.
address@hidden table
- printf "\n\n" # header
+One user
address@hidden Karl Berry, address@hidden, 10/95
+suggested that @command{gawk} be modified to automatically read these files
+upon startup. Instead, it would be very simple to modify @command{igawk}
+to do this. Since @command{igawk} can process nested @samp{@@include}
+directives, @file{default.awk} could simply contain @samp{@@include}
+statements for the desired library functions.
- for (i = 1; i <= Nlines; i += 10) @{
- if (i == 21 || i == 61)
- print ""
- for (j = 0; j < 5; j++) @{
- if (i + j > MAXLINES)
- break
- printf " %-41s %s\n", line[i+j], line[i+j+5]
- @}
- print ""
- @}
address@hidden Exercise: make this change
address@hidden ENDOFRANGE libfex
address@hidden ENDOFRANGE flibex
address@hidden ENDOFRANGE awkpex
- printf "\n\n" # footer
address@hidden Anagram Program
address@hidden Finding Anagrams From A Dictionary
+
+An interesting programming challenge is to
+search for @dfn{anagrams} in a
+word list (such as
address@hidden/usr/share/dict/words} on many GNU/Linux systems).
+One word is an anagram of another if both words contain
+the same letters
+(for example, ``babbling'' and ``blabbing'').
+
+An elegant algorithm is presented in Column 2, Problem C of
+Jon Bentley's @cite{Programming Pearls}, second edition.
+The idea is to give words that are anagrams a common signature,
+sort all the words together by their signature, and then print them.
+Dr.@: Bentley observes that taking the letters in each word and
+sorting them produces that common signature.
+
+The following program uses arrays of arrays to bring together
+words with the same signature and array sorting to print the words
+in sorted order.
+
address@hidden @code{anagram.awk} program
address@hidden
address@hidden file eg/prog/anagram.awk
+# anagram.awk --- An implementation of the anagram finding algorithm
+# from Jon Bentley's "Programming Pearls", 2nd edition.
+# Addison Wesley, 2000, ISBN 0-201-65788-0.
+# Column 2, Problem C, section 2.8, pp 18-20.
address@hidden endfile
address@hidden
address@hidden file eg/prog/anagram.awk
+#
+# This program requires gawk 4.0 or newer.
+# Required gawk-specific features:
+# - True multidimensional arrays
+# - split() with "" as separator splits out individual characters
+# - asort() and asorti() functions
+#
+# See http://savannah.gnu.org/projects/gawk.
+#
+# Arnold Robbins
+# arnold@@skeeve.com
+# Public Domain
+# January, 2011
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/anagram.awk
- delete line
address@hidden
+/'s$/ @{ next @} # Skip possessives
address@hidden endfile
address@hidden example
-# main rule
address@hidden
- if (Count >= 20) @{
- printpage()
- Count = 0
- Nlines = 0
- @}
- n = split($0, a, "\n")
- for (i = 1; i <= n; i++)
- line[++Nlines] = a[i]
- for (; i <= 5; i++)
- line[++Nlines] = ""
- Count++
address@hidden
+The program starts with a header, and then a rule to skip
+possessives in the dictionary file. The next rule builds
+up the data structure. The first dimension of the array
+is indexed by the signature; the second dimension is the word
+itself:
-END \
address@hidden
address@hidden file eg/prog/anagram.awk
@{
- printpage()
+ key = word2key($1) # Build signature
+ data[key][$1] = $1 # Store word with signature
@}
@c endfile
@end example
address@hidden ENDOFRANGE prml
address@hidden ENDOFRANGE mlprint
-
address@hidden Word Sorting
address@hidden Generating Word-Usage Counts
-
address@hidden STARTOFRANGE worus
address@hidden words, usage address@hidden generating
-
-When working with large amounts of text, it can be interesting to know
-how often different words appear. For example, an author may overuse
-certain words, in which case she might wish to find synonyms to substitute
-for words that appear too often. This @value{SUBSECTION} develops a
-program for counting words and presenting the frequency information
-in a useful format.
-At first glance, a program like this would seem to do the job:
+The @code{word2key()} function creates the signature.
+It splits the word apart into individual letters,
+sorts the letters, and then joins them back together:
@example
-# Print list of word frequencies
address@hidden file eg/prog/anagram.awk
+# word2key --- split word apart into letters, sort, joining back together
+function word2key(word, a, i, n, result)
@{
- for (i = 1; i <= NF; i++)
- freq[$i]++
+ n = split(word, a, "")
+ asort(a)
+
+ for (i = 1; i <= n; i++)
+ result = result a[i]
+
+ return result
@}
address@hidden endfile
address@hidden example
+
+Finally, the @code{END} rule traverses the array
+and prints out the anagram lists. It sends the output
+to the system @command{sort} command, since otherwise
+the anagrams would appear in arbitrary order:
address@hidden
address@hidden file eg/prog/anagram.awk
END @{
- for (word in freq)
- printf "%s\t%d\n", word, freq[word]
+ sort = "sort"
+ for (key in data) @{
+ # Sort words with same key
+ nwords = asorti(data[key], words)
+ if (nwords == 1)
+ continue
+
+ # And print. Minor glitch: trailing space at end of each line
+ for (j = 1; j <= nwords; j++)
+ printf("%s ", words[j]) | sort
+ print "" | sort
+ @}
+ close(sort)
@}
address@hidden endfile
@end example
-The program relies on @command{awk}'s default field splitting
-mechanism to break each line up into ``words,'' and uses an
-associative array named @code{freq}, indexed by each word, to count
-the number of times the word occurs. In the @code{END} rule,
-it prints the counts.
+Here is some partial output when the program is run:
-This program has several problems that prevent it from being
-useful on real text files:
address@hidden
+$ @kbd{gawk -f anagram.awk /usr/share/dict/words | grep '^b'}
address@hidden
+babbled blabbed
+babbler blabber brabble
+babblers blabbers brabbles
+babbling blabbing
+babbly blabby
+babel bable
+babels beslab
+babery yabber
address@hidden
address@hidden example
address@hidden @bullet
address@hidden
-The @command{awk} language considers upper- and lowercase characters to be
-distinct. Therefore, ``bartender'' and ``Bartender'' are not treated
-as the same word. This is undesirable, since in normal text, words
-are capitalized if they begin sentences, and a frequency analyzer should not
-be sensitive to capitalization.
address@hidden Signature Program
address@hidden And Now For Something Completely Different
address@hidden
-Words are detected using the @command{awk} convention that fields are
-separated just by whitespace. Other characters in the input (except
-newlines) don't have any special meaning to @command{awk}. This means that
-punctuation characters count as part of words.
+The following program was written by Davide Brini
address@hidden (@email{dave_br@@gmx.com})
+and is published on @uref{http://backreference.org/2011/02/03/obfuscated-awk/,
+his website}.
+It serves as his signature in the Usenet group @code{comp.lang.awk}.
+He supplies the following copyright terms:
address@hidden
-The output does not come out in any useful order. You're more likely to be
-interested in which words occur most frequently or in having an alphabetized
-table of how frequently each word occurs.
address@hidden itemize
address@hidden
+Copyright @copyright{} 2008 Davide Brini
address@hidden @command{sort} utility
-The first problem can be solved by using @code{tolower()} to remove case
-distinctions. The second problem can be solved by using @code{gsub()}
-to remove punctuation characters. Finally, we solve the third problem
-by using the system @command{sort} utility to process the output of the
address@hidden script. Here is the new version of the program:
+Copying and distribution of the code published in this page, with or without
+modification, are permitted in any medium without royalty provided the
copyright
+notice and this notice are preserved.
address@hidden quotation
+
+Here is the program:
address@hidden @code{wordfreq.awk} program
@example
address@hidden file eg/prog/wordfreq.awk
-# wordfreq.awk --- print list of word frequencies
+awk 'address@hidden"~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
+printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
+X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
+O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),address@hidden'
address@hidden example
address@hidden
- $0 = tolower($0) # remove case distinctions
- # remove punctuation
- gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
- for (i = 1; i <= NF; i++)
- freq[$i]++
address@hidden
+We leave it to you to determine what the program does.
address@hidden endfile
-END @{
- for (word in freq)
- printf "%s\t%d\n", word, freq[word]
address@hidden
address@hidden example
address@hidden
+To: "Arnold Robbins" <address@hidden>
+Date: Sat, 20 Aug 2011 13:50:46 -0400
+Subject: The GNU Awk User's Guide, Section 13.3.11
+From: "Chris Johansen" <address@hidden>
+Message-ID: <address@hidden>
-Assuming we have saved this program in a file named @file{wordfreq.awk},
-and that the data is in @file{file1}, the following pipeline:
+Arnold, you don't know me, but we have a tenuous connection. My wife is
+Barbara A. Field, FAIA, GIT '65 (B. Arch.).
address@hidden
-awk -f wordfreq.awk file1 | sort -k 2nr
address@hidden example
+I have had a couple of paper copies of "Effective Awk Programming" for
+years, and now I'm going through a Kindle version of "The GNU Awk User's
+Guide" again. When I got to section 13.3.11, I reformatted and lightly
+commented Davide Brin's signature script to understand its workings.
address@hidden
-produces a table of the words appearing in @file{file1} in order of
-decreasing frequency.
+It occurs to me that this might have pedagogical value as an example
+(although imperfect) of the value of whitespace and comments, and a
+starting point for that discussion. It certainly helped _me_ understand
+what's going on. You are welcome to it, as-is or modified (subject to
+Davide's constraints, of course, which I think I have met).
-The @command{awk} program suitably massages the
-data and produces a word frequency table, which is not ordered.
-The @command{awk} script's output is then sorted by the @command{sort}
-utility and printed on the screen.
+If I were to include it in a future edition, I would put it at some
+distance from section 13.3.11, say, as a note or an appendix, so as not to
+be a "spoiler" to the puzzle.
-The options given to @command{sort}
-specify a sort that uses the second field of each input line (skipping
-one field), that the sort keys should be treated as numeric quantities
-(otherwise @samp{15} would come before @samp{5}), and that the sorting
-should be done in descending (reverse) order.
+Best regards,
+--
+Chris Johansen {johansen at main dot nc dot us}
+ . . . collapsing the probability wave function, sending ripples of
+certainty through the space-time continuum.
-The @command{sort} could even be done from within the program, by changing
-the @code{END} action to:
address@hidden
address@hidden file eg/prog/wordfreq.awk
-END @{
- sort = "sort -k 2nr"
- for (word in freq)
- printf "%s\t%d\n", word, freq[word] | sort
- close(sort)
address@hidden
address@hidden endfile
address@hidden example
+#! /usr/bin/gawk -f
-This way of sorting must be used on systems that do not
-have true pipes at the command-line (or batch-file) level.
-See the general operating system documentation for more information on how
-to use the @command{sort} program.
address@hidden ENDOFRANGE worus
+# From "13.3.11 And Now For Something Completely Different"
+#
http://www.gnu.org/software/gawk/manual/html_node/Signature-Program.html#Signature-Program
address@hidden History Sorting
address@hidden Removing Duplicates from Unsorted Text
+# Copyright © 2008 Davide Brini
address@hidden STARTOFRANGE lidu
address@hidden lines, address@hidden removing
-The @command{uniq} program
-(@pxref{Uniq Program}),
-removes duplicate lines from @emph{sorted} data.
+# Copying and distribution of the code published in this page, with
+# or without modification, are permitted in any medium without
+# royalty provided the copyright notice and this notice are preserved.
+
+BEGIN {
+ O = "~" ~ "~"; # 1
+ o = "==" == "=="; # 1
+ o += +o; # 2
+ x = O "" O; # 11
-Suppose, however, you need to remove duplicate lines from a @value{DF} but
-that you want to preserve the order the lines are in. A good example of
-this might be a shell history file. The history file keeps a copy of all
-the commands you have entered, and it is not unusual to repeat a command
-several times in a row. Occasionally you might want to compact the history
-by removing duplicate entries. Yet it is desirable to maintain the order
-of the original commands.
-This simple program does the job. It uses two arrays. The @code{data}
-array is indexed by the text of each line.
-For each line, @code{data[$0]} is incremented.
-If a particular line has not
-been seen before, then @code{data[$0]} is zero.
-In this case, the text of the line is stored in @code{lines[count]}.
-Each element of @code{lines} is a unique command, and the indices of
address@hidden indicate the order in which those lines are encountered.
-The @code{END} rule simply prints out the lines, in order:
+ while ( X++ <= x + o + o ) c = c "%c";
address@hidden Rakitzis, Byron
address@hidden @code{histsort.awk} program
address@hidden
address@hidden file eg/prog/histsort.awk
-# histsort.awk --- compact a shell history file
-# Thanks to Byron Rakitzis for the general idea
address@hidden endfile
address@hidden
address@hidden file eg/prog/histsort.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# May 1993
address@hidden endfile
+ # O is 1
+ # o is 2
+ # x is 11
+ # X is 17
+ # c is "%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c"
+
+ printf c,
+ ( x - O )*( x - O), # 100 d
+ x*( x - o ) - o, # 97 a
+ x*( x - O ) + x - O - o, # 118 v
+ +x*( x - O ) - x + o, # 101 e
+ X*( o*o + O ) + x - O, # 95 _
+ X*( X - x ) - o*o, # 98 b
+ ( x + X )*o*o + o, # 114 r
+ x*( X - x ) - O - O, # 64 @
+ x - O + ( O + o + X + x )*( o + O ), # 103 g
+ X*X - X*( x - O ) - x + O, # 109 m
+ O + X*( o*( o + O ) + O ), # 120 x
+ +x + O + X*o, # 46 .
+ x*( x - o), # 99 c
+ ( o + X + x )*o*o - ( x - O - O ), # 111 0
+ O + ( X - x )*( X + O ), # 109 m
+ x - O # 10 \n
+}
@end ignore
address@hidden file eg/prog/histsort.awk
address@hidden
address@hidden
- if (data[$0]++ == 0)
- lines[++count] = $0
address@hidden
address@hidden group
address@hidden The original text for this chapter was contributed by Efraim
Yawitz.
address@hidden FIXME: Add more indexing.
address@hidden
-END @{
- for (i = 1; i <= count; i++)
- print lines[i]
address@hidden
address@hidden group
address@hidden endfile
address@hidden example
address@hidden Debugger
address@hidden Debugging @command{awk} Programs
address@hidden debugging @command{awk} programs
-This program also provides a foundation for generating other useful
-information. For example, using the following @code{print} statement in the
address@hidden rule indicates how often a particular command is used:
+It would be nice if computer programs worked perfectly the first time they
+were run, but in real life, this rarely happens for programs of
+any complexity. Thus, most programming languages have facilities available
+for ``debugging'' programs, and now @command{awk} is no exception.
address@hidden
-print data[lines[i]], lines[i]
address@hidden example
+The @command{gawk} debugger is purposely modeled after
address@hidden://www.gnu.org/software/gdb/, the GNU Debugger (GDB)}
+command-line debugger. If you are familiar with GDB, learning
+how to use @command{gawk} for debugging your program is easy.
-This works because @code{data[$0]} is incremented each time a line is
-seen.
address@hidden ENDOFRANGE lidu
address@hidden
+* Debugging:: Introduction to @command{gawk} debugger.
+* Sample Debugging Session:: Sample debugging session.
+* List of Debugger Commands:: Main debugger commands.
+* Readline Support:: Readline support.
+* Limitations:: Limitations and future plans.
address@hidden menu
address@hidden Extract Program
address@hidden Extracting Programs from Texinfo Source Files
address@hidden Debugging
address@hidden Introduction to @command{gawk} Debugger
address@hidden STARTOFRANGE texse
address@hidden Texinfo, extracting programs from source files
address@hidden STARTOFRANGE fitex
address@hidden files, address@hidden extracting programs from
address@hidden
-Both this chapter and the previous chapter
-(@ref{Library Functions})
-present a large number of @command{awk} programs.
address@hidden ifnotinfo
address@hidden
-The nodes
address@hidden Functions},
-and @ref{Sample Programs},
-are the top level nodes for a large number of @command{awk} programs.
address@hidden ifinfo
-If you want to experiment with these programs, it is tedious to have to type
-them in by hand. Here we present a program that can extract parts of a
-Texinfo input file into separate files.
+This @value{SECTION} introduces debugging in general and begins
+the discussion of debugging in @command{gawk}.
address@hidden Texinfo
-This @value{DOCUMENT} is written in @uref{http://texinfo.org, Texinfo},
-the GNU project's document formatting language.
-A single Texinfo source file can be used to produce both
-printed and online documentation.
address@hidden
-Texinfo is fully documented in the book
address@hidden GNU Documentation Format},
-available from the Free Software Foundation.
address@hidden ifnotinfo
address@hidden
-The Texinfo language is described fully, starting with
address@hidden, , Texinfo, texinfo,Texinfo---The GNU Documentation Format}.
address@hidden ifinfo
address@hidden
+* Debugging Concepts:: Debugging in General.
+* Debugging Terms:: Additional Debugging Concepts.
+* Awk Debugging:: Awk Debugging.
address@hidden menu
-For our purposes, it is enough to know three things about Texinfo input
-files:
address@hidden Debugging Concepts
address@hidden Debugging in General
+
+(If you have used debuggers in other languages, you may want to skip
+ahead to the next section on the specific features of the @command{awk}
+debugger.)
+
+Of course, a debugging program cannot remove bugs for you, since it has
+no way of knowing what you or your users consider a ``bug'' and what is a
+``feature.'' (Sometimes, we humans have a hard time with this ourselves.)
+In that case, what can you expect from such a tool? The answer to that
+depends on the language being debugged, but in general, you can expect at
+least the following:
@itemize @bullet
@item
-The ``at'' symbol (@samp{@@}) is special in Texinfo, much as
-the backslash (@samp{\}) is in C
-or @command{awk}. Literal @samp{@@} symbols are represented in Texinfo source
-files as @samp{@@@@}.
+The ability to watch a program execute its instructions one by one,
+giving you, the programmer, the opportunity to think about what is happening
+on a time scale of seconds, minutes, or hours, rather than the nanosecond
+time scale at which the code usually runs.
@item
-Comments start with either @samp{@@c} or @samp{@@comment}.
-The file-extraction program works by using special comments that start
-at the beginning of a line.
+The opportunity to not only passively observe the operation of your
+program, but to control it and try different paths of execution, without
+having to change your source files.
@item
-Lines containing @samp{@@group} and @samp{@@end group} commands bracket
-example text that should not be split across a page boundary.
-(Unfortunately, @TeX{} isn't always smart enough to do things exactly right,
-so we have to give it some help.)
address@hidden itemize
-
-The following program, @file{extract.awk}, reads through a Texinfo source
-file and does two things, based on the special comments.
-Upon seeing @address@hidden@@c system @dots{}}},
-it runs a command, by extracting the command text from the
-control line and passing it on to the @code{system()} function
-(@pxref{I/O Functions}).
-Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to
-the file @var{filename}, until @samp{@@c endfile} is encountered.
-The rules in @file{extract.awk} match either @samp{@@c} or
address@hidden@@comment} by letting the @samp{omment} part be optional.
-Lines containing @samp{@@group} and @samp{@@end group} are simply removed.
address@hidden uses the @code{join()} library function
-(@pxref{Join Function}).
-
-The example programs in the online Texinfo source for @address@hidden
-(@file{gawk.texi}) have all been bracketed inside @samp{file} and
address@hidden lines. The @command{gawk} distribution uses a copy of
address@hidden to extract the sample programs and install many
-of them in a standard directory where @command{gawk} can find them.
-The Texinfo file looks something like this:
-
address@hidden
address@hidden
-This program has a @@address@hidden@} rule,
-that prints a nice message:
-
-@@example
-@@c file examples/messages.awk
-BEGIN @@@{ print "Don't panic!" @@@}
-@@c end file
-@@end example
+The chance to see the values of data in the program at any point in
+execution, and also to change that data on the fly, to see how that
+affects what happens afterwards. (This often includes the ability
+to look at internal data structures besides the variables you actually
+defined in your code.)
-It also prints some final advice:
address@hidden
+The ability to obtain additional information about your program's state
+or even its internal structure.
address@hidden itemize
-@@example
-@@c file examples/messages.awk
-END @@@{ print "Always avoid bored archeologists!" @@@}
-@@c end file
-@@end example
address@hidden
address@hidden example
+All of these tools provide a great amount of help in using your own
+skills and understanding of the goals of your program to find where it
+is going wrong (or, for that matter, to better comprehend a perfectly
+functional program that you or someone else wrote).
address@hidden begins by setting @code{IGNORECASE} to one, so that
-mixed upper- and lowercase letters in the directives won't matter.
address@hidden Debugging Terms
address@hidden Additional Debugging Concepts
-The first rule handles calling @code{system()}, checking that a command is
-given (@code{NF} is at least three) and also checking that the command
-exits with a zero exit status, signifying OK:
+Before diving in to the details, we need to introduce several
+important concepts that apply to just about all debuggers.
+The following list defines terms used throughout the rest of
+this @value{CHAPTER}.
address@hidden @code{extract.awk} program
address@hidden
address@hidden file eg/prog/extract.awk
-# extract.awk --- extract files and run programs
-# from texinfo files
address@hidden endfile
address@hidden
address@hidden file eg/prog/extract.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# May 1993
-# Revised September 2000
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/extract.awk
address@hidden @dfn
address@hidden Stack Frame
+Programs generally call functions during the course of their execution.
+One function can call another, or a function can call itself (recursion).
+You can view the chain of called functions (main program calls A, which
+calls B, which calls C), as a stack of executing functions: the currently
+running function is the topmost one on the stack, and when it finishes
+(returns), the next one down then becomes the active function.
+Such a stack is termed a @dfn{call stack}.
-BEGIN @{ IGNORECASE = 1 @}
+For each function on the call stack, the system maintains a data area
+that contains the function's parameters, local variables, and return value,
+as well as any other ``bookkeeping'' information needed to manage the
+call stack. This data area is termed a @dfn{stack frame}.
-/^@@c(omment)?[ \t]+system/ \
address@hidden
- if (NF < 3) @{
- e = (FILENAME ":" FNR)
- e = (e ": badly formed `system' line")
- print e > "/dev/stderr"
- next
- @}
- $1 = ""
- $2 = ""
- stat = system($0)
- if (stat != 0) @{
- e = (FILENAME ":" FNR)
- e = (e ": warning: system returned " stat)
- print e > "/dev/stderr"
- @}
address@hidden
address@hidden endfile
address@hidden example
address@hidden also follows this model, and gives you
+access to the call stack and to each stack frame. You can see the
+call stack, as well as from where each function on the stack was
+invoked. Commands that print the call stack print information about
+each stack frame (as detailed later on).
address@hidden
-The variable @code{e} is used so that the rule
-fits nicely on the
address@hidden
-page.
address@hidden ifnotinfo
address@hidden
-screen.
address@hidden ifnottex
address@hidden Breakpoint
+During debugging, you often wish to let the program run until it
+reaches a certain point, and then continue execution from there one
+statement (or instruction) at a time. The way to do this is to set
+a @dfn{breakpoint} within the program. A breakpoint is where the
+execution of the program should break off (stop), so that you can
+take over control of the program's execution. You can add and remove
+as many breakpoints as you like.
-The second rule handles moving data into files. It verifies that a
address@hidden is given in the directive. If the file named is not the
-current file, then the current file is closed. Keeping the current file
-open until a new file is encountered allows the use of the @samp{>}
-redirection for printing the contents, keeping open file management
-simple.
address@hidden Watchpoint
+A watchpoint is similar to a breakpoint. The difference is that
+breakpoints are oriented around the code: stop when a certain point in the
+code is reached. A watchpoint, however, specifies that program execution
+should stop when a @emph{data value} is changed. This is useful, since
+sometimes it happens that a variable receives an erroneous value, and it's
+hard to track down where this happens just by looking at the code.
+By using a watchpoint, you can stop whenever a variable is assigned to,
+and usually find the errant code quite quickly.
address@hidden table
-The @code{for} loop does the work. It reads lines using @code{getline}
-(@pxref{Getline}).
-For an unexpected end of file, it calls the @address@hidden()}}
-function. If the line is an ``endfile'' line, then it breaks out of
-the loop.
-If the line is an @samp{@@group} or @samp{@@end group} line, then it
-ignores it and goes on to the next line.
-Similarly, comments within examples are also ignored.
address@hidden Awk Debugging
address@hidden Awk Debugging
-Most of the work is in the following few lines. If the line has no @samp{@@}
-symbols, the program can print it directly.
-Otherwise, each leading @samp{@@} must be stripped off.
-To remove the @samp{@@} symbols, the line is split into separate elements of
-the array @code{a}, using the @code{split()} function
-(@pxref{String Functions}).
-The @samp{@@} symbol is used as the separator character.
-Each element of @code{a} that is empty indicates two successive @samp{@@}
-symbols in the original line. For each two empty elements (@samp{@@@@} in
-the original file), we have to add a single @samp{@@} symbol back
address@hidden program was written before @command{gawk} had the
address@hidden()} function. Consider how you might use it to simplify the code.}
+Debugging an @command{awk} program has some specific aspects that are
+not shared with other programming languages.
-When the processing of the array is finished, @code{join()} is called with the
-value of @code{SUBSEP}, to rejoin the pieces back into a single
-line. That line is then printed to the output file:
+First of all, the fact that @command{awk} programs usually take input
+line-by-line from a file or files and operate on those lines using specific
+rules makes it especially useful to organize viewing the execution of
+the program in terms of these rules. As we will see, each @command{awk}
+rule is treated almost like a function call, with its own specific block
+of instructions.
address@hidden
address@hidden file eg/prog/extract.awk
-/^@@c(omment)?[ \t]+file/ \
address@hidden
- if (NF != 3) @{
- e = (FILENAME ":" FNR ": badly formed `file' line")
- print e > "/dev/stderr"
- next
- @}
- if ($3 != curfile) @{
- if (curfile != "")
- close(curfile)
- curfile = $3
- @}
+In addition, since @command{awk} is by design a very concise language,
+it is easy to lose sight of everything that is going on ``inside''
+each line of @command{awk} code. The debugger provides the opportunity
+to look at the individual primitive instructions carried out
+by the higher-level @command{awk} commands.
- for (;;) @{
- if ((getline line) <= 0)
- unexpected_eof()
- if (line ~ /^@@c(omment)?[ \t]+endfile/)
- break
- else if (line ~ /^@@(end[ \t]+)?group/)
- continue
- else if (line ~ /^@@c(omment+)?[ \t]+/)
- continue
- if (index(line, "@@") == 0) @{
- print line > curfile
- continue
- @}
- n = split(line, a, "@@")
- # if a[1] == "", means leading @@,
- # don't add one back in.
- for (i = 2; i <= n; i++) @{
- if (a[i] == "") @{ # was an @@@@
- a[i] = "@@"
- if (a[i+1] == "")
- i++
- @}
- @}
- print join(a, 1, n, SUBSEP) > curfile
- @}
address@hidden
address@hidden endfile
address@hidden example
address@hidden Sample Debugging Session
address@hidden Sample Debugging Session
-An important thing to note is the use of the @samp{>} redirection.
-Output done with @samp{>} only opens the file once; it stays open and
-subsequent output is appended to the file
-(@pxref{Redirection}).
-This makes it easy to mix program text and explanatory prose for the same
-sample source file (as has been done here!) without any hassle. The file is
-only closed when a new data @value{FN} is encountered or at the end of the
-input file.
+In order to illustrate the use of @command{gawk} as a debugger, let's look at
a sample
+debugging session. We will use the @command{awk} implementation of the
+POSIX @command{uniq} command described earlier (@pxref{Uniq Program})
+as our example.
-Finally, the function @address@hidden()}} prints an appropriate
-error message and then exits.
-The @code{END} rule handles the final cleanup, closing the open file:
address@hidden
+* Debugger Invocation:: How to Start the Debugger.
+* Finding The Bug:: Finding the Bug.
address@hidden menu
address@hidden function lb put on same line for page breaking. sigh
address@hidden
address@hidden file eg/prog/extract.awk
address@hidden
-function unexpected_eof()
address@hidden
- printf("%s:%d: unexpected EOF or error\n",
- FILENAME, FNR) > "/dev/stderr"
- exit 1
address@hidden
address@hidden group
address@hidden Debugger Invocation
address@hidden How to Start the Debugger
-END @{
- if (curfile)
- close(curfile)
address@hidden
address@hidden endfile
+Starting the debugger is almost exactly like running @command{awk}, except you
have to
+pass an additional option @option{--debug} or the corresponding short option
@option{-D}.
+The file(s) containing the program and any supporting code are given on the
command
+line as arguments to one or more @option{-f} options. (@command{gawk} is not
designed
+to debug command-line programs, only programs contained in files.) In our
case,
+we invoke the debugger like this:
+
address@hidden
+$ @kbd{gawk -D -f getopt.awk -f join.awk -f uniq.awk inputfile}
@end example
address@hidden ENDOFRANGE texse
address@hidden ENDOFRANGE fitex
address@hidden Simple Sed
address@hidden A Simple Stream Editor
address@hidden
+where both @file{getopt.awk} and @file{uniq.awk} are in @env{$AWKPATH}.
+(Experienced users of GDB or similar debuggers should note that
+this syntax is slightly different from what they are used to.
+With @command{gawk} debugger, the arguments for running the program are given
+in the command line to the debugger rather than as part of the @code{run}
+command at the debugger prompt.)
address@hidden @command{sed} utility
address@hidden stream editors
-The @command{sed} utility is a stream editor, a program that reads a
-stream of data, makes changes to it, and passes it on.
-It is often used to make global changes to a large file or to a stream
-of data generated by a pipeline of commands.
-While @command{sed} is a complicated program in its own right, its most common
-use is to perform global substitutions in the middle of a pipeline:
+Instead of immediately running the program on @file{inputfile}, as
address@hidden would ordinarily do, the debugger merely loads all
+the program source files, compiles them internally, and then gives
+us a prompt:
@example
-command1 < orig.data | sed 's/old/new/g' | command2 > result
+gawk>
@end example
-Here, @samp{s/old/new/g} tells @command{sed} to look for the regexp
address@hidden on each input line and globally replace it with the text
address@hidden, i.e., all the occurrences on a line. This is similar to
address@hidden's @code{gsub()} function
-(@pxref{String Functions}).
address@hidden
+from which we can issue commands to the debugger. At this point, no
+code has been executed.
-The following program, @file{awksed.awk}, accepts at least two command-line
-arguments: the pattern to look for and the text to replace it with. Any
-additional arguments are treated as data @value{FN}s to process. If none
-are provided, the standard input is used:
address@hidden Finding The Bug
address@hidden Finding the Bug
address@hidden Brennan, Michael
address@hidden @command{awksed.awk} program
address@hidden @cindex simple stream editor
address@hidden @cindex stream editor, simple
address@hidden
address@hidden file eg/prog/awksed.awk
-# awksed.awk --- do s/foo/bar/g using just print
-# Thanks to Michael Brennan for the idea
address@hidden endfile
address@hidden
address@hidden file eg/prog/awksed.awk
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# August 1995
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/awksed.awk
+Let's say that we are having a problem using (a faulty version of)
address@hidden in the ``field-skipping'' mode, and it doesn't seem to be
+catching lines which should be identical when skipping the first field,
+such as:
-function usage()
address@hidden
- print "usage: awksed pat repl [files...]" > "/dev/stderr"
- exit 1
address@hidden
address@hidden
+awk is a wonderful program!
+gawk is a wonderful program!
address@hidden example
-BEGIN @{
- # validate arguments
- if (ARGC < 3)
- usage()
+This could happen if we were thinking (C-like) of the fields in a record
+as being numbered in a zero-based fashion, so instead of the lines:
- RS = ARGV[1]
- ORS = ARGV[2]
address@hidden
+clast = join(alast, fcount+1, n)
+cline = join(aline, fcount+1, m)
address@hidden example
- # don't use arguments as files
- ARGV[1] = ARGV[2] = ""
address@hidden
address@hidden
+we wrote:
address@hidden
-# look ma, no hands!
address@hidden
- if (RT == "")
- printf "%s", $0
- else
- print
address@hidden
address@hidden group
address@hidden endfile
address@hidden
+clast = join(alast, fcount, n)
+cline = join(aline, fcount, m)
@end example
-The program relies on @command{gawk}'s ability to have @code{RS} be a regexp,
-as well as on the setting of @code{RT} to the actual text that terminates the
-record (@pxref{Records}).
+The first thing we usually want to do when trying to investigate a
+problem like this is to put a breakpoint in the program so that we can
+watch it at work and catch what it is doing wrong. A reasonable spot for
+a breakpoint in @file{uniq.awk} is at the beginning of the function
address@hidden()}, which compares the current line with the previous one. To set
+the breakpoint, use the @code{b} (breakpoint) command:
-The idea is to have @code{RS} be the pattern to look for. @command{gawk}
-automatically sets @code{$0} to the text between matches of the pattern.
-This is text that we want to keep, unmodified. Then, by setting @code{ORS}
-to the replacement text, a simple @code{print} statement outputs the
-text we want to keep, followed by the replacement text.
address@hidden
+gawk> @kbd{b are_equal}
address@hidden Breakpoint 1 set at file `awklib/eg/prog/uniq.awk', line 64
address@hidden example
-There is one wrinkle to this scheme, which is what to do if the last record
-doesn't end with text that matches @code{RS}. Using a @code{print}
-statement unconditionally prints the replacement text, which is not correct.
-However, if the file did not end in text that matches @code{RS}, @code{RT}
-is set to the null string. In this case, we can print @code{$0} using
address@hidden
-(@pxref{Printf}).
+The debugger tells us the file and line number where the breakpoint is.
+Now type @samp{r} or @samp{run} and the program runs until it hits
+the breakpoint for the first time:
-The @code{BEGIN} rule handles the setup, checking for the right number
-of arguments and calling @code{usage()} if there is a problem. Then it sets
address@hidden and @code{ORS} from the command-line arguments and sets
address@hidden and @code{ARGV[2]} to the null string, so that they are
-not treated as @value{FN}s
-(@pxref{ARGC and ARGV}).
address@hidden
+gawk> @kbd{r}
address@hidden Starting program:
address@hidden Stopping in Rule ...
address@hidden Breakpoint 1, are_equal(n, m, clast, cline, alast, aline)
+ at `awklib/eg/prog/uniq.awk':64
address@hidden 64 if (fcount == 0 && charcount == 0)
+gawk>
address@hidden example
-The @code{usage()} function prints an error message and exits.
-Finally, the single rule handles the printing scheme outlined above,
-using @code{print} or @code{printf} as appropriate, depending upon the
-value of @code{RT}.
+Now we can look at what's going on inside our program. First of all,
+let's see how we got to where we are. At the prompt, we type @samp{bt}
+(short for ``backtrace''), and the debugger responds with a
+listing of the current stack frames:
address@hidden
-Exercise, compare the performance of this version with the more
-straightforward:
address@hidden
+gawk> @kbd{bt}
address@hidden #0 are_equal(n, m, clast, cline, alast, aline)
+ at `awklib/eg/prog/uniq.awk':69
address@hidden #1 in main() at `awklib/eg/prog/uniq.awk':89
address@hidden example
-BEGIN {
- pat = ARGV[1]
- repl = ARGV[2]
- ARGV[1] = ARGV[2] = ""
-}
+This tells us that @code{are_equal()} was called by the main program at
+line 89 of @file{uniq.awk}. (This is not a big surprise, since this
+is the only call to @code{are_equal()} in the program, but in more complex
+programs, knowing who called a function and with what parameters can be
+the key to finding the source of the problem.)
-{ gsub(pat, repl); print }
+Now that we're in @code{are_equal()}, we can start looking at the values
+of some variables. Let's say we type @samp{p n}
+(@code{p} is short for ``print''). We would expect to see the value of
address@hidden, a parameter to @code{are_equal()}. Actually, the debugger
+gives us:
-Exercise: what are the advantages and disadvantages of this version versus sed?
- Advantage: egrep regexps
- speed (?)
- Disadvantage: no & in replacement text
address@hidden
+gawk> @kbd{p n}
address@hidden n = untyped variable
address@hidden example
-Others?
address@hidden ignore
address@hidden
+In this case, @code{n} is an uninitialized local variable, since the
+function was called without arguments (@pxref{Function Calls}).
address@hidden Igawk Program
address@hidden An Easy Way to Use Library Functions
+A more useful variable to display might be the current record:
address@hidden STARTOFRANGE libfex
address@hidden libraries of @command{awk} functions, example program for using
address@hidden STARTOFRANGE flibex
address@hidden functions, library, example program for using
-In @ref{Include Files}, we saw how @command{gawk} provides a built-in
-file-inclusion capability. However, this is a @command{gawk} extension.
-This @value{SECTION} provides the motivation for making file inclusion
-available for standard @command{awk}, and shows how to do it using a
-combination of shell and @command{awk} programming.
address@hidden
+gawk> @kbd{p $0}
address@hidden $0 = string ("gawk is a wonderful program!")
address@hidden example
-Using library functions in @command{awk} can be very beneficial. It
-encourages code reuse and the writing of general functions. Programs are
-smaller and therefore clearer.
-However, using library functions is only easy when writing @command{awk}
-programs; it is painful when running them, requiring multiple @option{-f}
-options. If @command{gawk} is unavailable, then so too is the @env{AWKPATH}
-environment variable and the ability to put @command{awk} functions into a
-library directory (@pxref{Options}).
-It would be nice to be able to write programs in the following manner:
address@hidden
+This might be a bit puzzling at first since this is the second line of
+our test input above. Let's look at @code{NR}:
@example
-# library functions
-@@include getopt.awk
-@@include join.awk
address@hidden
+gawk> @kbd{p NR}
address@hidden NR = number (2)
address@hidden example
-# main program
-BEGIN @{
- while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
- @dots{}
- @dots{}
address@hidden
+So we can see that @code{are_equal()} was only called for the second record
+of the file. Of course, this is because our program contained a rule for
address@hidden == 1}:
+
address@hidden
+NR == 1 @{
+ last = $0
+ next
@}
@end example
-The following program, @file{igawk.sh}, provides this service.
-It simulates @command{gawk}'s searching of the @env{AWKPATH} variable
-and also allows @dfn{nested} includes; i.e., a file that is included
-with @samp{@@include} can contain further @samp{@@include} statements.
address@hidden makes an effort to only include files once, so that nested
-includes don't accidentally include a library function twice.
-
address@hidden should behave just like @command{gawk} externally. This
-means it should accept all of @command{gawk}'s command-line arguments,
-including the ability to have multiple source files specified via
address@hidden, and the ability to mix command-line and library source files.
+OK, let's just check that that rule worked correctly:
-The program is written using the POSIX Shell (@command{sh}) command
address@hidden explaining the @command{sh} language is beyond
-the scope of this book. We provide some minimal explanations, but see
-a good shell programming book if you wish to understand things in more
-depth.} It works as follows:
address@hidden
+gawk> @kbd{p last}
address@hidden last = string ("awk is a wonderful program!")
address@hidden example
address@hidden
address@hidden
-Loop through the arguments, saving anything that doesn't represent
address@hidden source code for later, when the expanded program is run.
+Everything we have done so far has verified that the program has worked as
+planned, up to and including the call to @code{are_equal()}, so the problem
must
+be inside this function. To investigate further, we must begin
+``stepping through'' the lines of @code{are_equal()}. We start by typing
address@hidden (for ``next''):
address@hidden
-For any arguments that do represent @command{awk} text, put the arguments into
-a shell variable that will be expanded. There are two cases:
address@hidden
+gawk> @kbd{n}
address@hidden 67 if (fcount > 0) @{
address@hidden example
address@hidden a
address@hidden
-Literal text, provided with @option{--source} or @option{--source=}. This
-text is just appended directly.
+This tells us that @command{gawk} is now ready to execute line 67, which
+decides whether to give the lines the special ``field skipping'' treatment
+indicated by the @option{-f} command-line option. (Notice that we skipped
+from where we were before at line 64 to here, since the condition in line 64
address@hidden
-Source @value{FN}s, provided with @option{-f}. We use a neat trick and append
address@hidden@@include @var{filename}} to the shell variable's contents.
Since the file-inclusion
-program works the way @command{gawk} does, this gets the text
-of the file included into the program at the correct point.
address@hidden enumerate
address@hidden
+if (fcount == 0 && charcount == 0)
address@hidden example
address@hidden
-Run an @command{awk} program (naturally) over the shell variable's contents to
expand
address@hidden@@include} statements. The expanded program is placed in a second
-shell variable.
address@hidden
+was false.)
address@hidden
-Run the expanded program with @command{gawk} and any other original
command-line
-arguments that the user supplied (such as the data @value{FN}s).
address@hidden enumerate
+Continuing to step, we now get to the splitting of the current and
+last records:
-This program uses shell variables extensively: for storing command-line
arguments,
-the text of the @command{awk} program that will expand the user's program, for
the
-user's original program, and for the expanded program. Doing so removes some
-potential problems that might arise were we to use temporary files instead,
-at the cost of making the script somewhat more complicated.
address@hidden
+gawk> @kbd{n}
address@hidden 68 n = split(last, alast)
+gawk> @kbd{n}
address@hidden 69 m = split($0, aline)
address@hidden example
-The initial part of the program turns on shell tracing if the first
-argument is @samp{debug}.
+At this point, we should be curious to see what our records were split
+into, so we try to look:
-The next part loops through all the command-line arguments.
-There are several cases of interest:
address@hidden
+gawk> @kbd{p n m alast aline}
address@hidden n = number (5)
address@hidden m = number (5)
address@hidden alast = array, 5 elements
address@hidden aline = array, 5 elements
address@hidden example
address@hidden @code
address@hidden --
-This ends the arguments to @command{igawk}. Anything else should be passed on
-to the user's @command{awk} program without being evaluated.
address@hidden
+(The @code{p} command can take more than one argument, similar to
address@hidden's @code{print} statement.)
address@hidden -W
-This indicates that the next option is specific to @command{gawk}. To make
-argument processing easier, the @option{-W} is appended to the front of the
-remaining arguments and the loop continues. (This is an @command{sh}
-programming trick. Don't worry about it if you are not familiar with
address@hidden)
+This is kind of disappointing, though. All we found out is that there
+are five elements in each of our arrays. Useful enough (we now know that
+none of the words were accidentally left out), but what if we want to see
+inside the array?
address@hidden address@hidden,} -F
-These are saved and passed on to @command{gawk}.
+The first choice would be to use subscripts:
address@hidden address@hidden,} address@hidden,} address@hidden,} -Wfile=
-The @value{FN} is appended to the shell variable @code{program} with an
address@hidden@@include} statement.
-The @command{expr} utility is used to remove the leading option part of the
-argument (e.g., @samp{--file=}).
-(Typical @command{sh} usage would be to use the @command{echo} and
@command{sed}
-utilities to do this work. Unfortunately, some versions of @command{echo}
evaluate
-escape sequences in their arguments, possibly mangling the program text.
-Using @command{expr} avoids this problem.)
address@hidden
+gawk> @kbd{p alast[0]}
address@hidden "0" not in array `alast'
address@hidden example
address@hidden address@hidden,} address@hidden,} -Wsource=
-The source text is appended to @code{program}.
address@hidden
+Oops!
address@hidden address@hidden,} -Wversion
address@hidden prints its version number, runs @samp{gawk --version}
-to get the @command{gawk} version information, and then exits.
address@hidden table
address@hidden
+gawk> @kbd{p alast[1]}
address@hidden alast["1"] = string ("awk")
address@hidden example
-If none of the @option{-f}, @option{--file}, @option{-Wfile},
@option{--source},
-or @option{-Wsource} arguments are supplied, then the first nonoption argument
-should be the @command{awk} program. If there are no command-line
-arguments left, @command{igawk} prints an error message and exits.
-Otherwise, the first argument is appended to @code{program}.
-In any case, after the arguments have been processed,
address@hidden contains the complete text of the original @command{awk}
-program.
+This would be kind of slow for a 100-member array, though, so
address@hidden provides a shortcut (reminiscent of another language
+not to be mentioned):
-The program is as follows:
address@hidden
+gawk> @kbd{p @@alast}
address@hidden alast["1"] = string ("awk")
address@hidden alast["2"] = string ("is")
address@hidden alast["3"] = string ("a")
address@hidden alast["4"] = string ("wonderful")
address@hidden alast["5"] = string ("program!")
address@hidden example
+
+It looks like we got this far OK. Let's take another step
+or two:
address@hidden @code{igawk.sh} program
@example
address@hidden file eg/prog/igawk.sh
-#! /bin/sh
-# igawk --- like gawk but do @@include processing
address@hidden endfile
address@hidden
address@hidden file eg/prog/igawk.sh
-#
-# Arnold Robbins, arnold@@skeeve.com, Public Domain
-# July 1993
-# December 2010, minor edits
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/igawk.sh
+gawk> @kbd{n}
address@hidden 70 clast = join(alast, fcount, n)
+gawk> @kbd{n}
address@hidden 71 cline = join(aline, fcount, m)
address@hidden example
-if [ "$1" = debug ]
-then
- set -x
- shift
-fi
+Well, here we are at our error (sorry to spoil the suspense). What we
+had in mind was to join the fields starting from the second one to make
+the virtual record to compare, and if the first field was numbered zero,
+this would work. Let's look at what we've got:
-# A literal newline, so that program text is formatted correctly
-n='
-'
address@hidden
+gawk> @kbd{p cline clast}
address@hidden cline = string ("gawk is a wonderful program!")
address@hidden clast = string ("awk is a wonderful program!")
address@hidden example
-# Initialize variables to empty
-program=
-opts=
+Hey, those look pretty familiar! They're just our original, unaltered,
+input records. A little thinking (the human brain is still the best
+debugging tool), and we realize that we were off by one!
-while [ $# -ne 0 ] # loop over arguments
-do
- case $1 in
- --) shift
- break ;;
+We get out of the debugger:
- -W) shift
- # The address@hidden'message here'@} construct prints a
- # diagnostic if $x is the null string
- set -- -W"address@hidden@@?'missing operand'@}"
- continue ;;
address@hidden
+gawk> @kbd{q}
address@hidden The program is running. Exit anyway (y/n)? @kbd{y}
address@hidden example
- -[vF]) opts="$opts $1 'address@hidden'missing operand'@}'"
- shift ;;
address@hidden
+Then we get into an editor:
- -[vF]*) opts="$opts '$1'" ;;
address@hidden
+clast = join(alast, fcount+1, n)
+cline = join(aline, fcount+1, m)
address@hidden example
- -f) program="$program$n@@include address@hidden'missing operand'@}"
- shift ;;
address@hidden
+and problem solved!
- -f*) f=$(expr "$1" : '-f\(.*\)')
- program="$program$n@@include $f" ;;
address@hidden List of Debugger Commands
address@hidden Main Debugger Commands
- -[W-]file=*)
- f=$(expr "$1" : '-.file=\(.*\)')
- program="$program$n@@include $f" ;;
+The @command{gawk} debugger command set can be divided into the
+following categories:
- -[W-]file)
- program="$program$n@@include address@hidden'missing operand'@}"
- shift ;;
address@hidden @bullet{}
- -[W-]source=*)
- t=$(expr "$1" : '-.source=\(.*\)')
- program="$program$n$t" ;;
address@hidden
+Breakpoint control
- -[W-]source)
- program="address@hidden'missing operand'@}"
- shift ;;
address@hidden
+Execution control
- -[W-]version)
- echo igawk: version 3.0 1>&2
- gawk --version
- exit 0 ;;
address@hidden
+Viewing and changing data
- -[W-]*) opts="$opts '$1'" ;;
address@hidden
+Working with the stack
- *) break ;;
- esac
- shift
-done
address@hidden
+Getting information
-if [ -z "$program" ]
-then
- address@hidden'missing program'@}
- shift
-fi
address@hidden
+Miscellaneous
address@hidden itemize
-# At this point, `program' has the program.
address@hidden endfile
address@hidden example
+Each of these are discussed in the following subsections.
+In the following descriptions, commands which may be abbreviated
+show the abbreviation on a second description line.
+A debugger command name may also be truncated if that partial
+name is unambiguous. The debugger has the built-in capability to
+automatically repeat the previous command when just hitting @key{Enter}.
+This works for the commands @code{list}, @code{next}, @code{nexti},
@code{step}, @code{stepi}
+and @code{continue} executed without any argument.
-The @command{awk} program to process @samp{@@include} directives
-is stored in the shell variable @code{expand_prog}. Doing this keeps
-the shell script readable. The @command{awk} program
-reads through the user's program, one line at a time, using @code{getline}
-(@pxref{Getline}). The input
address@hidden and @samp{@@include} statements are managed using a stack.
-As each @samp{@@include} is encountered, the current @value{FN} is
-``pushed'' onto the stack and the file named in the @samp{@@include}
-directive becomes the current @value{FN}. As each file is finished,
-the stack is ``popped,'' and the previous input file becomes the current
-input file again. The process is started by making the original file
-the first one on the stack.
address@hidden
+* Breakpoint Control:: Control of Breakpoints.
+* Debugger Execution Control:: Control of Execution.
+* Viewing And Changing Data:: Viewing and Changing Data.
+* Execution Stack:: Dealing with the Stack.
+* Debugger Info:: Obtaining Information about the Program and
+ the Debugger State.
+* Miscellaneous Debugger Commands:: Miscellaneous Commands.
address@hidden menu
-The @code{pathto()} function does the work of finding the full path to
-a file. It simulates @command{gawk}'s behavior when searching the
address@hidden environment variable
-(@pxref{AWKPATH Variable}).
-If a @value{FN} has a @samp{/} in it, no path search is done.
-Similarly, if the @value{FN} is @code{"-"}, then that string is
-used as-is. Otherwise,
-the @value{FN} is concatenated with the name of each directory in
-the path, and an attempt is made to open the generated @value{FN}.
-The only way to test if a file can be read in @command{awk} is to go
-ahead and try to read it with @code{getline}; this is what @code{pathto()}
address@hidden some very old versions of @command{awk}, the test
address@hidden junk < t} can loop forever if the file exists but is empty.
-Caveat emptor.} If the file can be read, it is closed and the @value{FN}
-is returned:
address@hidden Breakpoint Control
address@hidden Control of Breakpoints
address@hidden
-An alternative way to test for the file's existence would be to call
address@hidden("test -r " t)}, which uses the @command{test} utility to
-see if the file exists and is readable. The disadvantage to this method
-is that it requires creating an extra process and can thus be slightly
-slower.
address@hidden ignore
+As we saw above, the first thing you probably want to do in a debugging
+session is to get your breakpoints set up, since otherwise your program
+will just run as if it was not under the debugger. The commands for
+controlling breakpoints are:
address@hidden
address@hidden file eg/prog/igawk.sh
-expand_prog='
address@hidden @asis
address@hidden debugger commands, @code{b} (@code{break})
address@hidden debugger commands, @code{break}
address@hidden @code{break} debugger command
address@hidden @code{b} debugger command (alias for @code{break})
address@hidden @code{break} address@hidden@code{:address@hidden |
@var{function}] address@hidden"@var{expression}"}]
address@hidden @code{b} address@hidden@code{:address@hidden | @var{function}]
address@hidden"@var{expression}"}]
+Without any argument, set a breakpoint at the next instruction
+to be executed in the selected stack frame.
+Arguments can be one of the following:
-function pathto(file, i, t, junk)
address@hidden
- if (index(file, "/") != 0)
- return file
address@hidden nested table
address@hidden @var
address@hidden n
+Set a breakpoint at line number @var{n} in the current source file.
- if (file == "-")
- return file
address@hidden address@hidden:}n
+Set a breakpoint at line number @var{n} in source file @var{filename}.
- for (i = 1; i <= ndirs; i++) @{
- t = (pathlist[i] "/" file)
address@hidden
- if ((getline junk < t) > 0) @{
- # found it
- close(t)
- return t
- @}
address@hidden group
- @}
- return ""
address@hidden
address@hidden endfile
address@hidden example
address@hidden function
+Set a breakpoint at entry to (the first instruction of)
+function @var{function}.
address@hidden table
-The main program is contained inside one @code{BEGIN} rule. The first thing it
-does is set up the @code{pathlist} array that @code{pathto()} uses. After
-splitting the path on @samp{:}, null elements are replaced with @code{"."},
-which represents the current directory:
+Each breakpoint is assigned a number which can be used to delete it from
+the breakpoint list using the @code{delete} command.
address@hidden
address@hidden file eg/prog/igawk.sh
-BEGIN @{
- path = ENVIRON["AWKPATH"]
- ndirs = split(path, pathlist, ":")
- for (i = 1; i <= ndirs; i++) @{
- if (pathlist[i] == "")
- pathlist[i] = "."
- @}
address@hidden endfile
address@hidden example
+With a breakpoint, you may also supply a condition. This is an
address@hidden expression (enclosed in double quotes) that the debugger
+evaluates whenever the breakpoint is reached. If the condition is true,
+then the debugger stops execution and prompts for a command. Otherwise,
+it continues executing the program.
-The stack is initialized with @code{ARGV[1]}, which will be @file{/dev/stdin}.
-The main loop comes next. Input lines are read in succession. Lines that
-do not start with @samp{@@include} are printed verbatim.
-If the line does start with @samp{@@include}, the @value{FN} is in @code{$2}.
address@hidden()} is called to generate the full path. If it cannot, then the
program
-prints an error message and continues.
address@hidden debugger commands, @code{clear}
address@hidden @code{clear} debugger command
address@hidden @code{clear} address@hidden@code{:address@hidden |
@var{function}]
+Without any argument, delete any breakpoint at the next instruction
+to be executed in the selected stack frame. If the program stops at
+a breakpoint, this deletes that breakpoint so that the program
+does not stop at that location again. Arguments can be one of the following:
-The next thing to check is if the file is included already. The
address@hidden array is indexed by the full @value{FN} of each included
-file and it tracks this information for us. If the file is
-seen again, a warning message is printed. Otherwise, the new @value{FN} is
-pushed onto the stack and processing continues.
address@hidden nested table
address@hidden @var
address@hidden n
+Delete breakpoint(s) set at line number @var{n} in the current source file.
-Finally, when @code{getline} encounters the end of the input file, the file
-is closed and the stack is popped. When @code{stackptr} is less than zero,
-the program is done:
address@hidden address@hidden:}n
+Delete breakpoint(s) set at line number @var{n} in source file @var{filename}.
address@hidden
address@hidden file eg/prog/igawk.sh
- stackptr = 0
- input[stackptr] = ARGV[1] # ARGV[1] is first file
address@hidden function
+Delete breakpoint(s) set at entry to function @var{function}.
address@hidden table
- for (; stackptr >= 0; stackptr--) @{
- while ((getline < input[stackptr]) > 0) @{
- if (tolower($1) != "@@include") @{
- print
- continue
- @}
- fpath = pathto($2)
address@hidden
- if (fpath == "") @{
- printf("igawk:%s:%d: cannot find %s\n",
- input[stackptr], FNR, $2) > "/dev/stderr"
- continue
- @}
address@hidden group
- if (! (fpath in processed)) @{
- processed[fpath] = input[stackptr]
- input[++stackptr] = fpath # push onto stack
- @} else
- print $2, "included in", input[stackptr],
- "already included in",
- processed[fpath] > "/dev/stderr"
- @}
- close(input[stackptr])
- @}
address@hidden' # close quote ends `expand_prog' variable
address@hidden debugger commands, @code{condition}
address@hidden @code{condition} debugger command
address@hidden @code{condition} @var{n} @code{"@var{expression}"}
+Add a condition to existing breakpoint or watchpoint @var{n}. The
+condition is an @command{awk} expression that the debugger evaluates
+whenever the breakpoint or watchpoint is reached. If the condition is true,
then
+the debugger stops execution and prompts for a command. Otherwise,
+the debugger continues executing the program. If the condition expression is
+not specified, any existing condition is removed; i.e., the breakpoint or
+watchpoint is made unconditional.
-processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF
-$program
-EOF
-)
address@hidden endfile
address@hidden example
address@hidden debugger commands, @code{d} (@code{delete})
address@hidden debugger commands, @code{delete}
address@hidden @code{delete} debugger command
address@hidden @code{d} debugger command (alias for @code{delete})
address@hidden @code{delete} address@hidden n2} @dots{}] address@hidden@var{m}]
address@hidden @code{d} address@hidden n2} @dots{}] address@hidden@var{m}]
+Delete specified breakpoints or a range of breakpoints. Deletes
+all defined breakpoints if no argument is supplied.
-The shell construct @address@hidden << @var{marker}} is called a @dfn{here
document}.
-Everything in the shell script up to the @var{marker} is fed to @var{command}
as input.
-The shell processes the contents of the here document for variable and command
substitution
-(and possibly other things as well, depending upon the shell).
address@hidden debugger commands, @code{disable}
address@hidden @code{disable} debugger command
address@hidden @code{disable} address@hidden n2} @dots{} | @address@hidden
+Disable specified breakpoints or a range of breakpoints. Without
+any argument, disables all breakpoints.
-The shell construct @samp{$(@dots{})} is called @dfn{command substitution}.
-The output of the command inside the parentheses is substituted
-into the command line.
-Because the result is used in a variable assignment,
-it is saved as a single string, even if the results contain whitespace.
address@hidden debugger commands, @code{e} (@code{enable})
address@hidden debugger commands, @code{enable}
address@hidden @code{enable} debugger command
address@hidden @code{e} debugger command (alias for @code{enable})
address@hidden @code{enable} address@hidden | @code{once}] address@hidden n2}
@dots{}] address@hidden@var{m}]
address@hidden @code{e} address@hidden | @code{once}] address@hidden n2}
@dots{}] address@hidden@var{m}]
+Enable specified breakpoints or a range of breakpoints. Without
+any argument, enables all breakpoints.
+Optionally, you can specify how to enable the breakpoint:
-The expanded program is saved in the variable @code{processed_program}.
-It's done in these steps:
address@hidden nested table
address@hidden @code
address@hidden del
+Enable the breakpoint(s) temporarily, then delete it when
+the program stops at the breakpoint.
address@hidden
address@hidden
-Run @command{gawk} with the @samp{@@include}-processing program (the
-value of the @code{expand_prog} shell variable) on standard input.
address@hidden once
+Enable the breakpoint(s) temporarily, then disable it when
+the program stops at the breakpoint.
address@hidden table
address@hidden
-Standard input is the contents of the user's program, from the shell variable
@code{program}.
-Its contents are fed to @command{gawk} via a here document.
address@hidden debugger commands, @code{ignore}
address@hidden @code{ignore} debugger command
address@hidden @code{ignore} @var{n} @var{count}
+Ignore breakpoint number @var{n} the next @var{count} times it is
+hit.
address@hidden
-The results of this processing are saved in the shell variable
@code{processed_program} by using command substitution.
address@hidden enumerate
address@hidden debugger commands, @code{t} (@code{tbreak})
address@hidden debugger commands, @code{tbreak}
address@hidden @code{tbreak} debugger command
address@hidden @code{t} debugger command (alias for @code{tbreak})
address@hidden @code{tbreak} address@hidden@code{:address@hidden |
@var{function}]
address@hidden @code{t} address@hidden@code{:address@hidden | @var{function}]
+Set a temporary breakpoint (enabled for only one stop).
+The arguments are the same as for @code{break}.
address@hidden table
-The last step is to call @command{gawk} with the expanded program,
-along with the original
-options and command-line arguments that the user supplied.
address@hidden Debugger Execution Control
address@hidden Control of Execution
address@hidden this causes more problems than it solves, so leave it out.
address@hidden
-The special file @file{/dev/null} is passed as a @value{DF} to @command{gawk}
-to handle an interesting case. Suppose that the user's program only has
-a @code{BEGIN} rule and there are no @value{DF}s to read.
-The program should exit without reading any @value{DF}s.
-However, suppose that an included library file defines an @code{END}
-rule of its own. In this case, @command{gawk} will hang, reading standard
-input. In order to avoid this, @file{/dev/null} is explicitly added to the
-command-line. Reading from @file{/dev/null} always returns an immediate
-end of file indication.
+Now that your breakpoints are ready, you can start running the program
+and observing its behavior. There are more commands for controlling
+execution of the program than we saw in our earlier example:
address@hidden Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh.
address@hidden ignore
address@hidden @asis
address@hidden debugger commands, @code{commands}
address@hidden @code{commands} debugger command
address@hidden debugger commands, @code{silent}
address@hidden @code{silent} debugger command
address@hidden debugger commands, @code{end}
address@hidden @code{end} debugger command
address@hidden @code{commands} address@hidden
address@hidden @code{silent}
address@hidden @dots{}
address@hidden @code{end}
+Set a list of commands to be executed upon stopping at
+a breakpoint or watchpoint. @var{n} is the breakpoint or watchpoint number.
+Without a number, the last one set is used. The actual commands follow,
+starting on the next line, and terminated by the @code{end} command.
+If the command @code{silent} is in the list, the usual messages about
+stopping at a breakpoint and the source line are not printed. Any command
+in the list that resumes execution (e.g., @code{continue}) terminates the list
+(an implicit @code{end}), and subsequent commands are ignored.
+For example:
@example
address@hidden file eg/prog/igawk.sh
-eval gawk $opts -- '"$processed_program"' '"$@@"'
address@hidden endfile
+gawk> @kbd{commands}
+> @kbd{silent}
+> @kbd{printf "A silent breakpoint; i = %d\n", i}
+> @kbd{info locals}
+> @kbd{set i = 10}
+> @kbd{continue}
+> @kbd{end}
+gawk>
@end example
-The @command{eval} command is a shell construct that reruns the shell's parsing
-process. This keeps things properly quoted.
address@hidden debugger commands, @code{c} (@code{continue})
address@hidden debugger commands, @code{continue}
address@hidden @code{continue} address@hidden
address@hidden @code{c} address@hidden
+Resume program execution. If continued from a breakpoint and @var{count} is
+specified, ignores the breakpoint at that location the next @var{count} times
+before stopping.
-This version of @command{igawk} represents my fifth version of this program.
-There are four key simplifications that make the program work better:
address@hidden debugger commands, @code{finish}
address@hidden @code{finish} debugger command
address@hidden @code{finish}
+Execute until the selected stack frame returns.
+Print the returned value.
address@hidden @bullet
address@hidden
-Using @samp{@@include} even for the files named with @option{-f} makes building
-the initial collected @command{awk} program much simpler; all the
address@hidden@@include} processing can be done once.
address@hidden debugger commands, @code{n} (@code{next})
address@hidden debugger commands, @code{next}
address@hidden @code{next} debugger command
address@hidden @code{n} debugger command (alias for @code{next})
address@hidden @code{next} address@hidden
address@hidden @code{n} address@hidden
+Continue execution to the next source line, stepping over function calls.
+The argument @var{count} controls how many times to repeat the action, as
+in @code{step}.
address@hidden
-Not trying to save the line read with @code{getline}
-in the @code{pathto()} function when testing for the
-file's accessibility for use with the main program simplifies things
-considerably.
address@hidden what problem does this engender though - exercise
address@hidden answer, reading from "-" or /dev/stdin
address@hidden debugger commands, @code{ni} (@code{nexti})
address@hidden debugger commands, @code{nexti}
address@hidden @code{nexti} debugger command
address@hidden @code{ni} debugger command (alias for @code{nexti})
address@hidden @code{nexti} address@hidden
address@hidden @code{ni} address@hidden
+Execute one (or @var{count}) instruction(s), stepping over function calls.
address@hidden
-Using a @code{getline} loop in the @code{BEGIN} rule does it all in one
-place. It is not necessary to call out to a separate loop for processing
-nested @samp{@@include} statements.
address@hidden debugger commands, @code{return}
address@hidden @code{return} debugger command
address@hidden @code{return} address@hidden
+Cancel execution of a function call. If @var{value} (either a string or a
+number) is specified, it is used as the function's return value. If used in a
+frame other than the innermost one (the currently executing function, i.e.,
+frame number 0), discard all inner frames in addition to the selected one,
+and the caller of that frame becomes the innermost frame.
address@hidden
-Instead of saving the expanded program in a temporary file, putting it in a
shell variable
-avoids some potential security problems.
-This has the disadvantage that the script relies upon more features
-of the @command{sh} language, making it harder to follow for those who
-aren't familiar with @command{sh}.
address@hidden itemize
address@hidden debugger commands, @code{r} (@code{run})
address@hidden debugger commands, @code{run}
address@hidden @code{run} debugger command
address@hidden @code{r} debugger command (alias for @code{run})
address@hidden @code{run}
address@hidden @code{r}
+Start/restart execution of the program. When restarting, the debugger
+retains the current breakpoints, watchpoints, command history,
+automatic display variables, and debugger options.
-Also, this program illustrates that it is often worthwhile to combine
address@hidden and @command{awk} programming together. You can usually
-accomplish quite a lot, without having to resort to low-level programming
-in C or C++, and it is frequently easier to do certain kinds of string
-and argument manipulation using the shell than it is in @command{awk}.
address@hidden debugger commands, @code{s} (@code{step})
address@hidden debugger commands, @code{step}
address@hidden @code{step} debugger command
address@hidden @code{s} debugger command (alias for @code{step})
address@hidden @code{step} address@hidden
address@hidden @code{s} address@hidden
+Continue execution until control reaches a different source line in the
+current stack frame. @code{step} steps inside any function called within
+the line. If the argument @var{count} is supplied, steps that many times
before
+stopping, unless it encounters a breakpoint or watchpoint.
-Finally, @command{igawk} shows that it is not always necessary to add new
-features to a program; they can often be layered on top.
address@hidden
-With @command{igawk},
-there is no real reason to build @samp{@@include} processing into
address@hidden itself.
address@hidden ignore
address@hidden debugger commands, @code{si} (@code{stepi})
address@hidden debugger commands, @code{stepi}
address@hidden @code{stepi} debugger command
address@hidden @code{si} debugger command (alias for @code{stepi})
address@hidden @code{stepi} address@hidden
address@hidden @code{si} address@hidden
+Execute one (or @var{count}) instruction(s), stepping inside function calls.
+(For illustration of what is meant by an ``instruction'' in @command{gawk},
+see the output shown under @code{dump} in @ref{Miscellaneous Debugger
Commands}.)
address@hidden search paths
address@hidden search paths, for source files
address@hidden source address@hidden search path for
address@hidden files, address@hidden search path for
address@hidden directories, searching
-As an additional example of this, consider the idea of having two
-files in a directory in the search path:
address@hidden debugger commands, @code{u} (@code{until})
address@hidden debugger commands, @code{until}
address@hidden @code{until} debugger command
address@hidden @code{u} debugger command (alias for @code{until})
address@hidden @code{until} address@hidden@code{:address@hidden |
@var{function}]
address@hidden @code{u} address@hidden@code{:address@hidden | @var{function}]
+Without any argument, continue execution until a line past the current
+line in current stack frame is reached. With an argument,
+continue execution until the specified location is reached, or the current
+stack frame returns.
address@hidden table
address@hidden @file
address@hidden default.awk
-This file contains a set of default library functions, such
-as @code{getopt()} and @code{assert()}.
address@hidden Viewing And Changing Data
address@hidden Viewing and Changing Data
address@hidden site.awk
-This file contains library functions that are specific to a site or
-installation; i.e., locally developed functions.
-Having a separate file allows @file{default.awk} to change with
-new @command{gawk} releases, without requiring the system administrator to
-update it each time by adding the local functions.
address@hidden table
+The commands for viewing and changing variables inside of @command{gawk} are:
-One user
address@hidden Karl Berry, address@hidden, 10/95
-suggested that @command{gawk} be modified to automatically read these files
-upon startup. Instead, it would be very simple to modify @command{igawk}
-to do this. Since @command{igawk} can process nested @samp{@@include}
-directives, @file{default.awk} could simply contain @samp{@@include}
-statements for the desired library functions.
address@hidden @asis
address@hidden debugger commands, @code{display}
address@hidden @code{display} debugger command
address@hidden @code{display} address@hidden | @address@hidden
+Add variable @var{var} (or field @address@hidden) to the display list.
+The value of the variable or field is displayed each time the program stops.
+Each variable added to the list is identified by a unique number:
address@hidden Exercise: make this change
address@hidden ENDOFRANGE libfex
address@hidden ENDOFRANGE flibex
address@hidden ENDOFRANGE awkpex
address@hidden
+gawk> @kbd{display x}
address@hidden 10: x = 1
address@hidden example
address@hidden Anagram Program
address@hidden Finding Anagrams From A Dictionary
address@hidden
+displays the assigned item number, the variable name and its current value.
+If the display variable refers to a function parameter, it is silently
+deleted from the list as soon as the execution reaches a context where
+no such variable of the given name exists.
+Without argument, @code{display} displays the current values of
+items on the list.
-An interesting programming challenge is to
-search for @dfn{anagrams} in a
-word list (such as
address@hidden/usr/share/dict/words} on many GNU/Linux systems).
-One word is an anagram of another if both words contain
-the same letters
-(for example, ``babbling'' and ``blabbing'').
address@hidden debugger commands, @code{eval}
address@hidden @code{eval} debugger command
address@hidden @code{eval "@var{awk statements}"}
+Evaluate @var{awk statements} in the context of the running program.
+You can do anything that an @command{awk} program would do: assign
+values to variables, call functions, and so on.
-An elegant algorithm is presented in Column 2, Problem C of
-Jon Bentley's @cite{Programming Pearls}, second edition.
-The idea is to give words that are anagrams a common signature,
-sort all the words together by their signature, and then print them.
-Dr.@: Bentley observes that taking the letters in each word and
-sorting them produces that common signature.
address@hidden @code{eval} @var{param}, @dots{}
address@hidden @var{awk statements}
address@hidden @code{end}
+This form of @code{eval} is similar, but it allows you to define
+``local variables'' that exist in the context of the
address@hidden statements}, instead of using variables or function
+parameters defined by the program.
-The following program uses arrays of arrays to bring together
-words with the same signature and array sorting to print the words
-in sorted order.
address@hidden debugger commands, @code{p} (@code{print})
address@hidden debugger commands, @code{print}
address@hidden @code{print} debugger command
address@hidden @code{p} debugger command (alias for @code{print})
address@hidden @code{print} @address@hidden,} @var{var2} @dots{}]
address@hidden @code{p} @address@hidden,} @var{var2} @dots{}]
+Print the value of a @command{gawk} variable or field.
+Fields must be referenced by constants:
address@hidden @code{anagram.awk} program
@example
address@hidden file eg/prog/anagram.awk
-# anagram.awk --- An implementation of the anagram finding algorithm
-# from Jon Bentley's "Programming Pearls", 2nd edition.
-# Addison Wesley, 2000, ISBN 0-201-65788-0.
-# Column 2, Problem C, section 2.8, pp 18-20.
address@hidden endfile
address@hidden
address@hidden file eg/prog/anagram.awk
-#
-# This program requires gawk 4.0 or newer.
-# Required gawk-specific features:
-# - True multidimensional arrays
-# - split() with "" as separator splits out individual characters
-# - asort() and asorti() functions
-#
-# See http://savannah.gnu.org/projects/gawk.
-#
-# Arnold Robbins
-# arnold@@skeeve.com
-# Public Domain
-# January, 2011
address@hidden endfile
address@hidden ignore
address@hidden file eg/prog/anagram.awk
-
-/'s$/ @{ next @} # Skip possessives
address@hidden endfile
+gawk> @kbd{print $3}
@end example
-The program starts with a header, and then a rule to skip
-possessives in the dictionary file. The next rule builds
-up the data structure. The first dimension of the array
-is indexed by the signature; the second dimension is the word
-itself:
address@hidden
+This prints the third field in the input record (if the specified field does
not
+exist, it prints @samp{Null field}). A variable can be an array element, with
+the subscripts being constant values. To print the contents of an array,
+prefix the name of the array with the @samp{@@} symbol:
@example
address@hidden file eg/prog/anagram.awk
address@hidden
- key = word2key($1) # Build signature
- data[key][$1] = $1 # Store word with signature
address@hidden
address@hidden endfile
+gawk> @kbd{print @@a}
@end example
-The @code{word2key()} function creates the signature.
-It splits the word apart into individual letters,
-sorts the letters, and then joins them back together:
address@hidden
+This prints the indices and the corresponding values for all elements in
+the array @code{a}.
address@hidden
address@hidden file eg/prog/anagram.awk
-# word2key --- split word apart into letters, sort, joining back together
address@hidden debugger commands, @code{printf}
address@hidden @code{printf} debugger command
address@hidden @code{printf} @var{format} address@hidden,} @var{arg} @dots{}]
+Print formatted text. The @var{format} may include escape sequences,
+such as @samp{\n}
+(@pxref{Escape Sequences}).
+No newline is printed unless one is specified.
-function word2key(word, a, i, n, result)
address@hidden
- n = split(word, a, "")
- asort(a)
address@hidden debugger commands, @code{set}
address@hidden @code{set} debugger command
address@hidden @code{set} @address@hidden@var{value}
+Assign a constant (number or string) value to an @command{awk} variable
+or field.
+String values must be enclosed between double quotes (@code{"@dots{}"}).
- for (i = 1; i <= n; i++)
- result = result a[i]
+You can also set special @command{awk} variables, such as @code{FS},
address@hidden, @code{NR}, etc.
- return result
address@hidden
address@hidden endfile
address@hidden example
address@hidden debugger commands, @code{w} (@code{watch})
address@hidden debugger commands, @code{watch}
address@hidden @code{watch} debugger command
address@hidden @code{w} debugger command (alias for @code{watch})
address@hidden @code{watch} @var{var} | @address@hidden
address@hidden"@var{expression}"}]
address@hidden @code{w} @var{var} | @address@hidden
address@hidden"@var{expression}"}]
+Add variable @var{var} (or field @address@hidden) to the watch list.
+The debugger then stops whenever
+the value of the variable or field changes. Each watched item is assigned a
+number which can be used to delete it from the watch list using the
address@hidden command.
-Finally, the @code{END} rule traverses the array
-and prints out the anagram lists. It sends the output
-to the system @command{sort} command, since otherwise
-the anagrams would appear in arbitrary order:
+With a watchpoint, you may also supply a condition. This is an
address@hidden expression (enclosed in double quotes) that the debugger
+evaluates whenever the watchpoint is reached. If the condition is true,
+then the debugger stops execution and prompts for a command. Otherwise,
address@hidden continues executing the program.
address@hidden
address@hidden file eg/prog/anagram.awk
-END @{
- sort = "sort"
- for (key in data) @{
- # Sort words with same key
- nwords = asorti(data[key], words)
- if (nwords == 1)
- continue
address@hidden debugger commands, @code{undisplay}
address@hidden @code{undisplay} debugger command
address@hidden @code{undisplay} address@hidden
+Remove item number @var{n} (or all items, if no argument) from the
+automatic display list.
- # And print. Minor glitch: trailing space at end of each line
- for (j = 1; j <= nwords; j++)
- printf("%s ", words[j]) | sort
- print "" | sort
- @}
- close(sort)
address@hidden
address@hidden endfile
address@hidden example
address@hidden debugger commands, @code{unwatch}
address@hidden @code{unwatch} debugger command
address@hidden @code{unwatch} address@hidden
+Remove item number @var{n} (or all items, if no argument) from the
+watch list.
-Here is some partial output when the program is run:
address@hidden table
address@hidden
-$ @kbd{gawk -f anagram.awk /usr/share/dict/words | grep '^b'}
address@hidden
-babbled blabbed
-babbler blabber brabble
-babblers blabbers brabbles
-babbling blabbing
-babbly blabby
-babel bable
-babels beslab
-babery yabber
address@hidden
address@hidden example
address@hidden Execution Stack
address@hidden Dealing with the Stack
address@hidden Signature Program
address@hidden And Now For Something Completely Different
+Whenever you run a program which contains any function calls,
address@hidden maintains a stack of all of the function calls leading up
+to where the program is right now. You can see how you got to where you are,
+and also move around in the stack to see what the state of things was in the
+functions which called the one you are in. The commands for doing this are:
-The following program was written by Davide Brini
address@hidden (@email{dave_br@@gmx.com})
-and is published on @uref{http://backreference.org/2011/02/03/obfuscated-awk/,
-his website}.
-It serves as his signature in the Usenet group @code{comp.lang.awk}.
-He supplies the following copyright terms:
address@hidden @asis
address@hidden debugger commands, @code{bt} (@code{backtrace})
address@hidden debugger commands, @code{backtrace}
address@hidden @code{backtrace} debugger command
address@hidden @code{bt} debugger command (alias for @code{backtrace})
address@hidden @code{backtrace} address@hidden
address@hidden @code{bt} address@hidden
+Print a backtrace of all function calls (stack frames), or innermost
@var{count}
+frames if @var{count} > 0. Print the outermost @var{count} frames if
address@hidden < 0. The backtrace displays the name and arguments to each
+function, the source @value{FN}, and the line number.
address@hidden
-Copyright @copyright{} 2008 Davide Brini
address@hidden debugger commands, @code{down}
address@hidden @code{down} debugger command
address@hidden @code{down} address@hidden
+Move @var{count} (default 1) frames down the stack toward the innermost frame.
+Then select and print the frame.
-Copying and distribution of the code published in this page, with or without
-modification, are permitted in any medium without royalty provided the
copyright
-notice and this notice are preserved.
address@hidden quotation
address@hidden debugger commands, @code{f} (@code{frame})
address@hidden debugger commands, @code{frame}
address@hidden @code{frame} debugger command
address@hidden @code{f} debugger command (alias for @code{frame})
address@hidden @code{frame} address@hidden
address@hidden @code{f} address@hidden
+Select and print (frame number, function and argument names, source file,
+and the source line) stack frame @var{n}. Frame 0 is the currently executing,
+or @dfn{innermost}, frame (function call), frame 1 is the frame that called the
+innermost one. The highest numbered frame is the one for the main program.
-Here is the program:
address@hidden debugger commands, @code{up}
address@hidden @code{up} debugger command
address@hidden @code{up} address@hidden
+Move @var{count} (default 1) frames up the stack toward the outermost frame.
+Then select and print the frame.
address@hidden table
address@hidden
-awk 'address@hidden"~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
-printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
-X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
-O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),address@hidden'
address@hidden example
address@hidden Debugger Info
address@hidden Obtaining Information about the Program and the Debugger State
-We leave it to you to determine what the program does.
+Besides looking at the values of variables, there is often a need to get
+other sorts of information about the state of your program and of the
+debugging environment itself. The @command{gawk} debugger has one command
which
+provides this information, appropriately called @code{info}. @code{info}
+is used with one of a number of arguments that tell it exactly what
+you want to know:
address@hidden
-To: "Arnold Robbins" <address@hidden>
-Date: Sat, 20 Aug 2011 13:50:46 -0400
-Subject: The GNU Awk User's Guide, Section 13.3.11
-From: "Chris Johansen" <address@hidden>
-Message-ID: <address@hidden>
address@hidden @asis
address@hidden debugger commands, @code{i} (@code{info})
address@hidden debugger commands, @code{info}
address@hidden @code{info} debugger command
address@hidden @code{i} debugger command (alias for @code{info})
address@hidden @code{info} @var{what}
address@hidden @code{i} @var{what}
+The value for @var{what} should be one of the following:
-Arnold, you don't know me, but we have a tenuous connection. My wife is
-Barbara A. Field, FAIA, GIT '65 (B. Arch.).
address@hidden nested table
address@hidden @code
address@hidden args
+Arguments of the selected frame.
-I have had a couple of paper copies of "Effective Awk Programming" for
-years, and now I'm going through a Kindle version of "The GNU Awk User's
-Guide" again. When I got to section 13.3.11, I reformatted and lightly
-commented Davide Brin's signature script to understand its workings.
address@hidden break
+List all currently set breakpoints.
-It occurs to me that this might have pedagogical value as an example
-(although imperfect) of the value of whitespace and comments, and a
-starting point for that discussion. It certainly helped _me_ understand
-what's going on. You are welcome to it, as-is or modified (subject to
-Davide's constraints, of course, which I think I have met).
address@hidden display
+List all items in the automatic display list.
-If I were to include it in a future edition, I would put it at some
-distance from section 13.3.11, say, as a note or an appendix, so as not to
-be a "spoiler" to the puzzle.
address@hidden frame
+Description of the selected stack frame.
-Best regards,
---
-Chris Johansen {johansen at main dot nc dot us}
- . . . collapsing the probability wave function, sending ripples of
-certainty through the space-time continuum.
address@hidden functions
+List all function definitions including source file names and
+line numbers.
address@hidden locals
+Local variables of the selected frame.
-#! /usr/bin/gawk -f
address@hidden source
+The name of the current source file. Each time the program stops, the
+current source file is the file containing the current instruction.
+When the debugger first starts, the current source file is the first file
+included via the @option{-f} option. The
address@hidden @var{filename}:@var{lineno}} command can
+be used at any time to change the current source.
-# From "13.3.11 And Now For Something Completely Different"
-#
http://www.gnu.org/software/gawk/manual/html_node/Signature-Program.html#Signature-Program
address@hidden sources
+List all program sources.
-# Copyright © 2008 Davide Brini
address@hidden variables
+List all global variables.
-# Copying and distribution of the code published in this page, with
-# or without modification, are permitted in any medium without
-# royalty provided the copyright notice and this notice are preserved.
address@hidden watch
+List all items in the watch list.
address@hidden table
address@hidden table
-BEGIN {
- O = "~" ~ "~"; # 1
- o = "==" == "=="; # 1
- o += +o; # 2
- x = O "" O; # 11
+Additional commands give you control over the debugger, the ability to
+save the debugger's state, and the ability to run debugger commands
+from a file. The commands are:
address@hidden @asis
address@hidden debugger commands, @code{o} (@code{option})
address@hidden debugger commands, @code{option}
address@hidden @code{option} debugger command
address@hidden @code{o} debugger command (alias for @code{option})
address@hidden @code{option} address@hidden@address@hidden
address@hidden @code{o} address@hidden@address@hidden
+Without an argument, display the available debugger options
+and their current values. @samp{option @var{name}} shows the current
+value of the named option. @samp{option @address@hidden assigns
+a new value to the named option.
+The available options are:
- while ( X++ <= x + o + o ) c = c "%c";
address@hidden nested table
address@hidden @code
address@hidden history_size
+The maximum number of lines to keep in the history file @file{./.gawk_history}.
+The default is 100.
- # O is 1
- # o is 2
- # x is 11
- # X is 17
- # c is "%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c"
address@hidden listsize
+The number of lines that @code{list} prints. The default is 15.
- printf c,
- ( x - O )*( x - O), # 100 d
- x*( x - o ) - o, # 97 a
- x*( x - O ) + x - O - o, # 118 v
- +x*( x - O ) - x + o, # 101 e
- X*( o*o + O ) + x - O, # 95 _
- X*( X - x ) - o*o, # 98 b
- ( x + X )*o*o + o, # 114 r
- x*( X - x ) - O - O, # 64 @
- x - O + ( O + o + X + x )*( o + O ), # 103 g
- X*X - X*( x - O ) - x + O, # 109 m
- O + X*( o*( o + O ) + O ), # 120 x
- +x + O + X*o, # 46 .
- x*( x - o), # 99 c
- ( o + X + x )*o*o - ( x - O - O ), # 111 0
- O + ( X - x )*( X + O ), # 109 m
- x - O # 10 \n
-}
address@hidden ignore
address@hidden outfile
+Send @command{gawk} output to a file; debugger output still goes
+to standard output. An empty string (@code{""}) resets output to
+standard output.
address@hidden The original text for this chapter was contributed by Efraim
Yawitz.
address@hidden FIXME: Add more indexing.
address@hidden prompt
+The debugger prompt. The default is @address@hidden> }}.
address@hidden Debugger
address@hidden Debugging @command{awk} Programs
address@hidden debugging @command{awk} programs
address@hidden save_history @r{[}on @r{|} address@hidden
+Save command history to file @file{./.gawk_history}.
+The default is @code{on}.
-It would be nice if computer programs worked perfectly the first time they
-were run, but in real life, this rarely happens for programs of
-any complexity. Thus, most programming languages have facilities available
-for ``debugging'' programs, and now @command{awk} is no exception.
address@hidden save_options @r{[}on @r{|} address@hidden
+Save current options to file @file{./.gawkrc} upon exit.
+The default is @code{on}.
+Options are read back in to the next session upon startup.
-The @command{gawk} debugger is purposely modeled after
address@hidden://www.gnu.org/software/gdb/, the GNU Debugger (GDB)}
-command-line debugger. If you are familiar with GDB, learning
-how to use @command{gawk} for debugging your program is easy.
address@hidden trace @r{[}on @r{|} address@hidden
+Turn instruction tracing on or off. The default is @code{off}.
address@hidden table
address@hidden
-* Debugging:: Introduction to @command{gawk} debugger.
-* Sample Debugging Session:: Sample debugging session.
-* List of Debugger Commands:: Main debugger commands.
-* Readline Support:: Readline support.
-* Limitations:: Limitations and future plans.
address@hidden menu
address@hidden @code{save} @var{filename}
+Save the commands from the current session to the given @value{FN},
+so that they can be replayed using the @command{source} command.
address@hidden Debugging
address@hidden Introduction to @command{gawk} Debugger
address@hidden @code{source} @var{filename}
+Run command(s) from a file; an error in any command does not
+terminate execution of subsequent commands. Comments (lines starting
+with @samp{#}) are allowed in a command file.
+Empty lines are ignored; they do @emph{not}
+repeat the last command.
+You can't restart the program by having more than one @code{run}
+command in the file. Also, the list of commands may include additional
address@hidden commands; however, the @command{gawk} debugger will not source
the
+same file more than once in order to avoid infinite recursion.
-This @value{SECTION} introduces debugging in general and begins
-the discussion of debugging in @command{gawk}.
+In addition to, or instead of the @code{source} command, you can use
+the @option{-D @var{file}} or @address@hidden command-line
+options to execute commands from a file non-interactively
+(@pxref{Options}.
address@hidden table
address@hidden
-* Debugging Concepts:: Debugging in General.
-* Debugging Terms:: Additional Debugging Concepts.
-* Awk Debugging:: Awk Debugging.
address@hidden menu
address@hidden Miscellaneous Debugger Commands
address@hidden Miscellaneous Commands
address@hidden Debugging Concepts
address@hidden Debugging in General
+There are a few more commands which do not fit into the
+previous categories, as follows:
-(If you have used debuggers in other languages, you may want to skip
-ahead to the next section on the specific features of the @command{awk}
-debugger.)
address@hidden @asis
address@hidden debugger commands, @code{dump}
address@hidden @code{dump} debugger command
address@hidden @code{dump} address@hidden
+Dump bytecode of the program to standard output or to the file
+named in @var{filename}. This prints a representation of the internal
+instructions which @command{gawk} executes to implement the @command{awk}
+commands in a program. This can be very enlightening, as the following
+partial dump of Davide Brini's obfuscated code
+(@pxref{Signature Program}) demonstrates:
-Of course, a debugging program cannot remove bugs for you, since it has
-no way of knowing what you or your users consider a ``bug'' and what is a
-``feature.'' (Sometimes, we humans have a hard time with this ourselves.)
-In that case, what can you expect from such a tool? The answer to that
-depends on the language being debugged, but in general, you can expect at
-least the following:
address@hidden
+gawk> @kbd{dump}
address@hidden # BEGIN
address@hidden
address@hidden [ 2:0x89faef4] Op_rule : [in_rule = BEGIN]
[source_file = brini.awk]
address@hidden [ 3:0x89fa428] Op_push_i : "~" [PERM|STRING|STRCUR]
address@hidden [ 3:0x89fa464] Op_push_i : "~" [PERM|STRING|STRCUR]
address@hidden [ 3:0x89fa450] Op_match :
address@hidden [ 3:0x89fa3ec] Op_store_var : O [do_reference = FALSE]
address@hidden [ 4:0x89fa48c] Op_push_i : "=="
[PERM|STRING|STRCUR]
address@hidden [ 4:0x89fa4c8] Op_push_i : "=="
[PERM|STRING|STRCUR]
address@hidden [ 4:0x89fa4b4] Op_equal :
address@hidden [ 4:0x89fa400] Op_store_var : o [do_reference = FALSE]
address@hidden [ 5:0x89fa4f0] Op_push : o
address@hidden [ 5:0x89fa4dc] Op_plus_i : 0 [PERM|NUMCUR|NUMBER]
address@hidden [ 5:0x89fa414] Op_push_lhs : o [do_reference = TRUE]
address@hidden [ 5:0x89fa4a0] Op_assign_plus :
address@hidden [ :0x89fa478] Op_pop :
address@hidden [ 6:0x89fa540] Op_push : O
address@hidden [ 6:0x89fa554] Op_push_i : "" [PERM|STRING|STRCUR]
address@hidden [ :0x89fa5a4] Op_no_op :
address@hidden [ 6:0x89fa590] Op_push : O
address@hidden [ :0x89fa5b8] Op_concat : [expr_count = 3]
[concat_flag = 0]
address@hidden [ 6:0x89fa518] Op_store_var : x [do_reference = FALSE]
address@hidden [ 7:0x89fa504] Op_push_loop : [target_continue =
0x89fa568] [target_break = 0x89fa680]
address@hidden [ 7:0x89fa568] Op_push_lhs : X [do_reference = TRUE]
address@hidden [ 7:0x89fa52c] Op_postincrement :
address@hidden [ 7:0x89fa5e0] Op_push : x
address@hidden [ 7:0x89fa61c] Op_push : o
address@hidden [ 7:0x89fa5f4] Op_plus :
address@hidden [ 7:0x89fa644] Op_push : o
address@hidden [ 7:0x89fa630] Op_plus :
address@hidden [ 7:0x89fa5cc] Op_leq :
address@hidden [ :0x89fa57c] Op_jmp_false : [target_jmp = 0x89fa680]
address@hidden [ 7:0x89fa694] Op_push_i : "%c"
[PERM|STRING|STRCUR]
address@hidden [ :0x89fa6d0] Op_no_op :
address@hidden [ 7:0x89fa608] Op_assign_concat : c
address@hidden [ :0x89fa6a8] Op_jmp : [target_jmp = 0x89fa568]
address@hidden [ :0x89fa680] Op_pop_loop :
address@hidden
address@hidden
address@hidden
address@hidden [ 8:0x89fa658] Op_K_printf : [expr_count = 17]
[redir_type = ""]
address@hidden [ :0x89fa374] Op_no_op :
address@hidden [ :0x89fa3d8] Op_atexit :
address@hidden [ :0x89fa6bc] Op_stop :
address@hidden [ :0x89fa39c] Op_no_op :
address@hidden [ :0x89fa3b0] Op_after_beginfile :
address@hidden [ :0x89fa388] Op_no_op :
address@hidden [ :0x89fa3c4] Op_after_endfile :
+gawk>
address@hidden smallexample
address@hidden @bullet
address@hidden
-The ability to watch a program execute its instructions one by one,
-giving you, the programmer, the opportunity to think about what is happening
-on a time scale of seconds, minutes, or hours, rather than the nanosecond
-time scale at which the code usually runs.
address@hidden debugger commands, @code{h} (@code{help})
address@hidden debugger commands, @code{help}
address@hidden @code{help} debugger command
address@hidden @code{h} debugger command (alias for @code{help})
address@hidden @code{help}
address@hidden @code{h}
+Print a list of all of the @command{gawk} debugger commands with a short
+summary of their usage. @samp{help @var{command}} prints the information
+about the command @var{command}.
address@hidden
-The opportunity to not only passively observe the operation of your
-program, but to control it and try different paths of execution, without
-having to change your source files.
address@hidden debugger commands, @code{l} (@code{list})
address@hidden debugger commands, @code{list}
address@hidden @code{list} debugger command
address@hidden @code{l} debugger command (alias for @code{list})
address@hidden @code{list} address@hidden | @code{+} | @var{n} |
@address@hidden:}n} | @address@hidden | @var{function}]
address@hidden @code{l} address@hidden | @code{+} | @var{n} |
@address@hidden:}n} | @address@hidden | @var{function}]
+Print the specified lines (default 15) from the current source file
+or the file named @var{filename}. The possible arguments to @code{list}
+are as follows:
address@hidden
-The chance to see the values of data in the program at any point in
-execution, and also to change that data on the fly, to see how that
-affects what happens afterwards. (This often includes the ability
-to look at internal data structures besides the variables you actually
-defined in your code.)
address@hidden nested table
address@hidden @asis
address@hidden @code{-}
+Print lines before the lines last printed.
address@hidden
-The ability to obtain additional information about your program's state
-or even its internal structure.
address@hidden itemize
address@hidden @code{+}
+Print lines after the lines last printed.
address@hidden without any argument does the same thing.
-All of these tools provide a great amount of help in using your own
-skills and understanding of the goals of your program to find where it
-is going wrong (or, for that matter, to better comprehend a perfectly
-functional program that you or someone else wrote).
address@hidden @var{n}
+Print lines centered around line number @var{n}.
address@hidden Debugging Terms
address@hidden Additional Debugging Concepts
address@hidden @address@hidden
+Print lines from @var{n} to @var{m}.
-Before diving in to the details, we need to introduce several
-important concepts that apply to just about all debuggers.
-The following list defines terms used throughout the rest of
-this @value{CHAPTER}.
address@hidden @address@hidden:}n}
+Print lines centered around line number @var{n} in
+source file @var{filename}. This command may change the current source file.
address@hidden @dfn
address@hidden Stack Frame
-Programs generally call functions during the course of their execution.
-One function can call another, or a function can call itself (recursion).
-You can view the chain of called functions (main program calls A, which
-calls B, which calls C), as a stack of executing functions: the currently
-running function is the topmost one on the stack, and when it finishes
-(returns), the next one down then becomes the active function.
-Such a stack is termed a @dfn{call stack}.
address@hidden @var{function}
+Print lines centered around beginning of the
+function @var{function}. This command may change the current source file.
address@hidden table
-For each function on the call stack, the system maintains a data area
-that contains the function's parameters, local variables, and return value,
-as well as any other ``bookkeeping'' information needed to manage the
-call stack. This data area is termed a @dfn{stack frame}.
address@hidden debugger commands, @code{q} (@code{quit})
address@hidden debugger commands, @code{quit}
address@hidden @code{quit} debugger command
address@hidden @code{q} debugger command (alias for @code{quit})
address@hidden @code{quit}
address@hidden @code{q}
+Exit the debugger. Debugging is great fun, but sometimes we all have
+to tend to other obligations in life, and sometimes we find the bug,
+and are free to go on to the next one! As we saw above, if you are
+running a program, the debugger warns you if you accidentally type
address@hidden or @samp{quit}, to make sure you really want to quit.
address@hidden also follows this model, and gives you
-access to the call stack and to each stack frame. You can see the
-call stack, as well as from where each function on the stack was
-invoked. Commands that print the call stack print information about
-each stack frame (as detailed later on).
address@hidden debugger commands, @code{trace}
address@hidden @code{trace} debugger command
address@hidden @code{trace} @code{on} @r{|} @code{off}
+Turn on or off a continuous printing of instructions which are about to
+be executed, along with printing the @command{awk} line which they
+implement. The default is @code{off}.
address@hidden Breakpoint
-During debugging, you often wish to let the program run until it
-reaches a certain point, and then continue execution from there one
-statement (or instruction) at a time. The way to do this is to set
-a @dfn{breakpoint} within the program. A breakpoint is where the
-execution of the program should break off (stop), so that you can
-take over control of the program's execution. You can add and remove
-as many breakpoints as you like.
+It is to be hoped that most of the ``opcodes'' in these instructions are
+fairly self-explanatory, and using @code{stepi} and @code{nexti} while
address@hidden is on will make them into familiar friends.
address@hidden Watchpoint
-A watchpoint is similar to a breakpoint. The difference is that
-breakpoints are oriented around the code: stop when a certain point in the
-code is reached. A watchpoint, however, specifies that program execution
-should stop when a @emph{data value} is changed. This is useful, since
-sometimes it happens that a variable receives an erroneous value, and it's
-hard to track down where this happens just by looking at the code.
-By using a watchpoint, you can stop whenever a variable is assigned to,
-and usually find the errant code quite quickly.
@end table
address@hidden Awk Debugging
address@hidden Awk Debugging
address@hidden Readline Support
address@hidden Readline Support
-Debugging an @command{awk} program has some specific aspects that are
-not shared with other programming languages.
+If @command{gawk} is compiled with the @code{readline} library, you
+can take advantage of that library's command completion and history expansion
+features. The following types of completion are available:
-First of all, the fact that @command{awk} programs usually take input
-line-by-line from a file or files and operate on those lines using specific
-rules makes it especially useful to organize viewing the execution of
-the program in terms of these rules. As we will see, each @command{awk}
-rule is treated almost like a function call, with its own specific block
-of instructions.
address@hidden @asis
address@hidden Command completion
+Command names.
-In addition, since @command{awk} is by design a very concise language,
-it is easy to lose sight of everything that is going on ``inside''
-each line of @command{awk} code. The debugger provides the opportunity
-to look at the individual primitive instructions carried out
-by the higher-level @command{awk} commands.
address@hidden Source @value{FN} completion
+Source @value{FN}s. Relevant commands are
address@hidden,
address@hidden,
address@hidden,
address@hidden,
+and
address@hidden
address@hidden Sample Debugging Session
address@hidden Sample Debugging Session
address@hidden Argument completion
+Non-numeric arguments to a command.
+Relevant commands are @code{enable} and @code{info}.
-In order to illustrate the use of @command{gawk} as a debugger, let's look at
a sample
-debugging session. We will use the @command{awk} implementation of the
-POSIX @command{uniq} command described earlier (@pxref{Uniq Program})
-as our example.
address@hidden Variable name completion
+Global variable names, and function arguments in the current context
+if the program is running. Relevant commands are
address@hidden,
address@hidden,
address@hidden,
+and
address@hidden
address@hidden
-* Debugger Invocation:: How to Start the Debugger.
-* Finding The Bug:: Finding the Bug.
address@hidden menu
address@hidden table
address@hidden Debugger Invocation
address@hidden How to Start the Debugger
address@hidden Limitations
address@hidden Limitations and Future Plans
-Starting the debugger is almost exactly like running @command{awk}, except you
have to
-pass an additional option @option{--debug} or the corresponding short option
@option{-D}.
-The file(s) containing the program and any supporting code are given on the
command
-line as arguments to one or more @option{-f} options. (@command{gawk} is not
designed
-to debug command-line programs, only programs contained in files.) In our
case,
-we invoke the debugger like this:
+We hope you find the @command{gawk} debugger useful and enjoyable to work with,
+but as with any program, especially in its early releases, it still has
+some limitations. A few which are worth being aware of are:
address@hidden
-$ @kbd{gawk -D -f getopt.awk -f join.awk -f uniq.awk inputfile}
address@hidden example
address@hidden @bullet{}
address@hidden
+At this point, the debugger does not give a detailed explanation of
+what you did wrong when you type in something it doesn't like. Rather, it just
+responds @samp{syntax error}. When you do figure out what your mistake was,
+though, you'll feel like a real guru.
address@hidden
-where both @file{getopt.awk} and @file{uniq.awk} are in @env{$AWKPATH}.
-(Experienced users of GDB or similar debuggers should note that
-this syntax is slightly different from what they are used to.
-With @command{gawk} debugger, the arguments for running the program are given
-in the command line to the debugger rather than as part of the @code{run}
-command at the debugger prompt.)
address@hidden
+If you perused the dump of opcodes in @ref{Miscellaneous Debugger Commands},
+(or if you are already familiar with @command{gawk} internals),
+you will realize that much of the internal manipulation of data
+in @command{gawk}, as in many interpreters, is done on a stack.
address@hidden, @code{Op_pop}, etc., are the ``bread and butter'' of
+most @command{gawk} code. Unfortunately, as of now, the @command{gawk}
+debugger does not allow you to examine the stack's contents.
-Instead of immediately running the program on @file{inputfile}, as
address@hidden would ordinarily do, the debugger merely loads all
-the program source files, compiles them internally, and then gives
-us a prompt:
+That is, the intermediate results of expression evaluation are on the
+stack, but cannot be printed. Rather, only variables which are defined
+in the program can be printed. Of course, a workaround for
+this is to use more explicit variables at the debugging stage and then
+change back to obscure, perhaps more optimal code later.
address@hidden
-gawk>
address@hidden example
address@hidden
+There is no way to look ``inside'' the process of compiling
+regular expressions to see if you got it right. As an @command{awk}
+programmer, you are expected to know what @code{/[^[:alnum:][:blank:]]/}
+means.
address@hidden
-from which we can issue commands to the debugger. At this point, no
-code has been executed.
address@hidden
+The @command{gawk} debugger is designed to be used by running a program (with
all its
+parameters) on the command line, as described in @ref{Debugger Invocation}.
+There is no way (as of now) to attach or ``break in'' to a running program.
+This seems reasonable for a language which is used mainly for quickly
+executing, short programs.
address@hidden Finding The Bug
address@hidden Finding the Bug
address@hidden
+The @command{gawk} debugger only accepts source supplied with the @option{-f}
option.
address@hidden itemize
-Let's say that we are having a problem using (a faulty version of)
address@hidden in the ``field-skipping'' mode, and it doesn't seem to be
-catching lines which should be identical when skipping the first field,
-such as:
+Look forward to a future release when these and other missing features may
+be added, and of course feel free to try to add them yourself!
address@hidden
-awk is a wonderful program!
-gawk is a wonderful program!
address@hidden example
address@hidden Arbitrary Precision Arithmetic
address@hidden Arithmetic and Arbitrary Precision Arithmetic with @command{gawk}
address@hidden arbitrary precision
address@hidden multiple precision
address@hidden infinite precision
address@hidden floating-point numbers, arbitrary precision
address@hidden MPFR
address@hidden GMP
-This could happen if we were thinking (C-like) of the fields in a record
-as being numbered in a zero-based fashion, so instead of the lines:
address@hidden Knuth, Donald
address@hidden
address@hidden's a credibility gap: We don't know how much of the computer's
answers
+to believe. Novice computer users solve this problem by implicitly trusting
+in the computer as an infallible authority; they tend to believe that all
+digits of a printed answer are significant. Disillusioned computer users have
+just the opposite approach; they are constantly afraid that their answers
+are almost address@hidden
+Donald address@hidden E.@: Knuth.
address@hidden Art of Computer Programming}. Volume 2,
address@hidden Algorithms}, third edition,
+1998, ISBN 0-201-89683-4, p.@: 229.}
address@hidden quotation
address@hidden
-clast = join(alast, fcount+1, n)
-cline = join(aline, fcount+1, m)
address@hidden example
+This @value{CHAPTER} discusses issues that you may encounter
+when performing arithmetic. It begins by discussing some of
+the general atributes of computer arithmetic, along with how
+this can influence what you see when running @command{awk} programs.
+This discussion applies to all versions of @command{awk}.
address@hidden
-we wrote:
+Then the discussion moves on to @dfn{arbitrary precsion
+arithmetic}, a feature which is specific to @command{gawk}.
address@hidden
-clast = join(alast, fcount, n)
-cline = join(aline, fcount, m)
address@hidden example
address@hidden
+* General Arithmetic:: An introduction to computer arithmetic.
+* Floating-point Programming:: Effective Floating-point Programming.
+* Gawk and MPFR:: How @command{gawk} provides
+ aribitrary-precision arithmetic.
+* Arbitrary Precision Floats:: Arbitrary Precision Floating-point Arithmetic
+ with @command{gawk}.
+* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with
+ @command{gawk}.
address@hidden menu
-The first thing we usually want to do when trying to investigate a
-problem like this is to put a breakpoint in the program so that we can
-watch it at work and catch what it is doing wrong. A reasonable spot for
-a breakpoint in @file{uniq.awk} is at the beginning of the function
address@hidden()}, which compares the current line with the previous one. To set
-the breakpoint, use the @code{b} (breakpoint) command:
address@hidden General Arithmetic
address@hidden A General Description of Computer Arithmetic
address@hidden
-gawk> @kbd{b are_equal}
address@hidden Breakpoint 1 set at file `awklib/eg/prog/uniq.awk', line 64
address@hidden example
address@hidden integers
address@hidden floating-point, numbers
address@hidden numbers, floating-point
+Within computers, there are two kinds of numeric values: @dfn{integers}
+and @dfn{floating-point}.
+In school, integer values were referred to as ``whole'' numbers---that is,
+numbers without any fractional part, such as 1, 42, or @minus{}17.
+The advantage to integer numbers is that they represent values exactly.
+The disadvantage is that their range is limited. On most systems,
+this range is @minus{}2,147,483,648 to 2,147,483,647.
+However, many systems now support a range from
address@hidden,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
+
address@hidden unsigned integers
address@hidden integers, unsigned
+Integer values come in two flavors: @dfn{signed} and @dfn{unsigned}.
+Signed values may be negative or positive, with the range of values just
+described.
+Unsigned values are always positive. On most systems,
+the range is from 0 to 4,294,967,295.
+However, many systems now support a range from
+0 to 18,446,744,073,709,551,615.
+
address@hidden double precision floating-point
address@hidden single precision floating-point
+Floating-point numbers represent what are called ``real'' numbers; i.e.,
+those that do have a fractional part, such as 3.1415927.
+The advantage to floating-point numbers is that they
+can represent a much larger range of values.
+The disadvantage is that there are numbers that they cannot represent
+exactly.
address@hidden uses @dfn{double precision} floating-point numbers, which
+can hold more digits than @dfn{single precision}
+floating-point numbers.
address@hidden Floating-point issues are discussed more fully in
address@hidden @ref{Floating Point Issues}.
-The debugger tells us the file and line number where the breakpoint is.
-Now type @samp{r} or @samp{run} and the program runs until it hits
-the breakpoint for the first time:
+There a several important issues to be aware of, described next.
address@hidden
-gawk> @kbd{r}
address@hidden Starting program:
address@hidden Stopping in Rule ...
address@hidden Breakpoint 1, are_equal(n, m, clast, cline, alast, aline)
- at `awklib/eg/prog/uniq.awk':64
address@hidden 64 if (fcount == 0 && charcount == 0)
-gawk>
address@hidden example
address@hidden
+* Floating Point Issues:: Stuff to know about floating-point numbers.
+* Integer Programming:: Effective integer programming.
address@hidden menu
-Now we can look at what's going on inside our program. First of all,
-let's see how we got to where we are. At the prompt, we type @samp{bt}
-(short for ``backtrace''), and the debugger responds with a
-listing of the current stack frames:
address@hidden Floating Point Issues
address@hidden Floating-Point Number Caveats
address@hidden
-gawk> @kbd{bt}
address@hidden #0 are_equal(n, m, clast, cline, alast, aline)
- at `awklib/eg/prog/uniq.awk':69
address@hidden #1 in main() at `awklib/eg/prog/uniq.awk':89
address@hidden example
+As mentioned earlier, floating-point numbers represent what are called
+``real'' numbers, i.e., those that have a fractional part. @command{awk}
+uses double precision floating-point numbers to represent all
+numeric values. This @value{SECTION} describes some of the issues
+involved in using floating-point numbers.
-This tells us that @code{are_equal()} was called by the main program at
-line 89 of @file{uniq.awk}. (This is not a big surprise, since this
-is the only call to @code{are_equal()} in the program, but in more complex
-programs, knowing who called a function and with what parameters can be
-the key to finding the source of the problem.)
+There is a very nice
address@hidden://www.validlab.com/goldberg/paper.pdf, paper on floating-point
arithmetic}
+by David Goldberg,
+``What Every Computer Scientist Should Know About Floating-point Arithmetic,''
address@hidden Computing Surveys} @strong{23}, 1 (1991-03), 5-48.
+This is worth reading if you are interested in the details,
+but it does require a background in computer science.
-Now that we're in @code{are_equal()}, we can start looking at the values
-of some variables. Let's say we type @samp{p n}
-(@code{p} is short for ``print''). We would expect to see the value of
address@hidden, a parameter to @code{are_equal()}. Actually, the debugger
-gives us:
address@hidden
+* String Conversion Precision:: The String Value Can Lie.
+* Unexpected Results:: Floating Point Numbers Are Not Abstract
+ Numbers.
+* POSIX Floating Point Problems:: Standards Versus Existing Practice.
address@hidden menu
address@hidden
-gawk> @kbd{p n}
address@hidden n = untyped variable
address@hidden example
address@hidden String Conversion Precision
address@hidden The String Value Can Lie
address@hidden
-In this case, @code{n} is an uninitialized local variable, since the
-function was called without arguments (@pxref{Function Calls}).
+Internally, @command{awk} keeps both the numeric value
+(double precision floating-point) and the string value for a variable.
+Separately, @command{awk} keeps
+track of what type the variable has
+(@pxref{Typing and Comparison}),
+which plays a role in how variables are used in comparisons.
-A more useful variable to display might be the current record:
+It is important to note that the string value for a number may not
+reflect the full value (all the digits) that the numeric value
+actually contains.
+The following program (@file{values.awk}) illustrates this:
@example
-gawk> @kbd{p $0}
address@hidden $0 = string ("gawk is a wonderful program!")
address@hidden
+ sum = $1 + $2
+ # see it for what it is
+ printf("sum = %.12g\n", sum)
+ # use CONVFMT
+ a = "<" sum ">"
+ print "a =", a
+ # use OFMT
+ print "sum =", sum
address@hidden
@end example
@noindent
-This might be a bit puzzling at first since this is the second line of
-our test input above. Let's look at @code{NR}:
-
address@hidden
-gawk> @kbd{p NR}
address@hidden NR = number (2)
address@hidden example
+This program shows the full value of the sum of @code{$1} and @code{$2}
+using @code{printf}, and then prints the string values obtained
+from both automatic conversion (via @code{CONVFMT}) and
+from printing (via @code{OFMT}).
address@hidden
-So we can see that @code{are_equal()} was only called for the second record
-of the file. Of course, this is because our program contained a rule for
address@hidden == 1}:
+Here is what happens when the program is run:
@example
-NR == 1 @{
- last = $0
- next
address@hidden
+$ @kbd{echo 3.654321 1.2345678 | awk -f values.awk}
address@hidden sum = 4.8888888
address@hidden a = <4.88889>
address@hidden sum = 4.88889
@end example
-OK, let's just check that that rule worked correctly:
-
address@hidden
-gawk> @kbd{p last}
address@hidden last = string ("awk is a wonderful program!")
address@hidden example
+This makes it clear that the full numeric value is different from
+what the default string representations show.
-Everything we have done so far has verified that the program has worked as
-planned, up to and including the call to @code{are_equal()}, so the problem
must
-be inside this function. To investigate further, we must begin
-``stepping through'' the lines of @code{are_equal()}. We start by typing
address@hidden (for ``next''):
address@hidden's default value is @code{"%.6g"}, which yields a value with
+at least six significant digits. For some applications, you might want to
+change it to specify more precision.
+On most modern machines, most of the time,
+17 digits is enough to capture a floating-point number's
+value address@hidden cases can require up to
+752 digits (!), but we doubt that you need to worry about this.}
address@hidden
-gawk> @kbd{n}
address@hidden 67 if (fcount > 0) @{
address@hidden example
address@hidden Unexpected Results
address@hidden Floating Point Numbers Are Not Abstract Numbers
-This tells us that @command{gawk} is now ready to execute line 67, which
-decides whether to give the lines the special ``field skipping'' treatment
-indicated by the @option{-f} command-line option. (Notice that we skipped
-from where we were before at line 64 to here, since the condition in line 64
address@hidden floating-point, numbers
+Unlike numbers in the abstract sense (such as what you studied in high school
+or college arithmetic), numbers stored in computers are limited in certain
ways.
+They cannot represent an infinite number of digits, nor can they always
+represent things exactly.
+In particular,
+floating-point numbers cannot
+always represent values exactly. Here is an example:
@example
-if (fcount == 0 && charcount == 0)
+$ @kbd{awk '@{ printf("%010d\n", $1 * 100) @}'}
+515.79
address@hidden 0000051579
+515.80
address@hidden 0000051579
+515.81
address@hidden 0000051580
+515.82
address@hidden 0000051582
address@hidden@value{CTL}-d}
@end example
@noindent
-was false.)
+This shows that some values can be represented exactly,
+whereas others are only approximated. This is not a ``bug''
+in @command{awk}, but simply an artifact of how computers
+represent numbers.
-Continuing to step, we now get to the splitting of the current and
-last records:
address@hidden NOTE
+It cannot be emphasized enough that the behavior just
+described is fundamental to modern computers. You will
+see this kind of thing happen in @emph{any} programming
+language using hardware floating-point numbers. It is @emph{not}
+a bug in @command{gawk}, nor is it something that can be ``just
+fixed.''
address@hidden quotation
address@hidden
-gawk> @kbd{n}
address@hidden 68 n = split(last, alast)
-gawk> @kbd{n}
address@hidden 69 m = split($0, aline)
address@hidden example
address@hidden negative zero
address@hidden positive zero
address@hidden address@hidden negative vs.@: positive
+Another peculiarity of floating-point numbers on modern systems
+is that they often have more than one representation for the number zero!
+In particular, it is possible to represent ``minus zero'' as well as
+regular, or ``positive'' zero.
-At this point, we should be curious to see what our records were split
-into, so we try to look:
+This example shows that negative and positive zero are distinct values
+when stored internally, but that they are in fact equal to each other,
+as well as to ``regular'' zero:
@example
-gawk> @kbd{p n m alast aline}
address@hidden n = number (5)
address@hidden m = number (5)
address@hidden alast = array, 5 elements
address@hidden aline = array, 5 elements
+$ @kbd{gawk 'BEGIN @{ mz = -0 ; pz = 0}
+> @kbd{printf "-0 = %g, +0 = %g, (-0 == +0) -> %d\n", mz, pz, mz == pz}
+> @kbd{printf "mz == 0 -> %d, pz == 0 -> %d\n", mz == 0, pz == 0}
+> @address@hidden'}
address@hidden -0 = -0, +0 = 0, (-0 == +0) -> 1
address@hidden mz == 0 -> 1, pz == 0 -> 1
@end example
address@hidden
-(The @code{p} command can take more than one argument, similar to
address@hidden's @code{print} statement.)
-
-This is kind of disappointing, though. All we found out is that there
-are five elements in each of our arrays. Useful enough (we now know that
-none of the words were accidentally left out), but what if we want to see
-inside the array?
+It helps to keep this in mind should you process numeric data
+that contains negative zero values; the fact that the zero is negative
+is noted and can affect comparisons.
-The first choice would be to use subscripts:
address@hidden POSIX Floating Point Problems
address@hidden Standards Versus Existing Practice
address@hidden
-gawk> @kbd{p alast[0]}
address@hidden "0" not in array `alast'
address@hidden example
+Historically, @command{awk} has converted any non-numeric looking string
+to the numeric value zero, when required. Furthermore, the original
+definition of the language and the original POSIX standards specified that
address@hidden only understands decimal numbers (base 10), and not octal
+(base 8) or hexadecimal numbers (base 16).
address@hidden
-Oops!
+Changes in the language of the
+2001 and 2004 POSIX standards can be interpreted to imply that @command{awk}
+should support additional features. These features are:
address@hidden
-gawk> @kbd{p alast[1]}
address@hidden alast["1"] = string ("awk")
address@hidden example
address@hidden @bullet
address@hidden
+Interpretation of floating point data values specified in hexadecimal
+notation (@samp{0xDEADBEEF}). (Note: data values, @emph{not}
+source code constants.)
-This would be kind of slow for a 100-member array, though, so
address@hidden provides a shortcut (reminiscent of another language
-not to be mentioned):
address@hidden
+Support for the special IEEE 754 floating point values ``Not A Number''
+(NaN), positive Infinity (``inf'') and negative Infinity (address@hidden'').
+In particular, the format for these values is as specified by the ISO 1999
+C standard, which ignores case and can allow machine-dependent additional
+characters after the @samp{nan} and allow either @samp{inf} or @samp{infinity}.
address@hidden itemize
address@hidden
-gawk> @kbd{p @@alast}
address@hidden alast["1"] = string ("awk")
address@hidden alast["2"] = string ("is")
address@hidden alast["3"] = string ("a")
address@hidden alast["4"] = string ("wonderful")
address@hidden alast["5"] = string ("program!")
address@hidden example
+The first problem is that both of these are clear changes to historical
+practice:
-It looks like we got this far OK. Let's take another step
-or two:
address@hidden @bullet
address@hidden
+The @command{gawk} maintainer feels that supporting hexadecimal floating
+point values, in particular, is ugly, and was never intended by the
+original designers to be part of the language.
address@hidden
-gawk> @kbd{n}
address@hidden 70 clast = join(alast, fcount, n)
-gawk> @kbd{n}
address@hidden 71 cline = join(aline, fcount, m)
address@hidden example
address@hidden
+Allowing completely alphabetic strings to have valid numeric
+values is also a very severe departure from historical practice.
address@hidden itemize
-Well, here we are at our error (sorry to spoil the suspense). What we
-had in mind was to join the fields starting from the second one to make
-the virtual record to compare, and if the first field was numbered zero,
-this would work. Let's look at what we've got:
+The second problem is that the @code{gawk} maintainer feels that this
+interpretation of the standard, which requires a certain amount of
+``language lawyering'' to arrive at in the first place, was not even
+intended by the standard developers. In other words, ``we see how you
+got where you are, but we don't think that that's where you want to be.''
address@hidden
-gawk> @kbd{p cline clast}
address@hidden cline = string ("gawk is a wonderful program!")
address@hidden clast = string ("awk is a wonderful program!")
address@hidden example
+Recognizing the above issues, but attempting to provide compatibility
+with the earlier versions of the standard,
+the 2008 POSIX standard added explicit wording to allow, but not require,
+that @command{awk} support hexadecimal floating point values and
+special values for ``Not A Number'' and infinity.
-Hey, those look pretty familiar! They're just our original, unaltered,
-input records. A little thinking (the human brain is still the best
-debugging tool), and we realize that we were off by one!
+Although the @command{gawk} maintainer continues to feel that
+providing those features is inadvisable,
+nevertheless, on systems that support IEEE floating point, it seems
+reasonable to provide @emph{some} way to support NaN and Infinity values.
+The solution implemented in @command{gawk} is as follows:
-We get out of the debugger:
address@hidden @bullet
address@hidden
+With the @option{--posix} command-line option, @command{gawk} becomes
+``hands off.'' String values are passed directly to the system library's
address@hidden()} function, and if it successfully returns a numeric value,
+that is what's address@hidden asked for it, you got it.}
+By definition, the results are not portable across
+different systems. They are also a little surprising:
@example
-gawk> @kbd{q}
address@hidden The program is running. Exit anyway (y/n)? @kbd{y}
+$ @kbd{echo nanny | gawk --posix '@{ print $1 + 0 @}'}
address@hidden nan
+$ @kbd{echo 0xDeadBeef | gawk --posix '@{ print $1 + 0 @}'}
address@hidden 3735928559
@end example
address@hidden
-Then we get into an editor:
address@hidden
+Without @option{--posix}, @command{gawk} interprets the four strings
address@hidden,
address@hidden,
address@hidden,
+and
address@hidden
+specially, producing the corresponding special numeric values.
+The leading sign acts a signal to @command{gawk} (and the user)
+that the value is really numeric. Hexadecimal floating point is
+not supported (unless you also use @option{--non-decimal-data},
+which is @emph{not} recommended). For example:
@example
-clast = join(alast, fcount+1, n)
-cline = join(aline, fcount+1, m)
+$ @kbd{echo nanny | gawk '@{ print $1 + 0 @}'}
address@hidden 0
+$ @kbd{echo +nan | gawk '@{ print $1 + 0 @}'}
address@hidden nan
+$ @kbd{echo 0xDeadBeef | gawk '@{ print $1 + 0 @}'}
address@hidden 0
@end example
address@hidden
-and problem solved!
-
address@hidden List of Debugger Commands
address@hidden Main Debugger Commands
address@hidden does ignore case in the four special values.
+Thus @samp{+nan} and @samp{+NaN} are the same.
address@hidden itemize
-The @command{gawk} debugger command set can be divided into the
-following categories:
address@hidden Integer Programming
address@hidden Mixing Integers And Floating-point
address@hidden @bullet{}
+As has been mentioned already, @command{gawk} ordinarily uses hardware double
+precision with 64-bit IEEE binary floating-point representation
+for numbers on most systems. A large integer like 9007199254740997
+has a binary representation that, although finite, is more than 53 bits long;
+it must also be rounded to 53 bits.
+The biggest integer that can be stored in a C @code{double} is usually the same
+as the largest possible value of a @code{double}. If your system @code{double}
+is an IEEE 64-bit @code{double}, this largest possible value is an integer and
+can be represented precisely. What more should one know about integers?
address@hidden
-Breakpoint control
+If you want to know what is the largest integer, such that it and
+all smaller integers can be stored in 64-bit doubles without losing precision,
+then the answer is
address@hidden
address@hidden
address@hidden iftex
address@hidden
+2^53.
address@hidden ifnottex
+The next representable number is the even number
address@hidden
address@hidden + 2},
address@hidden iftex
address@hidden
+2^53 + 2,
address@hidden ifnottex
+meaning it is unlikely that you will be able to make
address@hidden print
address@hidden
address@hidden + 1}
address@hidden iftex
address@hidden
+2^53 + 1
address@hidden ifnottex
+in integer format.
+The range of integers exactly representable by a 64-bit double
+is
address@hidden
address@hidden, 2^{53}]}.
address@hidden iftex
address@hidden
address@hidden, 2^53].
address@hidden ifnottex
+If you ever see an integer outside this range in @command{gawk}
+using 64-bit doubles, you have reason to be very suspicious about
+the accuracy of the output. Here is a simple program with erroneous output:
address@hidden
-Execution control
address@hidden
+$ @kbd{gawk 'BEGIN @{ i = 2^53 - 1; for (j = 0; j < 4; j++) print i + j @}'}
address@hidden 9007199254740991
address@hidden 9007199254740992
address@hidden 9007199254740992
address@hidden 9007199254740994
address@hidden example
address@hidden
-Viewing and changing data
+The lesson is to not assume that any large integer printed by @command{gawk}
+represents an exact result from your computation, especially if it wraps
+around on your screen.
address@hidden
-Working with the stack
address@hidden Floating-point Programming
address@hidden Understanding Floating-point Programming
address@hidden
-Getting information
+Numerical programming is an extensive area; if you need to develop
+sophisticated numerical algorithms then @command{gawk} may not be
+the ideal tool, and this documentation may not be sufficient.
address@hidden FIXME: JOHN: Do you want to cite some actual books?
+It might require digesting a book or two to really internalize how to compute
+with ideal accuracy and precision
+and the result often depends on the particular application.
address@hidden
-Miscellaneous
address@hidden itemize
address@hidden NOTE
+A floating-point calculation's @dfn{accuracy} is how close it comes
+to the real value. This is as opposed to the @dfn{precision}, which
+usually refers to the number of bits used to represent the number
+(see @uref{http://en.wikipedia.org/wiki/Accuracy_and_precision,
+the Wikipedia article} for more information).
address@hidden quotation
-Each of these are discussed in the following subsections.
-In the following descriptions, commands which may be abbreviated
-show the abbreviation on a second description line.
-A debugger command name may also be truncated if that partial
-name is unambiguous. The debugger has the built-in capability to
-automatically repeat the previous command when just hitting @key{Enter}.
-This works for the commands @code{list}, @code{next}, @code{nexti},
@code{step}, @code{stepi}
-and @code{continue} executed without any argument.
+There are two options for doing floating-point calculations:
+hardware floating-point (as used by standard @command{awk} and
+the default for @command{gawk}), and @dfn{arbitrary-precision}
+floating-point, which is software based. This @value{CHAPTER}
+aims to provide enough information to understand both, and then
+will focus on @command{gawk}'s facilities for the address@hidden you
+are interested in other tools that perform arbitrary precision arithmetic,
+you may want to investigate the POSIX @command{bc} tool. See
address@hidden://pubs.opengroup.org/onlinepubs/009695399/utilities/bc.html,
+the POSIX specification for it}, for more information.}
address@hidden
-* Breakpoint Control:: Control of Breakpoints.
-* Debugger Execution Control:: Control of Execution.
-* Viewing And Changing Data:: Viewing and Changing Data.
-* Execution Stack:: Dealing with the Stack.
-* Debugger Info:: Obtaining Information about the Program and
- the Debugger State.
-* Miscellaneous Debugger Commands:: Miscellaneous Commands.
address@hidden menu
+Binary floating-point representations and arithmetic are inexact.
+Simple values like 0.1 cannot be precisely represented using
+binary floating-point numbers, and the limited precision of
+floating-point numbers means that slight changes in
+the order of operations or the precision of intermediate storage
+can change the result. To make matters worse, with arbitrary precision
+floating-point, you can set the precision before starting a computation,
+but then you cannot be sure of the number of significant decimal places
+in the final result.
address@hidden Breakpoint Control
address@hidden Control of Breakpoints
+Sometimes, before you start to write any code, you should think more
+about what you really want and what's really happening. Consider the
+two numbers in the following example:
-As we saw above, the first thing you probably want to do in a debugging
-session is to get your breakpoints set up, since otherwise your program
-will just run as if it was not under the debugger. The commands for
-controlling breakpoints are:
address@hidden
+x = 0.875 # 1/2 + 1/4 + 1/8
+y = 0.425
address@hidden example
address@hidden @asis
address@hidden debugger commands, @code{b} (@code{break})
address@hidden debugger commands, @code{break}
address@hidden @code{break} debugger command
address@hidden @code{b} debugger command (alias for @code{break})
address@hidden @code{break} address@hidden@code{:address@hidden |
@var{function}] address@hidden"@var{expression}"}]
address@hidden @code{b} address@hidden@code{:address@hidden | @var{function}]
address@hidden"@var{expression}"}]
-Without any argument, set a breakpoint at the next instruction
-to be executed in the selected stack frame.
-Arguments can be one of the following:
+Unlike the number in @code{y}, the number stored in @code{x}
+is exactly representable
+in binary since it can be written as a finite sum of one or
+more fractions whose denominators are all powers of two.
+When @command{gawk} reads a floating-point number from
+program source, it automatically rounds that number to whatever
+precision your machine supports. If you try to print the numeric
+content of a variable using an output format string of @code{"%.17g"},
+it may not produce the same number as you assigned to it:
address@hidden nested table
address@hidden @var
address@hidden n
-Set a breakpoint at line number @var{n} in the current source file.
address@hidden
+$ @kbd{gawk 'BEGIN @{ x = 0.875; y = 0.425}
+> @kbd{ printf("%0.17g, %0.17g\n", x, y) @}'}
address@hidden 0.875, 0.42499999999999999
address@hidden example
address@hidden address@hidden:}n
-Set a breakpoint at line number @var{n} in source file @var{filename}.
+Often the error is so small you do not even notice it, and if you do,
+you can always specify how much precision you would like in your output.
+Usually this is a format string like @code{"%.15g"}, which when
+used in the previous example, produces an output identical to the input.
address@hidden function
-Set a breakpoint at entry to (the first instruction of)
-function @var{function}.
address@hidden table
+Because the underlying representation can be little bit off from the exact
value,
+comparing floating-point values to see if they are equal is generally not a
good idea.
+Here is an example where it does not work like you expect:
-Each breakpoint is assigned a number which can be used to delete it from
-the breakpoint list using the @code{delete} command.
address@hidden
+$ @kbd{gawk 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'}
address@hidden 0
address@hidden example
-With a breakpoint, you may also supply a condition. This is an
address@hidden expression (enclosed in double quotes) that the debugger
-evaluates whenever the breakpoint is reached. If the condition is true,
-then the debugger stops execution and prompts for a command. Otherwise,
-it continues executing the program.
+The loss of accuracy during a single computation with floating-point numbers
+usually isn't enough to worry about. However, if you compute a value
+which is the result of a sequence of floating point operations,
+the error can accumulate and greatly affect the computation itself.
+Here is an attempt to compute the value of the constant
address@hidden using one of its many series representations:
address@hidden debugger commands, @code{clear}
address@hidden @code{clear} debugger command
address@hidden @code{clear} address@hidden@code{:address@hidden |
@var{function}]
-Without any argument, delete any breakpoint at the next instruction
-to be executed in the selected stack frame. If the program stops at
-a breakpoint, this deletes that breakpoint so that the program
-does not stop at that location again. Arguments can be one of the following:
address@hidden
+BEGIN @{
+ x = 1.0 / sqrt(3.0)
+ n = 6
+ for (i = 1; i < 30; i++) @{
+ n = n * 2.0
+ x = (sqrt(x * x + 1) - 1) / x
+ printf("%.15f\n", n * x)
+ @}
address@hidden
address@hidden example
address@hidden nested table
address@hidden @var
address@hidden n
-Delete breakpoint(s) set at line number @var{n} in the current source file.
+When run, the early errors propagating through later computations
+cause the loop to terminate prematurely after an attempt to divide by zero.
address@hidden address@hidden:}n
-Delete breakpoint(s) set at line number @var{n} in source file @var{filename}.
address@hidden
+$ @kbd{gawk -f pi.awk}
address@hidden 3.215390309173475
address@hidden 3.159659942097510
address@hidden 3.146086215131467
address@hidden 3.142714599645573
address@hidden
address@hidden 3.224515243534819
address@hidden 2.791117213058638
address@hidden 0.000000000000000
address@hidden gawk: pi.awk:6: fatal: division by zero attempted
address@hidden example
address@hidden function
-Delete breakpoint(s) set at entry to function @var{function}.
address@hidden table
+Here is one more example where the inaccuracies in internal representations
+yield an unexpected result:
address@hidden debugger commands, @code{condition}
address@hidden @code{condition} debugger command
address@hidden @code{condition} @var{n} @code{"@var{expression}"}
-Add a condition to existing breakpoint or watchpoint @var{n}. The
-condition is an @command{awk} expression that the debugger evaluates
-whenever the breakpoint or watchpoint is reached. If the condition is true,
then
-the debugger stops execution and prompts for a command. Otherwise,
-the debugger continues executing the program. If the condition expression is
-not specified, any existing condition is removed; i.e., the breakpoint or
-watchpoint is made unconditional.
address@hidden
+$ @kbd{gawk 'BEGIN @{}
+> @kbd{for (d = 1.1; d <= 1.5; d += 0.1)}
+> @kbd{i++}
+> @kbd{print i}
+> @address@hidden'}
address@hidden 4
address@hidden example
address@hidden debugger commands, @code{d} (@code{delete})
address@hidden debugger commands, @code{delete}
address@hidden @code{delete} debugger command
address@hidden @code{d} debugger command (alias for @code{delete})
address@hidden @code{delete} address@hidden n2} @dots{}] address@hidden@var{m}]
address@hidden @code{d} address@hidden n2} @dots{}] address@hidden@var{m}]
-Delete specified breakpoints or a range of breakpoints. Deletes
-all defined breakpoints if no argument is supplied.
+Can computation using aribitrary precision help with the previous examples?
+If you are impatient to know, see
address@hidden Arithmetic}.
address@hidden debugger commands, @code{disable}
address@hidden @code{disable} debugger command
address@hidden @code{disable} address@hidden n2} @dots{} | @address@hidden
-Disable specified breakpoints or a range of breakpoints. Without
-any argument, disables all breakpoints.
+Instead of aribitrary precision floating-point arithmetic,
+often all you need is an adjustment of your logic
+or a different order for the operations in your calculation.
+The stability and the accuracy of the computation of the constant @value{PI}
+in the previous example can be enhanced by using the following
+simple algebraic transformation:
address@hidden debugger commands, @code{e} (@code{enable})
address@hidden debugger commands, @code{enable}
address@hidden @code{enable} debugger command
address@hidden @code{e} debugger command (alias for @code{enable})
address@hidden @code{enable} address@hidden | @code{once}] address@hidden n2}
@dots{}] address@hidden@var{m}]
address@hidden @code{e} address@hidden | @code{once}] address@hidden n2}
@dots{}] address@hidden@var{m}]
-Enable specified breakpoints or a range of breakpoints. Without
-any argument, enables all breakpoints.
-Optionally, you can specify how to enable the breakpoint:
address@hidden
+(sqrt(x * x + 1) - 1) / x = x / (sqrt(x * x + 1) + 1)
address@hidden example
address@hidden nested table
address@hidden @code
address@hidden del
-Enable the breakpoint(s) temporarily, then delete it when
-the program stops at the breakpoint.
address@hidden
+After making this, change the program does converge to
address@hidden in under 30 iterations:
address@hidden once
-Enable the breakpoint(s) temporarily, then disable it when
-the program stops at the breakpoint.
address@hidden table
address@hidden
+$ @kbd{gawk -f /tmp/pi2.awk}
address@hidden 3.215390309173473
address@hidden 3.159659942097501
address@hidden 3.146086215131436
address@hidden 3.142714599645370
address@hidden 3.141873049979825
address@hidden
address@hidden 3.141592653589797
address@hidden 3.141592653589797
address@hidden example
address@hidden debugger commands, @code{ignore}
address@hidden @code{ignore} debugger command
address@hidden @code{ignore} @var{n} @var{count}
-Ignore breakpoint number @var{n} the next @var{count} times it is
-hit.
+There is no need to be unduly suspicious about the results from
+floating-point arithmetic. The lesson to remember is that
+floating-point arithmetic is always more complex than the arithmetic using
+pencil and paper. In order to take advantage of the power
+of computer floating-point, you need to know its limitations
+and work within them. For most casual use of floating-point arithmetic,
+you will often get the expected result in the end if you simply round
+the display of your final results to the correct number of significant
+decimal digits. And, avoid presenting numerical data in a manner that
+implies better precision than is actually the case.
address@hidden debugger commands, @code{t} (@code{tbreak})
address@hidden debugger commands, @code{tbreak}
address@hidden @code{tbreak} debugger command
address@hidden @code{t} debugger command (alias for @code{tbreak})
address@hidden @code{tbreak} address@hidden@code{:address@hidden |
@var{function}]
address@hidden @code{t} address@hidden@code{:address@hidden | @var{function}]
-Set a temporary breakpoint (enabled for only one stop).
-The arguments are the same as for @code{break}.
address@hidden table
address@hidden
+* Floating-point Representation:: Binary floating-point representation.
+* Floating-point Context:: Floating-point context.
+* Rounding Mode:: Floating-point rounding mode.
address@hidden menu
address@hidden Debugger Execution Control
address@hidden Control of Execution
address@hidden Floating-point Representation
address@hidden Binary Floating-point Representation
address@hidden IEEE-754 format
-Now that your breakpoints are ready, you can start running the program
-and observing its behavior. There are more commands for controlling
-execution of the program than we saw in our earlier example:
+Although floating-point representations vary from machine to machine,
+the most commonly encountered representation is that defined by the
+IEEE 754 Standard. An IEEE-754 format value has three components:
address@hidden @asis
address@hidden debugger commands, @code{commands}
address@hidden @code{commands} debugger command
address@hidden debugger commands, @code{silent}
address@hidden @code{silent} debugger command
address@hidden debugger commands, @code{end}
address@hidden @code{end} debugger command
address@hidden @code{commands} address@hidden
address@hidden @code{silent}
address@hidden @dots{}
address@hidden @code{end}
-Set a list of commands to be executed upon stopping at
-a breakpoint or watchpoint. @var{n} is the breakpoint or watchpoint number.
-Without a number, the last one set is used. The actual commands follow,
-starting on the next line, and terminated by the @code{end} command.
-If the command @code{silent} is in the list, the usual messages about
-stopping at a breakpoint and the source line are not printed. Any command
-in the list that resumes execution (e.g., @code{continue}) terminates the list
-(an implicit @code{end}), and subsequent commands are ignored.
-For example:
address@hidden @bullet
address@hidden
+A sign bit telling whether the number is positive or negative.
address@hidden
-gawk> @kbd{commands}
-> @kbd{silent}
-> @kbd{printf "A silent breakpoint; i = %d\n", i}
-> @kbd{info locals}
-> @kbd{set i = 10}
-> @kbd{continue}
-> @kbd{end}
-gawk>
address@hidden example
address@hidden
+An @dfn{exponent} giving its order of magnitude, @var{e}.
address@hidden debugger commands, @code{c} (@code{continue})
address@hidden debugger commands, @code{continue}
address@hidden @code{continue} address@hidden
address@hidden @code{c} address@hidden
-Resume program execution. If continued from a breakpoint and @var{count} is
-specified, ignores the breakpoint at that location the next @var{count} times
-before stopping.
address@hidden
+A @dfn{significand}, @var{s},
+specifying the actual digits of the number.
address@hidden itemize
+
+The value of the
+number is then
address@hidden
address@hidden @cdot 2^e}.
address@hidden iftex
address@hidden
address@hidden * 2^e}.
address@hidden ifnottex
+The first bit of a non-zero binary significand
+is always one, so the significand in an IEEE-754 format only includes the
+fractional part, leaving the leading one implicit.
address@hidden debugger commands, @code{finish}
address@hidden @code{finish} debugger command
address@hidden @code{finish}
-Execute until the selected stack frame returns.
-Print the returned value.
+Three of the standard IEEE-754 types are 32-bit single precision,
+64-bit double precision and 128-bit quadruple precision.
+The standard also specifies extended precision formats
+to allow greater precisions and larger exponent ranges.
address@hidden debugger commands, @code{n} (@code{next})
address@hidden debugger commands, @code{next}
address@hidden @code{next} debugger command
address@hidden @code{n} debugger command (alias for @code{next})
address@hidden @code{next} address@hidden
address@hidden @code{n} address@hidden
-Continue execution to the next source line, stepping over function calls.
-The argument @var{count} controls how many times to repeat the action, as
-in @code{step}.
+The significand is stored in @dfn{normalized} format,
+which means that the first bit is always a one.
address@hidden debugger commands, @code{ni} (@code{nexti})
address@hidden debugger commands, @code{nexti}
address@hidden @code{nexti} debugger command
address@hidden @code{ni} debugger command (alias for @code{nexti})
address@hidden @code{nexti} address@hidden
address@hidden @code{ni} address@hidden
-Execute one (or @var{count}) instruction(s), stepping over function calls.
address@hidden Floating-point Context
address@hidden Floating-point Context
address@hidden context, floating-point
address@hidden debugger commands, @code{return}
address@hidden @code{return} debugger command
address@hidden @code{return} address@hidden
-Cancel execution of a function call. If @var{value} (either a string or a
-number) is specified, it is used as the function's return value. If used in a
-frame other than the innermost one (the currently executing function, i.e.,
-frame number 0), discard all inner frames in addition to the selected one,
-and the caller of that frame becomes the innermost frame.
+A floating-point @dfn{context} defines the environment for arithmetic
operations.
+It governs precision, sets rules for rounding, and limits the range for
exponents.
+The context has the following primary components:
address@hidden debugger commands, @code{r} (@code{run})
address@hidden debugger commands, @code{run}
address@hidden @code{run} debugger command
address@hidden @code{r} debugger command (alias for @code{run})
address@hidden @code{run}
address@hidden @code{r}
-Start/restart execution of the program. When restarting, the debugger
-retains the current breakpoints, watchpoints, command history,
-automatic display variables, and debugger options.
address@hidden @dfn
address@hidden Precision
+Precision of the floating-point format in bits.
address@hidden emax
+Maximum exponent allowed for this format.
address@hidden emin
+Minimum exponent allowed for this format.
address@hidden Underflow behavior
+The format may or may not support gradual underflow.
address@hidden Rounding
+The rounding mode of this context.
address@hidden table
address@hidden debugger commands, @code{s} (@code{step})
address@hidden debugger commands, @code{step}
address@hidden @code{step} debugger command
address@hidden @code{s} debugger command (alias for @code{step})
address@hidden @code{step} address@hidden
address@hidden @code{s} address@hidden
-Continue execution until control reaches a different source line in the
-current stack frame. @code{step} steps inside any function called within
-the line. If the argument @var{count} is supplied, steps that many times
before
-stopping, unless it encounters a breakpoint or watchpoint.
address@hidden lists the precision and exponent
+field values for the basic IEEE-754 binary formats:
address@hidden debugger commands, @code{si} (@code{stepi})
address@hidden debugger commands, @code{stepi}
address@hidden @code{stepi} debugger command
address@hidden @code{si} debugger command (alias for @code{stepi})
address@hidden @code{stepi} address@hidden
address@hidden @code{si} address@hidden
-Execute one (or @var{count}) instruction(s), stepping inside function calls.
-(For illustration of what is meant by an ``instruction'' in @command{gawk},
-see the output shown under @code{dump} in @ref{Miscellaneous Debugger
Commands}.)
address@hidden Table,table-ieee-formats
address@hidden IEEE Format Context Values}
address@hidden @columnfractions .20 .20 .20 .20 .20
address@hidden Name @tab Total bits @tab Precision @tab emin @tab emax
address@hidden Single @tab 32 @tab 24 @tab @minus{}126 @tab +127
address@hidden Double @tab 64 @tab 53 @tab @minus{}1022 @tab +1023
address@hidden Quadruple @tab 128 @tab 113 @tab @minus{}16382 @tab +16383
address@hidden multitable
address@hidden float
address@hidden debugger commands, @code{u} (@code{until})
address@hidden debugger commands, @code{until}
address@hidden @code{until} debugger command
address@hidden @code{u} debugger command (alias for @code{until})
address@hidden @code{until} address@hidden@code{:address@hidden |
@var{function}]
address@hidden @code{u} address@hidden@code{:address@hidden | @var{function}]
-Without any argument, continue execution until a line past the current
-line in current stack frame is reached. With an argument,
-continue execution until the specified location is reached, or the current
-stack frame returns.
address@hidden table
address@hidden NOTE
+The precision numbers include the implied leading one that gives them
+one extra bit of significand.
address@hidden quotation
address@hidden Viewing And Changing Data
address@hidden Viewing and Changing Data
+A floating-point context can also determine which signals are treated
+as exceptions, and can set rules for arithmetic with special values.
+Please consult the IEEE-754 standard or other resources for details.
-The commands for viewing and changing variables inside of @command{gawk} are:
address@hidden ordinarily uses the hardware double precision
+representation for numbers. On most systems, this is IEEE-754
+floating-point format, corresponding to 64-bit binary with 53 bits
+of precision.
address@hidden @asis
address@hidden debugger commands, @code{display}
address@hidden @code{display} debugger command
address@hidden @code{display} address@hidden | @address@hidden
-Add variable @var{var} (or field @address@hidden) to the display list.
-The value of the variable or field is displayed each time the program stops.
-Each variable added to the list is identified by a unique number:
address@hidden NOTE
+In case an underflow occurs, the standard allows, but does not require,
+the result from an arithmetic operation to be a number smaller than
+the smallest nonzero normalized number. Such numbers do
+not have as many significant digits as normal numbers, and are called
address@hidden or @dfn{subnormals}. The alternative, simply returning a zero,
+is called @dfn{flush to zero}. The basic IEEE-754 binary formats
+support subnormal numbers.
address@hidden quotation
address@hidden
-gawk> @kbd{display x}
address@hidden 10: x = 1
address@hidden example
address@hidden Rounding Mode
address@hidden Floating-point Rounding Mode
address@hidden rounding mode, floating-point
address@hidden
-displays the assigned item number, the variable name and its current value.
-If the display variable refers to a function parameter, it is silently
-deleted from the list as soon as the execution reaches a context where
-no such variable of the given name exists.
-Without argument, @code{display} displays the current values of
-items on the list.
+The @dfn{rounding mode} specifies the behavior for the results of numerical
+operations when discarding extra precision. Each rounding mode indicates
+how the least significant returned digit of a rounded result is to
+be calculated.
address@hidden lists the IEEE-754 defined
+rounding modes:
address@hidden debugger commands, @code{eval}
address@hidden @code{eval} debugger command
address@hidden @code{eval "@var{awk statements}"}
-Evaluate @var{awk statements} in the context of the running program.
-You can do anything that an @command{awk} program would do: assign
-values to variables, call functions, and so on.
address@hidden Table,table-rounding-modes
address@hidden 754 Rounding Modes}
address@hidden @columnfractions .45 .55
address@hidden Rounding Mode @tab IEEE Name
address@hidden Round to nearest, ties to even @tab @code{roundTiesToEven}
address@hidden Round toward plus Infinity @tab @code{roundTowardPositive}
address@hidden Round toward negative Infinity @tab @code{roundTowardNegative}
address@hidden Round toward zero @tab @code{roundTowardZero}
address@hidden Round to nearest, ties away from zero @tab @code{roundTiesToAway}
address@hidden multitable
address@hidden float
address@hidden @code{eval} @var{param}, @dots{}
address@hidden @var{awk statements}
address@hidden @code{end}
-This form of @code{eval} is similar, but it allows you to define
-``local variables'' that exist in the context of the
address@hidden statements}, instead of using variables or function
-parameters defined by the program.
+The default mode @code{roundTiesToEven} is the most preferred,
+but the least intuitive. This method does the obvious thing for most values,
+by rounding them up or down to the nearest digit.
+For example, rounding 1.132 to two digits yields 1.13,
+and rounding 1.157 yields 1.16.
address@hidden debugger commands, @code{p} (@code{print})
address@hidden debugger commands, @code{print}
address@hidden @code{print} debugger command
address@hidden @code{p} debugger command (alias for @code{print})
address@hidden @code{print} @address@hidden,} @var{var2} @dots{}]
address@hidden @code{p} @address@hidden,} @var{var2} @dots{}]
-Print the value of a @command{gawk} variable or field.
-Fields must be referenced by constants:
+However, when it comes to rounding a value that is exactly halfway between,
+things do not work the way you probably learned in school.
+In this case, the number is rounded to the nearest even digit.
+So rounding 0.125 to two digits rounds down to 0.12,
+but rounding 0.6875 to three digits rounds up to 0.688.
+You probably have already encountered this rounding mode when
+using the @code{printf} routine to format floating-point numbers.
+For example:
@example
-gawk> @kbd{print $3}
+BEGIN @{
+ x = -4.5
+ for (i = 1; i < 10; i++) @{
+ x += 1.0
+ printf("%4.1f => %2.0f\n", x, x)
+ @}
address@hidden
@end example
@noindent
-This prints the third field in the input record (if the specified field does
not
-exist, it prints @samp{Null field}). A variable can be an array element, with
-the subscripts being constant values. To print the contents of an array,
-prefix the name of the array with the @samp{@@} symbol:
+produces the following output when run:@footnote{It
+is possible for the output to be completely different if the
+C library in your system does not use the IEEE-754 even-rounding
+rule to round halfway cases for @code{printf()}.}
@example
-gawk> @kbd{print @@a}
+-3.5 => -4
+-2.5 => -2
+-1.5 => -2
+-0.5 => 0
+ 0.5 => 0
+ 1.5 => 2
+ 2.5 => 2
+ 3.5 => 4
+ 4.5 => 4
@end example
address@hidden
-This prints the indices and the corresponding values for all elements in
-the array @code{a}.
-
address@hidden debugger commands, @code{printf}
address@hidden @code{printf} debugger command
address@hidden @code{printf} @var{format} address@hidden,} @var{arg} @dots{}]
-Print formatted text. The @var{format} may include escape sequences,
-such as @samp{\n}
-(@pxref{Escape Sequences}).
-No newline is printed unless one is specified.
-
address@hidden debugger commands, @code{set}
address@hidden @code{set} debugger command
address@hidden @code{set} @address@hidden@var{value}
-Assign a constant (number or string) value to an @command{awk} variable
-or field.
-String values must be enclosed between double quotes (@code{"@dots{}"}).
+The theory behind the rounding mode @code{roundTiesToEven} is that
+it more or less evenly distributes upward and downward rounds
+of exact halves, which might cause the round-off error
+to cancel itself out. This is the default rounding mode used
+in IEEE-754 computing functions and operators.
-You can also set special @command{awk} variables, such as @code{FS},
address@hidden, @code{NR}, etc.
+The other rounding modes are rarely used.
+Round toward positive infinity (@code{roundTowardPositive})
+and round toward negative infinity (@code{roundTowardNegative})
+are often used to implement interval arithmetic,
+where you adjust the rounding mode to calculate upper and lower bounds
+for the range of output. The @code{roundTowardZero}
+mode can be used for converting floating-point numbers to integers.
+The rounding mode @code{roundTiesToAway} rounds the result to the
+nearest number and selects the number with the larger magnitude
+if a tie occurs.
address@hidden debugger commands, @code{w} (@code{watch})
address@hidden debugger commands, @code{watch}
address@hidden @code{watch} debugger command
address@hidden @code{w} debugger command (alias for @code{watch})
address@hidden @code{watch} @var{var} | @address@hidden
address@hidden"@var{expression}"}]
address@hidden @code{w} @var{var} | @address@hidden
address@hidden"@var{expression}"}]
-Add variable @var{var} (or field @address@hidden) to the watch list.
-The debugger then stops whenever
-the value of the variable or field changes. Each watched item is assigned a
-number which can be used to delete it from the watch list using the
address@hidden command.
+Some numerical analysts will tell you that your choice of rounding style
+has tremendous impact on the final outcome, and advise you to wait until
+final output for any rounding. Instead, you can often avoid round-off error
problems by
+setting the precision initially to some value sufficiently larger than
+the final desired precision, so that the accumulation of round-off error
+does not influence the outcome.
+If you suspect that results from your computation are
+sensitive to accumulation of round-off error,
+one way to be sure is to look for a significant difference in output
+when you change the rounding mode.
-With a watchpoint, you may also supply a condition. This is an
address@hidden expression (enclosed in double quotes) that the debugger
-evaluates whenever the watchpoint is reached. If the condition is true,
-then the debugger stops execution and prompts for a command. Otherwise,
address@hidden continues executing the program.
address@hidden Gawk and MPFR
address@hidden @command{gawk} + MPFR = Powerful Arithmetic
address@hidden debugger commands, @code{undisplay}
address@hidden @code{undisplay} debugger command
address@hidden @code{undisplay} address@hidden
-Remove item number @var{n} (or all items, if no argument) from the
-automatic display list.
+The rest of this @value{CHAPTER} decsribes how to use the arbitrary precision
+(also known as @dfn{multiple precision} or @dfn{infinite precision}) numeric
+capabilites in @command{gawk} to produce maximally accurate results
+when you need it.
address@hidden debugger commands, @code{unwatch}
address@hidden @code{unwatch} debugger command
address@hidden @code{unwatch} address@hidden
-Remove item number @var{n} (or all items, if no argument) from the
-watch list.
+But first you should check if your version of
address@hidden supports arbitrary precision arithmetic.
+The easiest way to find out is to look at the output of
+the following command:
address@hidden table
address@hidden
+$ @kbd{gawk --version}
address@hidden GNU Awk 4.1.0 (GNU MPFR 3.1.0, GNU MP 5.0.3)
address@hidden Copyright (C) 1989, 1991-2012 Free Software Foundation.
address@hidden
address@hidden example
address@hidden Execution Stack
address@hidden Dealing with the Stack
address@hidden uses the
address@hidden://www.mpfr.org, GNU MPFR}
+and
address@hidden://gmplib.org, GNU MP} (GMP)
+libraries for arbitrary precision
+arithmetic on numbers. So if you do not see the names of these libraries
+in the output, then your version of @command{gawk} does not support
+arbitrary precision arithmetic.
-Whenever you run a program which contains any function calls,
address@hidden maintains a stack of all of the function calls leading up
-to where the program is right now. You can see how you got to where you are,
-and also move around in the stack to see what the state of things was in the
-functions which called the one you are in. The commands for doing this are:
+Additionally,
+there are a few elements available in the @code{PROCINFO} array
+to provide information about the MPFR and GMP libraries.
address@hidden, for more information.
address@hidden @asis
address@hidden debugger commands, @code{bt} (@code{backtrace})
address@hidden debugger commands, @code{backtrace}
address@hidden @code{backtrace} debugger command
address@hidden @code{bt} debugger command (alias for @code{backtrace})
address@hidden @code{backtrace} address@hidden
address@hidden @code{bt} address@hidden
-Print a backtrace of all function calls (stack frames), or innermost
@var{count}
-frames if @var{count} > 0. Print the outermost @var{count} frames if
address@hidden < 0. The backtrace displays the name and arguments to each
-function, the source @value{FN}, and the line number.
address@hidden
+Even if you aren't interested in arbitrary precision arithmetic, you
+may still benefit from knowing about how @command{gawk} handles numbers
+in general, and the limitations of doing arithmetic with ordinary
address@hidden numbers.
address@hidden ignore
address@hidden debugger commands, @code{down}
address@hidden @code{down} debugger command
address@hidden @code{down} address@hidden
-Move @var{count} (default 1) frames down the stack toward the innermost frame.
-Then select and print the frame.
address@hidden debugger commands, @code{f} (@code{frame})
address@hidden debugger commands, @code{frame}
address@hidden @code{frame} debugger command
address@hidden @code{f} debugger command (alias for @code{frame})
address@hidden @code{frame} address@hidden
address@hidden @code{f} address@hidden
-Select and print (frame number, function and argument names, source file,
-and the source line) stack frame @var{n}. Frame 0 is the currently executing,
-or @dfn{innermost}, frame (function call), frame 1 is the frame that called the
-innermost one. The highest numbered frame is the one for the main program.
address@hidden Arbitrary Precision Floats
address@hidden Arbitrary Precision Floating-point Arithmetic with @command{gawk}
address@hidden debugger commands, @code{up}
address@hidden @code{up} debugger command
address@hidden @code{up} address@hidden
-Move @var{count} (default 1) frames up the stack toward the outermost frame.
-Then select and print the frame.
address@hidden table
address@hidden uses the GNU MPFR library
+for arbitrary precision floating-point arithmetic. The MPFR library
+provides precise control over precisions and rounding modes, and gives
+correctly rounded reproducible platform-independent results. With the
+command-line option @option{--bignum} or @option{-M},
+all floating-point arithmetic operators and numeric functions can yield
+results to any desired precision level supported by MPFR.
+Two built-in
+variables @code{PREC}
+(@pxref{Setting Precision})
+and @code{ROUNDMODE}
+(@pxref{Setting Rounding Mode})
+provide control over the working precision and the rounding mode.
+The precision and the rounding mode are set globally for every operation
+to follow.
address@hidden Debugger Info
address@hidden Obtaining Information about the Program and the Debugger State
+The default working precision for arbitrary precision floating-point values is
53,
+and the default value for @code{ROUNDMODE} is @code{"N"},
+which selects the IEEE-754
address@hidden (@pxref{Rounding Mode}) rounding address@hidden
+default precision is 53, since according to the MPFR documentation,
+the library should be able to exactly reproduce all computations with
+double-precision machine floating-point numbers (@code{double} type
+in C), except the default exponent range is much wider and subnormal
+numbers are not implemented.}
address@hidden uses the default exponent range in MPFR
address@hidden
+(@math{emax = 2^{30} - 1, emin = -emax})
address@hidden iftex
address@hidden
+(@var{emax} = 2^30 @minus{} 1, @var{emin} = @address@hidden)
address@hidden ifnottex
+for all floating-point contexts.
+There is no explicit mechanism to adjust the exponent range.
+MPFR does not implement subnormal numbers by default,
+and this behavior cannot be changed in @command{gawk}.
-Besides looking at the values of variables, there is often a need to get
-other sorts of information about the state of your program and of the
-debugging environment itself. The @command{gawk} debugger has one command
which
-provides this information, appropriately called @code{info}. @code{info}
-is used with one of a number of arguments that tell it exactly what
-you want to know:
address@hidden NOTE
+When emulating an IEEE-754 format (@pxref{Setting Precision}),
address@hidden internally adjusts the exponent range
+to the value defined for the format and also performs computations needed for
+gradual underflow (subnormal numbers).
address@hidden quotation
address@hidden @asis
address@hidden debugger commands, @code{i} (@code{info})
address@hidden debugger commands, @code{info}
address@hidden @code{info} debugger command
address@hidden @code{i} debugger command (alias for @code{info})
address@hidden @code{info} @var{what}
address@hidden @code{i} @var{what}
-The value for @var{what} should be one of the following:
address@hidden NOTE
+MPFR numbers are variable-size entities, consuming only as much space as
+needed to store the significant digits. Since the performance using MPFR
+numbers pales in comparison to doing arithmetic using the underlying machine
+types, you should consider using only as much precision as needed by
+your program.
address@hidden quotation
address@hidden nested table
address@hidden @code
address@hidden args
-Arguments of the selected frame.
address@hidden
+* Setting Precision:: Setting the working precision.
+* Setting Rounding Mode:: Setting the rounding mode.
+* Floating-point Constants:: Representing floating-point constants.
+* Changing Precision:: Changing the precision of a number.
+* Exact Arithmetic:: Exact arithmetic with floating-point numbers.
address@hidden menu
address@hidden break
-List all currently set breakpoints.
address@hidden Setting Precision
address@hidden Setting the Working Precision
address@hidden @code{PREC} variable
address@hidden display
-List all items in the automatic display list.
address@hidden uses a global working precision; it does not keep track of
+the precision or accuracy of individual numbers. Performing an arithmetic
+operation or calling a built-in function rounds the result to the current
+working precision. The default working precision is 53 which can be
+modified using the built-in variable @code{PREC}. You can also set the
+value to one of the following pre-defined case-insensitive strings
+to emulate an IEEE-754 binary format:
address@hidden frame
-Description of the selected stack frame.
address@hidden address@hidden"double"}} {12345678901234567890123456789012345}
address@hidden @code{PREC} @tab IEEE-754 Binary Format
address@hidden @code{"half"} @tab 16-bit half-precision.
address@hidden @code{"single"} @tab Basic 32-bit single precision.
address@hidden @code{"double"} @tab Basic 64-bit double precision.
address@hidden @code{"quad"} @tab Basic 128-bit quadruple precision.
address@hidden @code{"oct"} @tab 256-bit octuple precision.
address@hidden multitable
address@hidden functions
-List all function definitions including source file names and
-line numbers.
+The following example illustrates the effects of changing precision
+on arithmetic operations:
address@hidden locals
-Local variables of the selected frame.
address@hidden
+$ @kbd{gawk -M -vPREC=100 'BEGIN @{ x = 1.0e-400; print x + 0; \}
+> @kbd{PREC = "double"; print x + 0 @}'}
address@hidden 1e-400
address@hidden 0
address@hidden example
address@hidden source
-The name of the current source file. Each time the program stops, the
-current source file is the file containing the current instruction.
-When the debugger first starts, the current source file is the first file
-included via the @option{-f} option. The
address@hidden @var{filename}:@var{lineno}} command can
-be used at any time to change the current source.
+Binary and decimal precisions are related approximately according to the
+formula:
address@hidden sources
-List all program sources.
address@hidden
address@hidden = 3.322 @cdot dps}
address@hidden iftex
address@hidden
address@hidden = 3.322 * @var{dps}
address@hidden ifnottex
address@hidden variables
-List all global variables.
address@hidden
+Here, @var{prec} denotes the binary precision
+(measured in bits) and @var{dps} (short for decimal places)
+is the decimal digits. We can easily calculate how many decimal
+digits the 53-bit significand of an IEEE double is equivalent to:
+53 / 3.332 which is equal to about 15.95.
+But what does 15.95 digits actually mean? It depends whether you are
+concerned about how many digits you can rely on, or how many digits
+you need.
address@hidden watch
-List all items in the watch list.
address@hidden table
address@hidden table
+It is important to know how many bits it takes to uniquely identify
+a double-precision value (the C type @code{double}). If you want to
+convert from @code{double} to decimal and back to @code{double} (e.g.,
+saving a @code{double} representing an intermediate result to a file, and
+later reading it back to restart the computation), then a few more decimal
+digits are required. 17 digits is generally enough for a @code{double}.
-Additional commands give you control over the debugger, the ability to
-save the debugger's state, and the ability to run debugger commands
-from a file. The commands are:
+It can also be important to know what decimal numbers can be uniquely
+represented with a @code{double}. If you want to convert
+from decimal to @code{double} and back again, 15 digits is the most that
+you can get. Stated differently, you should not present
+the numbers from your floating-point computations with more than 15
+significant digits in them.
address@hidden @asis
address@hidden debugger commands, @code{o} (@code{option})
address@hidden debugger commands, @code{option}
address@hidden @code{option} debugger command
address@hidden @code{o} debugger command (alias for @code{option})
address@hidden @code{option} address@hidden@address@hidden
address@hidden @code{o} address@hidden@address@hidden
-Without an argument, display the available debugger options
-and their current values. @samp{option @var{name}} shows the current
-value of the named option. @samp{option @address@hidden assigns
-a new value to the named option.
-The available options are:
+Conversely, it takes a precision of 332 bits to hold an approximation
+of the constant @value{PI} that is accurate to 100 decimal places.
+You should always add some extra bits in order to avoid the confusing round-off
+issues that occur because numbers are stored internally in binary.
address@hidden nested table
address@hidden @code
address@hidden history_size
-The maximum number of lines to keep in the history file @file{./.gawk_history}.
-The default is 100.
address@hidden Setting Rounding Mode
address@hidden Setting the Rounding Mode
address@hidden @code{ROUNDMODE} variable
address@hidden listsize
-The number of lines that @code{list} prints. The default is 15.
+The @code{ROUNDMODE} variable provides
+program level control over the rounding mode.
+The correspondance between @code{ROUNDMODE} and the IEEE
+rounding modes is shown in @ref{table-gawk-rounding-modes}.
address@hidden outfile
-Send @command{gawk} output to a file; debugger output still goes
-to standard output. An empty string (@code{""}) resets output to
-standard output.
address@hidden Table,table-gawk-rounding-modes
address@hidden@command{gawk} Rounding Modes}
address@hidden @columnfractions .45 .30 .25
address@hidden Rounding Mode @tab IEEE Name @tab @code{ROUNDMODE}
address@hidden Round to nearest, ties to even @tab @code{roundTiesToEven} @tab
@code{"N"} or @code{"n"}
address@hidden Round toward plus Infinity @tab @code{roundTowardPositive} @tab
@code{"U"} or @code{"u"}
address@hidden Round toward negative Infinity @tab @code{roundTowardNegative}
@tab @code{"D"} or @code{"d"}
address@hidden Round toward zero @tab @code{roundTowardZero} @tab @code{"Z"} or
@code{"z"}
address@hidden Round to nearest, ties away from zero @tab
@code{roundTiesToAway} @tab @code{"A"} or @code{"a"}
address@hidden multitable
address@hidden float
address@hidden prompt
-The debugger prompt. The default is @address@hidden> }}.
address@hidden has the default value @code{"N"},
+which selects the IEEE-754 rounding mode @code{roundTiesToEven}.
+Besides the values listed in @ref{table-gawk-rounding-modes},
address@hidden also accepts @code{"A"} to select the IEEE-754 mode
address@hidden
+if your version of the MPFR library supports it; otherwise setting
address@hidden to this value has no effect. @xref{Rounding Mode},
+for the meanings of the various rounding modes.
address@hidden save_history @r{[}on @r{|} address@hidden
-Save command history to file @file{./.gawk_history}.
-The default is @code{on}.
+Here is an example of how to change the default rounding behavior of
address@hidden's output:
address@hidden save_options @r{[}on @r{|} address@hidden
-Save current options to file @file{./.gawkrc} upon exit.
-The default is @code{on}.
-Options are read back in to the next session upon startup.
address@hidden
+$ @kbd{gawk -M -vROUNDMODE="Z" 'BEGIN @{ printf("%.2f\n", 1.378) @}'}
address@hidden 1.37
address@hidden example
address@hidden trace @r{[}on @r{|} address@hidden
-Turn instruction tracing on or off. The default is @code{off}.
address@hidden table
address@hidden Floating-point Constants
address@hidden Representing Floating-point Constants
address@hidden constants, floating-point
address@hidden @code{save} @var{filename}
-Save the commands from the current session to the given @value{FN},
-so that they can be replayed using the @command{source} command.
+Be wary of floating-point constants! When reading a floating-point constant
+from program source code, @command{gawk} uses the default precision,
+unless overridden
+by an assignment to the special variable @code{PREC} on the command
+line, to store it internally as a MPFR number.
+Changing the precision using @code{PREC} in the program text does
+not change the precision of a constant. If you need to
+represent a floating-point constant at a higher precision than the
+default and cannot use a command line assignment to @code{PREC},
+you should either specify the constant as a string, or
+as a rational number whenever possible. The following example
+illustrates the differences among various ways to
+print a floating-point constant:
address@hidden @code{source} @var{filename}
-Run command(s) from a file; an error in any command does not
-terminate execution of subsequent commands. Comments (lines starting
-with @samp{#}) are allowed in a command file.
-Empty lines are ignored; they do @emph{not}
-repeat the last command.
-You can't restart the program by having more than one @code{run}
-command in the file. Also, the list of commands may include additional
address@hidden commands; however, the @command{gawk} debugger will not source
the
-same file more than once in order to avoid infinite recursion.
address@hidden
+$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 0.1) @}'}
address@hidden 0.1000000000000000055511151
+$ @kbd{gawk -M -vPREC = 113 'BEGIN @{ printf("%0.25f\n", 0.1) @}'}
address@hidden 0.1000000000000000000000000
+$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", "0.1") @}'}
address@hidden 0.1000000000000000000000000
+$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 1/10) @}'}
address@hidden 0.1000000000000000000000000
address@hidden example
-In addition to, or instead of the @code{source} command, you can use
-the @option{-D @var{file}} or @address@hidden command-line
-options to execute commands from a file non-interactively
-(@pxref{Options}.
address@hidden table
+In the first case, the number is stored with the default precision of 53.
address@hidden Miscellaneous Debugger Commands
address@hidden Miscellaneous Commands
address@hidden Changing Precision
address@hidden Changing the Precision of a Number
-There are a few more commands which do not fit into the
-previous categories, as follows:
address@hidden Laurie, Dirk
address@hidden
address@hidden point is that in any variable-precision package,
+a decision is made on how to treat numbers given as data,
+or arising in intermediate results, which are represented in
+floating-point format to a precision lower than working precision.
+Do we promote them to full membership of the high-precision club,
+or do we treat them and all their associates as second-class citizens?
+Sometimes the first course is proper, sometimes the second, and it takes
+careful analysis to tell which.}
address@hidden @asis
address@hidden debugger commands, @code{dump}
address@hidden @code{dump} debugger command
address@hidden @code{dump} address@hidden
-Dump bytecode of the program to standard output or to the file
-named in @var{filename}. This prints a representation of the internal
-instructions which @command{gawk} executes to implement the @command{awk}
-commands in a program. This can be very enlightening, as the following
-partial dump of Davide Brini's obfuscated code
-(@pxref{Signature Program}) demonstrates:
+Dirk address@hidden Laurie.
address@hidden Arithmetic Considered Perilous --- A Detective Story}.
+Electronic Transactions on Numerical Analysis. Volume 28, pp. 168-173, 2008.}
address@hidden quotation
address@hidden
-gawk> @kbd{dump}
address@hidden # BEGIN
address@hidden
address@hidden [ 2:0x89faef4] Op_rule : [in_rule = BEGIN]
[source_file = brini.awk]
address@hidden [ 3:0x89fa428] Op_push_i : "~" [PERM|STRING|STRCUR]
address@hidden [ 3:0x89fa464] Op_push_i : "~" [PERM|STRING|STRCUR]
address@hidden [ 3:0x89fa450] Op_match :
address@hidden [ 3:0x89fa3ec] Op_store_var : O [do_reference = FALSE]
address@hidden [ 4:0x89fa48c] Op_push_i : "=="
[PERM|STRING|STRCUR]
address@hidden [ 4:0x89fa4c8] Op_push_i : "=="
[PERM|STRING|STRCUR]
address@hidden [ 4:0x89fa4b4] Op_equal :
address@hidden [ 4:0x89fa400] Op_store_var : o [do_reference = FALSE]
address@hidden [ 5:0x89fa4f0] Op_push : o
address@hidden [ 5:0x89fa4dc] Op_plus_i : 0 [PERM|NUMCUR|NUMBER]
address@hidden [ 5:0x89fa414] Op_push_lhs : o [do_reference = TRUE]
address@hidden [ 5:0x89fa4a0] Op_assign_plus :
address@hidden [ :0x89fa478] Op_pop :
address@hidden [ 6:0x89fa540] Op_push : O
address@hidden [ 6:0x89fa554] Op_push_i : "" [PERM|STRING|STRCUR]
address@hidden [ :0x89fa5a4] Op_no_op :
address@hidden [ 6:0x89fa590] Op_push : O
address@hidden [ :0x89fa5b8] Op_concat : [expr_count = 3]
[concat_flag = 0]
address@hidden [ 6:0x89fa518] Op_store_var : x [do_reference = FALSE]
address@hidden [ 7:0x89fa504] Op_push_loop : [target_continue =
0x89fa568] [target_break = 0x89fa680]
address@hidden [ 7:0x89fa568] Op_push_lhs : X [do_reference = TRUE]
address@hidden [ 7:0x89fa52c] Op_postincrement :
address@hidden [ 7:0x89fa5e0] Op_push : x
address@hidden [ 7:0x89fa61c] Op_push : o
address@hidden [ 7:0x89fa5f4] Op_plus :
address@hidden [ 7:0x89fa644] Op_push : o
address@hidden [ 7:0x89fa630] Op_plus :
address@hidden [ 7:0x89fa5cc] Op_leq :
address@hidden [ :0x89fa57c] Op_jmp_false : [target_jmp = 0x89fa680]
address@hidden [ 7:0x89fa694] Op_push_i : "%c"
[PERM|STRING|STRCUR]
address@hidden [ :0x89fa6d0] Op_no_op :
address@hidden [ 7:0x89fa608] Op_assign_concat : c
address@hidden [ :0x89fa6a8] Op_jmp : [target_jmp = 0x89fa568]
address@hidden [ :0x89fa680] Op_pop_loop :
address@hidden
address@hidden
address@hidden
address@hidden [ 8:0x89fa658] Op_K_printf : [expr_count = 17]
[redir_type = ""]
address@hidden [ :0x89fa374] Op_no_op :
address@hidden [ :0x89fa3d8] Op_atexit :
address@hidden [ :0x89fa6bc] Op_stop :
address@hidden [ :0x89fa39c] Op_no_op :
address@hidden [ :0x89fa3b0] Op_after_beginfile :
address@hidden [ :0x89fa388] Op_no_op :
address@hidden [ :0x89fa3c4] Op_after_endfile :
-gawk>
address@hidden smallexample
address@hidden does not implicitly modify the precision of any previously
+computed results when the working precision is changed with an assignment
+to @code{PREC}. The precision of a number is always the one that was
+used at the time of its creation, and there is no way for the user
+to explicitly change it afterwards. However, since the result of a
+floating-point arithmetic operation is always an arbitrary precision
+floating-point value---with a precision set by the value of @code{PREC}---one
of the
+following workarounds effectively accomplishes the desired behavior:
address@hidden debugger commands, @code{h} (@code{help})
address@hidden debugger commands, @code{help}
address@hidden @code{help} debugger command
address@hidden @code{h} debugger command (alias for @code{help})
address@hidden @code{help}
address@hidden @code{h}
-Print a list of all of the @command{gawk} debugger commands with a short
-summary of their usage. @samp{help @var{command}} prints the information
-about the command @var{command}.
address@hidden
+x = x + 0.0
address@hidden example
address@hidden debugger commands, @code{l} (@code{list})
address@hidden debugger commands, @code{list}
address@hidden @code{list} debugger command
address@hidden @code{l} debugger command (alias for @code{list})
address@hidden @code{list} address@hidden | @code{+} | @var{n} |
@address@hidden:}n} | @address@hidden | @var{function}]
address@hidden @code{l} address@hidden | @code{+} | @var{n} |
@address@hidden:}n} | @address@hidden | @var{function}]
-Print the specified lines (default 15) from the current source file
-or the file named @var{filename}. The possible arguments to @code{list}
-are as follows:
address@hidden
+or:
address@hidden nested table
address@hidden @asis
address@hidden @code{-}
-Print lines before the lines last printed.
address@hidden
+x += 0.0
address@hidden example
address@hidden @code{+}
-Print lines after the lines last printed.
address@hidden without any argument does the same thing.
address@hidden Exact Arithmetic
address@hidden Exact Arithmetic with Floating-point Numbers
address@hidden @var{n}
-Print lines centered around line number @var{n}.
address@hidden CAUTION
+Never depend on the exactness of floating-point arithmetic,
+even for apparently simple expressions!
address@hidden quotation
address@hidden @address@hidden
-Print lines from @var{n} to @var{m}.
+Can arbitrary precision arithmetic give exact results? There are
+no easy answers. The standard rules of algebra often do not apply
+when using floating-point arithmetic.
+Among other things, the distributive and associative laws
+do not hold completely, and order of operation may be important
+for your computation. Rounding error, cumulative precision loss
+and underflow are often troublesome.
address@hidden @address@hidden:}n}
-Print lines centered around line number @var{n} in
-source file @var{filename}. This command may change the current source file.
+When @command{gawk} tests the expressions @samp{0.1 + 12.2} and @samp{12.3}
+for equality
+using the machine double precision arithmetic, it decides that they
+are not equal!
+(@xref{Floating-point Programming}.)
+You can get the result you want by increasing the precision;
+56 in this case will get the job done:
address@hidden @var{function}
-Print lines centered around beginning of the
-function @var{function}. This command may change the current source file.
address@hidden table
address@hidden
+$ @kbd{gawk -M -vPREC=56 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'}
address@hidden 1
address@hidden example
address@hidden debugger commands, @code{q} (@code{quit})
address@hidden debugger commands, @code{quit}
address@hidden @code{quit} debugger command
address@hidden @code{q} debugger command (alias for @code{quit})
address@hidden @code{quit}
address@hidden @code{q}
-Exit the debugger. Debugging is great fun, but sometimes we all have
-to tend to other obligations in life, and sometimes we find the bug,
-and are free to go on to the next one! As we saw above, if you are
-running a program, the debugger warns you if you accidentally type
address@hidden or @samp{quit}, to make sure you really want to quit.
+If adding more bits is good, perhaps adding even more bits of
+precision is better?
+Here is what happens if we use an even larger value of @code{PREC}:
address@hidden debugger commands, @code{trace}
address@hidden @code{trace} debugger command
address@hidden @code{trace} @code{on} @r{|} @code{off}
-Turn on or off a continuous printing of instructions which are about to
-be executed, along with printing the @command{awk} line which they
-implement. The default is @code{off}.
address@hidden
+$ @kbd{gawk -M -vPREC=201 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'}
address@hidden 0
address@hidden example
-It is to be hoped that most of the ``opcodes'' in these instructions are
-fairly self-explanatory, and using @code{stepi} and @code{nexti} while
address@hidden is on will make them into familiar friends.
+This is not a bug in @command{gawk} or in the MPFR library.
+It is easy to forget that the finite number of bits used to store the value
+is often just an approximation after proper rounding.
+The test for equality succeeds if and only if @emph{all} bits in the two
operands
+are exactly the same. Since this is not necessarily true after floating-point
+computations with a particular precision and effective rounding rule,
+a straight test for equality may not work.
address@hidden table
+So, don't assume that floating-point values can be compared for equality.
+You should also exercise caution when using other forms of comparisons.
+The standard way to compare between floating-point numbers is to determine
+how much error (or @dfn{tolerance}) you will allow in a comparison and
+check to see if one value is within this error range of the other.
address@hidden Readline Support
address@hidden Readline Support
+In applications where 15 or fewer decimal places suffice,
+hardware double precision arithmetic can be adequate, and is usually much
faster.
+But you do need to keep in mind that every floating-point operation
+can suffer a new rounding error with catastrophic consequences as illustrated
+by our attempt to compute the value of the constant @value{PI}
+(@pxref{Floating-point Programming}).
+Extra precision can greatly enhance the stability and the accuracy
+of your computation in such cases.
-If @command{gawk} is compiled with the @code{readline} library, you
-can take advantage of that library's command completion and history expansion
-features. The following types of completion are available:
+Repeated addition is not necessarily equivalent to multiplication
+in floating-point arithmetic. In the example in
address@hidden Programming}:
address@hidden @asis
address@hidden Command completion
-Command names.
address@hidden
+$ @kbd{gawk 'BEGIN @{}
+> @kbd{for (d = 1.1; d <= 1.5; d += 0.1)}
+> @kbd{i++}
+> @kbd{print i}
+> @address@hidden'}
address@hidden 4
address@hidden example
address@hidden Source @value{FN} completion
-Source @value{FN}s. Relevant commands are
address@hidden,
address@hidden,
address@hidden,
address@hidden,
-and
address@hidden
address@hidden
+you may or may not succeed in getting the correct result by choosing
+an arbitrarily large value for @code{PREC}. Reformulation of
+the problem at hand is often the correct approach in such situations.
address@hidden Argument completion
-Non-numeric arguments to a command.
-Relevant commands are @code{enable} and @code{info}.
address@hidden Arbitrary Precision Integers
address@hidden Arbitrary Precision Integer Arithmetic with @command{gawk}
address@hidden integer, arbitrary precision
address@hidden Variable name completion
-Global variable names, and function arguments in the current context
-if the program is running. Relevant commands are
address@hidden,
address@hidden,
address@hidden,
-and
address@hidden
+If the option @option{--bignum} or @option{-M} is specified,
address@hidden performs all
+integer arithmetic using GMP arbitrary precision integers.
+Any number that looks like an integer in a program source or data file
+is stored as an arbitrary precision integer.
+The size of the integer is limited only by your computer's memory.
+The current floating-point context has no effect on operations involving
integers.
+For example, the following computes
address@hidden
address@hidden,
address@hidden iftex
address@hidden
+5^4^3^2,
address@hidden ifnottex
+the result of which is beyond the
+limits of ordinary @command{gawk} numbers:
address@hidden table
address@hidden
+$ @kbd{gawk -M 'BEGIN @{}
+> @kbd{x = 5^4^3^2}
+> @kbd{print "# of digits =", length(x)}
+> @kbd{print substr(x, 1, 20), "...", substr(x, length(x) - 19, 20)}
+> @address@hidden'}
address@hidden # of digits = 183231
address@hidden 62060698786608744707 ... 92256259918212890625
address@hidden example
address@hidden Limitations
address@hidden Limitations and Future Plans
+If you were to compute the same value using arbitrary precision
+floating-point values instead, the precision needed for correct output
+(using the formula
address@hidden
address@hidden = 3.322 @cdot dps}),
+would be @math{3.322 @cdot 183231},
address@hidden iftex
address@hidden
address@hidden = 3.322 * dps}),
+would be 3.322 x 183231,
address@hidden ifnottex
+or 608693.
+(Thus, the floating-point representation requires over 30 times as
+many decimal digits!)
-We hope you find the @command{gawk} debugger useful and enjoyable to work with,
-but as with any program, especially in its early releases, it still has
-some limitations. A few which are worth being aware of are:
+The result from an arithmetic operation with an integer and a floating-point
value
+is a floating-point value with a precision equal to the working precision.
+The following program calculates the eighth term in
+Sylvester's address@hidden, Eric W.
address@hidden's Sequence}. From MathWorld---A Wolfram Web Resource.
address@hidden://mathworld.wolfram.com/SylvestersSequence.html}}
+using a recurrence:
address@hidden @bullet{}
address@hidden
-At this point, the debugger does not give a detailed explanation of
-what you did wrong when you type in something it doesn't like. Rather, it just
-responds @samp{syntax error}. When you do figure out what your mistake was,
-though, you'll feel like a real guru.
address@hidden
+$ @kbd{gawk -M 'BEGIN @{}
+> @kbd{s = 2.0}
+> @kbd{for (i = 1; i <= 7; i++)}
+> @kbd{s = s * (s - 1) + 1}
+> @kbd{print s}
+> @address@hidden'}
address@hidden 113423713055421845118910464
address@hidden example
address@hidden
-If you perused the dump of opcodes in @ref{Miscellaneous Debugger Commands},
-(or if you are already familiar with @command{gawk} internals),
-you will realize that much of the internal manipulation of data
-in @command{gawk}, as in many interpreters, is done on a stack.
address@hidden, @code{Op_pop}, etc., are the ``bread and butter'' of
-most @command{gawk} code. Unfortunately, as of now, the @command{gawk}
-debugger does not allow you to examine the stack's contents.
+The output differs from the acutal number, 113423713055421844361000443,
+because the default precision of 53 is not enough to represent the
+floating-point results exactly. You can either increase the precision
+(100 is enough in this case), or replace the floating-point constant
address@hidden with an integer, to perform all computations using integer
+arithmetic to get the correct output.
-That is, the intermediate results of expression evaluation are on the
-stack, but cannot be printed. Rather, only variables which are defined
-in the program can be printed. Of course, a workaround for
-this is to use more explicit variables at the debugging stage and then
-change back to obscure, perhaps more optimal code later.
+It will sometimes be necessary for @command{gawk} to implicitly convert an
+arbitrary precision integer into an arbitrary precision floating-point value.
+This is primarily because the MPFR library does not always provide the
+relevant interface to process arbitrary precision integers or mixed-mode
+numbers as needed by an operation or function.
+In such a case, the precision is set to the minimum value necessary
+for exact conversion, and the working precision is not used for this purpose.
+If this is not what you need or want, you can employ a subterfuge
+like this:
address@hidden
-There is no way to look ``inside'' the process of compiling
-regular expressions to see if you got it right. As an @command{awk}
-programmer, you are expected to know what @code{/[^[:alnum:][:blank:]]/}
-means.
address@hidden
+gawk -M 'BEGIN @{ n = 13; print (n + 0.0) % 2.0 @}'
address@hidden example
address@hidden
-The @command{gawk} debugger is designed to be used by running a program (with
all its
-parameters) on the command line, as described in @ref{Debugger Invocation}.
-There is no way (as of now) to attach or ``break in'' to a running program.
-This seems reasonable for a language which is used mainly for quickly
-executing, short programs.
+You can avoid this issue altogether by specifying the number as a
floating-point value
+to begin with:
address@hidden
-The @command{gawk} debugger only accepts source supplied with the @option{-f}
option.
address@hidden itemize
address@hidden
+gawk -M 'BEGIN @{ n = 13.0; print n % 2.0 @}'
address@hidden example
-Look forward to a future release when these and other missing features may
-be added, and of course feel free to try to add them yourself!
+Note that for the particular example above, there is likely best
+to just use the following:
+
address@hidden
+gawk -M 'BEGIN @{ n = 13; print n % 2 @}'
address@hidden example
@node Dynamic Extensions
@chapter Writing Extensions for @command{gawk}
-----------------------------------------------------------------------
Summary of changes:
doc/ChangeLog | 2 +
doc/gawk.info |13103 ++++++++++++++++++++++++------------------------
doc/gawk.texi |15603 +++++++++++++++++++++++++++++----------------------------
3 files changed, 14360 insertions(+), 14348 deletions(-)
hooks/post-receive
--
gawk
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [gawk-diffs] [SCM] gawk branch, master, updated. 3c09996d7efa635947c357efb3ccc5ed05b1ea31,
Arnold Robbins <=