|
From: | George Zarkadas |
Subject: | Ver. 3.1.4 & 3.1.3 Windows ports: Chopped record count at large files |
Date: | Sun, 27 Aug 2006 13:19:08 +0300 |
gawk reports a (very) smaller than actual record count when processing a large (~ 275 MB) text file. This behavior exists in: 3.1.4 version, xmlgawk windows port (downloaded from http://lml.ls.fi.upm.es/~mcollado/xmlgawk/xmlgawk-3.1.4_20040920_mingw.zip) 3.1.3 version, gnuwin32 windows port (downloaded from http://sourceforge.net/projects/gnuwin32/ ) but not in the 3.0.4 version (mingw windows port) which gives the correct results (as verified by independent checks). As a consequence and in consistency with the above remark, gawk fails to extract a subset of records from the file that are located near the end of it. Attached are included: 1. Results (as copied and pasted from the command line) from (a) running the count scripts and (b) extracting the subset [files: count_results.txt and subset_results.txt] 2. The awk scripts in question Additional information -- The file upon which the scripts operated contains bibliographic records in bibtex format (converted from the xml file which is supplied by the DBLP project as downloaded from www.vldb.org <http://www.vldb.org/> ) -- The scripts were run on two machines with identical results. Configurations: OS: Windows XP SP2 (EL) in both CPU: Pentium M 1.7 GHz | Pentium 4 HT 3.0 GHz RAM: 1 GB | 2 GB HDD: 80 GB | 400 GB -- A bug report has also been submitted to the gnuwin32 project (no related contact-info was found for the 3.1.4 port). However I have the feeling that this is not a windows-port specific behavior; hence this bug report. Kind Regards George Zarkadas PS: The original file upon which the scripts acted is not included because of its size (~55 MB zipped) but will be happily supplied if requested.
count_results.txt
Description: Text document
count_dblp_bib2.awk
Description: Binary data
count_dblp_bib.awk
Description: Binary data
subset_results.txt
Description: Text document
get_vldb_subset.awk
Description: Binary data
get_vldb_subset2.awk
Description: Binary data
[Prev in Thread] | Current Thread | [Next in Thread] |