[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] How to ensure data completeness/integrity for the file do
Re: [Bug-wget] How to ensure data completeness/integrity for the file downloaded using wget
Tue, 28 Jul 2009 14:21:34 -0400
On Tue, Jul 28, 2009 at 7:19 AM, Anamika Jindal<address@hidden> wrote:
> We have an open audit issue regarding the files that are pulled from
> external interfaces. We download these files using wget utility. wget
> commands are being called from Pro*C batches e.g. for reference, code is
> something like
> << sprintf (WGET, "%s%s%s/%s.%s", "wget -P ",FEEDFILE_PATH,"
> ftp://username:address@hidden", FileName, "Z");>>
> Now, the audit issue is to ensure the data integrity and data completeness
> for the file that has been downloaded using wget.
> Option 1-> Recommended option is ofcourse checksum approach, in which we
> can get the checksum (any checksum e.g. MD5, SH1)of the file on remote
> server. After that, we can get the checksum of file on local server(just
> downloaded using wget). Then we can compare checksum to ensure the file
> has been successfully(and completely) downloaded. I checked on google/wget
> manual. wget does not provide any option to get the checksum but there
> were functions like gnu_md5.c, don't know why these are used..
> Option 2 -> is to check the File size on remote FTP server. After
> retrieving the file (using wget), our application can compare this file
> size with the file size of retrieved file. If file size does not match,
> error will be raised. Now wget does not provide any direct option for
> getting the file size. But it gives that information in the output message
> Now, my requirement is very simple. To ensure the data
> completeness/integrity. Can somebody please suggest which options I should
> use or I can use?? My first preference is to compare checksum.
as you know, file size has nothing to do with integrity or matching
checksums, except that you know if the file size is different then the
checksums can't match...
the easiest solution if you're in control of the server would probably
be to use the Content-MD5 header and a download program that supports
it. I don't know if wget does; probably not.
another (biased) solution is to use metalinks, which are XML files
which lists mirrors, checksums, & signatures. metalink clients (wget
does not support it yet) are numerous, & there are GUI and lightweight
command line clients like metalink-checker (python), mulk (libcurl
based) and aria2.
here's an example metalink:
<?xml version="1.0" encoding="UTF-8"?>
<metalink version="3.0" xmlns="http://www.metalinker.org">
<url type="ftp" location="uk"
<url type="http" location="us"
more info at http://en.wikipedia.org/wiki/Metalink
(( Anthony Bryan ... Metalink [ http://www.metalinker.org ]
)) Easier, More Reliable, Self Healing Downloads