[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] How to ensure data completeness/integrity for the file downlo

From: Anamika Jindal
Subject: [Bug-wget] How to ensure data completeness/integrity for the file downloaded using wget
Date: Tue, 28 Jul 2009 16:49:00 +0530


We have an open audit issue regarding the files that are pulled from 
external interfaces. We download these files using wget utility. wget 
commands are being called from Pro*C batches e.g. for reference, code is 
something like
<< sprintf (WGET, "%s%s%s/%s.%s", "wget -P  ",FEEDFILE_PATH," 
ftp://username:address@hidden";, FileName, "Z");>>

Now, the audit issue is to ensure the data integrity and data completeness 
for the file that has been downloaded using wget. 
Option 1-> Recommended option is ofcourse checksum approach, in which we 
can get the checksum (any checksum e.g. MD5, SH1)of the file on remote 
server. After that, we can get the checksum of file on local server(just 
downloaded using wget). Then we can compare checksum to ensure the file 
has been successfully(and completely) downloaded. I checked on google/wget 
manual. wget does not provide any option to get the checksum but there 
were functions like gnu_md5.c, don't know why these are used..

 Option 2 -> is to check the File size on remote FTP server. After 
retrieving the file (using wget), our application can compare this file 
size with the file size of retrieved file.  If file size does not match, 
error will be raised. Now wget does not provide any direct option for 
getting the file size. But it gives that information in the output message 
--2009-07-28 09:52:41--  ftp://....
Resolving http-proxy.gslb.db.com...
Connecting to http-proxy.gslb.db.com||:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 22774 (22K)
Saving to: `C090725.eod'

22,774      --.-K/s   in 0.004s

Last-modified header missing -- time-stamps turned off.
2009-07-28 09:52:43 (5.09 MB/s) - `C090725.eod' saved [22774/22774]


Issue is , I can not automate this. If I read this output message from my 
batch e.g. grep on file size OR 100%, then this is not something that will 
remain same in all the wget versions. This output text can change for new 
version of wget. 
Even with the same version, If I check different file on different server 
, output message is different. So, I do not want to rely on this 

Connecting to connected.
Logging in as pardev ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD not needed.
==> SIZE CLO_090722.csv_22-07-2009 ... 5147277786087434
==> PASV ... done.    ==> RETR CLO_090722.csv_22-07-2009 ... done.
Length: 5147277786087434 (4.6P)

 0% [               ] 1,198,444   --.-K/s   in 0.1s

2009-07-28 10:17:37 (8.89 MB/s) - `CLO_090722.csv_22-07-2009' saved 

 Option 3 ->I checked other options, and I found this option:
When running Wget with -N, with or without -r, the decision as to whether 
or not to download a newer copy of a file depends on the local and remote 
timestamp and size of the file. 
So, we thought may be after downloading the file using wget, we can 
execute wget -N, and if this command gives the message that file is same. 
This will imply that (timestamp, size) on local is same as (timestamp, 
size) on remote server.  But when I checked this option in my Production 
envt. I got this message:
<<Proxy request sent, awaiting response... 400 Bad Request
2009-07-28 09:55:39 ERROR 400: Bad Request.>>

This was working fine with a sample file in test envt, 

Now, my requirement is very simple. To ensure the data 
completeness/integrity. Can somebody please suggest which options I should 
use or I can use?? My first preference is to compare checksum. 

Thanks & Regards,
Anamika Jindal
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

reply via email to

[Prev in Thread] Current Thread [Next in Thread]