[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lynx-dev] circumventing blocking sites
From: |
Stefan Caunter |
Subject: |
Re: [Lynx-dev] circumventing blocking sites |
Date: |
Sat, 4 Feb 2017 12:06:39 -0500 |
On Sat, Feb 4, 2017 at 11:28 AM, Nelson H. F. Beebe <address@hidden> wrote:
> For several years, I have used lynx (and also wget, and rarely, curl)
> to access publisher Web pages for new journal issues. Recently, I
> noticed that a lynx pull of an page from Elsevier ScienceDirect would
> never complete:
>
> % lynx -source -accept_all_cookies -cookies --trace
> http://www.sciencedirect.com/science/journal/00978493/62 > foo.62
>
> parse_arg(arg_name=http://www.sciencedirect.com/science/journal/00978493/62,
> mask=1, count=5)
> parse_arg
> startfile:http://www.sciencedirect.com/science/journal/00978493/62
> ... no further output, and no job completion ...
>
> Similarly, I also find that wget and curl fail to complete.
>
> This new behavior suggests that the publisher site has thrown up
> http-agent-specific, rather than IP-address-specific blocks, because
> accessing the same URL in a GUI browser on the SAME machine gets an
> immediate return of the expected journal issue contents.
>
> If I add the --debug option to wget, I find that it reports
>
> ---request begin---
> GET /science/journal/00978493/62 HTTP/1.1
> User-Agent: Wget/1.14 (linux-gnu)
> Accept: */*
> Host: www.sciencedirect.com
> Connection: Keep-Alive
>
> ---request end---
>
> Thus, it identifies itself as wget, and I assume that lynx probably
> self identifies as well.
>
> Does anyone on this list have an idea how to circumvent these apparent
> blocks?
>
put -useragent="Googlebot" or "Mozilla" in your command line:
lynx -useragent="Mozilla" -accept_all_cookies -dump
http://www.sciencedirect.com/science/journal/00978493/62
gets me a long list of links in the html result