lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

examples Re: lynx-dev making lynx traversal crawl download html, not te


From: Bob
Subject: examples Re: lynx-dev making lynx traversal crawl download html, not text
Date: Fri, 22 Mar 2002 23:05:55 -0500

I don't think yahoo is perceived by lynx to be doing a redirect that
could be ignored via cern rules. What I want is this
http://groups.yahoo.com/group/AkhaWeeklyJournal/message/120
and yahoo puts me here on every fourth message i.e. 120, 124,128
http://groups.yahoo.com/group/AkhaWeeklyJournal/interrupt?st=2&ln=AkhaWeeklyJournal&m=1&done=%2Fgroup%2FAkhaWeeklyJournal%2Fmessage%2F120

with an href on the page which is
http://groups.yahoo.com/group/AkhaWeeklyJournal/message/120
so effectively you have to request the same URL twice.

Here's an example of a page that won't download until it's requested
twice in the same session. "Continue to message" is just a link to
the same URL a second time in the same session and after cookies.

http://groups.yahoo.com/group/AkhaWeeklyJournal/message/120

In between requesting that the first time, cookies, then hit the link
which is a second request for the same file, yahoo gives you a
bunch of crap like the ad page
http://groups.yahoo.com/group/AkhaWeeklyJournal/interrupt?st=2&ln=AkhaWeeklyJournal&m=1&done=%2Fgroup%2FAkhaWeeklyJournal%2Fmessage%2F120

and then if you're coming from there and request the same URL
via the href link, you can have it. Conceivably a stack of cern
rules written by a script could say if yahoo gives you that crap,
give yahoo that crap back, then yahoo does this, so do that,
in the "redirect" language tailored to yahoo's "interrupt?"

OR.......................................code way to -traversal -crawl -source


> I don't find anywhere -traversal or -crawl use srcmode_for_next_retrieval,
> so that we could get html instead of text by srcmode_for_next_retrieval(1)
> instead of (0) or (-1). I'm looking elsewhere now.
>
> OR
>
> Since all I need to do is have lynx try to open a URL, satisfy cookies
> demands, then request the same URL a second time to go around
> yahoo's ad page with "Continue to message" link(just requesting
> the same URL a second time), could I stdin a GET the URL twice,
> or once on command line and GET again?
>
> OR
>
> If view mode were set to default to "source" rather than "presentation"
> text mode, -traversal -crawl might download html.
>
> OR
>
> If -source was changed in the following way, -traversal -crawl -source
> might not quit on the first link like -dump, and might keep on going in
> source mode download to the *.dat files.
>
> the way it is now -source will make lynx quit on the first download
>
> /* -source */
> PRIVATE int source_fun ARGS1(
>  char *,   next_arg GCC_UNUSED)
> {
>     dump_output_immediately = TRUE;
>     HTOutputFormat = (LYPrependBase ?
>         HTAtom_for("www/download") : HTAtom_for("www/dump"));
>     LYcols = MAX_COLS;
>     return 0;
> }
>
> could be
>
> /* -source */
> PRIVATE int source_fun ARGS1(
>  char *,   next_arg GCC_UNUSED) {
>     dump_output_immediately = FALSE;
>     if ( traversal != TRUE && crawl != TRUE ) {
>       dump_output_immediately = TRUE;
>     };
>     HTOutputFormat = (LYPrependBase ?
>         HTAtom_for("www/download") : HTAtom_for("www/dump"));
>     LYcols = MAX_COLS;
>     return 0;
> }
>
> That's not enough, though, since -traversal and -crawl would
> be downloading files, not just sending to stdout as -source.
>
> -traveral and -crawl build a links table
>
>         links.[curdoc.link].lname
>            add_to_table(curdoc.address)
>
> which they download into *.dat files via
>
>         sprintf(cfile,"lnk%08.dat",ccount);
>
> Are the curdocs referenced in that table in source format?
> Not since they are sprintfable?
>
> -Bob
>
> Thomas Dickey wrote:
>
> > On Fri, Mar 22, 2002 at 07:30:19PM -0500, Bob wrote:
> > > Either -dump or -source restrict the download to one file
> > > only, correct?
> > >
> > > I was hoping to iterate the crawl with downloading in
> > > html format.
> > >
> > > Perhaps there is a mode=1 set somewhere, instead of
> > > mode=0, if srcmode_for_next_retrieval() is called from
> > > somewhere? Or?
> >
> > I only see srcmode_for_next_retrieval() called with constant parameters:
>
> So, in one of those places where the call is made with parameter
> (0) or (-1) it might be nice if that was in a process under -traversal
> or -crawl. Then I would put (1) there instead. I'll start looking at--
>
> src/LYMainLoop.c:3819:  srcmode_for_next_retrieval(0);
> src/LYMainLoop.c:4380:  srcmode_for_next_retrieval(-1);
> src/LYMainLoop.c:4407:  srcmode_for_next_retrieval(0);
> src/LYMainLoop.c:4472:  srcmode_for_next_retrieval(0);
> src/LYOptions.c:3039:      srcmode_for_next_retrieval(0);
> src/LYOptions.c:3049:      srcmode_for_next_retrieval(0);
>
> -Bob
>
> > src/LYGetFile.c:1118:PUBLIC void srcmode_for_next_retrieval ARGS1(
> > src/LYGetFile.h:11:extern void srcmode_for_next_retrieval PARAMS((int));
> > src/LYMainLoop.c:3802:              srcmode_for_next_retrieval(1);
> > src/LYMainLoop.c:3819:                  srcmode_for_next_retrieval(0);
> > src/LYMainLoop.c:4236:  srcmode_for_next_retrieval(1);
> > src/LYMainLoop.c:4380:  srcmode_for_next_retrieval(-1);
> > src/LYMainLoop.c:4385:  srcmode_for_next_retrieval(1);
> > src/LYMainLoop.c:4407:  srcmode_for_next_retrieval(0);
> > src/LYMainLoop.c:4447:          srcmode_for_next_retrieval(1);
> > src/LYMainLoop.c:4469:      srcmode_for_next_retrieval(1);
> > src/LYMainLoop.c:4472:      srcmode_for_next_retrieval(0);
> > src/LYOptions.c:3032:       srcmode_for_next_retrieval(1);
> > src/LYOptions.c:3039:               srcmode_for_next_retrieval(0);
> > src/LYOptions.c:3049:           srcmode_for_next_retrieval(0);
> >
> > --
> > Thomas E. Dickey <address@hidden>
> > http://invisible-island.net
> > ftp://invisible-island.net
>
> ; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden


; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]