lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Lynx-dev] Issues with -dont_wrap_pre and -nomargins


From: Claus Strommer
Subject: [Lynx-dev] Issues with -dont_wrap_pre and -nomargins
Date: Wed, 16 Sep 2009 03:17:23 -0400

Hello all. I am using lynx to convert an archive of html files into plaintext for information retrieval. The command that I use is

lynx -nounderline -notitle -nocolor -nomargins -nolist -nobold - nonumbers -force_html -dump -dont_wrap_pre <file>

If works almost perfectly, except for one minor issue; I am not sure if it is a bug or something I am doing wrong. When I parse the attached a.html file, some of the words are printed without a whitespace separator:

"...However, in order to fully develop our vision of the next version of Twingle, we needed more control over the fine nuances of searching through email. And, asthe next phase of the Twingle development is to include a downloadable versionof the software, we needed it to make it easier for people to install - when the lead developer gave up after 6 hours of trying to get it all working on his own machine at home we knew we had a problem!..."

'asthe' should be 'as the', 'versionof' -> 'version of', and so on. AFAIK, this is not an input error - the words are separated when skip either of the -dont_wrap_pre or -nomargins options. As these errors occur near the n*80th characters in a paragraph, I can only assume that some part of the parsing is going awry there. The errors occur in the 1.8.6-rel5 (macports), 1.8.6-rel4 (ubuntu) and and latest 1.8.8 builds.

So my question is: Is there anything I can do to work around this? I would very much like to keep using these two options, as it is important to me to be able to distinguish between lines and paragraphs. I am even willing to use other tools, if you can suggest any - but as far as I've seen, lynx is the only one which gives the desired options. Also, I'd like to stay away from the -width option (it does not allow me to specify infinite width, AND it breaks with tables - the attached b.html, for example).


Plucene

# posted by Tony
Mon, February 2, 2004

We have been working for some time on Twingle, a tool for finding information in your email. The original, server-based version of this used the Jakarta Lucene engine, through the Perl-Java bridge of Inline::Java

However, in order to fully develop our vision of the next version of Twingle, we needed more control over the fine nuances of searching through email. And, as the next phase of the Twingle development is to include a downloadable version of the software, we needed it to make it easier for people to install - when the lead developer gave up after 6 hours of trying to get it all working on his own machine at home we knew we had a problem!

For the last year we have employed renowned Perl guru Simon Cozens to work with us on Twingle, and as his final project we asked him to port Lucene to Perl.

And so, Kasei is proud to announce the release of Plucene - a Perl port of Lucene. As with much of the software produced by Kasei it is released as open source (In the past year Simon, Tony, Marty and Marc have released over 60 Perl modules).

We'd like to publicly thank Marc and Simon for all their hard work on Plucene, and wish Simon all the best on his future projects.

separator

Unhappy Birthdays

# posted by Karen
Sat, December 6, 2003

Although we originally promised to use this weblog to rant about the sorry state of ecommerce, it's still my preferred means of shopping. At Christmas this is even truer, as traditional shopping becomes an absolute nightmare. More than ever, shop assistants seem to resent being interrupted from their job by customers, as they have so many more important things to do like stacking the shelves, answering the phones and chatting to their friends.

On Friday, needing a decorative gift bag for a bottle of vodka, I tried Birthdays, where I noticed that the one I liked was in a special 3 for 2 offer. Knowing that I could easily use the others I decided to take the 3. I fought my way through the narrow aisle to the counter, only to discover that there was only one shop assistant serving, and that I had to fight my way half-way back down the other narrow aisle to the end of the queue. Three other employees hovered by the stock room door, presumably so they could make a quick getaway if any one approached to ask a question.

When it was finally my time to be served, I handed over my three bags, and prepared to pay my £4.98. When I was asked rather for £7.47, I explained that the bags were in the offer. The shop assistant retorted that as there weren't any promotional stickers on the bags they could not be on offer. When I tried to explain that I had deliberately searched for bags without the stickers as I didn't think I would be able to remove them without damaging the bags, she called for the manager, who, without checking into the matter, reiterated that if they had no stickers they couldn't be in the offer. When I again pointed out that other identical bags had the stickers, she snippily stated that "someone must have just stuck those on".

As the manager disappeared back into the stock room, the assistant asked if I still wanted to buy the bags, and looked really put out when I told her that I would only buy one bag. As this meant cancelling the other items, she needed to get the manager to come back out again to change the details on the till. By now the queue behind me stretched the whole length of the shop, but of course none of the loitering staff deigned to open up another till.

When the manager finally returned to adjust the till, she discovered that the bags actually were on offer, that the till had automatically taken the discount, and what was my problem? Losing patience I asked what made her think that £2.49 times two was £7.47? Looking puzzled she checked the till again, and discovered that the assistant had actually rang through four bags. They removed one, and with the matter now "solved", perfunctorily took my £5 and moved on to the next customer.

Like most customers I won't make any sort of complaint about this. Instead, I'll do what Sam Walton always took great pains to warn his staff about:

There is only one boss: the customer. And he can fire everybody in the company from the chairman on down, simply by spending his money somewhere else.

separator

The Value of Tiny Tasks

# posted by Tony
Fri, October 24, 2003

IT Week this week references the Web Effectiveness Report 2003, which, amongst other things, reveals that only 19% of web site managers surveyed review their log files to look for problems with their site.

When we originally built BlackStar we were fairly novice Perl programmers. We knew that all Perl should really have "strict" mode and "warnings" turned on - so we made sure we did that. But as long as everything ran correctly, we didn't really pay much attention to the warnings that were emitted.

But after a while we realised that the profusion of verbiage cluttering up our logs was highly distracting, and would ensure that nobody really paid them much attention - and the useful information about real problems would be obscured.

So we decided to clean this up, to ensure that anything appearing in the log would most likely be a symptom of a real-life actual problem, of which someone should probably be notified straightaway.

This was much easier to decide than to actually implement, however. We were introducing all manner of new quality procedures at the time, so it was relatively easy to at least decree that any new code, or any alteration of existing code, should be free from warnings. This way we could at least halt the growth of new warnings (although with site visits growing 30-40% per month, the volume of messages was still growing quickly). We even scheduled into the programming schedule a little time devoted to removing some of the most egregious offenders, which probably removed about 50% of the warnings.

The slow clean up of having the old code gradually tidied in passing wasn't going to get us anywhere fast enough though. So we introduced a new system that was disarmingly simple, but remarkably effective. Every night a process collated the error logs from across the various web servers, and generated a report of the ten most common problems. This report would be emailed to the entire programming team, and the next day one person would take responsibility for ensuring that the top item on that list vanished.

This approach proved very effective and within a few months the volume of warnings had decreased dramatically. It was so successful that we started to apply the approach to many other areas of the business. Any problem that was monitored over time was a candidate for it - and with an entirely web based business we had a lot of data to monitor. Someone became responsible for checking the list of the most common search terms that returned no results, and adding a re-mapping so that that search would automatically be transformed into what the person probably wanted. Someone else would ensure that the most visited DVD page that didn't provide cross-reference to the VHS of the same item was given that linkage. Someone in the warehouse would upload the cover of the most visited item in stock that hadn't previously been scanned.

Most of these tasks took less than 5 minutes. They would rarely be a top priority in amongst the constant fire fighting, growth-pains, and breakneck pace of developing new features and systems in the crazy world of exponential growth. In normal circumstances none of these things would even have appeared in someone's list of top 10 priorities. But the simple action of ensuring that each day the top occurrence of each problem was removed created a staggering cumulative effect.

The art of time management is usually a matter of ranking your items by importance and urgency, and prioritising according to how high things appear on each axis. But most books, articles and seminars on the topic stop there. Spending a few minutes each day doing something that is neither particularly important, nor particularly urgent, but that has a beneficial outcome, has value. When everyone in an organisation is doing likewise, and those tasks are automatically selected based on their potential benefit, that value can be enormous.

separator

They Blinded Them With Science

# posted by Marc
Thu, September 25, 2003

Recently, we were told that Mars was closer to the earth than it has been at any time in the past 60,000 years. This gained much media attention, and sparked a wider interest in astronomy amongst the general public.

This was followed by the revelation that the Earth was in danger of being hit by an asteroid. As this seemed to be of considerable significance, rather than just a general curiosity, all the data was re-examined, and the recalculations indicated that we should be safe after all, as there was only a 1 in 909,000 chance of it hitting.

One might think that, in the light of such a dramatic reinterpretation of the data, someone might similarly return to the Mars data to see if the figures there were incorrect too. But although both cases are essentially the same mathematically, the approach and resulting presentation were very different.

It seems that someone, somewhere, noticed that Mars seemed to be getting closer to the Earth. Curious as to whether or not this was so, they gathered the data from recent observations, and traced its orbit back in time to estimate its more historic position. Out of this popped the media-friendly statistic that the last time Mars was so close was 60,000 years ago. A sensible margin of error here might be 1%, but "60,000 +/- 600 years" doesn't make for as nice headlines.

So why is one portrayed as much more accurate than the other? Of course, there are differences in the calculations, but essentially it comes down to scientists presenting their information in a way to make it more palatable to the public. But should science be something which can be played in a particular way in order to make it more acceptable to certain audiences? Science is as much value-judgement based as anything else, and the universe is not the large clockwork instrument the Victorians believed it to be.

What is really at stake is the different way that issues can be presented to the public. And especially when it comes to anything technical, be it physics, chemistry or computer science, the general public are all too willing to believe what they are being told. The gurus all tell us about the Next Big Thing. The scientists all tell us How It Is. All the critical analysis is done for us, and as a result we accept sloppy reporting without question.

We all want to believe that the latest methodology or technology will make us better people, more productive workers, and even rich. But the Latest Thing may just be the Same Old Thing from a different angle. It is important not to forget any lessons we have already learnt, not to instantly abandon what we know to be right just to keep up with the all-singing, all-dancing bandwagon.

separator

All I Really Need to Know About Programming I Learned From Fairy Tales

# posted by Marc
Mon, August 4, 2003

Software development methodologies are designed to help us produce cleaner, better and more maintainable code. Books and journals are produced at a staggering rate, filled with the latest answers to all our coding and maintenance woes. As programmers we are all by now expected to be hyper-productive and error-free.

The names change, but the ideologies remain similar. Beck and Fowler may have displaced Weinberg and Constantine, who in turn displaced Knuth and Kernighan, but these are all mere pretenders. As Isaac Newton said, albeit in sarcasm, we all stand on the shoulders of giants. In order to see where all this wisdom originated, we must travel back further still - back to the tales we were told as children, the stories we heard at our mother's knee. We need to rediscover the truths that are in our folklore.

I believe the time is ripe for significantly better documentation of programs [where the programmer] chooses the names of variables carefully and explains what each variable means. - Donald Knuth

A fine sentiment from the ever-wise Knuth. Of course, it is common sense to name your variables wisely. It adds an extra layer of meaning to your code, which can greatly ease future maintenance. But doesn't that sound like a familiar lesson? Where did we first come across the power of naming? Rumplestilskin, of course. Think back to the difficulties the Queen faced in puzzling out the naming convention there!

Or consider the interface to a software library. This is notoriously hard to get right, and the price of getting it wrong is even higher than badly named variables. So software developers learn about the principle of encapsulation. Rather than displaying the innards for all to see, we should hide away all the workings, providing a clean and usable front end. This is, of course, a great piece of advice. But again it is neither new, nor original. In fact, this information has been around for a very long time. Consider the tale of Hansel and Gretel. If the children had have been able to see the cauldrons, potions, jail cells and oven hidden the witch's house, they would never have gone near it.

But instead, the witch encapsulated all the horrible things, and put all the things that children like on display. (Of course, the witch had a bug in her system, which allowed the children to escape, but that is a different matter entirely!)

Even the issues involved choosing a software library were implanted when we were youngsters. I spend most of my time programming in Perl, which has one of the largest public resources of any language: the Comprehensive Perl Archive Network
(CPAN). One of the benefits of these sorts of libraries is that any problem you are trying to solve may already have been solved by someone before you who has released their code for free use. In fact this has probably happened more than once, in many different ways. So it is always good to choose amongst these libraries carefully, in order to find the one that will solve your own particular problem, in the best manner. Of course we learnt this in our youth from Goldilocks. She didn't like the big bowl of porridge, so she tried the middle sized one, which still wasn't to her liking, but the littlest one was exactly what she was looking for. (Of course, there are lessons to be learnt in this story about hacking and computer misuse, but that's for another day!)

Or consider testing. The recent rise of the agile methodologies, such as XP, have re-awakened developers to the power of testing. A good test suite, with full regression tests, can save a project. In fact, some go as far as to recommend you write your tests before you write any code.

But again this is hardly a new concept. Remember Chicken Licken? Unable to correctly identify a problem, he believed his entire system was failing, and led his entire project to disaster. With only had a simple test suite he could have been much more confident in his environment, and quickly recognised the problem solely as untested external input.

I could go on and on. There are many other tales which also show that far from being a new science, Computer Science has just tapped into our universal stories and dressed them up with new terminologies. But hopefully this small taster demonstrates that there may be alternative sources when you're looking for further information on a new concept you've discovered in your favourite buzzword-compliant journal.

separator



Monday

Monday November 28th 2005, 11:13 pm
Filed under: General

Yes, Ben, I should have done something. Anything would have been better than nothing. Another perhaps missed opportunity. This pit in my stomach will go away and soon enough be replaced by some other, yet to be known girl. But, honestly, who could have expected any other outcome? There inlies my greatest disappointment, that this is the expected behavoir.

Cole, I added a link to the photoblog over there —>. Hope you enjoy it. I will take any other suggestions on further tweaking of the link, if you have any.

Well, Thanksgiving went well. No missed flights going. Woke up on time and all that. Good to see the fam, and I was pleasantly surprised to have KB and Clay meet me at the airport along with the expected parents. Being in Canyon was altogether uneventful and restful, as it should be.

Getting back, however, involved snow in Canyon and Amarillo, leading to an hour delay in leaving and missing my connection by probably about 20 minutes. Which of course led to my fun with standby and the pretty girl whom I’ll never see again. (Boo hoo me. Insert world’s tiniest violin, playing blah blah blah).

I would upload a pic, but my reader isn’t being recognized by my computer, has locked it up once already (due to what I am assuming is dust in the usb ports), and to make matters worse, when I restarted, I had to reboot again to go to runlevel 1 to deal with my misconfigured resin server and manually load modules in the correct order to get my ethX devices in the proper order. I suck, I know. I will take suggestions on long term solutions to either of the aforementioned problems. So, if I get it figured out (I guess I can get the laptop out, although I doubt that I will have much more success), then you might get to see a pic, else it will happen some other time. USB SUCKS! Firewire rules.

That is all.



Little Thing

Monday November 28th 2005, 1:30 am
Filed under: Women

So here I am chowing down twin apple pies from McDonalds writing to you straight from A-town (Atlanta that is). Finally got back from Canyon today after fun with missed connections, et al. But that is not the reason that I stream my concious to you right before I crash so I can get up to go to work tomorrow. No, the reason that I blog to you tonight is women. Hmm, let me check that little box to make this post go into the women’s category. Ah, much better.

Ok, so most of you should have seen this one coming, especially since I am apparently have the worst luck with weather whilest trying to get back to Atlanta from the panhandle (it snowed today in the panhandle for those of you at home!). With poor weather comes lots of down time at the airport. So too much time on your hands and plenty of pretty women chilling at the airport with you makes me wonder when I will ever grow some balls? Seriously. There was this one girl in particular. While I was waiting to try to fly on standby so that I wouldn’t have to stay overnight and fly out at 1pm tomorrow, there happened to be a very pretty girl waiting in the same little area. So, she was hot and I didn’t say anything to her, and that was that. As it turned out she was flying standby to Atlanta as well, and she got to fly out on that flight and I got the opportunity to try and catch the next one. How fortuitous and crappy at the same time. Anyway, if any of you have ever yeard the song “Little Thing” by Dave Matthews, then you will know something analgous to what I’m currenty feeling. Anyhoo, I do this every once in a while and I always kick myself.

After I got on the plane that finally brought me home, I was trying to get comfortable, but I couldn’t really, because I’m too dad gummed tall. I thought to myself, “self, don’t you hate being tall?”. Which led me to the following pickup line:

Excuse me, but don’t you hate being so beautiful, because every time you’re at an airport random guys walk up to you and strike up a conversation as if they know you, when all you really want to do is to get out of this frickin’ airport… get back to your home, to your boyfriend, and to a nice, hot bubblebath.

I should take bets on who thinks that next time I fly home I could find some similarly hot girl and say that to her.

In other news I’ll post some pics to the photoblog from the past couple of days. I probably have a couple or 3 to post…

And if anyone would like to submit a 1500–2000 word essay on why I have no cojones, then feel free.



Mauri

Friday November 25th 2005, 10:58 pm
Filed under: General

NOBODY TOLD ME MAURI HAD A BLOG!

Check out her new blog at Xanga.



Lights

Tuesday November 22nd 2005, 12:17 am
Filed under: General

Everyone must download this now! It is glorious! Whoever did this has way too much time and money on their hands…



Poll

Monday November 21st 2005, 8:56 pm
Filed under: General

I’m sure that none of you would ever expect me to pose this question, but why are strapless wedding dresses in now?



Punch Drunk Love

Thursday November 17th 2005, 11:21 pm
Filed under: General

What the hell was that about? Seriously… I’m so confused, but I liked it.

Oh, BTW, guess who just got offered an upgrade from contractor, bitches? :D



WC

Wednesday November 16th 2005, 12:05 am
Filed under: General

Funniest bathroom sign, ever.
http://www.csd.net/~seawall/sunday.jpg



Firefly

Monday November 07th 2005, 11:17 pm
Filed under: General

Firefly deserved 6 more seasons and to have gone out on top.



Pulp Fiction

Sunday November 06th 2005, 1:04 am
Filed under: General

I watched it again for the first time tonight. I tried to watch it some time ago (during high school, I think), but for whatever reason, I couldn’t watch it all the way through. I guess since then my tastes have changed a lot, and I have a greater appreciation for these kinds of movies and Mr. Tarantino as well. Anyway, I liked it a lot. Good movie.



I want to know…

Saturday November 05th 2005, 1:25 pm
Filed under: General

the whats that lie ahead.
the whys of the path that has brought me here thus far.
the whos of my future.


 


Copyright © Thomas, All Rights Reserved
Conestoga Street Wordpress Theme by Theron Parlin


reply via email to

[Prev in Thread] Current Thread [Next in Thread]