aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] Checking of word-marginal specials


From: Ciarán Ó Duibhín
Subject: [aspell-devel] Checking of word-marginal specials
Date: Thu, 20 Jun 2013 18:08:40 +0100

This is the second part (change #2) of my consideration of apostrophes and hyphens in aspell.  The first part (change #1) was "Tokenization of word-initial specials" dated June 14 2013.
 
Currently, when *.dat marks apostrophe as valid initially, the dictionary form well validates the token 'well (in addition to the token well).  And, when *.dat marks apostrophe as valid finally, the dictionary form well also validates the token well' .  However, neither of the tokens 'well or well' should ever be validated by the form well, but approved only if those exact forms are present in the dictionary.
 
There are two cases: when the apostrophe is encountered in a token in a position, initial or final, where it IS NOT valid in *.dat (and note that this applies to en.dat), it is immediately dropped from the token, and only the token without the apostrophe is checked against the dictionary.  (Before change #1, even a valid initial apostrophe was dropped from the token, but not a valid final apostrophe.)  So if "trying the token without the special" is done with the intention of accepting a token of English which has contrived to include a neighbouring quotation mark, this is a non-existent situation whose removal will have no effect.
 
When the apostrophe is encountered in a token in a position, initial or final, where it IS valid in *.dat, the token should be accepted only if the dictionary contains the word including the apostrophe — the current practice of accepting the token, merely because the corresponding form without the apostrophe is in the dictionary, is to accept an invalid word, possibly resulting from a mistaken use of the apostrophe (ASCII hex 27) as a quotation mark.  (Remember that languages which accept valid word-marginal apostrophes in *.dat do not use ASCII hex 27 as a quotation mark.)
 
The code for "trying the token with and without any initial or final special" is found in procedure SensitiveCompare in modules/speller/default/language.cpp at around line 428.  The suggested change #2 is to remove the code which, when the token begins or ends with a valid special, and has failed to match the dictionary, compares the token minus the special to the dictionary.  (Note again that a token will never be found to begin or end with an INVALID special, as that special will have been dropped during tokenization.)  Specifically, I suggest removal of the four separate lines which use the special() function.  Having no previous experience of C++ programming I cannot say that everything has been done which ought to be done, but the concept has been tried and shown to work.  I do not at present see any reason to make it conditional, ie. I cannot see any situation where the present behaviour is preferable.
 
This suggestion will enable a language like Italian, for example, to have a new it.dat in which word-final apostrophe is allowed, and non-words like anch may be replaced in the dictionary by anch' .  Even for English, a new en.dat allowing marginal apostrophes and a new dictionary (with, for example, 'twas and 'twill in place of twas and twill, and adding 'tis and 'twould) could produce an improvement, but only with English texts in which an encoding distinction has been made between apostrophe and quotation mark.  The main beneficiaries of the suggestion will be among languages other than English.
 
As before, my experiments have been conducted using the Hatier port of aspell for Windows at http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2 .
Third and final part to follow.
 
Ciarán Ó Duibhín

 


reply via email to

[Prev in Thread] Current Thread [Next in Thread]