I have written a Javascript spellchecker/corrector based on Peter Norvig’s Python 21-liner.
Click here to view my code. It is still not perfect. Comments, bug reports welcome.
Thanks to V for speeding this up.
Updated : Suggestions made by Dante (see comments) have been incorporated and the new code has been uploaded.
August 31, 2008 at 12:33 pm |
Some pointers …
1)Wordexists function should be applied to resultset before
returning the resultset in edits1.Thats most probably affecting
the performance of edits2.
2)Rewriting train(features)
for(f in features)
{
if(model[features[f]])
model[features[f]]+=1;
else
model[features[f]]=1;
}
3)I dont think its a particularly good idea to import the corpus
as I hope this is a part of web application. The simplest
brown corpus is about 7 MB. Instead you can import the
words and their frequency that have been already
preprocessed. Here is the frequency list from brown corpus
that I have developed for my other projects
http://www.mediafire.com/?sharekey=5cacc0942f2ed052d2db6fb9a8902bda
August 31, 2008 at 2:41 pm |
Hi Dante
Yes, (2) is an optimization that should be done. i will change the code and re-upload.
(3) is a good idea. Perhaps train method can be executed on the server side and the pre-processed corpus can be used on the client (browser).
Regarding (1), though it is true that calling wordexists on the result set of edits1 will improve the performance, I am not sure, if it will eliminate certain possible states from the overall results. Is there a way to ascertain that the reduced result set of edits1 would still provide the same set that Known_edits2() does when edits1 is not reduced (using wordexists). Maybe I will test it with brown corpus and analyze the impact of a reduced edits1.
Also, thanks for the pre-processed brown corpus file.
August 31, 2008 at 9:45 pm |
Hi Intensity!
On thinking again I get why the reduced resultset should not
be returned in edits1. Lets take ‘we’ as the input. the
transition we->wre->were would not happen if wordexists
was applied in edits1.There are other consequences as well.
Why not try Levenshtein Distance Algorithm
for each word in corpus
if(len[word]-len[input] is not between -2 and 2)
continue;
find edit distance
if(edit distance > 2)
continue;
result[word]=1
output word with max probability
September 1, 2008 at 1:34 pm |
Will try and post the results.
Just a thought … if Levenshtein would anyway account for edit distances <=2, then there is no need to call the edits1( ) function anymore.
ok … Implemented LD, there is tremendous improvement in performance. Thanks for the pointers Dante.
I have re-uploaded the new source. I have retained both approaches for comparison.