How does a computer spelling checker work?

You probably know that spell checkers work by checking words against a dictionary containing words known to be correct. If a word is not found in the dictionary, the word is reported as a misspelling. If a word is found, it is skipped over without being reported. Two key measures of a spell checker's accuracy are its detection rate, which is the number of misspelled words reported vs. the number of words actually misspelled, and the false-positive rate, which is the number of valid words incorrectly reported as misspelled vs. the number of words checked. A high detection rate and a low false-positive rate are desirable. The number of words in the dictionary has a strong bearing on both of these measures. If the dictionary contains too many words, the probability will increase that a misspelled word will match one of the words in the dictionary, and therefore will not be reported. This will decrease the spell checker's detection rate. If the dictionary contains too few words, more valid words will be reported because they are not in the dictionary. This will increase the spell checker's false-positive rate. The ideal dictionary for you would contain every word in your vocabulary, but no other words. This dictionary would yield an excellent detection rate and a false-positive rate of 0%. The detection rate would not be 100% because you could still misspell a word and match a different valid word - you might accidentally leave the e off stare and match star, for example. Unfortunately, a spell-checker dictionary that is ideal for you would likely be less than ideal for someone else, since different people have different vocabularies. Moreover, creating a dictionary containing the words in only one person's vocabulary would be prohibitively expensive. A cost-effective dictionary contains the words most commonly used by the population of its users. To maintain a high detection rate, the dictionary should contain only words common to a large portion of the population. If the dictionary contains technical terms used only by the small portion of the population who are archaeologists, for example, there is an increased chance that a misspelling made by an average user will match one of these specialized terms and therefore not be reported. To maintain a low false-positive rate, the dictionary should contain most of the words used by the population. If the dictionary does not contain a word commonly used by the population, people will experience frustration when the spell checker reports the word as a misspelling. Incidentally, a dictionary in a spell checker is not like a print dictionary. Print dictionaries have an obligation to include as many words, no matter how obscure, as possible within their limits. A spell checker that flags valid words as misspellings may be annoying, but a spell checker that allows a misspelled word to pass through without report has failed to do its job. For this reason, the spell checker dictionary should contain as many common words as are needed to maintain a reasonable false-positive rate, but no more. Putting it another way, the dictionary should contain the minimum number of words needed to avoid incorrectly reporting common valid words. Companies that build dictionaries for spell checkers often do so by statistically analyzing vast amounts of text from many sources to ensure that the most common words are included, with words ranging from "the," "a," and "of" to less common but still far from obscure words like "plenipotentiary" and "disenfranchisement." Specialized terms are best handled by supplemental spell checker dictionaries.

