Spell Checker 4: Python Implementation

Merge all components into Python

Posted by Qing on August 13, 2022

Index

  1. Introduction
  2. Hunspell’s Python Implementation
  3. Cospell
  4. Summary

Introduction

I have found a good NER model and extended the dictionary on demand. They are all finished by using python. In this article, I will introduce how to implement python files that combine all these functions.

Hunspell’s Python Implementation

I planned to use the original hunspell python package at first, but this package stopped all updates in 2017, which makes it unstable to use. Then I want to implement a simple spell checker by using Norvig’s method because Hunspell is also implemented with the help of it. But later I found another well-re-implemented Hunspell version in python that is called Spylls. This project implements two main features of Hunspell: lookup (whether a word is in the dictionary) and suggest (for misspelled words).

This implementation uses Norvig’s insert-delete-swap-replace method, n-gram method, and phonetic method to correct the word. The detailed explanation can be found in author’s blog.

In view of performance, it is a little slower than the C++ version which was used in Red Hen Lab. On the author’s laptop ThinkPad Edge E330 with i5-3210M CPU and 8 GiB of RAM: Dictionary reading for en_US takes around 1.2s. The lookup takes microseconds. Suggest takes ~0.05s in a good case, and up to 0.5s in a bad one (n-gram suggest, which includes the whole dictionary iteration).

Cospell

I named the new python files as cospell, which means that it is a spell checker which combines several functions.

When a sentence is fed in, cospell will first check if it includes symbols ‘CCO’, ‘CC1’, which indicates if this sentence belongs to the contents extracted by CCExtractor. Then the whole sentence will be checked by NER function, all special entities will be masked out. The rest of words will go through Spanish checker and English checker, if it is not recognized as a right Spanish or English word, the three methods that were mentioned above will be used to correct it.

Summary

The whole process is 2-3 times slower than the original Hunspell version in Red Hen Lab, because it uses both Spanish and English spell checkers if necessary. Except for this disadvantage, cospell can remove 85% of wrong corrected words than the original Hunspell. For example, the original Hunspell will generate a list of 120 words as output, 100 of them are actually right, the 20 words left are real wrong words. By using cospell, it will only give back a list of 35 words on average, of those, only 15 words are right words treated as wrong (compared to the 100 of the original Hunspell) and the rest 20 are correctly identified wrong words. This performance can also be improved in the future (and we plan to do it) by running Cospell in more CCExtractor files used by the Red Hen Lab and adding the right words included in the output file to the dictionary. I did it for the test dataset, thus in the test dataset, it removes 97% of wrong corrected words.