Index
This page describes the keypoint of the spell checker implemented in the project.
Background
For Google Summer of Code 2022, I am going to build tools for improving subtitle & caption quality for Red Hen Lab, so it’s necessary to figure out the performance of tools for current subtitle correctness. As it was introduced by my mentor, some Red Hen Lab members have been using Hunspell to check the correctness of files generated by CCExtractor.
Hunspell is a spell checker designed for languages with polymorphic and complex combinations of words, originally designed for Hungarian. It is widely used in many applications such as LibreOffice, OpenOffice.org, Mozilla Firefox, Thunderbird, and Google Chrome. It is a dictionary-based spell checker, which means the performance of Hunspell for different languages is decided by the language dictionaries and lexical rules we offer. To run Hunspell correctly, people must offer one .dic file and one .aff file with corresponding lexical rules. You can find Hunspell dictionary and affix files for Spanish here
Problem Analysis
This part lists three problems that need to be solved in my project.
Red Hen Lab uses Hunspell to process mainly English and Spanish files. Hunspell performs well for English files, but there are many problems when processing Spanish files.
Here are some errors in one Spanish file found by Hunspell:
TVE
Decirlo
cualqueir
[Standards
and
Poor
s
Christine
Tokio
PSOE
Moscú
Yan
contrarrestarle
Antidopaje
[s
recapitalizar
Most of the words in the list above have been marked as errors incorrectly. The problems can be separated into 4 types:
-
Wrong Predictions
This is the biggest problem that Red Hen Lab meets now. Some right Spanish words, like “Decirlo”, “recapitalizar” are the right expressions in Spanish. “Decirlo” consists of two sub-words: “Decir” and “lo”, and “recapitalizar” is the combination of “re” and “capitalizar”. It’s obvious that Hunspell can’t deal with some words’ prefix or suffix correctly.
-
Mismarked English Words
In today’s world, with the deepening communication between countries, loanwords(especially English loanwords) appear in various languages. words like “and”, “Poor” belongs to this type, they are not standard Spanish words, but researchers in Red Hen Lab want to treat them as correct words in practice.
-
Special Entities
Dictionary-based spell checkers can not deal with special entities like people’s names and locations so well. It’s hard to avoid because you can’t put all these entities into the dictionary, and you can’t distinguish it by following lexical rules. Otherwise, if you try to do it, you will have a huge dictionary which makes spell checking slower. “Tokio” and “Moscú” are apparently locations, and Yan is a proper name.
Solution
Final solutions are introduced briefly in this part, if you want to see the detailed version, please move to the detailed pages.
I tried different methods to solve three problems above, here are the final solutions.
-
Wrong Predictions
As I said before, the quality of the user dictionary will affect the performance of Hunspell. Spanish is used in many countries, so it has actually multiple Spanish dictionaries in Hunspell. Red Hen Lab focuses on files from Spain and Mexico, I first combine the original the original Spanish (from Spain) and Spanish (from Mexico) dictionaries and their affix files together. Then I cooperated with my mentor Rosa, and add some extra affix rules to the current affix files. With those two steps, the appearance of wrong predictions decreases dramatically.
-
Mismarked English Words
This problem is easy to solve, after comparing the two methods, I add one English Hunspell checker after the original Spanish Hunspell. The Spanish Hunspell checker will check all words, but the English one is only triggered when a word is adjusted as wrong in Spanish, so it won’t take much longer processing time than before.
-
Special Entities
All sentences will go through one small NER model before Hunspell, the person (PER), organization (ORG), location (LOC), and miscellaneous (MISC) will be removed and won’t be fed into the spell checker. The NER model here needs only one CPU with the processing ability of 160 sentences/second. It’s not suitable for super large files, but it works for the daily jobs in Red Hen Lab.