Spell Checker 1: Demand Analysis

Index

This page describes the keypoint of the spell checker implemented in the project.

Background
Problem Analysis
Solution

Background

For Google Summer of Code 2022, I am going to build tools for improving subtitle & caption quality for Red Hen Lab, so it’s necessary to figure out the performance of tools for current subtitle correctness. As it was introduced by my mentor, some Red Hen Lab members have been using Hunspell to check the correctness of files generated by CCExtractor.

Hunspell is a spell checker designed for languages with polymorphic and complex combinations of words, originally designed for Hungarian. It is widely used in many applications such as LibreOffice, OpenOffice.org, Mozilla Firefox, Thunderbird, and Google Chrome. It is a dictionary-based spell checker, which means the performance of Hunspell for different languages is decided by the language dictionaries and lexical rules we offer. To run Hunspell correctly, people must offer one .dic file and one .aff file with corresponding lexical rules. You can find Hunspell dictionary and affix files for Spanish here

Problem Analysis

This part lists three problems that need to be solved in my project.

Red Hen Lab uses Hunspell to process mainly English and Spanish files. Hunspell performs well for English files, but there are many problems when processing Spanish files.

Here are some errors in one Spanish file found by Hunspell:

TVE
Decirlo
cualqueir
[Standards
and
Poor
s
Christine
Tokio
PSOE
Moscú
Yan
contrarrestarle
Antidopaje
[s
recapitalizar

Most of the words in the list above have been marked as errors incorrectly. The problems can be separated into 4 types:

Wrong Predictions

This is the biggest problem that Red Hen Lab meets now. Some right Spanish words, like “Decirlo”, “recapitalizar” are the right expressions in Spanish. “Decirlo” consists of two sub-words: “Decir” and “lo”, and “recapitalizar” is the combination of “re” and “capitalizar”. It’s obvious that Hunspell can’t deal with some words’ prefix or suffix correctly.

Mismarked English Words

In today’s world, with the deepening communication between countries, loanwords(especially English loanwords) appear in various languages. words like “and”, “Poor” belongs to this type, they are not standard Spanish words, but researchers in Red Hen Lab want to treat them as correct words in practice.

Special Entities

Dictionary-based spell checkers can not deal with special entities like people’s names and locations so well. It’s hard to avoid because you can’t put all these entities into the dictionary, and you can’t distinguish it by following lexical rules. Otherwise, if you try to do it, you will have a huge dictionary which makes spell checking slower. “Tokio” and “Moscú” are apparently locations, and Yan is a proper name.

Solution

Final solutions are introduced briefly in this part, if you want to see the detailed version, please move to the detailed pages.

I tried different methods to solve three problems above, here are the final solutions.

Wrong Predictions

As I said before, the quality of the user dictionary will affect the performance of Hunspell. Spanish is used in many countries, so it has actually multiple Spanish dictionaries in Hunspell. Red Hen Lab focuses on files from Spain and Mexico, I first combine the original the original Spanish (from Spain) and Spanish (from Mexico) dictionaries and their affix files together. Then I cooperated with my mentor Rosa, and add some extra affix rules to the current affix files. With those two steps, the appearance of wrong predictions decreases dramatically.

Mismarked English Words

This problem is easy to solve, after comparing the two methods, I add one English Hunspell checker after the original Spanish Hunspell. The Spanish Hunspell checker will check all words, but the English one is only triggered when a word is adjusted as wrong in Spanish, so it won’t take much longer processing time than before.

Special Entities

All sentences will go through one small NER model before Hunspell, the person (PER), organization (ORG), location (LOC), and miscellaneous (MISC) will be removed and won’t be fed into the spell checker. The NER model here needs only one CPU with the processing ability of 160 sentences/second. It’s not suitable for super large files, but it works for the daily jobs in Red Hen Lab.

Spell Checker 1: Demand Analysis

A brief overview about spell checkers

Index

Background

Problem Analysis

Wrong Predictions

Mismarked English Words

Special Entities

Solution

Wrong Predictions

Mismarked English Words

Special Entities

FEATURED TAGS

FRIENDS