Spell Checker 2: NER Model

Choose right NER model for the project

Posted by Qing on June 28, 2022

Index

  1. Dataset
  2. Model Choice
    1. BERT
    2. LSTM-CRF
    3. SpaCy
  3. Summary

NER is the abbrevation of Named Entity Recognition, it is a standard Natural Language Processing problem which deals with information extraction.

Dataset

For data that will be used in the model, I chose CONELL 2002 Spanish dataset and WikiANN dataset. CONELL 2002 Spanish is a traditional NER dataset for Spanish, and WikiANN is a new multilingual NER dataset that supports 176 of the 282 languages from the original WikiANN corpus. BIO tagging format is used in both datasets, it is a common tagging format for tagging tokens in a chunking task in computational linguistics, and also a widely used tagging method for NER. This method labels each element as “B-X”, “I-X”, or “O”, where “B-X” indicates that the fragment in which the element resides is of type X and that the element is at the beginning of the fragment, “I-X” indicates that the fragment in which the element resides is of type X and that the element is in the middle of the fragment, and “O” indicates that it does not belong to any type. Here are the complete tags:

“O”, “B-PER”, “I-PER”, “B-ORG”, “I-ORG”, “B-LOC”, “I-LOC”, “B-MISC”, “I-MISC”

Model Choice

This section lists three models that were considered in my project.

  • When thinking about building an NLP model, Bert comes to mind first. Bert is a pre-training model developed by the Google AI Institute in October 2018. Bert is a Bidirectional Representation Encoder from Transformers. It performed surprisingly well on machine-reading comprehension, outperforming humans on all two measures, and creating a SOTA performance on 11 different NLP tests, and has become a landmark model achievement in the history of NLP development. But BERT isn’t a good choice for the project, because it needs too many computation resources. I checked around 10 Spanish BERT models which are published in HuggingFace, they can reach around 86 to 88 F1 scores, but their models take around 440MB, that equals to about 110 million parameters. Loading the model and calculating tags will take much time, it will slow down the processing time of Hunspell dramatically, so it makes BERT not suitable for the project.

I think that Albert is a good choice for the NER task in the project, but I realized that after several weeks, so I didn’t check the performance in Albert.

  • LSTM - CRF

A traditional method to do NER is to use LSTM or Bi-LSTM as the feature extractor, and then add a CRF layer as the output layer. LSTM is a good choice for medium-long sequence feature extraction. Bidirectional LSTM can make use of the information of the sequence in both directions, so it can improve the feature extraction ability of the model.

CRF(Conditional Random Fields) is a probabilistic undirected discriminative model, which solves the limited dependency problem of HMM (hidden Markov) models in sequence tagging. In HMM, the hidden state yi is only affected by yi-1, thus output xi is affected by yi. In CRF, xi can be affected by contextual information {…yi-1,yi,yi+1…} depending on feature functions. It is suitable for task like NER where contextual states(words) affect the current prediction.

The whole process can be separated into three parts:

  1. Embedding layer converts the text to a character vector, obtaining the input embedding
  2. Embedding is input into Bi-LSTM layer, feature extraction (encoding) is carried out to get feature representation of sequence, Logits are obtained.
  3. The logits need to be decoded to get the annotation sequence. Enter it into the decoding CRF layer and get the sequence of each word.

I tried two LSTM-CRF-based methods. Model 1 is a simple model with one BiLSTM as encoder and one CRF as the decoder. The size of this model is relatively small, only 60MB. This model has the F1 score of 0.83, precision of 0.85. The performance is not so good. Model 2 is an improved version based on Model 1, thus it is a little more complex. Besides BiLSTM and CRF, the author adds an external layer to implement char-level LSTM. The results of word-level BiLSTM and char-level LSTM are combined together as the encoder, and the results will be fed into the decoder soon. Model 2 performs better than Model 1, the F1 score increases by 3 percent, and the precision also increases by 1 percent. However, Model 2 takes 2.1 GB of space, which means it will take a much longer time than Model 1 to process a sentence. On the other hand, it is written with Pytorch version 0.3.0, which is too outdated in 2022. So it is not a good choice, too.

  • SpaCy

    As it is described above, two LSTM-CRF-based models are not so good choices. I found some alternatives in SpaCy. According to the official website, spaCy is an open-source software library for advanced NLP, which offers industrial-strength NLP tools and models. I found 2 already trained NER models there, “es_core_news_md” and “es_core_news_sm”. They are all trained with corpus in the news domain. After a small comparison, I choose “es_core_news_md” as Model 3. Although “es_core_news_sm” is much smaller (only 12MB), it performs worse than Model1 and Model 2. Model 3 can reach higher than 0.89 for all F1, precision, and recall values. By the way, after removing unnecessary functions like “tok2vec”, “tagger”, “parser”, “attribute_ruler”, “lemmatizer”, the processing time of Model 3 in CPU becomes 1.6 times faster. Model 3 can process 160 sentences per second in CPU.

Summary

I choose Model 3 as the final NER model and merge it into the spell checker. Here is the performance summary:

Metrics Model1 Model2 Model3
F1 0.8306 0.8625 0.8949
Precision 0.8497 0.8624 0.8942
Recall 0.8124 0.8626 0.8956
Size 60MB 2.1GB 41MB