Text classification is one of the most common topics for discussion. You can see a pipeline from google tutorial for text classification below. It presents the guide what to choose at the start. In this notion we will focus only on models, methods and tricky questions.

TextClassificationFlowchart.png

basic info:

data: samples of text, where each one represents one or a few topics.

input: data is always preprocessed in a vector form, such as TF-IDF representation or word2vec vectors or any other embeddings from language models. Embeddings can be obtained from the whole text, sentence or words.

output: for each text we give a label.


Some ways to solve this task:

  1. SVM
  2. MLP
  3. GBDT
  4. CNN
  5. sepCNN
  6. CNN - RNN
  7. BERT with numerical features

Understand every approach: