Joke Collection Website - Blessing messages - A summary of short text classification

A summary of short text classification

With the development of information technology, the scarcest resource is no longer the information itself, but the ability to process information. And most of the information is in the form of text. How to get the most effective information from such a large number of complex text information is a big goal of information processing. Text classification can help users accurately locate the information they need and divert the information. At the same time, the rapid development of the Internet has spawned a large number of short texts in the form of book reviews, film reviews, online chats and product introductions. It contains a lot of valuable hidden information, and there is an urgent need for automated tools to classify short texts.

? The text classification system based on artificial intelligence technology can automatically classify a large number of texts according to semantics. A variety of statistical theories and machine learning methods are used for automatic text classification. But the biggest problems are the high dimension of feature space and the sparsity of document representation vector. The total number of Chinese entries exceeds 200,000, and the high-dimensional feature space is too large for all algorithms. An effective feature extraction method is urgently needed to reduce the dimension of feature space and improve the efficiency and accuracy of classification.

? Text classification methods are mainly divided into two categories, based on traditional machine learning methods and deep learning methods. The text classification method based on traditional machine learning is mainly to preprocess the text, extract features, then vectorize the processed text, and finally model the training data set through the commonly used machine learning classification algorithm. In traditional text classification methods, the quality of text feature extraction has a great influence on the accuracy of text classification. The method based on deep learning is to train the data through deep learning models such as CNN, and there is no need to manually extract the features of the data. What affects the accuracy of text classification is more the amount of data and the number of iterations of training.

? Compared with long texts, short texts are sparse and irregular because of their few words and weak descriptive information. Traditional machine learning methods have high latitude, sparse text representation and weak feature expression ability, and neural networks are not good at processing this kind of data. In addition, feature engineering needs to be carried out manually, and the cost is very high, which can not meet the needs of short text classification well. A very important reason why deep learning achieved great success in image and speech at first is that the original data of image and speech are continuous, dense and have local correlation. Using deep learning to solve the problem of large-scale text classification, the most important thing is to solve the text representation, and then use CNN/RNN and other network structures to automatically obtain the ability of feature expression and get rid of complex artificial feature engineering.

?

? Short text classification algorithm is widely used in various industries, such as news classification, man-machine writing judgment, spam identification, user emotion classification, intelligent copy generation, intelligent product recommendation and so on.

? Scenario 1: Intelligent recommendation of goods, which classifies the text according to the name of the goods purchased by the user as a prediction sample, obtains the user's transaction category, combines with other data to construct the user portrait, predicts the user's next purchase behavior according to the user portrait with different characteristics, and intelligently recommends goods and services.

? Scenario 2: Intelligent generation of copywriting. Based on high-quality copywriting as a training set, a text classification model is obtained. When users input keywords, they intelligently recommend adaptive copywriting.

? Scenario 3: Automatically classify or tag news, with multiple tags.

Scene 4: Judge whether the article was written by a human or a machine.

? Scenario 5: Judging whether the emotions in film reviews are positive, negative or neutral, similar application scenarios are very extensive.

? By using the word vector technology of deep learning, the text data is transformed from the difficult way of neural network with high latitude and high sparsity into continuous dense data similar to images and voices, and words are transformed into dense vectors, thus solving the problem of text representation. As the characteristics of machine learning and deep learning models, word vectors have great influence on the final model.

? At the same time, CNN/RNN and other deep learning networks and their variants are used to solve the problem of automatic feature extraction (that is, feature expression), and the corresponding text classification model is as follows:

? 1) quick text

? FastText is an open source word vector and text classification tool for Facebook, with simple model and fast training speed. The principle of FastText is to average all the word vectors in the short text, and then connect them directly to the softmax layer, and add some n-gram features to capture local sequence information. Compared with other text classification models such as SVM, Logistic regression and neural network, FastText greatly shortens the training time while maintaining the classification effect, and supports multilingual expression. However, its model is based on the word packet-based English text classification method, and the words that make up English sentences are spaced. However, when applied to Chinese text, word segmentation and punctuation need to be converted into the data format required by the model.

? 2) CNN

? Compared with FastText, TextCNN uses CNN (Convolutional Neural Network) to extract key information similar to n-gram in sentences, with simple structure and good effect.

? 3) Text neural network

Although TextCNN can perform well in many tasks, the biggest problem of CNN is to fix the field of vision of filter_size, on the one hand, it can't model longer sequence information, on the other hand, the super-parameter adjustment of filter_size is very complicated. The essence of CNN is to express the characteristics of text, and recurrent neural network (RNN) is more commonly used in natural language processing, which can better express contextual information. Specifically, in the task of text classification, bidirectional RNN (actually using bidirectional LSTM) can be understood as capturing variable-length and bidirectional "n-gram" information in a sense.

4)TextRNN+note

Although CNN and RNN are effective in the task of text classification, they both have a shortcoming, such as poor intuition and interpretability. Attention mechanism is a common modeling long-term memory mechanism in the field of natural language processing, which can intuitively give the contribution of each word to the result and is the standard of Seq2Seq model. In fact, text classification can be understood as a special kind of Seq2Seq in a sense, so attention mechanism can be considered.

? The core point of attention is that the context used in translating each target word (or predicting the category of commodity title text) is different, which is more reasonable. After adding attention, you can intuitively explain the importance of each sentence and word to the classification category.

5)TextRCNN(TextRNN + CNN)

? Forward and reverse RNN are used to get the representation of the forward and reverse context of each word, so that the representation of the word becomes the form of word vector and forward and reverse context vector concat. Finally, the same convolution layer, TextCNN, pooling layer, is connected, the only difference is that the convolution layer filter_size = 1.

? Summary: In practical application, the application effect of CNN model in Chinese text classification has been very good. The research shows that TextRCNN improves the accuracy of about 1%, but it is not very significant. The best way is to use TextCNN model to debug the overall task effect to the best, and then try to improve the model.

?

Reference: Overview of Text Categorization Solutions