Joke Collection Website - News headlines - How to segment Chinese words with Python?

How to segment Chinese words with Python?

Python does Chinese word segmentation mainly in the following ways: stuttering word segmentation, NLTK, THULAC.

1、fxsjy/jieba

The slogan of stuttering is: be the best Python Chinese word segmentation component. Maybe it is not the best from now on, but it has been used by the most people. There are many learning materials and use cases of stuttering word segmentation on the Internet, which are relatively easy to use and faster.

Advantages of stuttering:

Support three word segmentation modes

Support traditional word segmentation

Support custom dictionaries

Massachusetts institute of technology licensing agreement

2.THULAC: an efficient Chinese lexical analysis toolkit.

Two days ago, I was doing a user feedback classification about * * * enjoying bicycles. The word segmentation with street fighter is too fragmented and the classification result is not good. Later, Brother Jiang recommended THULAC, a Chinese lexical analysis toolkit developed by Tsinghua University Natural Language Processing and Social Humanities Computing Laboratory. THULAC's interface document is very detailed and easy to use.

Advantages of THULAC word segmentation:

Strong ability. The largest corpus of Chinese word segmentation and part-of-speech tagging (about 58 million words) is used for training, and the model tagging ability is strong.

High accuracy. The F 1 value of word segmentation and the F 1 value of part-of-speech tagging can reach 97.3% and 92.9%, respectively, on the standard data set Chinese Tree Database (CTB5).

Faster. At the same time, the speed of word segmentation and part-of-speech tagging is 300KB/s, and it can process about1.5000 words per second. The word segmentation speed only reaches 1.3MB/s, which is slower than that of Jieba.

Python can basically use the following logic to solve the Chinese coding problem:

Utf8 (input) -> Unicode (processing)->; (output) utf8

The characters processed in Python are all encoded in unicode, so the solution to the encoding problem is to decode the input text (no matter what encoding) and encode it into the required encoding when it is output.

Because txt documents are usually processed, the simplest method is to save txt documents as utf-8 encoding, and then decode them into unicode(sometexts.decode('utf8')) when Python is used, and then encode them into utf8 when the output results return to txt (just use the str () function directly).