Joke Collection Website - News headlines - How to segment Chinese words with Python?
How to segment Chinese words with Python?
1、fxsjy/jieba
The slogan of stuttering is: be the best Python Chinese word segmentation component. Maybe it is not the best from now on, but it has been used by the most people. There are many learning materials and use cases of stuttering word segmentation on the Internet, which are relatively easy to use and faster.
Advantages of stuttering:
Support three word segmentation modes
Support traditional word segmentation
Support custom dictionaries
Massachusetts institute of technology licensing agreement
2.THULAC: an efficient Chinese lexical analysis toolkit.
Two days ago, I was doing a user feedback classification about * * * enjoying bicycles. The word segmentation with street fighter is too fragmented and the classification result is not good. Later, Brother Jiang recommended THULAC, a Chinese lexical analysis toolkit developed by Tsinghua University Natural Language Processing and Social Humanities Computing Laboratory. THULAC's interface document is very detailed and easy to use.
Advantages of THULAC word segmentation:
Strong ability. The largest corpus of Chinese word segmentation and part-of-speech tagging (about 58 million words) is used for training, and the model tagging ability is strong.
High accuracy. The F 1 value of word segmentation and the F 1 value of part-of-speech tagging can reach 97.3% and 92.9%, respectively, on the standard data set Chinese Tree Database (CTB5).
Faster. At the same time, the speed of word segmentation and part-of-speech tagging is 300KB/s, and it can process about1.5000 words per second. The word segmentation speed only reaches 1.3MB/s, which is slower than that of Jieba.
Python can basically use the following logic to solve the Chinese coding problem:
Utf8 (input) -> Unicode (processing)->; (output) utf8
The characters processed in Python are all encoded in unicode, so the solution to the encoding problem is to decode the input text (no matter what encoding) and encode it into the required encoding when it is output.
Because txt documents are usually processed, the simplest method is to save txt documents as utf-8 encoding, and then decode them into unicode(sometexts.decode('utf8')) when Python is used, and then encode them into utf8 when the output results return to txt (just use the str () function directly).
- Previous article:What is the development course of China People's Liberation Army?
- Next article:Where is the hometown of Chinese poetry?
- Related articles
- Selected 10 speeches of the motherland in my heart
- How to ensure fairness and justice?
- Professional training for security guards
- Beijing Mentougou slogan
- How to write a slogan when WeChat sells fruit?
- Anti-epidemic photographic works
- What are the slogans for cheering for the marathon?
- Inspirational slogans on college campuses
- How many kilometers is it from Taonan, Jilin to Huangxian, Shandong?
- Domineering slogans scolding littering