Joke Collection Website - Joke collection - Similarities and differences between English and Chinese in telling jokes

Similarities and differences between English and Chinese in telling jokes

First, Chinese and English have different word segmentation methods.

Word segmentation is the most well-known difference between Chinese and English NLP. We all know that English words are naturally separated by spaces, so it is very easy to separate words with spaces when dealing with English texts. For example, English sentences:

DataGrand is a China company.

It can be easily divided into DataGrand/is/a/Chinese/company (the text separator is/for text).

In Chinese, there is no separator in the middle of each sentence, but a series of continuous Chinese characters are connected in order to form a sentence. The basic morphemes for expressing meaning in modern Chinese are words, not words. For example, "Nature" cannot be divided into "zi" and "ran", but the word formed by combining the two words can have an accurate meaning, and the corresponding English word is nature. Therefore, when we use computer technology for automatic semantic analysis of Chinese, Chinese word segmentation is usually the first step. Chinese word segmentation refers to dividing continuous Chinese characters into words that can express their meanings independently according to the way people understand Chinese. For example, Chinese sentences:

"Daguan Data is a China company."

Let the computer handle it, the first step needs to be divided into the form of "philosophical data/yes /a/ China/company", and then the subsequent understanding and processing.

How to segment Chinese words correctly according to semantics is a challenging task. Once word segmentation fails, it will lead to chain problems in subsequent text processing, which will hinder the correct understanding of semantics. In order to segment Chinese quickly and accurately, academic circles have studied it for more than 50 years and put forward many methods. Commonly used Chinese word segmentation methods include classical mechanical word segmentation (such as forward/reverse maximum matching and bidirectional maximum matching). ), better statistical segmentation (such as hidden Markov HMM, conditional random field CRF), and RNN, LSTM and other methods using deep neural network in recent years.

Because Chinese grammar itself is extremely flexible, semantic ambiguity often occurs, which brings many obstacles to the correct completion of Chinese word segmentation. As the example of "Yan Shouyi turned off the mobile phone" shows, according to the semantic understanding, the correct segmentation method is "Yan Shouyi/Ba/Mobile phone/Guan". When the algorithm is wrong, it is easy to divide it into "Strictly observe/Ba/Mobile phone/Guan".

What's more, sometimes the two word segmentation methods have the same meaning, such as "the ping-pong racket is sold", and it is feasible to divide it into "ping-pong/racket/sold" and "ping-pong/auction/lost", so it is necessary to rely on more contexts to choose the correct word segmentation method at present. There are also "Nanjing Yangtze River Bridge", "Changchun Pharmacy in Jilin Province" and so on. If "mayor" and "governor" are cut off, the understanding of the whole sentence will be much deviated. Common types of ambiguity include cross ambiguity and combined ambiguity. In recent years, scholars at home and abroad have put forward new solutions to the specific problem of Sino-Tibetan language family.

Incidentally, similar to Chinese, Japanese sentences also lack natural separators, so Japanese also needs word segmentation. Japanese is deeply influenced by Chinese grammar, but it is also influenced by phonetic grammar. During the Meiji era, there was a movement to abandon Chinese characters and promote Pinyin. In writing, Chinese characters and pseudonyms are mixed together, just like Chinese and English hybrids. MeCab is a well-known Japanese word segmenter in the industry, and its algorithm core is conditional random field CRF. In fact, if MeCab's internal training corpus is changed from Japanese to Chinese, it can also be used to segment Chinese.

With the successful application of deep learning technology in NLP field in recent years, some seq2seq learning processes can no longer use word segmentation, but directly use words as input sequences, so that neural networks can automatically learn their features. In some end-to-end applications (such as automatic summarization, machine translation, text classification, etc.), Chinese word segmentation is indeed omitted. ), but on the one hand, many NLP applications are inseparable from the results of word segmentation, such as keyword extraction, named entity recognition, search engines and so on. On the other hand, segmented words can also be used as feature input together with single words to enhance the effect. Therefore, word segmentation is still an important technology in Chinese processing in engineering.

Second, the use of English morphemes and Chinese radicals

Although the extraction of English words is much simpler than that of Chinese, words can be obtained through spaces, but the unique phenomenon of English is that words are rich in deformation and transformation. In order to cope with these complex transformations, English NLP has some unique processing steps compared with Chinese, which we call lexical entry and stemming.

Morphological restoration is because English words are rich in singular and plural, passive and tense changes (*** 16), so it is necessary to "restore" words to their original forms in semantic understanding, so that computers can more conveniently carry out subsequent processing. For example, the words "does, done, doing, did" need to be reduced to the word "do" by part of speech, which is convenient for subsequent computer semantic analysis. Similarly, the nouns "potato, city, child and tooth" need to be transformed into the basic form of "potato, city, child and tooth" through externalization; Similarly, "Yes, Start, Drive" should be changed to "Yes, Start, Drive".

Please note that morphological reduction usually needs to be combined with part-of-speech tagging to ensure the accuracy of reduction and avoid ambiguity. Because there are some polysemous words in English, such as calf is a polysemous word, which can be used as the plural form of calf (noun, calf) or the third person singular of calf (verb, calf). Therefore, there are two options for word shape reduction, and the appropriate reduction method needs to be selected according to the actual part of speech.