Joke Collection Website - Blessing messages - Algorithm for judging spam messages

Algorithm for judging spam messages

We are deeply disturbed by spam messages, spam mails and sales calls, but there are also some mobile phone software assistants who can help us to trash these spam messages and calls. What are the algorithms behind these softwares?

An APP like 360 Mobile Guardian will save a blacklist of mobile phones locally or in the cloud. When there is an incoming call, the mobile phone can judge whether it is a harassing call. If the amount of data stored locally is large, it may occupy a lot of memory, but this can be solved by learning the data structure before Bloom filter, but Bloom filter may misjudge, and it is possible that incoming calls are not blacklisted but treated as blacklisted. This is a serious problem for interception software, so it may be judged by a combination of multiple methods, or for the blacklisted mobile phone filtered by Bloom, it is judged whether it is a real blacklisted user by connecting to the online cloud once, but this requires networking, and there is still the possibility of delay; If Bloom filter is judged as a normal user, it must be a normal user, so it is often unnecessary to judge by networking or combining other methods.

Like many virus detection software, or IDS and WAF software, spam messages and harassing phone calls can also establish their own rule base, through which spam messages can be judged. Similarly, due to the misjudgment of IDS and other software, if rules are used to judge spam messages, there is also a certain misjudgment, which should generally be combined with other judgment rules.

The rules are as follows:

When judging the rules, there is a rigid detection, which can not detect the situation that is not in the rules, and will be deliberately designed by people with a heart to circumvent the rules.

Intuitively, spam messages and spam generally contain specific words or links. Then we count the specific words in the mail in turn and judge it as spam when it reaches a certain standard.

At present, the problem of judging this kind of spam is generally solved by machine learning. In the algorithm of machine learning, judging that spam belongs to binary classification can be solved by many Chinese algorithms, such as decision tree, Bayesian, SVM, neural network and so on. Among them, Bayesian algorithm belongs to a statistics-based algorithm, which is also the algorithm to be introduced this time.

Bayesian algorithm is to solve the problem of "inverse probability". Let's give a simple example. For example, we have 10 red balls and 8 white balls in our bag, and then we randomly take out a ball from the bag and ask what is the probability that it is a red ball? This is a simple probability problem, and the result is 10/( 10+8). This kind of forward probability problem is easy to understand. On the other hand, if we only know that there are red balls and white balls in the bag, but we don't know the number and proportion, how many times have we taken them? Can you infer the proportion of the two balls in the bag by taking out the colors of the two balls?

Some probabilities summarized from past experience in Bayesian algorithm are called prior probabilities, which can be understood as the probability of past experience, so they are called prior probabilities. For example, it usually rains in Tomb-Sweeping Day, and the probability of rain is about 70%, which is summed up through past experience;

Posterior probability refers to something happening and inferring the possible reasons. For example, Xiao Ming gets up late, so the probability of getting up late is assumed to be 30%, which is the posterior probability. Conditional probability is the probability that one thing happens and another thing B happens, which is recorded as P(B|A) and pronounced as the probability that A happens and B happens, such as the probability that Xiao Ming gets up late and he is late.

To sum up, the prior probability is a summary of experience, the posterior probability is causal inference, and the conditional probability is causal inference.

According to the definition of conditional probability, Bayesian formula can be derived, and the derivation process in encyclopedia is as follows:

Description:

1) p (a | b) = probability of simultaneous occurrence of a and b/probability of simultaneous occurrence of b. Intuitively, the probability that B occurs must be greater than the probability that A and B occur simultaneously. The meaning of division is the probability of how many A's occur at the same time under the probability of B's occurrence, and the definition of conditional probability.

2) Change division into multiplication to get the combination formula, and then change it into Bayesian formula.

It may be more abstract. Let's take a wiki example:

We use two algorithms to calculate, one is intuitive thinking, and the other is naive Bayes.

Suppose there are u people in a school, the intuitive idea is calculated as follows:

P (yes girl | wearing pants) = number of all girls wearing pants/number of all girls wearing pants.

= U*0.4 (number of girls) *0.5 (half wearing pants) /(U*0.4*0.5 +U*0.6* 1)

= 0.2*U /0.8*U = 25%

If naive Bayes algorithm is used:

P (it's a girl | wearing pants) = P (it's a girl | wearing pants) *P (it's a girl) /P (wearing pants)

= 0.5*0.4/[(0.6* 1 +0.4*0.5)/ 1]

= 0.2 /0.8

= 25%

Note: p (wearing pants) = number of people wearing pants/total number of people = u * 0.6 *1+u * 0.4 * 0.5/u = 80%.

In this way, the naive Bayes formula is not very difficult.

Specifically, let's look at the classification of spam: we use D to represent an email, and D is made up of many words. F+ means spam, and f- means normal mail. According to Bayesian formula, the problem is formal:

Among them, P(f+) and P(f-) are relatively easy to obtain. You just need to calculate how much junk mail there is in the next mailbox and how much is normal mail, but it is best to find more and calculate the average, which is the so-called prior probability.

P(D|f+) means spam, and the probability of words appearing is:

P(d 1, d2, d3...dn|f+) is the probability that these words appear in spam at the same time. I can't find this. Assuming that these words are independent and irrelevant, then P(d 1, d2, d3...dn|f+) can be extended to P (D 1, D3 | f+). It is easier to calculate the probability of a word appearing in spam.

Translation:

P (spam | word d 1, word d2 ... dn appears at the same time) =[ P (wordd1,word d2 ... appears at the same time | is spam) *P (is spam) ]/P (word d 1, wordd2 ... appears at the same time.

According to the independent hypothesis:

P (Spam | word d 1, word d2 ... word dn appears at the same time) =[ P (word d 1 appearance | is spam) *P (word d2 appears | is spam) *P (word d3 appears | is spam) ... *P (is spam) ]/P

In fact, when we judge whether it is spam or not, we must calculate P (the word d 1, the word d2 ... appears in an email at the same time), and this cannot be calculated accurately. We only need to compare the probability of spam and the probability of non-spam, and take the larger one, as long as we calculate:

P (Spam | word d 1, word d2 ... word dn appears at the same time) =[ P (word d 1 appearance | is spam) *P (word d2 appears | is spam) *P (word d3 appears | is spam) ... *P (is spam)]

P (ordinary mail | word d 1, word d2...word dn appears at the same time) =[ P (word d 1 appears | is ordinary mail) *P (word d2 appears | is ordinary mail) *P (wordd3appears | is ordinary mail) ... * p (is ordinary mail)]

1. Find n emails and mark them as spam and non-spam.

2. Remove the stop words of n emails, and then use the word segmentation algorithm to do word segmentation.

3. Calculate the proportion of each word in spam and normal mail respectively.

4. Bring it into the formula to calculate which probability is higher and which classification it belongs to.

The summary here is relatively simple, Bayesian algorithm, and there is still a lot to say, and the understanding is not deep enough. Let's talk about this simple one first. The next article will find an example of naive Bayesian algorithm in actual combat.

Reference:

1. The beauty of data structure and algorithm: probability statistics

2. Data analysis in actual combat: Naive Bayes

3. Ordinary and magical Bayesian method