Joke Collection Website - Public benefit messages - How to fix the noise in verification code with python

How to fix the noise in verification code with python

It doesn't look that difficult. No interference lines, no adhesion and no deformation. But I still can't identify it directly with pytesser because of the noise and other background noises. My job is to get rid of these annoying things.

Let me introduce our tools:

1.pytsesser It is a python package of recognition tool named tesser based on C language. It's a pity that it's stupid, it can only do the simplest recognition, and it doesn't know Chinese characters.

2. Ask it to be the favorite of our children who like to write about reptiles. It provides a user-friendly interface at the expense of a little efficiency (don't think about efficiency when writing python).

3.BeautifulSoup It and Requests are a pair of good engine oil, which makes it easy to extract the required content in the document.

4.PIL is the protagonist today. PIL is a special library for image processing, which is very good and powerful. Skilled people can even draw pictures with it.

How to write a crawler to realize simulated login is not detailed here. Let's talk about how to solve the verification code identification first.

The solution is as follows:

1. First, use PIL to enhance the image, because the digital edges in the original image and the noise in the background are not clear. After enhancement, they can be separated. If they are not separated, some numbers may be missed when denoising.

im = image . open(" random image/random image 1 1 . jpg ")

Im = image enhancement. Sharpness (IM). Enhancement (3) parameter is 3, which is the ideal value after the experiment. Too strong is not good, too weak is not good.

2. After pretreatment, the background noise is removed. Background noise refers to various color blocks in the background, which may not be noticed by the naked eye. However, its existence will affect recognition. My initial practice was to convert the image into black and white, which naturally converted the noise into noise.

The effect is as shown

But I want to get rid of the noise and become like this.

The first thing that comes to mind is seed dyeing. What is seed dyeing? Please look at this link.

In order to prevent bad chain, here are some reprints.

The English name of seed dyeing method is Flood Fill, but in fact, the name Flood Fill is more appropriate, because when it acts on a node of a graph, it will "drown" other nodes connected with it and spread outward in the same way. This method is usually used to calculate the maximum connected subgraph of a graph ("graph" here is the concept of graph theory). Imagine an undirected graph. We start with an unlabeled (labeled) node in this graph, and assign the same label (dyed with the same color) to this node and all nodes that can be reached from this node. Then we get a maximal connected subgraph composed of these labeled nodes. We can find all the largest connected subgraphs by searching the next unlabeled node and repeating the above process. The process of "coloring" can be realized through DFS or BFS. If the number of nodes is V and the number of edges is E, the time complexity of all the largest connected subgraphs is o(V+E) because we "visit" each node twice and each edge twice during flood filling.

An example of Wikipedia:

Assuming that each white square is a node in the graph, and adjacent squares (up, down, left and right) are connected by edges, then the graph has three maximum connected subgraphs, which demonstrates the process of Flood Fill finding one of the maximum connected subgraphs.

In this paper, the area of each block is calculated by seed dyeing method, and then the small blocks are removed as noise.

This is the code

Definition check (j, i):

Try:

if pix

s =

If pix[j, i] == 0:

If you check (j- 1, i):

r[0],r[ 1],r[2] = im2.getpixel((j,I))

s[0],s[ 1],s[2] = im2.getpixel((j- 1,I))

Print r

Print s

Print "-"* 55

If juli(r, s) <; =l:

Matrix [j][i] = matrix [j- 1][i]

maps[str(matrix[j][I])]+= 1

Elif check (j- 1, i- 1):

r[0],r[ 1],r[2] = im2.getpixel((j,I))

s[0],s[ 1],s[2] = im2.getpixel((j- 1,i- 1))

If juli(r, s) <; =l:

Matrix [j][i] = matrix [j- 1][i- 1]

maps[str(matrix[j][I])]+= 1

Elif check (j, i- 1):

r[0],r[ 1],r[2] = im2.getpixel((j,I))

s[0],s[ 1],s[2] = im2.getpixel((j- 1,I))

If juli(r, s) <; =l:

Matrix [j][i] = matrix [j][i- 1]

maps[str(matrix[j][I])]+= 1

Elif check (j+ 1, i+ 1):

r[0],r[ 1],r[2] = im2.getpixel((j,I))

s[0],s[ 1],s[2] = im2.getpixel((j+ 1,i+ 1))

If juli(r, s) <; =l:

matrix[j][I]= matrix[j+ 1][I+ 1]

maps[str(matrix[j][I])]+= 1

Elif check (j, i+ 1):

r[0],r[ 1],r[2] = im2.getpixel((j,I))

s[0],s[ 1],s[2] = im2.getpixel((j,i+ 1))

If juli(r, s) <; =l:

matrix[j][I]= matrix[j][I+ 1]

maps[str(matrix[j][I])]+= 1

Elif check (j- 1, i+ 1):

pr[0],r[ 1],r[2] = im2.getpixel((j,I))

s[0],s[ 1],s[2] = im2.getpixel((j- 1,i+ 1))

If juli(r, s) <; =l:

Matrix [j][i] = matrix [j- 1][i+ 1]

maps[str(matrix[j][I])]+= 1

Elif check (j+ 1, i- 1):

r[0],r[ 1],r[2] = im2.getpixel((j,I))

s[0],s[ 1],s[2] = im2.getpixel((j+ 1,i- 1))

If juli(r, s) <; =l:

Matrix [j][i] = matrix [j+ 1][i- 1]

maps[str(matrix[j][I])]+= 1

Elif check (j+ 1, i):

r[0],r[ 1],r[2] = im2.getpixel((j,I))

s[0],s[ 1],s[2] = im2.getpixel((j+ 1,I))

If juli(r, s) <; =l:

Matrix [j][i] = matrix [j+ 1][i]

maps[str(matrix[j][I])]+= 1

Otherwise:

n+= 1

maps[str(n)]= 1

Matrix [j][i] = n

For I(w) in the range:

For j in the range (h):

If matrix [j][i]! =- 1 and maps [str (matrix [j] [I])] < =2:

Im.putpixel((j, i), 255) View the code.

Results Because the volume parameter was set small and the noise was not cleaned up, it was not ideal. If the quantity is set to be large, a small piece may be removed. The most important thing is that the noise level here is not very regular, so it is difficult to find a good area parameter.

Failure is only temporary. It is observed that the color of background noise is much lighter than that of numbers. This also means that its RGB value is much smaller than that of numbers. By analyzing RGB values, most of the noise can be removed, and the remaining noise can be treated by seed dyeing. That is to say, information is obtained from two pictures (black and white and color respectively), then processed on one picture, and finally recognized.

The core code is here

r[0],r[ 1],r[2] = im2.getpixel((j,I))

If r [0]+r [1]+r [2] >; =400 or r[0]>=250 or r[ 1]>=250 or r[2]>=250:

IM2。 Putpixel ((j, I), (255, 255, 255)) So far, the problems found this time have been solved, and the success rate is above 50%, which basically meets the requirements of the interface.