Joke Collection Website - Joke collection - How to use Python as a reptile

How to use Python as a reptile

1) First of all, you have to understand how reptiles work.

Imagine you are a spider, and now you are put on the Internet. Then, you need to read all the web pages. What shall we do? No problem, just start somewhere, for example, the home page of People's Daily, which is called the initial page, and it is represented by $

On the home page of People's Daily, you can see various links that the page points to. So you climbed happily from the "domestic news" page. Great, so you have climbed two pages (home page and domestic news)! Regardless of how the captured page is handled for the time being, imagine that you copy this page into an html and put it on you.

Suddenly you find a link back to the "home page" on the domestic news page. As a clever spider, you must know that you don't have to climb back, because you have already seen it. Therefore, you need to use your brain to save the address of the page you have seen. In this way, every time you see a new link that may need you to climb, you should first check in your mind whether you have been to this page address. If there is, don't go.

Well, theoretically, if you can reach all the pages from the initial page, you can prove that you can climb all the pages.

So how to implement it in python?

Very simple

Import queue

Initial_page = "Initialize page"

url_queue = Queue。 Queue ()

seen = set()

seen.insert(initial_page)

url_queue.put(initial_page)

While (true): # Go on until the seas run dry and the rocks crumble.

if URL _ queue . size()& gt; 0:

Current_url = url_queue.get() # Take out the url of the first one in the queue.

Store(current_url) # stores the web page represented by the url.

For next _ url in extract _ urls (current _ url), select the URL that links this URL.

If you can't see the next url:

seen.put(next_url)

url_queue.put(next_url)

Otherwise:

break

It's already pseudocode.

The spines of all reptiles are here. Let's analyze why reptiles are actually a very complicated thing-search engine companies usually have a whole team to maintain and develop them.

2) Efficiency

If you directly process the above code and run it directly, it will take you a whole year to climb down the whole douban content. Not to mention that a search engine like Google needs to crawl the content of the whole network.

What's the problem? There are too many pages to climb, and the code on them is too slow. Assuming that there are n websites in the whole network, the complexity of judging duplication is N*log(N), because all web pages need to be traversed once, and it takes the complexity of log(N) to judge set repeatedly. Ok, I know python's set implementation is hash-but it's still too slow, at least the memory usage is not efficient.

What is the practice of sentencing? Bloom filter. Simply put, it is also a hash method, but its characteristic is that it can use fixed memory (not increasing with the number of URLs) to determine whether the URL is already in the collection with O( 1) efficiency. Unfortunately, there is no such thing as a free lunch. Its only problem is that if this url is not in the collection, BF can 100% determine that this url has not been seen. But if this URL is in the collection, it will tell you that this URL should have already appeared, but I am 2% uncertain. Please note that when you allocate enough memory, the uncertainty here will become very small. A simple tutorial: an example of Bloom filter

Pay attention to this feature, if the website has been seen, it may be repeated in a small probability (it doesn't matter, it won't kill you if you look at it a few times). But if you haven't seen it, you will definitely be seen (this is very important, otherwise we will miss some web pages! )。 [Important: There is something wrong with this paragraph, please skip it temporarily]

Ok, now we are close to the fastest way to deal with weight judgment. Another bottleneck-you only have one machine. No matter how big your bandwidth is, as long as the speed of downloading web pages from your machine is the bottleneck, then you should speed up this speed. If one machine is not enough, use multiple machines! Of course, let's assume that each machine is at its maximum efficiency-using multithreading (multi-process in python).

3) Cluster grabbing

When climbing watercress, I always use 100 machines to run around the clock for a month. Imagine that if only one machine is used, it will run for 100 months. ...

So, suppose you have 100 machines available now, how to implement a distributed crawling algorithm in python?

We call 99 of 100 machines with low computing power as slaves and another machine with high computing power as master, so review the url_queue in the code above. If this queue can be placed on this host, all slaves can communicate with the host through the network. Every time the slave finishes downloading a webpage, it will request a new webpage from the host to grab it. And every time slave captures a new webpage, it sends all the links on this webpage to the queue of master. Similarly, bloom filter is also put on the master, but now the master only sends URLs that have not been visited to slave. Bloom Filter is placed in the memory of the master, and the accessed url is placed in Redis running on the master, thus ensuring that all operations are O( 1). (At least the average share is O( 1), and the access efficiency of Redis is as shown in Linsert–Redis).

Consider how to implement it in python:

If scrapy is installed on each slave, then each machine becomes a slave with crawling ability, and Redis and rq are installed on the host as distributed queues.

Then write code.

#slave.py

Current_url = Request from host ()

to_send = []

For the next url in the extracted url (current URL):

to_send.append(next_url)

store(current _ URL);

Send to host

#master.py

Distributed Queue = Distributed Queue ()

bf = BloomFilter()

initial _ pages = " www . renming ribao . com "

while(True):

if request == 'GET ':

if distributed_queue.size()>0:

Send (distributed_queue.get ())

Otherwise:

break

elif request == 'POST ':

bf.put(request.url)

Well, as you can imagine, someone has written what you need: darkho/scrap-redisgithub.

4) Prospect and post-processing

Although a lot of "simplicity" is used above, it is not easy to realize a commercial crawler. The above code is not a big problem to climb an entire website.

But if you add these follow-up treatments, such as

Effective storage (how should the database be arranged)

Effectively judge the weight (here refers to the web page, we don't want to climb the People's Daily and copy its Great People's Daily)

Effective information extraction (for example, how to extract all the addresses on the web page "China Road, Fenjin Road, Chaoyang District"), search engines usually don't need to store all the information, such as why should I save pictures? ...

Update in time (predict the update frequency of this page)

It is conceivable that every point here can be studied by many researchers for more than ten years. Even so,

"The road is long, Xiu Yuan, Xiu Yuan, and I will go up and down." .

So, don't ask how to get started, just hit the road:)