Joke Collection Website - Cold jokes - How to write Python crawler?

How to write Python crawler?

There are actually many Python crawler libraries, such as common URLs, requests, bs4, lxml and so on. If you are a beginner, you can learn two libraries, namely requests and bs4(BeautifulSoup), which are relatively easy to learn. Requests is used to request pages, and BeautifulSoup is used to parse pages. Here I am based on these two libraries. Briefly introduce how Python captures static data and dynamic data of web pages. The experimental environment is win10+python 3.6+pycharm 5.0. The main contents are as follows:

Python captures static data of web pages.

This is very simple, just request the page directly according to the URL. Here to grab the content of the encyclopedia of embarrassing things as an example:

1. Suppose the text we want to capture is as follows, which mainly includes four fields: nickname, content, number of paragraphs and number of comments:

Open the source code of the webpage, and the corresponding webpage structure is as follows. Very simple, the contents of all fields can be found directly:

2. According to the above web page structure, we can write relevant codes to capture web page data. It's simple. First, request the page according to the url address, and then use BeautifulSoup to parse the data (according to tags and attributes), as shown below:

The screenshot of the program running is as follows, and the data has been successfully crawled:

Python captures the dynamic data of web pages.

In many cases, the data of the web page is dynamically loaded, and we can't extract any data by directly crawling the web page. At this time, we need to grab package analysis to find dynamically loaded data, usually json files (of course, it may also be other types of files, such as xml, etc. ), and then request to parse this json file, so that we can get the data we need. Here is an example of grabbing scattered data on Renren Loan:

1. Suppose the data we crawled here are as follows, mainly including five fields: annual interest rate, loan title, term, amount and progress:

2. press F 12 to bring up the developer tool, and click network-> xhr in turn. F5 refreshes the page and you can find the dynamically loaded json file. Details are as follows:

3. Then, according to the above analysis, we can write relevant codes to capture data. The basic idea is similar to the static web page above. First, request json with a request, and then use the json package included with python to parse the data, as shown below:

The screenshot of the program running is as follows, and the data has been successfully obtained:

At this point, we have finished using python to capture web data. Generally speaking, the whole process is very simple. For beginners, requests and BeautifulSoup are very easy to learn and master. You can learn to use them. After you get familiar with it, you can learn the scrapy crawler framework, which can obviously improve the development efficiency. That's good. Of course, if there are encryption and verification codes in the web page, you need to ponder and study the countermeasures yourself. There are also related tutorials and materials on the Internet. If you are interested,