Joke Collection Website - Blessing messages - How to capture website data with python?

How to capture website data with python?

Here is a brief introduction. Take the static and dynamic data of the website as an example. The experimental environment is win10+python 3.6+pycharm 5.0. The main contents are as follows:

Capture the static data of the website (data is in the source code of the webpage): Take the data of the encyclopedia website as an example.

1. Suppose we capture the following data, mainly including the user's nickname, content, number of jokes and number of comments, as follows:

The corresponding web page source code is as follows, including the data we need:

2. Corresponding to the web page structure, the main code is as follows, which is very simple. Requests+BeautifulSoup is mainly used, where requests is used for requesting pages and BeautifulSoup is used for parsing pages:

The screenshot of the program running is as follows, and the data has been successfully crawled:

Grab the dynamic data of the website (the data is not in the source code of the webpage, json and other files): Take the data of Renren Loan website as an example.

1. Suppose we are grabbing bond data, which mainly includes five fields: annual interest rate, loan title, term, amount and progress. The screenshot is as follows:

When you open the web page source code, you will find that these data are not in the web page source code. When you press F 12 for packet analysis, you will find it in the json file, as shown below:

2. After obtaining the url of json file, we can grab the corresponding data. The package used here is similar to the one above. Because it is a json file, we also use json package (parsing json). The main contents are as follows:

The screenshot of the program running is as follows, and the data has been successfully captured:

So far, this paper introduces the capture of these two types of data, including static data and dynamic data. Generally speaking, these two examples are not difficult. They are all entry-level reptiles, and the web page structure is relatively simple. The most important thing is to analyze and extract pages. After getting familiar with it, you can use scrapy to grab data, which is more convenient and efficient. Of course, if the crawled page is complicated, such as verification code and encryption, it needs to be carefully analyzed at this time, and there are some tutorials on the Internet.