Joke Collection Website - Mood Talk - What data can python grab?

What data can python grab?

First, climb the first link we need.

Channel _ Excerpt.py

The first link here is what we call a large-scale link:

Import BeautifulSoupimport request from bs4.

start _ URL = '/Wu/' host _ URL = '/' def get _ channel _ URLs(URL):

wb_data = requests.get(url)

soup = beautiful soup(WB _ data . text,' lxml ')

links = soup.select('。 Fenlei & gtdt & gta ')# Print (link)

For links in links:

page _ URL = host _ URL+link . get(' href ')

print(page _ URL)# get _ channel _ URLs(start _ URL)channel _ URLs = ' ' '

/Jiaju/

/rirong white goods/

/Ji Shou/

/class work/

/agricultural supplies/

/jiadian/

/ershoubibibendiannao/

/Ruan Jian Books/

/Yingyou Yunfu/

/diannao/

/xianzhilipin/

/fushixiaobaxuemao/

/Mei Rong Hua Zhuang/

/Ma Shu/

/articles for the elderly/

/xuniwupin/

''' 123456789 10 1 1 12 13 14 15 16 17 18 19202 12223242526272829303 13233343536

Then take the 58 cities I climbed as an example, that is, I climbed all the categories of links in the second-hand market, which is what I said.

Find the * * * features of these links, output them with functions, and store them as multi-line text.

Secondly, get the links and details of the details page we need.

page_parsing.py

1, tell us about our database:

Look at the code first:

# Import beautiful SoupiPort from BS4 import library file, request to import Pymongo # Python operation MongoDB library import, re-import time# link and establish database client = pymongo.mongoclient ('localhost', 270 17).

Shi Ce = client ['Shi Ce'] # creates Shi Ce database Ganji _ URL _ list = Shi Ce ['Ganji _ URL _ list'] # creates table file Ganji _ URL _ info = Shi Ce ['Ganji _ URL _ info']123456789100.

2. Judge whether the page structure matches the page structure we want, for example, sometimes there are 404 pages;

3. Extract the links we want from the pages, that is, the links of each detail page;

What we want to say here is a way:

item_link = link.get('href ')。 Split ('?' )[0] 12

What kind of link is here, and what the hell is this get method?

Later, it was found that this type was

& ltclass ' bs4 . element . tab & gt; 1

If we want to get a property separately, we can do so, for example, from which class do we get it?

Print soup.p['class']

#['title'] 12

You can also use the get method to pass in the name of the property, which is equivalent.

print soup . p . get(' class ')#[' title '] 12

Let me paste the following code:

# page link for all product details: def get_type_links(channel, num):

list_view = '{0}o{ 1}/'。 Format (Channel, String (Number)) # Print (List View)

wb_data = requests.get(list_view)

soup = beautiful soup(WB _ data . text,' lxml ')

Linkon = soup.select('。 Pagebox) # Judge whether it is the logo of the page we need.

# If the crawled selection link is as follows: div.pageBox > ul> Li: n-child( 1)>a & gt The span:n-child( 1) here should be deleted.

# Print (Link)

If the link:

link = soup.select('。 zz & gt。 ZZ-til & gt; a’)

link_2 = soup.select('。 js-item & gt; a’)

link = link+link _ 2 # print(len(link))

For linkc in link:

linkc = linkc.get('href ')

ganji _ URL _ list . insert _ one({ ' URL ':linkc })

Print (linkc) others: pass123456789101213141516/.

4. Grab the information we need in the details page.

I post a code:

# Grab the link to the details page of Jiji.com: def get_url_info_ganji(url):

Time. Sleep (1)

wb_data = requests.get(url)

Soup = beautiful soup (WB _ data.text,' lxml') Try:

Select ('head & gttitle')[0]. text

timec = soup.select('。 pr-5')[0].text.strip()

type = soup.select('。 det-infor & gt; Li & gtspan & gta')[0]. text

price = soup.select('。 det-infor & gt; Li & gti')[0]. text

place = soup.select('。 det-infor & gt; Li & gta')[ 1:]

Placeb = [] is placec in place:

placeb.append(placec.text)

tag = soup.select('。 second-dt-be write & gt; Ul> Li') [0]. text

Tag = "". join(tag . split())# print(time . split())

data = { 'url' : url,' title' : title,' time' : timec.split(),' type' : type,' price' : price,' place' : placeb,' new' : tag

}

Ganji _ URL _ info.insert _ one (data) # Inserts a piece of data into the database;

Print (data) except index error: pass123456789101213141516/.

Fourth, how to write our main function?

main.py

Look at the code:

# Import functions and data from other files first: import get _ type _ links, get _ URL _ info _ ganji, ganji _ URL _ list from channel _ extractmportchannel _ URLs # Crawl all linked functions: def get_all_links_from(channel):

For I( 1, 100) in the range:

After get_type_links(channel, i)#, execute this function to grab all files on the details page: if _ _ name _ =' _ _ main _': # pool = pool () # # pool = pool.map (get _ URL _ info _ ganji, [URL ['URL'] for URL in ganji _ URL _ list.find ()]) # pool.close () # pool.join () # First, execute the following function to grab all links: if _ _ name _ =' _ _ main _ _':

pool = Pool()

pool = Pool()

pool.map(get_all_links_from,channel_urls.split())

pool.close()

pool . join() 123456789 10 1 12 13 14 16 17 18 19202 12223242526

Verb (abbreviation for verb) counting program

count.py

Used to display the amount of crawled data;

Import time from page_parsing import ganji_url_list, ganji _ URL _ infowhile true: # print (ganji _ URL _ list.find ()). count())

# Time. Sleep (5)

print(ganji_url_info.find()。 count())

Time. Sleep (5)