Joke Collection Website - Mood Talk - What data can python grab?
What data can python grab?
Channel _ Excerpt.py
The first link here is what we call a large-scale link:
Import BeautifulSoupimport request from bs4.
start _ URL = '/Wu/' host _ URL = '/' def get _ channel _ URLs(URL):
wb_data = requests.get(url)
soup = beautiful soup(WB _ data . text,' lxml ')
links = soup.select('。 Fenlei & gtdt & gta ')# Print (link)
For links in links:
page _ URL = host _ URL+link . get(' href ')
print(page _ URL)# get _ channel _ URLs(start _ URL)channel _ URLs = ' ' '
/Jiaju/
/rirong white goods/
/Ji Shou/
/class work/
/agricultural supplies/
/jiadian/
/ershoubibibendiannao/
/Ruan Jian Books/
/Yingyou Yunfu/
/diannao/
/xianzhilipin/
/fushixiaobaxuemao/
/Mei Rong Hua Zhuang/
/Ma Shu/
/articles for the elderly/
/xuniwupin/
''' 123456789 10 1 1 12 13 14 15 16 17 18 19202 12223242526272829303 13233343536
Then take the 58 cities I climbed as an example, that is, I climbed all the categories of links in the second-hand market, which is what I said.
Find the * * * features of these links, output them with functions, and store them as multi-line text.
Secondly, get the links and details of the details page we need.
page_parsing.py
1, tell us about our database:
Look at the code first:
# Import beautiful SoupiPort from BS4 import library file, request to import Pymongo # Python operation MongoDB library import, re-import time# link and establish database client = pymongo.mongoclient ('localhost', 270 17).
Shi Ce = client ['Shi Ce'] # creates Shi Ce database Ganji _ URL _ list = Shi Ce ['Ganji _ URL _ list'] # creates table file Ganji _ URL _ info = Shi Ce ['Ganji _ URL _ info']123456789100.
2. Judge whether the page structure matches the page structure we want, for example, sometimes there are 404 pages;
3. Extract the links we want from the pages, that is, the links of each detail page;
What we want to say here is a way:
item_link = link.get('href ')。 Split ('?' )[0] 12
What kind of link is here, and what the hell is this get method?
Later, it was found that this type was
& ltclass ' bs4 . element . tab & gt; 1
If we want to get a property separately, we can do so, for example, from which class do we get it?
Print soup.p['class']
#['title'] 12
You can also use the get method to pass in the name of the property, which is equivalent.
print soup . p . get(' class ')#[' title '] 12
Let me paste the following code:
# page link for all product details: def get_type_links(channel, num):
list_view = '{0}o{ 1}/'。 Format (Channel, String (Number)) # Print (List View)
wb_data = requests.get(list_view)
soup = beautiful soup(WB _ data . text,' lxml ')
Linkon = soup.select('。 Pagebox) # Judge whether it is the logo of the page we need.
# If the crawled selection link is as follows: div.pageBox > ul> Li: n-child( 1)>a & gt The span:n-child( 1) here should be deleted.
# Print (Link)
If the link:
link = soup.select('。 zz & gt。 ZZ-til & gt; a’)
link_2 = soup.select('。 js-item & gt; a’)
link = link+link _ 2 # print(len(link))
For linkc in link:
linkc = linkc.get('href ')
ganji _ URL _ list . insert _ one({ ' URL ':linkc })
Print (linkc) others: pass123456789101213141516/.
4. Grab the information we need in the details page.
I post a code:
# Grab the link to the details page of Jiji.com: def get_url_info_ganji(url):
Time. Sleep (1)
wb_data = requests.get(url)
Soup = beautiful soup (WB _ data.text,' lxml') Try:
Select ('head & gttitle')[0]. text
timec = soup.select('。 pr-5')[0].text.strip()
type = soup.select('。 det-infor & gt; Li & gtspan & gta')[0]. text
price = soup.select('。 det-infor & gt; Li & gti')[0]. text
place = soup.select('。 det-infor & gt; Li & gta')[ 1:]
Placeb = [] is placec in place:
placeb.append(placec.text)
tag = soup.select('。 second-dt-be write & gt; Ul> Li') [0]. text
Tag = "". join(tag . split())# print(time . split())
data = { 'url' : url,' title' : title,' time' : timec.split(),' type' : type,' price' : price,' place' : placeb,' new' : tag
}
Ganji _ URL _ info.insert _ one (data) # Inserts a piece of data into the database;
Print (data) except index error: pass123456789101213141516/.
Fourth, how to write our main function?
main.py
Look at the code:
# Import functions and data from other files first: import get _ type _ links, get _ URL _ info _ ganji, ganji _ URL _ list from channel _ extractmportchannel _ URLs # Crawl all linked functions: def get_all_links_from(channel):
For I( 1, 100) in the range:
After get_type_links(channel, i)#, execute this function to grab all files on the details page: if _ _ name _ =' _ _ main _': # pool = pool () # # pool = pool.map (get _ URL _ info _ ganji, [URL ['URL'] for URL in ganji _ URL _ list.find ()]) # pool.close () # pool.join () # First, execute the following function to grab all links: if _ _ name _ =' _ _ main _ _':
pool = Pool()
pool = Pool()
pool.map(get_all_links_from,channel_urls.split())
pool.close()
pool . join() 123456789 10 1 12 13 14 16 17 18 19202 12223242526
Verb (abbreviation for verb) counting program
count.py
Used to display the amount of crawled data;
Import time from page_parsing import ganji_url_list, ganji _ URL _ infowhile true: # print (ganji _ URL _ list.find ()). count())
# Time. Sleep (5)
print(ganji_url_info.find()。 count())
Time. Sleep (5)
- Related articles
- Where is Lijiang Flower and Bird Market? Introduction of Lijiang Flower and Bird Market
- See yourself clearly and make it clear.
- What do you think of those who advise people to drink?
- Talk about the cure of a broken heart; It's too hypocritical to say it, and it's too wronged not to say it.
- Why does Thread secretly talk about scientific truth?
- Neighbors always criticize you, what should you do! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
- Talk about advertising and write a 200-word advertising composition for your favorite products.
- What do you think of people who speak directly?
- I'm getting married soon.
- There are bright autumn colors everywhere in Toksun County, Xinjiang. Do you like autumn scenery in Xinjiang?