Joke Collection Website - Talk about mood - What is a search engine crawler?

What is a search engine crawler?

Search engine crawler (also known as web spider and web robot) is a program or script that automatically crawls information on the World Wide Web according to certain rules.

1. First, carefully select some webpages from the webpages, use the link addresses of these webpages as seed URLs, and put these seed URLs into the URL queue for crawling. Crawler reads from the URL queue to be crawled in turn, parses the URL through DNS, and converts the link address into the IP address corresponding to the website server.

2. Then give the path name relative to the webpage to the webpage downloader, which is responsible for downloading the webpage content. For downloaded web pages, on the one hand, they are stored in the page library, waiting for subsequent processing such as indexing; On the other hand, the URL of the downloaded webpage is put into the crawled URL queue, which records the URL of the webpage downloaded by the crawler system to avoid repeated crawling of the webpage.

3. For the newly downloaded webpage, extract all the link information and check it in the crawled URL queue. If the link is not crawled, the URL will be placed at the end of the URL queue to be crawled, and the webpage corresponding to the URL will be downloaded in the subsequent crawling schedule. This forms a cycle until the URL queue to be crawled is empty, which means that the crawler system has crawled all the crawlable web pages, and a complete crawling process is completed at this time.