Joke Collection Website - Public benefit messages - How to deal with the anti-crawler strategy of the website? How to capture a large amount of data efficiently
How to deal with the anti-crawler strategy of the website? How to capture a large amount of data efficiently
IP proxy
For IP proxy, the native request API of each language provides the response API of IP proxy, and the main problem to be solved is IP source.
There is a cheap proxy IP(0 on the network (0 yuan 65438+4000). I did a simple test. In 100 IP, the average available IP is around 40-60, and the access delay is above 200.
The network has high-quality agent IP for sale, provided that you have channels.
Because the delay increases and the failure rate increases after using IP proxy, we can design the request as asynchronous in the crawler framework, add the request task to the request queue (RabbitMQ, Kafka, Redis), call back after success, and rejoin the queue after failure. Each request obtains IP from the IP pool, and if the request fails, the invalid IP is deleted from the IP pool.
biscuit
Some websites use Cookies as anti-crawlers. This is basically like @ Zhu Tianyi said, maintaining a cookie pool.
Pay attention to the cookie expiration event of the target website, which can simulate the browser and generate cookies regularly.
Speed limit entry
For example, multi-thread data grabbing is really blocked by IP every minute, and it is quite simple to achieve speed-limited access (through task queue), so don't worry about efficiency. Generally, combined with IP proxy, the target content can be quickly captured.
Some pits
After grabbing a lot of content from the target website, the red line will inevitably trigger the other party's anti-crawler mechanism, so it is necessary to give an appropriate alarm to remind the crawler that it is invalid.
Generally, the page with the requested HttpCode of 403 will be returned after anti-crawling, and some websites will also return the input verification code (such as Douban). Therefore, when the 403 call fails, an alarm will be issued. You can combine some monitoring frameworks, such as Metrics, set the alarm to a certain threshold in a short time, and then send you emails, text messages, etc.
Of course, simply detecting 403 errors can't solve all situations. Some websites are wonderful, and the pages returned after anti-crawling are still 200 (such as where to go). At this time, the crawler task often enters the parsing stage, and parsing failure is inevitable. In response to these measures, we can only issue an alarm when the analysis fails, and then trigger a notification event when the alarm reaches a certain threshold in a short time.
Of course, this part of the solution is not perfect, because sometimes, because of the change of website structure, the analysis fails and the alarm is triggered. And you can't simply say which reason caused the alarm.
- Related articles
- How to check the results of Changsha Xiaoshengchu computer allocation
- Jining CDC issued a health reminder to pay attention to the latest epidemic situation in Jiangsu, Shanghai, Jilin, Shandong and Qingdao.
- Those years, our past prose
- A girl I quite like suddenly lost her temper with me, and she was fine a few days ago!
- What marketing effect can mass SMS bring to e-commerce?
- Is it the operator's problem that SMS reminders of Postal Savings Bank always leak SMS?
- How to improve the accounting consciousness of all staff?
- What mobile phones have processors starting with 7?
- What is EMS?
- The best new year greetings for 2008