Joke Collection Website - Public benefit messages - How to deal with the anti-crawler strategy of the website? How to capture a large amount of data efficiently

How to deal with the anti-crawler strategy of the website? How to capture a large amount of data efficiently

Some common methods

IP proxy

For IP proxy, the native request API of each language provides the response API of IP proxy, and the main problem to be solved is IP source.

There is a cheap proxy IP(0 on the network (0 yuan 65438+4000). I did a simple test. In 100 IP, the average available IP is around 40-60, and the access delay is above 200.

The network has high-quality agent IP for sale, provided that you have channels.

Because the delay increases and the failure rate increases after using IP proxy, we can design the request as asynchronous in the crawler framework, add the request task to the request queue (RabbitMQ, Kafka, Redis), call back after success, and rejoin the queue after failure. Each request obtains IP from the IP pool, and if the request fails, the invalid IP is deleted from the IP pool.

biscuit

Some websites use Cookies as anti-crawlers. This is basically like @ Zhu Tianyi said, maintaining a cookie pool.

Pay attention to the cookie expiration event of the target website, which can simulate the browser and generate cookies regularly.

Speed limit entry

For example, multi-thread data grabbing is really blocked by IP every minute, and it is quite simple to achieve speed-limited access (through task queue), so don't worry about efficiency. Generally, combined with IP proxy, the target content can be quickly captured.

Some pits

After grabbing a lot of content from the target website, the red line will inevitably trigger the other party's anti-crawler mechanism, so it is necessary to give an appropriate alarm to remind the crawler that it is invalid.

Generally, the page with the requested HttpCode of 403 will be returned after anti-crawling, and some websites will also return the input verification code (such as Douban). Therefore, when the 403 call fails, an alarm will be issued. You can combine some monitoring frameworks, such as Metrics, set the alarm to a certain threshold in a short time, and then send you emails, text messages, etc.

Of course, simply detecting 403 errors can't solve all situations. Some websites are wonderful, and the pages returned after anti-crawling are still 200 (such as where to go). At this time, the crawler task often enters the parsing stage, and parsing failure is inevitable. In response to these measures, we can only issue an alarm when the analysis fails, and then trigger a notification event when the alarm reaches a certain threshold in a short time.

Of course, this part of the solution is not perfect, because sometimes, because of the change of website structure, the analysis fails and the alarm is triggered. And you can't simply say which reason caused the alarm.

Previous article:How does Samsung 9 100 separate the notification tone from the short message tone?
Next article:I received a text message saying that I could pay by cash installment. Why not in online banking?