Joke Collection Website - News headlines - Bi Li Bi Li (1)- Capture video information for data analysis.

Bi Li Bi Li (1)- Capture video information for data analysis.

Thanks to the help of @ Xiongge and @ Xunge, I can finish this article. Otherwise, I don't know how to deal with IP restrictions.

Project address: /UranusLee/ Bili Bili _spider

Through chrome, we can see that the video statistics of Miley Miley are loaded through js. Can you go to stat? Json file with aid=3 1

After analyzing the loading mode of json file, we can basically get the required headers parameters.

Because I have been climbing websites with special headers, such as Douban, Zhihu and Lagou, I added them all to save trouble.

There was something wrong while crawling. Through continuous testing, it can basically be concluded that there are IP access restrictions, which are basically 1 minute. After 150 times, the IP will not be blocked, and it will be blocked for 5 minutes at a time. Therefore, we should consider using proxy IP, take the stable IP we bought and join in without delay.

A total of * * * crawled more than 7 10 w data, which took three or four days. During the period, the network was disconnected, or the IP address was deactivated, and it was dragged on. I don't plan to continue, because the data needs to be analyzed every year, so it's best to choose data assistance = 1 188335 1 as the deadline.

It was found that there was a value with the number of plays of-1, and the total * * * accounted for 2% of the total data, so it was deleted.

It can be seen that basically the broadcast volume, barrage, comment reply, collection, currency and sharing are all long tail data, and there are a lot of data with small values, but the overall average is greatly affected by extreme values. Comparatively speaking, the number of views is more research-oriented.

1.

Total * * * is divided into < 500,500-1000, 1000-5000, 5000-20000, & gt20000.

On the whole, a large number of videos are flooding. Less than 500 videos account for 48.8% of the total videos, and more than 20,000 videos only account for 5.4%. According to the "28 principle", videos with more than 3,338 videos will reach the limit of the practical utility of the video.

2. Analyze the annual video growth rate.

Analysis with July as the cycle.

Through aid, we can find out all the time of grabbing data, and then according to the probability analysis, we can get the video volume in July every year.

In fact, the overall growth rate of the number of videos is roughly doubled every year. Without10-1year, the overall growth curve is peaceful.

Between 20 10-20 1 1 year, something must have happened to cause a sudden increase in the video volume, which will lead to an increase of more than 800% in the total video volume. Through the query, it is basically the same as previously assumed. 10, due to several serious barrage conflicts at AC Fun (Station A), Station A closed the barrage system, many people posted the slogan "ACG get out of AC", and a large number of up owners at Station A turned to Mile Mile and began a counterattack in Mile Mile Mile.

The only time in 14 that the annual growth rate of video dropped to 94% was because of the copyright issue of animation in 14, and private uploading of animation was prohibited, and the video volume decreased by about 8 w. What is even more exaggerated this year is that 20 18 has not yet arrived in July, and the total video has reached about 2200W W w.

3. Analysis of user activity participation rate

The cost of barrage is the lowest, with an average of 27.8 people watching, and there will be barrage (including the number of non-members watching, but it is impossible to send barrage, which increases the cost of barrage). The sharing fee is not only for members, but also for non-members without logging in. This is true, the cost of 42.58 people is only higher than the cost of the barrage, which shows that the overall video style is more diversified. Coin cost 12 1.58 is limited by the mile-by-mile coin-operated system. B coins are few, and it is difficult to obtain them, which leads to the cost of coin investment being much higher than other items.

4. Coin analysis

There is a potential rule in the coin toss, that is, "No cow, no ticket, no objection, no ticket". Usually, the number of coins in a video can reflect the quality of the video and the trend of the trend.

The first place is 20 17 New Year Festival, 94. 1W coins.

The second place is Guzheng Qianben Sakura-have you ever seen such a fierce etude, 79.6W coin-operated style?

Ranked third is the 20 16 New Year Festival, with 77.2W coins.

Then director ao's combo.

The FC game that got your ears pregnant directed by Austria, 74.6W coin-operated.

Director ao hits the face! Contra underwater eight levels exist, 73.0W coins.

In fact, we can see that in the top three, the two New Year greetings and the spontaneous coin toss by users have become a part of the culture of Bili Bili, which is also the core part of every year. Of course, there is no lack of the conscience of the Austrian director. Every time he explains and introduces past games, there are also omnipotent buddies, otaku but sincere, awesome but very caring.

Cultural diversity is the fundamental reason why the whole sky is propped up. I have been poisoned by the blissful pure land. I have seen foreigners become internet celebrities in China, and I have heard guzheng playing contra and Japanese electronic music. This is a melting pot, and everyone can find what they like. Suddenly I remembered a high-scoring video in the advertising module, with only 300 barrage, but it was played more than 20 million times. I don't know that the operator has emptied some barrage and comments, but an advertisement is ringing, and countless young people who don't usually watch it can watch it more than 20 million times, almost once per capita. I just think I really know.

-dividing line.

I haven't finished the data analysis part, so I'm really sleepy today. Later today or tomorrow, I will continue to dig deep into the various modules of Miley, including semantic analysis of barrage, trying to figure out the plot through barrage, what kind of video can be popular, the influence of up owners, video quality modeling and so on.