Joke Collection Website - Bulletin headlines - Only one-tenth of the data is needed to clear the four major visual tasks, and it is actually open source.

Only one-tenth of the data is needed to clear the four major visual tasks, and it is actually open source.

Github link: /opengvlab

Family, do you have this kind of distress?

change the furniture once you move. Those expensive and heavy furniture are difficult to carry and take away.

the next time you buy furniture again, it's a waste of money. The key is to do the same thing back and forth! The furniture has not been used several times, and the utilization rate is not high!

this kind of distress of moving is just like the AI field. To do several tasks, you need to develop several highly customized models, which not only requires a lot of data collection, but also has to be marked from scratch every time. It can not only improve the learning efficiency of data, but also consume huge data acquisition costs.

AI front-end research alone consumes so much energy, not to mention tens of thousands of long-tail tasks in application scenarios.

what should I do?

making a general deep learning model is the key.

1. Universality is the basis of technology

At present, the common language model (GLM) has made remarkable progress, such as BERT, T5 and GPT-3, which have been able to cope with a wide range of language downstream tasks with ease.

In contrast, the research on the General Visual Model (GVM) has not produced a satisfactory answer.

in the past, most GVM studies mainly used a supervision signal source, such as ViT-G/14 with label supervision, SEER with contrast learning between different enhancements of samples, and CLIP with picture text for supervision. If it is pre-trained under a single supervision signal, these paradigms can indeed generate models that perform well in fixed scenes. However, if these models are used in downstream scenes with diverse scenes and tasks, they will be difficult to be competent.

For example, the most popular auto-driving, the car is in a moving state. We should not only see the road conditions, but also the traffic lights, but also pay attention to pedestrians. Even after the rise of the intelligent cockpit, we must cooperate with language technology and LBS scene services. So many perceptual data and collaborative tasks, so many random new tasks, both in terms of volume and dimension, have greatly improved the requirements for visual models.

at this time, only by creating a universal visual model and lowering the threshold of research and development, especially the time cost and capital cost of academia, can we enjoy the ultimate scene experience downstream.

In November last year, Shanghai Artificial Intelligence Laboratory, together with Shangtang Technology, The Chinese University of Hong Kong and Shanghai Jiaotong University, released the general vision technology system "INTERN", a set of continuous learning framework, which is used to systematically solve a series of bottleneck problems in the field of artificial intelligence vision, such as task generality, scene generalization and data efficiency.

Not long ago, Shanghai Artificial Intelligence Laboratory and Shangtang Technology released OpenGVLab, a general vision open source platform, which opened its super-efficient pre-training model, super-large-scale public data set and the industry's first evaluation benchmark for general vision model.

what is the magic of these open source technologies?

2. Make great efforts to make miracles and create a universal visual model

"INTERN", which is the basic technology to practice universal visual ability.

in terms of technical realization, the "Scholar" technical system consists of seven modules, including three infrastructure modules and four training stages.

structure diagram of scholar (INTERN)

First, the universal visual data system.

this is a very large-scale precision data set, with 1 billion samples and various supervision signals, and four data subsets are set according to four major visual tasks: GV-Dc-36M labeled by multi-modal data GV-D- 1B, GV-Dd-3M labeled by detection, and GV-Ds-143K labeled by segmentation.

In addition, this data set also contains 119, tag systems, which not only covers many fields of nature and almost all tags in computer vision research at present, but also expands a large number of fine-grained tags, covering attributes and states in various images.

and this is a big footnote for the scholar to "make great efforts to make miracles".

secondly, the general visual model structure.

it is constructed by a unified search space with CNN and Transformer.

why do you want to build such a mixed structure? You know, for many years, Convolutional Neural Network (CNN) has been dominant in visual representation learning, and it has shown stable portability in downstream tasks such as image classification, object detection and semantic segmentation. However, in recent years, Vision Transformer (ViT) can achieve comparable performance to CNN on ImageNet-1k only by using the common transformer structure, and ViT has shown greater potential than CNN on large-scale data sets.

although ViT has advantages in performance, pure Transformer network lacks some inductive biases compared with convolutional neural network, so it needs more data and computing resources. In addition, the calculation cost of self-attention is square relative to the number of inputs, which limits the application of high input resolution. Therefore, combining CNN with Transformer and MLP to balance efficiency and effectiveness is the key to the universality of the model.

this model structure with better generalization ability and higher model capacity is called MetaNet. Search the network structure in MetaNet network structure family, and get an optimal model training structure.

MetaNet architecture of unified search: Conv and Trans represent convolution and Transformer respectively. C and s are the number of output channels and the steps of each stage.

specifically, MetaNet not only puts forward a unified search architecture based on PPO algorithm of reinforcement learning, but also combines Context-Aware Down-Sampling Modules (DSM) including local-global-DSM (LG_DSM) and global-DSM (G-DSM) to replace the original downsampling module in order to prevent the traditional downsampling module from becoming the bottleneck of model performance.

therefore, in the shallow layer, the model still uses convolution to extract features, but in the deep layer, the model can combine the Transformer module with LG-DSM to better extract global information.

At the same time, based on the largest MetaNet-B15, the scholar distilled as many as 13 different model structures and ***24 different model weights, all of which are now open source.

these model structures basically cover most mainstream backbone in the market, which can not only be easily migrated to the required algorithm framework as the initialization of the new network pre-training, but also require shorter training time to achieve better training results than the original.

Compared with other models, the results of MetaNet model are as follows:

Based on convolution, Transformer and their mixed structures, which are represented by C, T and H respectively, it can be seen that MN-B1, MN-B4 and MN-B7 of MetaNet series not only have higher accuracy, but also lower FLOPS and parameters compared with other SOTA models in image classification performance.

in addition to the classification task, MetaNet is used as the backbone of detection and segmentation, and the structure of Mask R-CNN is used for training on COCO data set. The results show that the accuracy of MN-B4 is 2 to 4 points higher than that of Swin-T under the premise of smaller model parameters. In addition, the task of semantic segmentation is carried out on ADE2K data set, and the mIoU index of MN-B4 is five points higher than that of SWIN-T..

the above two experimental results show that the MetaNet series model structure has reached a new SOTA in terms of model accuracy and computational complexity!

finally, the common visual benchmark.

the visual evaluation benchmark GV-B is like a "challenge ring".

as shown in the following table, the evaluation benchmark collected 26 downstream task data sets, including four visual task types: classification, detection, segmentation and depth estimation.

in setting, the benchmark introduces percentage-shot, and only a part of the whole data set, such as 1% and 2%, is needed to compare the model performance after reducing the training data of downstream tasks.

compared with the traditional small sample setting, this percentage sample setting can well preserve the long tail distribution and other attributes of the original data set, and reduce the sensitivity to sample selection. Because there are some data sets with unbalanced distribution of sample categories, such as VOC7+12 in the table below, the division method of percentage data will inherit this distribution.

the three columns on the right, avg, min and max, respectively represent the average value, minimum value and maximum value of different types of samples in 1% of the data.

Based on the above data sets and task types, the paper selects some representative models for evaluation and comparison. In order to compare fairness, the comparison uses the official pre-training weights of these models. These models include:

With the super-large precision data set, model structure and evaluation benchmark, everything is ready, only training is needed.

As a classic image of ancient scholars in China, a scholar represents a personified role with all kinds of talents through continuous learning and growth: from learning basic knowledge and skills, to learning a variety of professional knowledge, and then growing into a generalist with general knowledge. With this image, the "INTERN" system can gradually achieve mastery in the field of general vision through continuous learning, and finally achieve flexible and efficient model deployment.

Let's take a look at how this system has changed from a novice to an expert to a generalist step by step through training, and finally shows its talents in various tasks.

in the first stage, the basic ability is trained, which is called "Amateur".

however, CLIP needs 4M image-text pairs for pre-training, and it is difficult for CLIP to develop further due to the huge amount of data. However, "Scholar" proposed a new training paradigm, DeCLIP(Data efficient CLIP), which can use the supervision signals from image-text, image-image and text-text pairs to pre-train the model at the same time, so as to achieve universality more effectively.

in addition, in order to make full use of the advantages of large-scale multimodal data acquisition basic model, an Upstream-Amateur (Up-A) visual language pre-training framework is proposed at this stage, and both intra-modal and cross-modal knowledge are mined.

this training framework is divided into two pre-training stages: upstream-amateur for global presentation (up-a-g) and upstream-amateur for local presentation (up-a-l).

among them, Up-A-G (left) uses the group supervision function to learn from richer supervision. Up-A-L (right) adopts local self-supervised learning method to adjust the trained visual-language model, so as to improve its performance in the intensive prediction CV task.

framework of upstream-Amateur

Thanks to these internal supervision, DeCLIP-ResNet5 can achieve 6.4% zero-shot accuracy first on ImageNet. This is .8% higher than CLIP-ResNet5, and the data usage is 81% less. When migrating to downstream tasks, DeCLIP-ResNet5 outperforms CLIP in 8 of 11 visual data sets.

more importantly, the completed Upstream-Amateur provides a high starting point for the subsequent training stage.

in the second stage, professional ability is trained, which is called "Expert model".

the basic model obtained in up-a stage shows excellent performance in general visual recognition problems. However, in order to master more specific tasks such as detection and segmentation, more professional pre-training is needed in each task, which leads to the arrival of the second stage, expert model.

For each expert, "Scholar" adopts a simple multi-head design, and each head is a sub-network of a specific data set, which branches from a public * * * and * * shared "backbone". For example, Up-E (C), Up-E (D) and Up-E (S) are used for image classification, object detection and semantic segmentation respectively.

in the third stage, the combined ability is trained, which is called "Generalist model".

the multi-task mentioned above refers to one visual problem (such as classification) of different data sets (such as ImageNet and CIFAR) or multiple visual problems (such as classification and detection) of one data set. But the key is how to integrate experts into a unified model and obtain a more general visual model. Therefore, after pre-training the "expert" stage, the "generalist" is taken as the third pre-training stage to further unify the feature representation.

"Scholar" put forward a new paradigm,

Previous article:Clean solar slogan?
Next article:Reasons for Guoan's loss