Joke Collection Website - Bulletin headlines - [ECCV2020] Paper Translation: Points for Attention in Character Area of Text Positioning

[ECCV2020] Paper Translation: Points for Attention in Character Area of Text Positioning

The scene text detector consists of text detection and recognition modules. Many researches have been done to unify these modules into an end-to-end trainable model to achieve better performance. The typical structure puts the detection and recognition module in a single branch, and usually uses RoI pool to make this branch enjoy visual features. However, there is still an opportunity to establish a more complementary connection between modules when a recognizer is adopted, which uses an attention-based decoder and detector to represent the spatial information of the character region. This is possible because two modules * * * share the same subtask, which will find the position of the character area. Based on these insights, we construct a tightly coupled single-pipe model. This structure is formed by using the detection output as the input of the recognizer and propagating the recognition loss in the detection stage. The use of the character score graph helps the recognizer to pay more attention to the central point of the character, and the propagation of the recognition loss to the detector module will enhance the position of the character region. In addition, the enhanced * * * enjoyment stage allows feature correction and boundary location of arbitrary shaped text areas. A large number of experiments have proved the latest performance of the publicly provided benchmark data sets of straight lines and curves.

Scene text location, including text detection and recognition, has recently attracted extensive attention due to its various applications in instant translation, image retrieval and scene analysis. Although the existing text detectors and recognizers are effective for horizontal text, it is still a challenge to find curved text examples in scene images.

In order to find the curved text in the image, a classic method is to cascade the existing detection and recognition models to manage the text instances on each side. The detector [32,31,2] tries to capture the geometric properties of the curved text by applying complex post-processing technology, while the recognizer applies multi-directional coding [6] or adopts correction module [37,46, 1 1] to enhance the accuracy of the recognizer to the curved text.

With the development of deep learning, it has been studied to combine detectors and recognizers into an end-to-end network that can be trained together [14,29]. Having a unified model can not only improve the dimensional efficiency and speed of the model, but also help the model learn the enjoy function, thus improving the overall performance. In order to benefit from this property, we also try to use the end-to-end model [32, 34, 10, 44] to deal with curved text instances. However, most existing works only use the region of interest pool to enjoy the low-level features between detecting and identifying branches. In the training stage, the * * * enjoy feature layer does not train the whole network, but trains by detecting and identifying losses.

As shown in figure 1, we propose a novel end-to-end text positioning model for attention in character areas, which is called CRAFTS. Instead of isolating the detection and identification modules in two separate branches, we establish a single pipeline by establishing complementary connections between modules. It is observed that the recognizer [1] using the attention-based decoder and the detector [2] encapsulating the character space information share a common subtask for locating the character region. By closely integrating these two modules, the output of the detection stage can help the recognizer to better identify the central point of the character, and the loss from the recognizer to the detector stage will enhance the position of the character region. In addition, the network can maximize the quality of feature representation used in common subtasks. As far as we know, this is the first end-to-end work to establish tight coupling loss.

Our contributions are summarized as follows:

(1) We propose an end-to-end network, which can detect and recognize any shape of text.

(2) By using the spatial character information from the detector on the calibration and recognition module, we construct the complementary relationship between the modules.

(3) Establish a single pipeline by dispersing the identification loss among all features of the whole network.

(4) We have achieved the most advanced performance on IC 13, IC 15, IC 19-MLT and TotalText [20, 19, 33, 7] data sets containing a large number of horizontal, curved and multilingual texts.

Text detection and recognition method

The detection network uses regression-based [16,24,25,48] or segmentation-based [9,3143,45] methods to generate text bounding boxes. Some recent methods, such as [17, 26, 47], take mask-rcnn [13] as the basic network, and gain advantages from regression and segmentation methods by adopting multi-task learning. As far as the unit of text detection is concerned, all methods can also rely on word-level or character-level [16,2] prediction for sub-classification.

Text recognition usually adopts CNN-based feature extractor and RNN-based sequence generator, and classifies according to their sequence generators. Connectionist temporal classification (CTC) [35] and attention-based sequential decoder [2 1, 36]. The detection model provides the information of the text area, but it is still a challenge for the recognizer to extract useful information from the text with arbitrary shape. In order to help identify the irregular text processed by the network, some studies [36, 28, 37] use the space transform network (STN) [18]. In addition, the paper [1 1, 46] further expands the use of STN by iterative correction. These studies show that running STN recursively helps the recognizer to extract useful features from extremely curved texts. In [27], a cyclic RoIWarp layer is proposed to cut a single character before it is recognized. This work proves that the task of finding the character region is closely related to the attention mechanism used in the attention-based decoder.

One way to build a text location model is to place detection and recognition networks in turn. The well-known two-stage structure combines TextBox++[24] detector and CRNN [35] recognizer. In a word, this method has achieved good results.

End-to-end use of recognizer based on RNN

EAA [14] and FOTS [29] are end-to-end models based on EAST detector [49]. The difference between the two networks lies in the recognizer. The FOTS model uses CTC decoder [35], while the EAA model uses attention decoder [36]. In both works, the function of affine transformation layer to merge * * * is realized. The proposed affine transformation works well on horizontal text, but it shows limitations when dealing with arbitrary shape text. Textnet [42] proposes a spatially aware text recognizer that transforms perspective RoI at the feature pool level. The network retains the RNN layer to identify the text sequence in the 2D feature map, but due to the lack of expressive quadrilateral, the network still shows limitations in detecting curved text.

Qin et al [34] proposed an end-to-end network based on Mask-RCNN [13]. Given the box suggestion, merge features from the * * * enjoyment layer, and use the ROI mask layer to filter background clutter. The proposed method improves its performance by ensuring that the attention is only in the text area. Busta and others proposed a deep text observer network and extended their work in E2E-MLT. The network consists of FPN-based detector and CTC-based recognizer. The model predicts multiple languages in an end-to-end way.

End-to-end use of recognizer based on CNN

When dealing with arbitrary shape text, most CNN-based models have advantages in identifying character-level text. MasktextSpotter [32] is a model that uses segmentation method to recognize text. Although it has advantages in detecting and recognizing single characters, it is difficult to train the network because it usually does not provide character-level annotations in public data sets. Charnet [44] is another method based on segmentation, which can make character-level prediction. The model is trained by weak supervision to overcome the problem of lack of character-level annotations. During training, the method performs iterative character detection to create false ground truth.

Although the recognizer based on segmentation has achieved great success, this method will be affected when the number of target characters increases. With the increase of the number of character sets, the model based on segmentation needs more output channels, which increases the demand for memory. The periodical version of MasktextSpotter [23] expanded the character set to handle multiple languages, but the authors added a decoder based on RNN instead of using their original recognizer based on CNN. Another limitation of segmentation-based recognizer is the lack of context information in the recognition branch. Due to the lack of sequential modeling like RNN, the accuracy of the model will be reduced in noisy images.

Textdragon [10] is another method based on segmentation, which is used to locate and identify text instances. However, there is no guarantee that the predicted character segment will cover a single character area. In order to solve this problem, the model introduces CTC to delete overlapping characters. The network shows good detection performance, but due to the lack of sequential modeling, it shows limitations in the recognizer.

Craft detector [2] is chosen as the basic network because it has the ability to express the semantic information of character areas. The output of the CRAFT network represents the central probability of the character region and its connection. Since the goal of these two modules is to locate the center position of the character, we assume that the center information of the character can be used to support the attention module in the recognizer. In this work, we have made three modifications to the original process model; Trunk replacement, connection representation and direction estimation.

Trunk replacement

Recent research shows that using ResNet50, clear feature representations defined by detectors and recognizers can be captured [30, 1]. Therefore, we changed the backbone network from vgg-16 [40] to resnet50 [15].

Connection representation

Vertical text is not common in Latin texts, but it often appears in East Asian languages (such as Chinese, Japanese and Korean). In this work, the binary centerline is used to connect consecutive character regions. The reason for this change is that using the original affinity diagram on vertical text usually produces an ill-posed perspective transformation, thus generating invalid frame coordinates. In order to generate the real ground connection diagram, a line segment with thickness t is drawn between adjacent characters. Here, t = max((d 1+d 2)/2 *α, 1), where d 1 and d 2 are diagonal lengths of adjacent character frames, and α is the scaling factor. Using this equation, the width of the center line can be proportional to the size of the characters. In the implementation, we set α to 0. 1.

Direction estimation

It is important to get the correct orientation of the text box, because the frame coordinates need to be clearly defined in the recognition stage to correctly recognize the text. To this end, we have increased the output of two channels in the detection stage. Channels are used to predict the angles of characters along the X and Y axes. To generate the ground truth of the directed graph.

* * * Enjoy stage includes two modules: text correction module and character area attention: CRA) module. Thin-plate Spline (TPS) [37] transformation is used to correct text areas with arbitrary shapes. Inspired by [46], our correction module combines iterative TPS to better represent the text area. By attractively updating the control points, the bending geometry of the text in the image can be improved. Through empirical research, we find that three TPS iterations are enough to correct.

The typical TPS module takes the word image as input, but we provide the character area graph and the connection graph because they encapsulate the geometric information of the text area. We use 20 control points to closely cover the curved text area. In order to use these control points as detection results, they are converted into original input image coordinates. We can choose to perform 2D polynomial fitting to smooth the boundary polygons. An example of iterative TPS and final smooth polygon output is shown in Figure 4.

The modules in the identification phase are formed according to the results reported in [1]. The recognition stage includes three parts: feature extraction, sequence modeling and prediction. Because the feature extraction module uses advanced semantic features as input, it is lighter than a single recognizer.

The detailed architecture of the feature extraction module is shown in table 1. After feature extraction, bidirectional LSTM is applied to sequence modeling, and then the final text prediction is made by attention-based decoder.

At each time step, the attention-based recognizer will decode the text information by masking the attention output of the feature. Although the attention module works well in most cases, it cannot predict the characters [5, 14] when the attention points are misplaced or disappeared. Figure 5 shows the effect of using CRA module. Proper attention points can make reliable text prediction.

The final loss l of training consists of detection loss and recognition loss, and L = Ldet+Lreg. The whole process of determining the loss is shown in Figure 6. The loss flows through the weight in the recognition stage and propagates to the detection stage through the attention module of the character area.

On the other hand, the detection loss is used as an intermediate loss, so the weights before the detection stage are updated using the detection and identification loss.

English data set IC13 [20] The data set consists of high-resolution images, 229 images for training and 233 images for testing. Rectangular boxes are used to annotate word-level text instances. IC15 [20] contains 1000 training images and 500 test images. Quadrilateral boxes are used to annotate word-level text instances. Total text [7]

It has 1255 training images and 300 test images. Unlike IC 13 and IC 15 data sets, it contains curved text examples and is labeled with polygon points.

Multilingual data set IC19 [33] The data set contains10,000 training images and10,000 test images. The data set contains texts in seven different languages and is annotated with quadrilateral points.

We train the detectors and recognizers in the CRAFTS model together. In order to train the detection stage, we follow the weak supervision training method described in [2]. The recognition loss is calculated by randomly sampling the cut word features in each image in batches. The maximum number of words per image is set to 16 to prevent the error of insufficient memory. Techniques such as cropping, rotation and color change are applied to data enhancement in the detector. For the recognizer, the corner of the real frame on the ground is disturbed in the range of 0% to 10% of the short length of the frame.

Firstly, the model is iteratively trained on SynthText data set [12] for 50k, and then we further train the network on the target data set. Adam optimizer is used, and online hard negative mining (OHEM) [39] is applied to force the positive and negative pixel ratio of 1: 3 to be used when detecting loss. When fine-tuning the model, the SynthText dataset is mixed at the ratio of 1: 5. We use 94 characters to cover letters, numbers and special characters, and 4267 characters to cover multilingual data sets.

Horizontal data set (IC 13, IC 15)

In order to realize the benchmark test of IC 13, we adopted the model trained on the SynthText data set, and fine-tuned it on the IC 13 and IC 19 data sets. In; In the process of reasoning, we adjust the input long side to 1280.

The results show that compared with the latest technology, the performance is significantly improved.

Then the model trained on IC 13 data set is fine-tuned on IC 15 data set. During the evaluation, the input size of the model is set to 2560x 1440. Please note that we perform a common assessment without a common vocabulary set. Table 2 lists the quantitative results of IC 13 and IC 15 data sets.

Heat map is used to explain the character area map and connection map, and the weighted pixel angle value is visualized in HSV color space.

As shown in the figure, the network successfully located the polygon area and recognized the characters in the curved text area. The two figures in the upper left corner show examples of text that successfully recognize complete rotation and high curvature.

Attention assisted by the attention of the character area

In this section, we will study how character area attention (CRA) affects the performance of recognizer by training an independent network without CRA.

Table 5 shows the effect of using CRA on the benchmark data set. Without CRA, we observed performance degradation on all data sets. Especially in perspective data set (IC 15) and curved data set (TotalText), we observe that the gap is larger than that in horizontal data set (IC 13). This means that when dealing with irregular text, the performance of the recognizer can be improved by sending the attention information of characters. (? The experimental data in the table is more effective for long-term texts. I wonder how this conclusion was reached. )

Importance of direction estimation

Because there are many multi-directional texts in the scene text image, direction estimation is very important. Our pixel-by-pixel averaging scheme is very useful for the recognizer to receive well-defined features. When direction information is not used, we compare the results of the model. On IC 15 data set, the performance drops from 74.9% to 74. 1%(-0.8%), and on TotalText data set, the average value of H drops from 78.7% to 77.5%(- 1.2%). The results show that the performance of rotating text can be improved by using correct angle information.

Reasoning speed

Because the reasoning speed varies with the size of the input image, we measure FPS at different input resolutions, and the long sides of each resolution are 960, 1280, 1600 and 2560 respectively. The FPS obtained from the test results are 9.9, 8.3, 6.8 and 5.4 respectively. For all experiments, we used Nvidia P40 GPU and Intel? Xeon? Central processing unit. Compared with the 8.6 FPS of the VGG-based CRAFT detector, the ResNet-based CRAFT network can obtain higher FPS with the same input. In addition, the direct use of control points from the correction module can alleviate the need for post-processing of polygon generation.

Particle size difference problem

We assume that the granularity difference between ground truth and prediction framework leads to relatively low detection performance of IC 15 data set. Character-level segmentation methods tend to generalize character connectivity based on spatial and color clues, rather than capturing all the features of word instances. Therefore, the annotation styles of boxes that do not meet the benchmark requirements are output. Fig. 9 shows a failure case in IC 15 data set, which proves that when we observe an acceptable qualitative result, the detection result is marked as incorrect.

In this paper, we propose an end-to-end trainable single pipeline model, which closely couples the detection and identification modules. * * * Pay attention to the character area in the enjoyment stage. Make full use of the character area map to help the recognizer correct and better participate in the text area. In addition, we design the recognition loss that propagates in the detection stage, which enhances the character positioning ability of the detector. In addition, the revision module of * * * enjoyment stage can accurately locate the bent text without developing manual post-processing. The experimental results verify the latest performance of CRAFTS on various data sets.