Joke Collection Website - News headlines - Xpeng motor voice command did not respond.
Xpeng motor voice command did not respond.
The second is the rapid response speed of voice instructions, which refers to the time from the end of the user's speech to the beginning of the execution of instructions by Xiao P. From the comparison of videos, we can find that the current Extreme Edition delays the response of language control from the original 1. 5s is reduced to about 0.9s. For car voice products, 0.9s is a better number. At present, car voice products are generally around 1.5s, and a better one can be1.2s. ..
In addition, each video emphasizes the ability to understand multi-intention instructions, but this is the existing function of P7. A better experience is that TTS's reply to multi-intention instructions is also a comprehensive reply, rather than broadcasting the execution of each instruction one by one.
Full-time dialogue
After turning on the full-time conversation switch, Xiao P will continue to receive the broadcast, and can directly say the instructions and execute them at any time without waking up (no need to shout hello Xiao P). At present, only some commands are supported, and the guess is mainly the command of vehicle control. During the full-time conversation, the car will not respond to the unsupported instructions, but the user can add a "small P" within 5 seconds, so that the small P can recognize and execute the instructions that were not supported just now. Through this product design, the problem of experience separation that full-time dialogue only supports the introduction of some fields is skillfully solved, and only "little P" is needed instead of "hello little P". Personally, I think this is the most eye-catching feature update of G9. Just like if you ask someone to do something for you, if he doesn't move, you can call him by his name again, and it is more natural to shorten "hello little P" to "little P".
In the video presentation, we can see that the interactive mode of oneshot combined with G9 has reduced the four-word wake-up word "Hello Xiao P" to two words "Xiao P", which has made great progress in doubling the number of wake-up words. At present, the two-word wake-up word technology is very immature. Using it alone will lead to a lot of false positives, and linking it with instructions to introduce two-word wake-up words in the form of oneshot will alleviate this problem well. Two-word wake-up words are more natural and convenient to use than four words, which can alleviate the embarrassment brought to users to some extent. This design has also been applied to Baidu's smart fitness mirror. It is said that Apple will also use this design to shorten "hi siri" to "siri".
When the full-time dialogue switch is turned on, only the full-time dialogue of the main driver is supported by default. Here Xiao P's eye animation has changed, so we can see the details of product design and better user experience.
Multiperson dialogue
After the multi-person call and full-time call are turned on at the same time, the full-time call function can be used in all four locations, and users in the four locations can speak alternately or at the same time without interference, which meets the requirements of multi-person call.
On G9, multi-round dialogue across sound zones is realized, and different sound zones remain in the same multi-round state. When the main driver says "turn on the seat for heating", the co-pilot only needs to say "Me too" to turn on the co-pilot seat for heating. Mainly aiming at the function points related to scope binding, it carries out multi-round dialogue inheritance optimization.
The asr results of the four positions are displayed in the four corners respectively, and the reply content will be displayed on the screen, and the sound zone reply will be locked (sometimes TTS reply will not be made). The video emphasizes the design of some product details here.
Figure 2 Four-way full-time dialogue screen display
functional analysis
Jisu dialogue
Simply put, the eternal pursuit of voice interaction technology can be condensed into two words: fast and accurate. Fast and accurate voice interaction technology is a necessary condition to create voice interaction products that really satisfy users. The goal of speed conversation is to realize "fast" voice interaction.
Fig. 3 voice interaction data flow diagram
Fig. 3 shows a simplified flow from the user's speech to the reply given by the vehicle. The yellow part of the recording module is responsible for data acquisition, the blue part processes the collected voice data to understand the user's intention, the purple part answers the user according to the understood instructions, and the orange part is executed by the car. Generally speaking, users feel that voice speed is fast, that is, the time from recording to instruction execution, which involves hardware, algorithms and other modules. In fact, the internal modules and interaction logic of a complete voice interaction product are far more complicated than those shown here. How to optimize the speed of voice interaction can be analyzed from three aspects: interactive link, algorithm, system and hardware.
1 interactive link
Interactive link optimization refers to shortening the transmission path of data or optimizing the transmission speed of data when designing interactive logic, so that the feedback results can flow to users faster. Possible schemes include:
Using offline scheme to optimize the logic of offline fusion.
Streaming processing is used to reduce the absolute waiting time of each algorithm module.
Algorithm modules are processed in parallel to find the shortest path to realize data transmission.
Algorithm modules are merged to shorten the link of data transmission.
2. Algorithm
There are many modules in the voice interaction technology chain. Imagine that if each algorithm module has a delay of tens of milliseconds, it may accumulate hundreds of milliseconds. Therefore, in order to improve the speed of voice interaction, it is very important to optimize and polish each algorithm module. For algorithm engineer, who is a product, the ultimate problem that everyone faces is: how to simplify the algorithm and increase the speed as much as possible without reducing the performance of the algorithm and increasing the occupation of computing power (CPU/NPU). Becoming a dancer on the tip of a knife may be the highest requirement of algorithm engineer who makes products. The optimization of algorithm module is not only closely related to the product experience, but also the simplified algorithm can directly reduce the hardware cost. In the voice technology chain, several modules that have an intuitive impact on the speed of voice interaction are:
Signal processing: including three core calculation modules: aec, separation and noise reduction, in addition, there will be sound area localization and vocal isolation.
VAD: The delay of VAD algorithm itself is generally relatively small, and the core will cause relatively large delay in post-processing strategy, which is related to product design and needs to be weighed in terms of small delay and other experiences.
ASR: The part that introduces delay includes accumulated data for model scoring, dependence on future information, peak shift of CTC and other algorithms, and pruning search strategy.
3. System and hardware
Hardware is the foundation and system is the support. A smooth underlying system is a necessary condition for excellent software products. Voice interaction system not only depends on hardware and system, but also controls body hardware or system. If the in-vehicle system itself is easy to get stuck, it is useless to optimize the voice interaction algorithm. Hardware and systems that affect the voice interaction experience include:
Recording hardware and recording drivers
The priority of voice-related processes is based on the system resource allocation policy.
Control the response speed of body hardware.
Vehicle system response speed
G9' s speed calling function reduces the voice control delay from 1.5s to about 0.9s It can achieve such a huge improvement, and two reasons are emphasized in each experience video:
The cloud voice solution is replaced by an offline integration solution, and the process of data uploading and downloading in the cloud solution is cancelled, thus shortening the interaction time.
Support streaming understanding, ASR and NLU can be processed in parallel, shortening the waiting time of NLU.
But now is the 5G era. Is the network delay really so big? With a skeptical attitude, the author made a detailed analysis according to the experience video. From the data statistics of three key time periods, from the end of speech to the first word on the screen, from the end of speech to all the recognition results on the screen, and from the recognition results to the beginning of car response, the following conclusions are drawn:
In high-speed conversation, the recognition result is advanced by 0. 15s, but the first word on the screen is slow. The high probability promotion here is related to the offline asr algorithm scheme, and the network delay accounts for a small proportion.
The great improvement of extremely fast dialogue comes from the improvement of vad post-processing strategy and offline NLU algorithm for streaming understanding.
Because the online experience video will be post-processed, it may be different from the real experience. Therefore, it will be analyzed and corrected again according to the real car experience. Students interested in speed optimization can jump to the appendix to view the analysis process.
Full-time dialogue
Full-time conversation is a subversive way of interaction, which breaks the tradition that voice interaction system must carry wake-up words since iphone 4s launched siri. According to the development of voice interaction logic, the evolution mode of full-time dialogue can be deduced from two directions, the essence of which is to improve interaction efficiency, make human-computer voice interaction more natural and convenient, and more in line with the dialogue logic between people.
Figure 4 Evolution diagram of full-time dialogue
As we all know, wake-up words are equivalent to the switch of voice system. When it opens, it starts recording, and when it closes, it stops recording. If the wake-up word is deleted from the full-time conversation, the speech recognition system will continue to listen. After losing control of the switch, it means that the privacy and security of the voice interaction system will receive more attention. In order to do a good job in full-time dialogue, we must do the following:
1, using offline voice scheme.
The offline voice scheme has the following advantages:
All data are processed locally to protect users' privacy. The data here is not only voice data containing biological characteristics, but also the text content of voice recognition contains a lot of user privacy.
Data does not need to be uploaded to the cloud, saving traffic costs.
All the work is done locally, saving the cost of cloud services.
The off-line voice scheme carefully polished on G9 provides feasibility for realizing full-time conversation function.
2. Do a good job of sound separation and isolation.
The goal of vocal separation is to separate the target person from other vocals, and the goal of vocal isolation is to eliminate non-target vocals and only send the target vocals to the speech recognition engine for recognition. G9 adopts the hardware configuration of distributed four microphones, which reduces the difficulty of voice separation and isolation. However, the algorithm still needs to be done well in these two aspects, especially the problem of missing sound when the target position does not speak and other positions speak.
3, do a good job of false alarm control
False alarm control is the most difficult and critical part of full-time call, which directly determines the user experience of full-time call function. Students who do voice should know that voice wake-up also has false positives, and 80% of the bad cases that each voice wake-up practitioner needs to solve may be the optimization of false positives. Full-time call false alarm and voice wake-up false alarm are essentially voices that the train system should not respond to. However, the false alarm of full-time conversation is obviously different from the false alarm of awakening. First of all, false positives have different effects on users. The wake-up word is just a switch. When there is a false alarm, it is nothing more than a small P answering the phone and turning to look at you. But every sentence in full-time conversation is a language control instruction with practical actions. Imagine that you are driving in a rainy day and calling your wife to say that you will come home later because of the traffic jam. At this time, the skylight was inexplicably opened. Will you smell good at this time? If you knew this was a full-time conversation, you would definitely close it immediately and never open it again. If you don't know that this is a full-time conversation, it may be puzzling for the first time. It is estimated that you will drive to the 4S shop for maintenance for the second time. Secondly, the frequency of false positives and the difficulty of control are different. Wake-up words are four definite words with relatively definite goals, but it is still very difficult to control false positives well. It is difficult to do it with only one definite word, not to mention hundreds of function points and thousands of sentences in full-time conversation. In fact, this kind of false alarm will also exist in the current delayed interception, but because the delayed interception is generally only a few tens of seconds, the possibility of false alarm is greatly reduced in the time dimension. False positives in full-time conversations can be divided into two categories. The first category is the instruction misidentification caused by algorithm recognition errors, such as asr recognizing irrelevant speech as a valid instruction, or nlu parsing irrelevant text as a valid instruction. The best way to solve this class is to improve the performance of the algorithm infinitely, and to detect and shield these wrong instructions through some strategies. The second kind of problem is the difference between man-machine dialogue and everyone's dialogue. For example, a sentence you mentioned in the process of chatting with friends is an instruction that can trigger the action of the car, but in fact you are chatting with friends instead of giving instructions to the car. This kind of problem may be the most difficult problem to solve in full-time dialogue.
4. Avoid the sense of fragmentation of the user experience.
From the perspective of security design and the maturity of current technology, the function points supported by full-time dialogue are only a subset of all voice function points for a long time, which will increase the learning cost of users, because users don't know which functions support which functions don't, which will cause a sense of fragmentation of user experience. The author thinks that Tucki G9 has handled this problem well, and Tucki's products and engineers have solved this problem gracefully through wake-up after use. Personally, I guess "Little P" should be realized by asr, not a special two-word wake-up system. At present, there are two other cars besides G9 that support full-time conversation. The first one is Geely's Xingyue L, which is set to geek mode in the system and can be used for full-time conversation after opening. But the experience of this car is very bad, and it is basically unusable, because once it is opened, just saying a word will trigger the voice function. The second model is Chery Tiggo 8 pro, which has a full-time dialogue function by default in the system, which is called full-time wake-up-free function in car promotion. This scheme is provided by Horizon, which is the first full-time dialogue system based on all-offline scheme in the industry, and it is also the best scheme in the market at present. I hope I can experience the full-time dialogue function of G9 as soon as possible, and I also hope that G9 can catch up and further promote the development of full-time dialogue function.
Multiperson dialogue
There are two main functions of multi-person dialogue in G9: First, people in different positions can use voice at the same time, and they are independent and do not interfere with each other; Conversations between people in a second different position can be inherited from each other. Technically speaking, multi-person dialogue will be simpler than fast dialogue and full-time dialogue.
1, multi-person parallel use function
To realize the function of multi-person parallel use, we need to do two things well. The first point is the powerful signal processing function, especially the ability of voice separation and isolation. At present, the front-end signal scheme based on distributed four-wheat is relatively mature, and there are good solutions, but there are also some difficult scenarios to break through. The second point is that it has strong computing power and can support the concurrency of four voice interaction systems. The core is the concurrency of four ASRs and four nlu.
2, multi-person multi-round dialogue function
The core of this function is to inherit the multi-round state of multi-tone area, which belongs to the category of dialogue management, and there are also better solutions in the industry.
abstract
According to the experience video, the author summarizes two kinds of interactive logic on G9. (Just a personal guess)
Fig. 5 is a schematic diagram of the internal algorithm module of "Hello Little P" initiating voice interaction.
Fig. 6 Logic diagram of internal algorithm module of full-time dialogue voice interaction
The listing of Tucki P7 has pushed the car voice assistant to a new height, which has become the goal of many car manufacturers. I hope G9 can push car voice to a new height, bring more convenience to users and create more opportunities and development space for many voice practitioners. Finally, I hope to experience all the functions of G9 as soon as possible.
Appendix: Delay Analysis
In the experience video, the author chooses an example of "opening a window", and analyzes the time point of each key event by analyzing the way of video, comparing the text on the screen and the execution state of instructions in voice and video.
Figure 2- 1 Turn off the high-speed conversation and the time point of each key time.
Figure 2-2: Time points of key events when the extremely fast dialogue is started.
According to the recognition results, the delay of voice interaction can be roughly divided into two parts: TD 1 and TD2. Please refer to the table for detailed definitions and explanations of each part. In addition, because the real-time display of voice results will also affect the user's feelings, the first word after the voice is displayed on the screen is recorded as TD3.
Name module description includes module analysis, closing speed dialog and opening speed dialog (increasing scale).
TD 1 The screen delay of the recognition result is 1. The recording delay is 1. 2. The front-end signal processing delay; 3.vad algorithm delay; 4. Data network transmission delay (cloud solution); 5.asr algorithm delay. 0.608 seconds (9.732 seconds ~ 10.340 seconds) 0.467 seconds (23.2%)(2 1.0 seconds ~ 2 1.467 seconds).
TD2 Delay time from text to instruction execution 1 from the display of complete instruction text on the screen to the start of vehicle execution. Vad strategy delay; 2.nlu algorithm delay; 3. System delays such as instruction interpretation and hardware startup. 0.947s( 10.340s ~ 1 1.287s)0.407s(57.0%)(2 1.467s ~ 2 1.874s)
The first word delay of TD3 recognition result is 1. Recording delay; 2. The front-end signal processing delay; 3.vad algorithm delay (data accumulation delay); 4. Data network transmission delay (cloud solution); 5.asr algorithm delay. 0.335 seconds (9.732 seconds ~ 10.067 seconds) 0.367 seconds (-9.5%)(2 1.0 seconds ~ 2 1.367 seconds).
Note: Only using pronunciation has general reference significance and needs some data to prove its effectiveness. According to the statistical results, the reasons for the increase in speed in extremely fast conversations are inferred:
Will there be optimization instructions in the module speed dialogue?
Recording delay, recording at the bottom. There should be no change before and after opening the ultrafast conversation.
Signal processing delay signal processing itself runs on the end side and the estimation has not changed.
Vad algorithm Delay VAD algorithm itself runs on the end side, and it is estimated that the data accumulation and dependence on future information of VAD model score have not changed.
The asr delay will change, and the promotion probability of TD 1 is related to the offline ASR algorithm scheme. On the one hand, it is the optimization at the model level, on the other hand, it has small search space and fast decoding speed. Asr model scores data accumulation, dependence on future information, decoding delay and ctc peak shift.
According to the results of TD3, the network transmission delay has little influence on voice data uploading and recognition results publishing in cloud solutions.
The delay of vad post-processing strategy has great influence. In general, the post-processing of vad will be extended backward for a certain time according to the output of the algorithm, so as to truncate the instruction in advance.
The Nlu algorithm delays the instruction to "open the window". Theoretically, no matter whether the cloud or the end-side rule engine is realized with high probability, the speed difference between them should have little effect. Combined with streaming semantic understanding will be improved.
System delays such as instruction interpretation and hardware startup will not change, and there will be no difference between hardware and system.
In the traditional speech interaction process, in order to ensure that speech recognition will not be truncated in advance (for example, the user stops talking, or the vad algorithm is not robust, etc.). ), a post-processing strategy is added after the output of vad algorithm, which generally extends backward for a certain time on the basis of the output of the algorithm, which will introduce a lot of delay in many scenarios. As shown in the following figure, although a complete recognition result is obtained at t3, the vad segment will not be sent to nlu for text parsing until t4. After the introduction of streaming semantic understanding, asr recognition text is sent to nlu for analysis in real time, and the analysis results of nlu can be obtained at t7. Whether the results are confirmed at t4 or only at t7, the delay will be greatly reduced. In fact, interestingly, it takes 0.947s from t3 to t6 when extreme voice is not turned on. Assuming that the vad post-processing of the system is extended backward by 0.6s, the hardware execution consumes 0. 1s, and the nlu actually consumes 0.247s, which is incredible for such a simple instruction as "opening a window". It can only be said that the great improvement depends on the previous generation.
- Previous article:Happy birthday to my aunt.
- Next article:How to do a good job in career planning education in secondary vocational schools
- Related articles
- Event invitation
- Can you understand that the English version of Spring Festival couplets is popular abroad and popular among consumers?
- Olympic spirit slogan eight words?
- What is a labor-intensive enterprise, how to identify it or what is its scale?
- The flower language of various flowers and its implication.
- Campus Tour Sequence Writing Essay 400
- Which is better, Kuitun No.10 Middle School or Yining No.18 Middle School?
- What is the next slogan to promote the art of calligraphy?
- Great wisdom indicator language
- A case of cholera in Wuhan University. What disease is cholera? How to prevent cholera