Deep learning can not solve the problem of speech recognition

Since the deep learning was introduced into speech recognition, the word error rate dropped rapidly. However, although you may have read some related articles, in fact, language recognition has not yet reached the human standard. Speech recognition has had many failed modes. The only way to develop ASR (Automatic Speech Recognition) from being available to a part of people only most of the time to being applicable to anyone at any time is to acknowledge these failures and take steps to resolve them.

Big Data

Progress in word error rate in the standard test of speech recognition in Switchboard conversational. This data set was collected in 2000. It consists of forty telephone calls. These calls are random two English-speaking people.

Just based on the results of the exchange's call, it claims to have achieved human-level speech recognition, just as it was claimed that humans had reached the level of human driving in a sunny town with no traffic flow. The recent development in the field of speech recognition is indeed very shocking. However, those who claim to reach human standards are too broad. Here are some areas that have yet to be improved.

Accents and noise

One of the most obvious flaws in speech recognition is the treatment of accents [1] and background noise. The most direct reason is that most of the training data is English with high signal-to-noise ratio and American accent. For example, in the training and test data set of the exchange call, only native English speakers (mostly Americans) have very little background noise.

However, training data alone cannot solve this problem. In many languages â€‹â€‹we have a large number of dialects and accents. We cannot collect enough data for all situations. Simply building a high-quality speech recognizer for American accent English requires more than 5000 hours of transcribed audio.

Comparison of artificial transcription and Baidu's Deep Speech 2 model in various types of speech [2]. Note that artificial totals perform worse when transcribing non-American accents, which may be attributable to American biases in the transcriptional population. I am even more interested in arranging local transcriptionists for different regions so that the error rate of regional accents is even lower.

Regarding background noise, it is not uncommon for a signal-to-noise ratio (SRN) as low as -5dB in a moving car. In this kind of environment, people are not difficult to communicate, but on the other hand, the voice recognition ability in the noise environment has dropped rapidly. It can be seen that from a high SNR to a low signal-to-noise ratio, the gap between the error rates seen only in people and models has dramatically increased.

Semantic errors

The actual goal of a speech recognition system is usually not the word error rate. What we are more concerned with is the semantic error rate, which is the part of the discourse that has been misunderstood.

An example of a semantic error, such as someone saying "let's meet up Tuesday" but speech recognition predicting "let's meet up today." We can also maintain semantic correctness in the case of word errors, such as the speech recognizer misses "up" and predicts "let's meet Tuesday", so that the semantics of the discourse are unchanged.

You must be careful when using the word error rate as an indicator. For a worst case example, the 5% error rate is roughly equivalent to 1 miss for every 20 words. If there are 20 words for each statement (approximately the English sentence average), the statement error rate may be as high as 100%. It is hoped that the wrong word will not change the semantics of the sentence, or even if only 5% of the word error rate may cause each sentence to be misread.

The key when comparing models with humans is to find the nature of errors, not just the rate of error as a decisive figure. In my experience, artificial transcription produces fewer extreme semantic errors than speech recognition.

Recently, Microsoft researchers compared the error of their artificial-grade speech recognizer with humans [3]. One difference they found was that the model confused "uh" and "uh huh" more frequently than humans. The semantics of these two terms are quite different: "uh" is just a filler, and "uh huh" is a reverse confirmation. Many errors of the same type occur in this model and people.

Single-channel and multi-user conversations

Since each caller is logged by a separate microphone, the exchange call task is also made simpler. There is no overlap of multiple talkers in the same audio stream. On the other hand, humans can understand multiple conversationers who sometimes speak at the same time.

A good conversational speech recognizer must be able to divide the audio according to who is speaking, and should also be able to understand overlapping conversations (separation of sources). It is not only feasible if there is a microphone in the mouth of each conversationalist, and it can further respond well to conversations that happen anywhere.

Field changes

Accents and background noise are just two aspects of speech recognition that need to be enhanced. Here are some other things:

Reverberation from changes in acoustic environment

Hardware artifacts

Audio codec and compression artifacts

Sampling Rate

Conversational age

Most people will not even notice the difference between mp3 and wav files. But before claiming to achieve human-level performance, speech recognition also needs to further enhance the diversification of sources of documents.

Context

You will notice that the benchmark of human-level error rate like the exchange is actually very high. If you are talking to a friend, he misunderstood one of every 20 words, and communication can be difficult.

One reason is that such assessments are context-independent. In real life, we use many other clues to help understand what other people are saying. List several situations where human use contexts and speech recognizers do not:

History conversations and topics discussed

Speaker's visual cues, including facial expressions and lip movements

Conversational knowledge

Currently, Android's speech recognizer has mastered your contact list and it can identify your friend's name [4]. The voice search of map products narrows the range of points of interest you want to navigate through geo-location [5].

When adding these signals, the ASR system will definitely improve. However, we have only just scratched the surface about the types of contexts available and how to use them.

deploy

The latest developments in conversational speech have not yet been deployed. If you want to address the deployment of new speech algorithms, you need to consider both the delay and the amount of computation. There is a correlation between the two, and the increase in the amount of algorithm computation usually leads to an increase in latency. But for simplicity, I will discuss them separately.

Delay: Regarding delay, here I am referring to the time when the user finished speaking the transcription. Low latency is a common product constraint in ASR and it significantly affects the user experience. For the ASR system, the delay requirement of 10 milliseconds is not uncommon. This may sound extreme, but keep in mind that transcription of text is usually the first step in a series of complex calculations. For example, in a voice search, actual web search can only be performed after voice recognition.

One example that is difficult to improve with respect to latency is the bidirectional loop layer. It is currently used by all the most advanced conversational speech recognition. The problem is that we can't calculate anything at the first level and we have to wait until the user finishes. So the delay here is related to the length of the speech.

How to effectively combine future information in speech recognition is still an open question.

Calculations: The computational power required to transcribe speech is an economic constraint. We must consider the price/performance ratio of the speech recognizer for each accuracy improvement. If improvement fails to reach an economic threshold, it cannot be deployed.

A typical case of continuous improvement that has never been deployed is integration. The 1% or 2% error reduction is rarely worth 2-8 times the computational increase. The new generation of RNN language models also fall into this category because they are expensive to use in beam searching, but are expected to change in the future.

It should be noted that I do not think it is useless to study how to increase the accuracy of the computational cost. We have already seen the success of the "first slow and accurate, then speed up" model. The point to mention is that it will not be usable until it is improved quickly enough.

The next five years

There are still many open and challenging issues in the field of speech recognition:

Capacity expansion in new areas, accents, far-field and low signal to noise ratio speech

Introduce more context in the recognition process

Diarisation and sound source separation

Evaluate Semantic Error Rate and Innovative Methods of Speech Recognition

Ultra low latency and efficient reasoning

Acrylic Organizer

Shenzhen Apex Artcrafts Co,.LTD , https://www.apexdisplaycn.com