Automatic Speech Recognition in Telephonic Speech

What is Automatic Speech Recognition (ASR)?

Telephonic conversations are one of the most effective ways of communication with customers as attention during calls is usually higher than other channels. 

Any customer-centric company tends to generate a lot of audio data, however, audio can be difficult to analyze and derive insights from. This is where Automatic Speech Recognition (ASR) comes in. ASR is a technology that enables computers to understand speech and convert it into textual information.

Multilingual ASR

convozen.AI serves customers situated at multiple locations across the world, and we interact in their native language. This provides another big challenge in building an understanding of different languages & accents. 

How do ASR systems work?

ASR systems usually have 2 major steps in speech-to-text processing. 

  1. Feature extraction from audio
  2. Mapping learned features to possible text sequences. 

1. Feature extraction 

Raw audio data consists of signals sampled at a predefined frequency. Usually, this is set at 16KHz, implying 16000 samples within 1 second. Human speech has an average of 12 characters / second. So compressing this information is vital to having a quality ASR system. There are several ways to extract features from raw audio.

  • Mel frequency cepstral coefficients (MFCC): Raw audio can be represented as power expressed at different frequency bands present in it in a certain time frame. Usually, these bands are defined by the Mel scale, a predefined scale of frequency bands where change can be observed by the human auditory system. MFCCs are a set of coefficients that capture the shape of the power spectrum of a sound signal expressed in this scale.
  • Transformer encoders: Transformers have shown unique capabilities in modelling text data, current state-of-the-art ASR systems are adopting them for audio representation also. Raw audio data is processed through several blocks of convolutions (Point convolutions & 1-D Depth wise convolution) to obtain latent representations, some of these representations are masked & sent through transformer layers to obtain contextualized representations. Then these are used to predict the masked latent representation using a contrastive objective. Once trained the contextualized representations become the audio embeddings.

2. Text Sequence mapping

After training on a large corpus of audio data, speech representations that capture voice patterns are obtained, these can be further fine-tuned for final objectives such as language identification, emotion recognition & also ASR.

Feature embeddings are usually generated at shorter time frames such as 20ms – 50 ms. A single phoneme could be repeated across several of these frames depending on the speed of speech. 

Example: “Good & God ” both can be represented as g-g-o-o-o-o-d-d-d, g-g-o-o-o-o-d-d-d in time frames. The algorithm that can learn to collapse these frames into meaningful sequences is known as Connectionist Temporal Classification (CTC). 

Connectionist Temporal Classification (CTC)

CTC Is an algorithm that assigns a probability score to an output Y given any input X. The main advantage of CTC is that the size of X and Y do not have to match. 

It tries to maximize the possible sequence of mappings that actually result in the final word against all the possible permutations. Ex: Let’s say our vocab is only 2 characters g,o & the number of time frames is 4. Total number of possibilities is 2^4. But those results in “go” are only 3 g-o-o-o, g-g-o-o, g-g-g-o. CTC will try to maximize the probability of these 3 sequences against all 16.

Advanced ASR systems usually have a few more modules at the end of decoding to improve accuracy such as beam search, language models, word boosting etc. We at convozen employ word boosting to cater for domain-specific words occurring in our diverse client base spanning Healthcare, education & finance sectors.

Ethical considerations 

ASR systems work with customers’ audio data & domain-specific information, it becomes of utmost importance to ensure data is not misused in nefarious ways such as voice cloning, data leakage to public domains, pitch & personalized discounts leakage. 

convozen prioritizes data privacy above all else. Client data is solely utilized to enhance process quality. We guarantee the segregation of data sources in the cloud and implement role-based authentications to restrict access appropriately.

https://convozen.ai/

Conclusion

Automatic speech recognition is a complex field, which is constantly evolving as more open-source audio datasets become available along with better algorithms. 

With language understanding models showcasing rapid growth in extracting meaningful insights, ASR systems coupled with LLMs make it possible to see a new future of automation that can enable companies to understand their customers & serve them effectively and grow rapidly! 

References

Unleash Your Contact Center’s Potential Today! 👉 Get Started with convozen.AI and Elevate Customer Experience.

Schedule a Demo Now!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top