Google’s digital voice assistant to sound more like humans
Google is making its voice assistant sound more like humans than robots. The company is working on a new text-to-speech system called Tacotron 2 which is essentially a neural network architecture for speech synthesis directly from text you see on the screen. “The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesise time-domain waveforms from those spectrograms,” Google explains in a study titled ‘Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions’. “Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further ...