Comes amid ongoing efforts by Mozilla to develop an open source speech-to-text engine
Mozilla’s ‘“Common Voice” project, which asks users to donate their voices in order to create a bank of speech data to run machine learning algorithms, has gone multilingual.
It is now accepting donations in German, French and Welsh, with acceptance of speech samples in languages ranging from Cornish to Tamil, Uzbek to Sakha all pending.
The open source giant wants the project to be “a tool for any community to make speech technology available in their own language”.
Multiplicity of Data + Machine Learning = Better Speech Technology
Mozilla is building out a bank of data by asking users from around the globe to donate their voices via their voice contribution platform. The firm knows that the more speech data they have, the more sophisticated speech-powered applications can be build.
The team, which is rumoured to be working on a speech-powered browser, said on its flagship site: “We believe that large and publicly available voice datasets foster innovation and healthy commercial competition in machine-learning based speech technology”.
The Innovation Penalty
The Common Voice project comes as it gets ever simpler to create production-quality speech-to-text (STT) and text-to-speech (TTS) engines.
As Mozilla’s Kelly Davis put it in an earlier blog, powerful tools like artificial intelligence and machine learning, combined with today’s more advanced speech algorithms, have changed our traditional approach to development.
“Programmers no longer need to build phoneme dictionaries or hand-design processing pipelines or custom components. Instead, speech engines can use deep learning techniques to handle varied speech patterns, accents and background noise – and deliver better-than-ever accuracy.”
Yet as Davis emphasised, there are barriers to innovation in the sector; developers who want to implement STT on the web are working with a fractured set of APIs and support. Creating a speech interface for a web application that works across all browsers either requires developers to write code that works across discrete browser APIs (they are starkly different for Chrome, Safari etc.)
Alternatively they can purchase access to a non-browser-based API from Google, IBM or Nuance. Davis notes: “Fees for this can cost roughly one cent per invocation. If you go this route, then you get one stable API to write to. But at one cent per utterance, those fees can add up quickly, especially if your app is wildly popular and millions of people want to use it. This option has a success penalty built into it, so it’s not a solid foundation for any business that wants to grow and scale.”
Why do They Need Donated Voice Samples Again?
That is the context for Mozilla’s efforts to develop an open source STT engine, which will give the ability to utilise STT in the Firefox browser, and hand over the toolkit to the speech developer community, with no access or usage fees. But, why the voice samples?
Language is incredibly complex—people ask about something as simple as the weather in over 10,000 ways (“our favourite: ‘Will it be cats and dogs today?’”) as Google noted in a blog published earlier this year, as it stunned observers with the capabilities of its AI voice assistant, Duplex. This has been programmed to match expectations around latency – and sounds eerily human.
Those wanting to democratise this process and gain access to similar skillsets could do worse than visit https://voice.mozilla.org/en/speak and read out one of the samples, which include: “Vermicelli A trio, or musical piece for three voices or instruments.”
Hopefully that trips off the tongue.