Thursday, 24 September 2015
Google voice search: faster and more accurate
Back in 2012, we announced that Google voice search had taken a new turn by adopting Deep Neural Networks (DNNs) as the core technology used to model the sounds of a language. These replaced the 30-year old standard in the industry: the Gaussian Mixture Model (GMM). DNNs were better able to assess which sound a user is producing at every instant in time, and with this they delivered greatly increased speech recognition accuracy.
Today, we’re happy to announce we built even better neural network acoustic models using Connectionist Temporal Classification (CTC) and sequence discriminative training techniques. These models are a special extension of recurrent neural networks (RNNs) that are more accurate, especially in noisy environments, and they are blazingly fast!
In a traditional speech recognizer, the waveform spoken by a user is split into small consecutive slices or “frames” of 10 milliseconds of audio. Each frame is analyzed for its frequency content, and the resulting feature vector is passed through an acoustic model such as a DNN that outputs a probability distribution over all the phonemes (sounds) in the model. A Hidden Markov Model (HMM) helps to impose some temporal structure on this sequence of probability distributions. This is then combined with other knowledge sources such as a Pronunciation Model that links sequences of sounds to valid words in the target language and a Language Model that expresses how likely given word sequences are in that language. The recognizer then reconciles all this information to determine the sentence the user is speaking. If the user speaks the word “museum” for example - /m j u z i @ m/ in phonetic notation - it may be hard to tell where the /j/ sound ends and where the /u/ starts, but in truth the recognizer doesn’t care where exactly that transition happens: All it cares about is that these sounds were spoken.
Our improved acoustic models rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud - “museum” - it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is a Long Short-Term Memory (LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models already improved the quality of our recognizer significantly.
The next step was to train the models to recognize phonemes in an utterance without requiring them to make a prediction for each time instant. With Connectionist Temporal Classification, the models are trained to output a sequence of “spikes” that reveals the sequence of sounds in the waveform. They can do this in any way as long as the sequence is correct.
The tricky part though was how to make this happen in real-time. After many iterations, we managed to train streaming, unidirectional, models that consume the incoming audio in larger chunks than conventional models, but do actual computations less often. With this, we drastically reduced computations and made the recognizer much faster. We also added artificial noise and reverberation to the training data, making the recognizer more robust to ambient noise. You can watch a model learning a sentence here.
We now had a faster and more accurate acoustic model and were excited to launch it on real voice traffic. However, we had to solve another problem - the model was delaying its phoneme predictions by about 300 milliseconds: it had just learned it could make better predictions by listening further ahead in the speech signal! This was smart, but it would mean extra latency for our users, which was not acceptable. We solved this problem by training the model to output phoneme predictions much closer to the ground-truth timing of the speech.
our new acoustic models are now used for voice searches and commands in the Google app (on Android and iOS), and for dictation on Android devices. In addition to requiring much lower computational resources, the new models are more accurate, robust to noise, and faster to respond to voice search queries - so give it a try, and happy (voice) searching!
Tuesday, 22 September 2015
Top 5 ways to amplify the impact of TV dollars with digital
Today’s consumers hop from screen to screen according to their needs-of-the-moment. They don’t give a thought to what “channel” they are using to interact with your brand — they simply expect brands to keep up.
In last week’s post, we discussed the advent of TV Attribution and the new opportunity marketers have to drive more ROI in a multi-screen world. This week, we’ll discuss 5 key ways that TV Attribution can help you get more from mass media investments with digital insights.
If you want more details on any of our top tips, take a look at our recent white paper or register for our upcoming webinar.
1. Align creative across channels. If a friend was always chummy on the phone, but cold in person, wouldn’t you be confused? Don’t let a choppy brand presentation put off interested consumers who experience TV ads, search online, and visit your sites and apps. Use consistency between your online and offline presence for a clear message.
2. Empower mobile search. Knowing that TV ads inspire mobile searches, make sure digital copy aligns with verbal and on-screen messages in TV ads to ensure consumers find you online. Use mobile context — include click-to-call, highlight nearby stores, show relevant hours — to move consumers from search to purchase.
3. Connect the data. Connecting TV airings data with digital signals like search query and site traffic offers a new level of granularity and immediacy of reporting. With better insights, you can fine-tune your next TV campaign and align digital strategies to capture incremental opportunity.
4. Find your best audiences. Take the guesswork out of demographic targeting with digital insights. Search and site data reveal who is really responding to TV messages by taking online actions — so you can confirm your best audiences by behavior.
5. Understand your consumer. Analyze digital signals to understand what parts of your message consumers are retaining — or not retaining. The keywords consumers search after being exposed to your TV ad offer insights that can drive faster campaign optimization, saving time and money over traditional surveys or studies.
More insight, more opportunity
TV Attribution not only offers a new, immediate, and granular view of mass media impact — it allows you to create more cross-channel synergy. Today’s consumers want immediate gratification and have high expectations for the brands they pursue. Join us for a webinar October 28th to discuss more tips and tricks for meeting new consumer expectations, and hear how top brands are leveraging minute-by-minute TV Attribution analysis to improve cross-channel marketing. If you’re ready to dive in, register here.
Posted by Natasha Moonka, Google Analytics team
A Beginner’s Guide to Deep Neural Networks
Posted by Natalie Hammel and Lorraine Yurshansky, creators of Nat & Lo’s 20% Project
Last year, we (a couple of people who knew nothing about how voice search works) set out to make a video about the research that’s gone into teaching computers to recognize speech and understand language.
Making the video was eye-opening and brain-opening. It introduced us to concepts we’d never heard of – like machine learning and artificial neural networks – and ever since, we’ve been kind of fascinated by them. Machine learning, in particular, is a very active area of Computer Science research, with far-ranging applications beyond voice search – like machine translation, image recognition and description, and Google Voice transcription.
So... still curious to know more (and having just started this project) we found Google researchers Greg Corrado and Christopher Olah and ambushed them with our machine learning questions.
This video is our attempt to distill what we learned from talking with them, but if anything in it piques your curiosity, or you have other questions, you’re in luck! On Friday, September 25, at 1 PM PDT / 4 PM EST Greg and Chris will be doing an Ask Me Anything on Reddit (see the calendar here) to answer your deep learning questions.
Everyone who’s curious is welcome to join, ask questions, and hopefully gain a better understanding of the world of machine learning and deep neural networks. (And we’ll be hanging out with them, too...in case you have any questions about video making or dogs.) We hope to see you this Friday!
Last year, we (a couple of people who knew nothing about how voice search works) set out to make a video about the research that’s gone into teaching computers to recognize speech and understand language.
Making the video was eye-opening and brain-opening. It introduced us to concepts we’d never heard of – like machine learning and artificial neural networks – and ever since, we’ve been kind of fascinated by them. Machine learning, in particular, is a very active area of Computer Science research, with far-ranging applications beyond voice search – like machine translation, image recognition and description, and Google Voice transcription.
So... still curious to know more (and having just started this project) we found Google researchers Greg Corrado and Christopher Olah and ambushed them with our machine learning questions.
This video is our attempt to distill what we learned from talking with them, but if anything in it piques your curiosity, or you have other questions, you’re in luck! On Friday, September 25, at 1 PM PDT / 4 PM EST Greg and Chris will be doing an Ask Me Anything on Reddit (see the calendar here) to answer your deep learning questions.
Everyone who’s curious is welcome to join, ask questions, and hopefully gain a better understanding of the world of machine learning and deep neural networks. (And we’ll be hanging out with them, too...in case you have any questions about video making or dogs.) We hope to see you this Friday!
Thursday, 17 September 2015
Information sharing for more efficient network utilization and management
Andreas Terzis, Software Engineer
As Internet traffic has grown and changed, Google and other content and application providers have worked cooperatively with Internet service providers (ISPs) so that services can be delivered quickly, efficiently and cost-effectively. For example, rather than content having to traverse a long distance and many different networks to reach an Internet access provider’s network, a content provider might store (cache) the data close by and interconnect (‘peer’) directly with the access provider. Google has invested billions of dollars in the network and infrastructure necessary to bring our services as close to your Internet access provider’s front door as possible, for free – which both reduces ISPs’ costs and improves the user experience.
Content and application providers can also tune their services for congested and/or lower bandwidth environments. For instance, YouTube detects how smoothly a video is playing and adjusts the quality to account for temporary fluctuations in bandwidth or congestion. In the Google Video Quality Report, we transparently reveal the speeds YouTube is experiencing on different networks.
As more of Internet traffic becomes encrypted, some network operators have expressed concern about the effect encryption might have on their ability to manage their networks. We don’t think there has to be a trade-off here – there are ways to do effective network management of encrypted traffic today, and, through further cooperation between content and application providers and ISPs, we believe this could be made easier while still respecting encryption.
To spur discussion and collaboration on this front, we recently submitted a paper to a workshop organized by the Internet Architecture Board outlining some ideas. We advocate for a model where ISPs selectively share network state to content and applications providers, enabling them to adapt to available network resources.
For example, we recently proposed to the Internet Engineering Task Force the concept of Throughput Guidance (TG), whereby mobile network operators could share information about the throughput of a radio downlink. Preliminary field tests in a production LTE network showed that TG reduces YouTube join latency, defined as the amount of time until the video starts playing, by 8% on average, rebuffering time by 20% on average, and rebuffer count by 2% on average. In addition to improving quality of experience for users, this mechanism improves the utilization of providers’ networks. Encryption of traffic would have no impact on the efficacy of this approach; it works equally well with encrypted and unencrypted traffic.
Throughput Guidance is one possible solution and many questions remain unanswered. It’s still relatively early days in our exploration of this and the other measures in our short paper, and we’re looking forward to getting feedback and collaborating with network operators and others.
As Internet traffic has grown and changed, Google and other content and application providers have worked cooperatively with Internet service providers (ISPs) so that services can be delivered quickly, efficiently and cost-effectively. For example, rather than content having to traverse a long distance and many different networks to reach an Internet access provider’s network, a content provider might store (cache) the data close by and interconnect (‘peer’) directly with the access provider. Google has invested billions of dollars in the network and infrastructure necessary to bring our services as close to your Internet access provider’s front door as possible, for free – which both reduces ISPs’ costs and improves the user experience.
Content and application providers can also tune their services for congested and/or lower bandwidth environments. For instance, YouTube detects how smoothly a video is playing and adjusts the quality to account for temporary fluctuations in bandwidth or congestion. In the Google Video Quality Report, we transparently reveal the speeds YouTube is experiencing on different networks.
As more of Internet traffic becomes encrypted, some network operators have expressed concern about the effect encryption might have on their ability to manage their networks. We don’t think there has to be a trade-off here – there are ways to do effective network management of encrypted traffic today, and, through further cooperation between content and application providers and ISPs, we believe this could be made easier while still respecting encryption.
To spur discussion and collaboration on this front, we recently submitted a paper to a workshop organized by the Internet Architecture Board outlining some ideas. We advocate for a model where ISPs selectively share network state to content and applications providers, enabling them to adapt to available network resources.
For example, we recently proposed to the Internet Engineering Task Force the concept of Throughput Guidance (TG), whereby mobile network operators could share information about the throughput of a radio downlink. Preliminary field tests in a production LTE network showed that TG reduces YouTube join latency, defined as the amount of time until the video starts playing, by 8% on average, rebuffering time by 20% on average, and rebuffer count by 2% on average. In addition to improving quality of experience for users, this mechanism improves the utilization of providers’ networks. Encryption of traffic would have no impact on the efficacy of this approach; it works equally well with encrypted and unencrypted traffic.
Throughput Guidance is one possible solution and many questions remain unanswered. It’s still relatively early days in our exploration of this and the other measures in our short paper, and we’re looking forward to getting feedback and collaborating with network operators and others.
Wednesday, 16 September 2015
Programmatic helps brands make the most of micro-moments
The following was originally posted to the DoubleClick Advertiser Blog.
Every day, your audience is filling their days with hundreds if not thousands of micro-moments—intent-rich moments when preferences are shaped and decisions are made. As consumers spread their attention across more and more screens and channels, those moments can happen almost anywhere, anytime. People search on their smartphones while in front of the TV. They watch YouTube videos on their tablets while texting their friends. They open a mobile app to shop for the perfect gift, then head to the store to buy it. With mobile devices never more than an arm’s length away, people can find and buy anything, anytime.
For marketers, this means the purchase funnel is wildly more complicated than it was just a few years ago.
“Brands can use programmatic to assemble a consumer’s micro-moments in just the right way—like joining puzzle pieces together—to see a detailed blueprint of consumer intent.”
It’s hard to plan for nonlinear purchase paths, but programmatic advertising can help, enabling brands to reach the right person with the right message in the moment of opportunity. Brands can use programmatic to assemble a consumer’s micro-moments in just the right way—like joining puzzle pieces together—to see a detailed blueprint of consumer intent. That’s a powerful proposition, and it’s why programmatic advertising spend is projected to grow by more than 77% this year.1
In this article, we share four tips for using programmatic to win these micro-moments and examples of brands that are doing it right.
Visit DoubleClick.com to read the full article.
Posted by Kelly Cox, Product Marketing Manager, DoubleClick
1. IDC, Worldwide Programmatic Display Forecast, 2015.
Monday, 14 September 2015
How can you get more ROI in a multi-screen world?
We live in a world of instant gratification. Wherever we are, and whatever we may be doing, when we want to know, to do, to buy we pull out our phones and search for satisfaction.
For marketers, a multi-screen world offers new opportunities for ROI. While TV accounts for 42% of all ad spending, or $78.8 billion annually, we also know that 90% of consumers engage with a second screen* — think tablets and mobile phones — while watching TV.
This means that in a multi-screen world, executing separate television and digital campaigns is a strategic miss. If that’s the case, why are so many of us still doing it?
The old TV measurement problem
In the past, channel-centric thinking, competing objectives, and data silos often stopped marketers from true cross-channel measurement. Even with the advent of marketing measurement best practices like marketing mix modeling, we lived with a significant blind spot around the true impact of TV advertising.
TV airings data was hard to come by, and traditional Marketing Mix Modeling reports are often too high-level — and too slow — to offer actionable insights. So, while we’ve known for a long time that TV drives consumers online, we had no way to accurately attribute digital activity to granular TV investments.
The new TV attribution solution
Now, TV attribution makes it possible to connect the dots between TV airings data and digital activity. The resulting insights from TV attribution enable marketers to improve campaign strategies across both mass media and digital channels.
At a high level, TV attribution carefully analyzes typical search query and site activity to establish a baseline. Then, minute-by-minute TV airings data is correlated with search and site data to detect — and accurately attribute — traffic driven by each TV ad spot.
We’ve seen great results for marketers that have embraced this new marketing measurement best practice. For example, Nest assessed and improved cross-channel campaigning with TV attribution, achieving a 2.5x lift in search volumes and 5x increase in search and website responses by acting on resulting insights.
For more details, read our new infographic to learn:
- How TV attribution reveals TV-to-digital behaviors
- How TV attribution insights help marketers quantify TV’s business value, optimize media buys, and empower creative teams
- How deeper understanding of consumers can lead to more effective cross-channel strategies
Time to improve your ROI?
Now that TV and digital data can be analyzed to reveal cross-channel behaviors, marketers have a new opportunity to improve both mass media and digital strategies. Next week, we’ll post our top 5 tips on amplifying TV dollars with digital. If you’re ready to get going on maximizing TV ROI, stay tuned.
Posted by: Natasha Moonka, Google Analytics team
*Source: Neal Mohan, Google, “Video Ads and Moments That Matter,” Consumer Electronics Show 2015.
Tuesday, 8 September 2015
Crowdsourcing a Text-to-Speech voice for low resource languages (episode 1)
Posted by Linne Ha, Senior Program Manager, Google Research for Low Resource Languages
Building a decent text-to-speech (TTS) voice for any language can be challenging, but creating one – a good, intelligible one – for a low resource language can be downright impossible. By definition, working with low resource languages can feel like a losing proposition – from the get go, there is not enough audio data, and the data that exists may be questionable in quality. High quality audio data, and lots of it, is key to developing a high quality machine learning model. To make matters worse, most of the world’s oldest, richest spoken languages fall into this category. There are currently over 300 languages, each spoken by at least one million people, and most will be overlooked by technologists for various reasons. One important reason is that there is not enough data to conduct meaningful research and development.
Project Unison is an on-going Google research effort, in collaboration with the Speech team, to explore innovative approaches to building a TTS voice for low resource languages – quickly, inexpensively and efficiently. This blog post will be one of several to track progress of this experiment and to share our experience with the research community at large – our successes and failures in a trial and error, iterative approach – as our adventure plays out.
One of the most critical aspects of building a TTS system is acquiring audio data. The traditional way to do this is in a professional recording studio with a voice talent, sound engineer and a voice director. The process can take considerable time and can be quite expensive. People often mistake voice talent work to be similar to a news reader, but it is highly specialized and the work can be very difficult.
Such investments in time and money may yield great audio, but the catch is that even if you’ve created the best TTS voice from these recordings, at best it will still sound exactly like the voice talent - the person who provided the raw audio data. (We’ve read the articles about people who have fallen for their GPS voice to find that they are real people with real names.) So the interesting problem here from a research perspective is how to create a voice that sounds human but is not identifiable as a singular person.
Crowd-sourcing projects for automatic speech recognition (ASR) for Google Voice Search had been successful in the past, with public volunteers eager to participate by providing voice samples. For ASR, the objective is to collect from a diversity of speakers and environments, capturing varying regional accents. The polar opposite is true of TTS, where one unique speaker, with the standard accent and in a soundproof studio is the basic criteria.
Many years ago, Yannis Agiomyrgiannakis, Digital Signal Processing researcher on the TTS team in Google London, wrote a “manifesto” for acoustic data collection for 2000 languages. In his document, he gave technical specifications on how to convert an average room into a recording studio. Knot Pipatsrisawat, software engineer in Google Research for Low Resource Languages, built a tool that we call “ChitChat”, a portable recording studio, using Yannis’ specifications. This web app allows users to read the prompt, playback the recording and even assess the noise level of the room.
From other past research in ASR, we knew that the right tool could solve the crowd sourcing problem. ChitChat allowed us to experiment in different environments to get an idea of what kind of office space would work and what kind of problems we might encounter. After experimenting with several different laptops and tablets, we were able to find a computer that recognized the necessary peripherals (the microphone, USB converter, and preamp) for under $2,000 – much cheaper than a recording studio!
Now we needed multiple speakers of a single language. For us, it was a no-brainer to pilot Project Unison with Bangladeshi Googlers, all of whom are passionate about getting Google products to their home country (the success of Android products in Bangladesh is an example of this). Googlers by and large are passionate about their work and many offer their 20% time as a way to help, to improve or to experiment on something that may or may not work because they care. The Bangladeshi Googlers are no exception. They embodied our objectives for a crowdsourcing innovation: out of many, we could achieve (literally) one voice.
With multiple speakers, we would target speakers of similar vocal profiles and adapt them to create a blended voice. Statistical parametric synthesis is not new, but the advances in recent technology have improved quality and proved to be a lightweight solution for a project like ours.
In May of this year, we auditioned 15 Bangaldeshi Googlers in Mountain View. From these recordings, the broader Bangladeshi Google community voted blindly for their preferred voice. Zakaria Haque, software engineer in Machine Intelligence, was chosen as our reference for the Bangla voice. We then narrowed down the group to five speakers based on these criteria: Dhaka accent, male (to match Zakaria’s), similarity in pitch and tone, and availability for recordings. The original plan of a spectral analysis using PRAAT proved to be unnecessary with our limited pool of candidates.
All 5 software engineers – Ahmed Chowdury, Mohammad Hossain, Syeed Faiz, Md. Arifuzzaman Arif, Sabbir Yousuf Sanny – plus Zakaria Haque recorded over 3 days in the anechoic chamber, a makeshift sound-proofed room at the Mountain View campus just before Ramadan. HyunJeong Choe, who had helped with the Korean TTS recordings, directed our volunteers.
In this session, we discovered an issue: a sudden drop in amplitude at high frequencies in a few recordings. We were worried that all the recordings might have to be scrapped.
the publicly available TTS data from the Indian Institute of Information Technology, but this represented the variant of Bangla spoken in West Bengal (India), which differs from the speech we recorded. Our internally designed pronunciation rules for Bengali were also aimed at West Bengal and would need to be revised later.
Deciding to proceed anyway, Alexander Gutkin, Speech software engineer and lead for TTS for Low Resource Languages in Google London, built an initial prototype voice. Using the preliminary text normalization rules created by Richard Sproat, Speech and Language Processing researcher, the first voice we attempted proved to be surprisingly good. The problem in the high frequencies we had seen in the recordings is undetectable in the parametric voice.
When we return to the sound studio to record an additional 200 longer sentences, we plan to try an upgrade of the USB converter. Meanwhile, Martin Jansche, Natural Language Understanding software engineer, has worked with a team of native speakers on a pronunciation and lexicon and model that better matches the phonology of colloquial Bangladeshi Bangla. Alexander will use the additional recordings and the new pronunciation dictionary to build the second version.
NEXT UP: Building a parametric voice with multiple speaker data (Ep.2)
Building a decent text-to-speech (TTS) voice for any language can be challenging, but creating one – a good, intelligible one – for a low resource language can be downright impossible. By definition, working with low resource languages can feel like a losing proposition – from the get go, there is not enough audio data, and the data that exists may be questionable in quality. High quality audio data, and lots of it, is key to developing a high quality machine learning model. To make matters worse, most of the world’s oldest, richest spoken languages fall into this category. There are currently over 300 languages, each spoken by at least one million people, and most will be overlooked by technologists for various reasons. One important reason is that there is not enough data to conduct meaningful research and development.
Project Unison is an on-going Google research effort, in collaboration with the Speech team, to explore innovative approaches to building a TTS voice for low resource languages – quickly, inexpensively and efficiently. This blog post will be one of several to track progress of this experiment and to share our experience with the research community at large – our successes and failures in a trial and error, iterative approach – as our adventure plays out.
One of the most critical aspects of building a TTS system is acquiring audio data. The traditional way to do this is in a professional recording studio with a voice talent, sound engineer and a voice director. The process can take considerable time and can be quite expensive. People often mistake voice talent work to be similar to a news reader, but it is highly specialized and the work can be very difficult.
Such investments in time and money may yield great audio, but the catch is that even if you’ve created the best TTS voice from these recordings, at best it will still sound exactly like the voice talent - the person who provided the raw audio data. (We’ve read the articles about people who have fallen for their GPS voice to find that they are real people with real names.) So the interesting problem here from a research perspective is how to create a voice that sounds human but is not identifiable as a singular person.
Crowd-sourcing projects for automatic speech recognition (ASR) for Google Voice Search had been successful in the past, with public volunteers eager to participate by providing voice samples. For ASR, the objective is to collect from a diversity of speakers and environments, capturing varying regional accents. The polar opposite is true of TTS, where one unique speaker, with the standard accent and in a soundproof studio is the basic criteria.
Many years ago, Yannis Agiomyrgiannakis, Digital Signal Processing researcher on the TTS team in Google London, wrote a “manifesto” for acoustic data collection for 2000 languages. In his document, he gave technical specifications on how to convert an average room into a recording studio. Knot Pipatsrisawat, software engineer in Google Research for Low Resource Languages, built a tool that we call “ChitChat”, a portable recording studio, using Yannis’ specifications. This web app allows users to read the prompt, playback the recording and even assess the noise level of the room.
From other past research in ASR, we knew that the right tool could solve the crowd sourcing problem. ChitChat allowed us to experiment in different environments to get an idea of what kind of office space would work and what kind of problems we might encounter. After experimenting with several different laptops and tablets, we were able to find a computer that recognized the necessary peripherals (the microphone, USB converter, and preamp) for under $2,000 – much cheaper than a recording studio!
Now we needed multiple speakers of a single language. For us, it was a no-brainer to pilot Project Unison with Bangladeshi Googlers, all of whom are passionate about getting Google products to their home country (the success of Android products in Bangladesh is an example of this). Googlers by and large are passionate about their work and many offer their 20% time as a way to help, to improve or to experiment on something that may or may not work because they care. The Bangladeshi Googlers are no exception. They embodied our objectives for a crowdsourcing innovation: out of many, we could achieve (literally) one voice.
With multiple speakers, we would target speakers of similar vocal profiles and adapt them to create a blended voice. Statistical parametric synthesis is not new, but the advances in recent technology have improved quality and proved to be a lightweight solution for a project like ours.
In May of this year, we auditioned 15 Bangaldeshi Googlers in Mountain View. From these recordings, the broader Bangladeshi Google community voted blindly for their preferred voice. Zakaria Haque, software engineer in Machine Intelligence, was chosen as our reference for the Bangla voice. We then narrowed down the group to five speakers based on these criteria: Dhaka accent, male (to match Zakaria’s), similarity in pitch and tone, and availability for recordings. The original plan of a spectral analysis using PRAAT proved to be unnecessary with our limited pool of candidates.
All 5 software engineers – Ahmed Chowdury, Mohammad Hossain, Syeed Faiz, Md. Arifuzzaman Arif, Sabbir Yousuf Sanny – plus Zakaria Haque recorded over 3 days in the anechoic chamber, a makeshift sound-proofed room at the Mountain View campus just before Ramadan. HyunJeong Choe, who had helped with the Korean TTS recordings, directed our volunteers.
In this session, we discovered an issue: a sudden drop in amplitude at high frequencies in a few recordings. We were worried that all the recordings might have to be scrapped.
the publicly available TTS data from the Indian Institute of Information Technology, but this represented the variant of Bangla spoken in West Bengal (India), which differs from the speech we recorded. Our internally designed pronunciation rules for Bengali were also aimed at West Bengal and would need to be revised later.
Deciding to proceed anyway, Alexander Gutkin, Speech software engineer and lead for TTS for Low Resource Languages in Google London, built an initial prototype voice. Using the preliminary text normalization rules created by Richard Sproat, Speech and Language Processing researcher, the first voice we attempted proved to be surprisingly good. The problem in the high frequencies we had seen in the recordings is undetectable in the parametric voice.
When we return to the sound studio to record an additional 200 longer sentences, we plan to try an upgrade of the USB converter. Meanwhile, Martin Jansche, Natural Language Understanding software engineer, has worked with a team of native speakers on a pronunciation and lexicon and model that better matches the phonology of colloquial Bangladeshi Bangla. Alexander will use the additional recordings and the new pronunciation dictionary to build the second version.
NEXT UP: Building a parametric voice with multiple speaker data (Ep.2)
Subscribe to:
Posts (Atom)