{"id":1293,"date":"2017-11-15T11:50:14","date_gmt":"2017-11-15T11:50:14","guid":{"rendered":"http:\/\/blogs.kent.ac.uk\/unikentcomp-news\/?p=1293"},"modified":"2018-05-10T14:45:59","modified_gmt":"2018-05-10T13:45:59","slug":"explainer-how-the-latest-earphones-translate-languages","status":"publish","type":"post","link":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/2017\/11\/15\/explainer-how-the-latest-earphones-translate-languages\/","title":{"rendered":"Explainer: how the latest earphones translate languages"},"content":{"rendered":"<figure><figcaption><span class=\"attribution\"><a class=\"source\" href=\"https:\/\/www.shutterstock.com\/image-photo\/woman-holds-her-hand-near-ear-454016317?src=gnBoj8EPd-JVtMfNYi_MOA-1-2\">Shutterstock<\/a><\/span><\/figcaption><\/figure>\n<p><a href=\"https:\/\/theconversation.com\/profiles\/ian-mcloughlin-217755\">Ian McLoughlin<\/a>, <em><a href=\"http:\/\/theconversation.com\/institutions\/university-of-kent-1248\">University of Kent<\/a><\/em><\/p>\n<p>In the <a href=\"http:\/\/www.bbc.co.uk\/programmes\/b03v379k\">Hitchhiker\u2019s Guide to The Galaxy<\/a>, Douglas Adams\u2019s seminal 1978 BBC broadcast (then book, feature film and now cultural icon), one of the many technology predictions was the <a href=\"http:\/\/www.bbc.co.uk\/cult\/hitchhikers\/guide\/babelfish.shtml\">Babel Fish<\/a>. This tiny yellow life-form, inserted into the human ear and fed by brain energy, was able to translate to and from any language.<\/p>\n<p>Web giant Google have now seemingly <a href=\"http:\/\/www.telegraph.co.uk\/technology\/2017\/10\/04\/googles-new-headphones-can-translate-foreign-languages-real\/\">developed their own version<\/a> of the Babel Fish, called Pixel Buds. These wireless earbuds make use of <a href=\"https:\/\/assistant.google.com\/\">Google Assistant<\/a>, a smart application which can speak to, understand and assist the wearer. One of the headline abilities is support for Google Translate which is said to be able to translate up to 40 different languages. Impressive technology for under US$200.<\/p>\n<p>So how does it work?<\/p>\n<p>Real-time speech translation consists of a chain of several distinct technologies \u2013 each of which have experienced rapid degrees of improvement over recent years. The chain, from input to output, goes like this:<\/p>\n<ol>\n<li><strong>Input conditioning<\/strong>: the earbuds pick up background noise and interference, effectively recording a mixture of the users\u2019 voice and other sounds. \u201c<a href=\"http:\/\/acousticsresearchcentre.no\/speech-enhancement-with-deep-learning\">Denoising<\/a>\u201d removes background sounds while a <a href=\"https:\/\/link.springer.com\/article\/10.1186\/s13634-015-0277-z#Sec5\">voice activity detector<\/a> (VAD) is used to turn the system on only when the correct person is speaking (and not someone standing behind you in a queue saying \u201cOK Google\u201d very loudly). Touch control is used to improve the VAD accuracy.<\/li>\n<li><strong>Language identification (LID)<\/strong>: this system uses machine learning to identify what <a href=\"https:\/\/doi.org\/10.1109\/TASLP.2017.2766023\">language is being spoken<\/a> within a couple of seconds. This is important because everything that follows is language specific. For language identification, phonetic characteristics alone are insufficient to distinguish languages (languages pairs like Ukrainian and Russian, Urdu and Hindi are virtually identical in their units of sound, or \u201cphonemes\u201d), so completely new acoustic representations <a href=\"https:\/\/pdfs.semanticscholar.org\/8665\/8be322dfb3d2a0fa5262b095ba6c5a6c31a2.pdf\">had to be developed<\/a>.<\/li>\n<li><strong>Automatic speech recognition (ASR)<\/strong>: <a href=\"http:\/\/www.cs.columbia.edu\/%7Emcollins\/6864\/slides\/asr.pdf\">ASR<\/a> uses an acoustic model to convert the recorded speech into a string of phonemes and then language modelling is used to convert the phonetic information into words. By using the rules of spoken grammar, context, probability and a pronunciation dictionary, ASR systems fill in gaps of missing information and correct mistakenly recognised phonemes to infer a textual representation of what the speaker said.<\/li>\n<li><strong>Natural language processing<\/strong>: <a href=\"https:\/\/blog.algorithmia.com\/introduction-natural-language-processing-nlp\">NLP<\/a> performs machine translation from one language to another. This is not as simple as substituting nouns and verbs, but includes <a href=\"https:\/\/codeburst.io\/a-guide-to-nlp-a-confluence-of-ai-and-linguistics-2786c56c0749\">decoding the <em>meaning<\/em> of the input speech<\/a>, and then re-encoding that meaning as output speech in a different language &#8211; with all the nuances and complexities that make second languages so hard for us to learn.<\/li>\n<li><strong>Speech synthesis<\/strong> or text-to-speech (TTS): almost the opposite of ASR, this synthesises natural sounding speech from a string of words (or phonetic information). Older systems used additive synthesis, which effectively meant joining together lots of short recordings of someone speaking different phonemes into the correct sequence. More modern systems use <a href=\"http:\/\/www.cstr.ed.ac.uk\/downloads\/publications\/2010\/king_hmm_tutorial.pdf\">complex statistical speech models<\/a> to recreate a natural sounding voice.<\/li>\n<\/ol>\n<figure><div class=\"kent-video-wrapper\"><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text\/html' width='1140' height='672' src='https:\/\/www.youtube.com\/embed\/dZojo2yxzVA?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0' allowfullscreen='true'><\/iframe><\/span><\/div><\/figure>\n<h2>Putting it all together<\/h2>\n<p>So now we have the five blocks of technology in the chain, let\u2019s see how the system would work in practice to translate between languages such as Chinese and English.<\/p>\n<p>Once ready to translate, the earbuds first record an utterance, using a VAD to identify when the speech starts and ends. Background noise can be partially removed within the earbuds themselves, or once the recording has been transferred by Bluetooth to a smartphone. It is then compressed to occupy a much smaller amount of data, then conveyed over WiFi, 3G or 4G to Google\u2019s speech servers.<\/p>\n<p>Google\u2019s servers, operating as a cloud, will accept the recording, decompress it, and use LID technology to determine whether the speech is in Chinese or in English.<\/p>\n<p>The speech will then be passed to an ASR system for Chinese, then to an NLP machine translator setup to map from Chinese to English. The output of this will finally be sent to TTS software for English, producing a compressed recording of the output. This is sent back in the reverse direction to be replayed through the earbuds.<\/p>\n<p>This might seem like a lot of stages of communication, but it takes <a href=\"https:\/\/www.youtube.com\/watch?v=dZojo2yxzVA\">just seconds to happen<\/a>. And it is necessary \u2013 firstly, because the processor in the earbuds is not powerful enough to do translation by itself, and secondly because their memory storage is insufficient to contain the language and acoustics models. Even if a powerful enough processor with enough memory could be squeezed in to the earbuds, the complex computer processing would deplete the earbud batteries in a couple of seconds.<\/p>\n<p>Furthermore, companies with these kind of products (Google, <a href=\"http:\/\/www.iflytek.com\/en\">iFlytek<\/a> and <a href=\"https:\/\/www.ibm.com\/watson\/services\/language-translator\">IBM<\/a>) rely on continuous improvement to correct, refine and improve their translation models. Updating a model is easy on their own cloud servers. It is much more difficult to do when installed in an earbud.<\/p>\n<figure><div class=\"kent-video-wrapper\"><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text\/html' width='1140' height='672' src='https:\/\/www.youtube.com\/embed\/6i5hho2aD-E?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0' allowfullscreen='true'><\/iframe><\/span><\/div><\/figure>\n<p><img loading=\"lazy\" src=\"https:\/\/counter.theconversation.com\/content\/87136\/count.gif?distributor=republish-lightbox-basic\" alt=\"The Conversation\" width=\"1\" height=\"1\" \/>The late Douglas Adams would surely have found the technology behind these real life translating machines amazing \u2013 which it is. But computer scientists and engineers will not stop here. The next wave of speech-enabled computing could even be inspired by another fictional device, such as Iron Man\u2019s smart computer, <a href=\"https:\/\/futurism.com\/this-new-ai-is-like-having-iron-mans-jarvis-living-on-your-wall\">J.A.R.V.I.S<\/a> (Just Another Rather Very Intelligent System) from the Marvel series. This system would go way beyond translation, would be able to converse with us, understand what we are feeling and thinking, and anticipate our needs.<\/p>\n<p><a href=\"https:\/\/theconversation.com\/profiles\/ian-mcloughlin-217755\">Ian McLoughlin<\/a>, Professor of Computing, Head of School (Medway), <em><a href=\"http:\/\/theconversation.com\/institutions\/university-of-kent-1248\">University of Kent<\/a><\/em><\/p>\n<p>This article was originally published on <a href=\"http:\/\/theconversation.com\">The Conversation<\/a>. Read the <a href=\"https:\/\/theconversation.com\/explainer-how-the-latest-earphones-translate-languages-87136\">original article<\/a>.<\/p>\n<div class=\"grammarly-disable-indicator\"><\/div>\n<div class=\"grammarly-disable-indicator\"><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Shutterstock Ian McLoughlin, University of Kent In the Hitchhiker\u2019s Guide to The Galaxy, Douglas Adams\u2019s seminal 1978 BBC broadcast (then book, feature film and now &hellip; <a href=\"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/2017\/11\/15\/explainer-how-the-latest-earphones-translate-languages\/\">Read&nbsp;more<\/a><\/p>\n","protected":false},"author":5321,"featured_media":1294,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1479,124,57908],"tags":[178022],"_links":{"self":[{"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/posts\/1293"}],"collection":[{"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/users\/5321"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/comments?post=1293"}],"version-history":[{"count":2,"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/posts\/1293\/revisions"}],"predecessor-version":[{"id":1296,"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/posts\/1293\/revisions\/1296"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/media\/1294"}],"wp:attachment":[{"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/media?parent=1293"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/categories?post=1293"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.kent.ac.uk\/unikentcomp-news\/wp-json\/wp\/v2\/tags?post=1293"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}