{"id":250,"date":"2018-10-30T21:31:03","date_gmt":"2018-10-30T21:31:03","guid":{"rendered":"http:\/\/blogs.kent.ac.uk\/datascience\/?p=250"},"modified":"2018-11-02T12:01:15","modified_gmt":"2018-11-02T12:01:15","slug":"how-to-teach-ai-to-speak-welsh-and-other-minority-languages","status":"publish","type":"post","link":"https:\/\/blogs.kent.ac.uk\/aida\/2018\/10\/30\/how-to-teach-ai-to-speak-welsh-and-other-minority-languages\/","title":{"rendered":"How to teach AI to speak Welsh (and other minority languages)"},"content":{"rendered":"<p class=\"lead\">Pioneering smart home technologies and voice assistants don\u2019t, as a rule, speak Welsh \u2013 although the Welsh government now aims to change that through their Welsh Language Technology Action Plan. But is their aim feasible, is it necessary, and how can it be done? Professor McLoughlin investigates.<\/p>\n<p>AI speech tools (like <a href=\"https:\/\/theconversation.com\/explainer-how-the-latest-earphones-translate-languages-87136\">Google\u2019s Pixelbuds<\/a>) are heavily reliant on the use of big data sets to learn a language, its pronunciation, grammar and semantics. The ability or quality of the resulting tools is mainly limited by how much data is available (and how \u201cgood\u201d it is). This means that, in theory at least, tools for a minority language like Welsh cannot become as capable as those for a mainstream language.<\/p>\n<p>Languages with limited amounts of good training data available are termed \u201clow resource\u201d languages. Compared to English, <a href=\"https:\/\/www.ethnologue.com\/language\/cym\">Welsh<\/a> resources are sparse, but there are <a href=\"https:\/\/www.ethnologue.com\/browse\/names\">several thousand languages<\/a> with <a href=\"https:\/\/www.ethnologue.com\/statistics\/size\">fewer speakers<\/a>, and most likely much poorer resources, than <a href=\"https:\/\/www.ethnologue.com\/language\/cym\">Welsh<\/a>.<\/p>\n<p>Fortunately there is good research being done on a machine learning technique called \u201c<a href=\"https:\/\/ieeexplore.ieee.org\/document\/5288526\">transfer learning<\/a>\u201d. This allows systems to learn using one set of data and to then apply this knowledge in another. In China it is being used for <a href=\"https:\/\/ieeexplore.ieee.org\/abstract\/document\/8282215\">automatic speech recognition (ASR) of Tibetan<\/a>, which has virtually no data available for training. The ASR system learned Chinese \u2013 which is linguistically very different to Tibetan \u2013 and was then retrained or finetuned to \u201cunderstand\u201d Tibetan. There is actually a lot of commonality between many languages \u2013 shared or borrowed words and pronunciation patterns \u2013 that helps this kind of technique.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-medium wp-image-253\" src=\"http:\/\/blogs.kent.ac.uk\/datascience\/files\/2018\/10\/Screen-Shot-2018-10-30-at-21.28.51-300x205.png\" alt=\"\" width=\"300\" height=\"205\" srcset=\"https:\/\/blogs.kent.ac.uk\/aida\/files\/2018\/10\/Screen-Shot-2018-10-30-at-21.28.51-300x205.png 300w, https:\/\/blogs.kent.ac.uk\/aida\/files\/2018\/10\/Screen-Shot-2018-10-30-at-21.28.51-768x524.png 768w, https:\/\/blogs.kent.ac.uk\/aida\/files\/2018\/10\/Screen-Shot-2018-10-30-at-21.28.51-1024x698.png 1024w, https:\/\/blogs.kent.ac.uk\/aida\/files\/2018\/10\/Screen-Shot-2018-10-30-at-21.28.51-1920x1309.png 1920w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<h2>Retraining AI in Welsh<\/h2>\n<p>So there is no reason why AI systems cannot be produced to converse in Welsh or other minority languages. But is there any reason why it should? All of the speech technology, smart homes and voice interaction systems used today are the products of commercial research. To put it bluntly, they exist to either make money from your data, to sell you more goods and services, or to influence your thinking. None of this AI <a href=\"https:\/\/www.theguardian.com\/technology\/2017\/jan\/22\/home-battleground-amazon-google-voice-technology\">exists for the public good<\/a>.<\/p>\n<p>Making a system that works well with Welsh may not be as easy as engineering everything in English. With current technology, speech AI experts will be needed (and we are expensive). There will be a need for Welsh training and testing material, and Welsh speaking testers must be involved. The dangers of not having Welsh speakers involved in the translation process has been <a href=\"http:\/\/news.bbc.co.uk\/1\/hi\/7702913.stm\">amply demonstrated in the past<\/a>, when an out of office email reply ended up on a road sign.<\/p>\n<p>Unless there is a strong enough economic argument, don\u2019t expect big companies to rush into producing Welsh, Gaelic or Cornish speech systems. Even tech giant Samsung hasn\u2019t yet managed to produce a UK-English speaking version of their Bixby assistant (international English speakers need to <a href=\"https:\/\/eu.community.samsung.com\/t5\/Smartphones-Tablets-Wearables\/Bixby-Voice-Command-UK-Music\/td-p\/252275\">speak to it in fake American accents<\/a> to get it to work). Even the US-English version was <a href=\"https:\/\/www.phonearena.com\/news\/Samsungs-Bixby-Voice-delayed-due-to-lack-of-resources_id95414\">delayed due to a lack of resources<\/a>.<\/p>\n<p>And as long as Welsh speakers are happy to make use of English language AI systems, there may not be an economic argument \u2013 unless the Welsh government decides to pay to make it happen, which it has so far not done (the action plan is a \u201ccommitment\u201d at this stage).<\/p>\n<h2>AI to the rescue<\/h2>\n<p>Technology marches on and techniques such as transfer learning are becoming more capable every day. This has allowed previous research on <a href=\"http:\/\/www.cs.cmu.edu\/%7Etanja\/Papers\/SchultzSpecomOrigPublication.pdf\">language adaptation<\/a> to be refreshed and extended into development of <a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.368.5160&amp;rep=rep1&amp;type=pdf\">multi-language deep learning techniques<\/a>. Meanwhile, growing use of other kinds of digital technology by Welsh speakers has improved the collection of resources in the language, as has Welsh TV and radio. These advances mean that the cost of localising systems for Welsh (and other minority languages) is reducing.<\/p>\n<p>Research on brain-like learning algorithms may just hold the key here. This is technology that can continually learn during use, just like humans learn to speak a new language. It is unlike most current AI systems that are trained in the lab, before being let loose in the wild \u2013 apart from a few exceptions some, like <a href=\"https:\/\/www.theguardian.com\/technology\/2016\/mar\/24\/tay-microsofts-ai-chatbot-gets-a-crash-course-in-racism-from-twitter\">Microsoft\u2019s Tay<\/a>, notable for their <a href=\"https:\/\/blogs.microsoft.com\/blog\/2016\/03\/25\/learning-tays-introduction\/\">spectacular failures<\/a>. Future systems will be able to gradually acquire skills in a second language just by having users gradually introduce more and more of that language in their daily interactions. Rather than funding research into Welsh speech AI, the Welsh government may well do better by backing research into <a href=\"https:\/\/www.cs.uic.edu\/%7Eliub\/lifelong-machine-learning.html\">this new kind of adaptive learning technology<\/a>.<\/p>\n<p>Because all current speech AI systems handle the speech centrally (it\u2019s not done in the device, but <a href=\"https:\/\/techxplore.com\/news\/2017-11-google-pixel-buds-earphones-languages.html\">in a remote server farm<\/a>), these systems could gather data from hundreds of users worldwide (or all over Wales) to rapidly learn. So the message to Welsh speakers today may be to not buy that English-language Google Home or Amazon Alexa if you want Google or Amazon to produce a system that works in Welsh. But if you do have one, as its software develops over the next few years, try speaking Welsh to it as much as possible. It may just surprise you and <a href=\"https:\/\/translate.google.co.uk\/#cy\/en\/Siaradwch%20%C3%A2%20chi%20yn%20Gymraeg\">Siaradwch \u00e2 chi yn Gymraeg<\/a>.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-medium wp-image-254\" src=\"http:\/\/blogs.kent.ac.uk\/datascience\/files\/2018\/10\/830px-Flag_of_Wales_1959\u2013present.svg_-300x180.png\" alt=\"\" width=\"300\" height=\"180\" srcset=\"https:\/\/blogs.kent.ac.uk\/aida\/files\/2018\/10\/830px-Flag_of_Wales_1959\u2013present.svg_-300x180.png 300w, https:\/\/blogs.kent.ac.uk\/aida\/files\/2018\/10\/830px-Flag_of_Wales_1959\u2013present.svg_-768x461.png 768w, https:\/\/blogs.kent.ac.uk\/aida\/files\/2018\/10\/830px-Flag_of_Wales_1959\u2013present.svg_.png 830w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>This article appears in full in <a href=\"https:\/\/theconversation.com\/how-to-teach-ai-to-speak-welsh-and-other-minority-languages-105675\">The Conversation<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Pioneering smart home technologies and voice assistants don\u2019t, as a rule, speak Welsh \u2013 although the Welsh government now aims to change that through their &hellip; <a href=\"https:\/\/blogs.kent.ac.uk\/aida\/2018\/10\/30\/how-to-teach-ai-to-speak-welsh-and-other-minority-languages\/\">Read&nbsp;more<\/a><\/p>\n","protected":false},"author":55472,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[124],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/posts\/250"}],"collection":[{"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/users\/55472"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/comments?post=250"}],"version-history":[{"count":3,"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/posts\/250\/revisions"}],"predecessor-version":[{"id":263,"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/posts\/250\/revisions\/263"}],"wp:attachment":[{"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/media?parent=250"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/categories?post=250"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.kent.ac.uk\/aida\/wp-json\/wp\/v2\/tags?post=250"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}