Tag: Malagasy

  • Optimised Malagasy Keyboard (version 3.0)

    This blog post is the follow-up post about this older post from 2016: Finding an optimised keyboard for Malagasy

    I wrote that post back in November 2016 about how inefficient the AZERTY keyboard — currently in use by most Malagasy people — was. It has an abysmal performance and may even lead to finger joints problem after extensive use over the years.

    Feedback from first versions

    After spending a few days iterating over layouts, I came up with a first version that scored really well with the test corpus, and even used it for a couple years, but besides it being not popular at all, it suffered some flaws:

    • Accented characters in Malagasy are OK, but accented letters in French, which is also used by most Malagasy people using a computer, were severely lacking
    • Programming was hard as some characters such as the anti slash were not present.
    • Money symbols like the Euro or the Pound were absent. While not a major inconvenience, their absence can sometimes be felt when writing in about the UK (use of Pound Sterling) or France (which uses Euro), for instance.
    • The characters “<” and “>” were not type-able on certain keyboards, including the laptop I was using back in 2017.

    After having spend a couple years getting used to the first iteration and noting its flaws, I have come up with another version, which takes some improvements suggested by Ian Douglas (see comments on the older post).

    Version 3

    So this Keyboard basically is a major change compared to the previous iteration, as several keys have been moved or swapped. Most notably

    • The U key is now moved to the right-hand side of the keyboard. U is not used in native words
    • The Apostrophe and Double quote has been moved to the left side of the keyboard. The most common word using the apostrophe is amin’ny which would allow us here to type it by alternating left and right hands.
    • the accented O has its own key. Like accented letters in French, Ô is not a considered a separate letter but it’s often used.

    Analysis Results

    When accessing the analysis results, we have the following winners:

    The heatmap for the Version 3 is as follows:

    The row usage is as follows

    Below is the hand usage for our sample text based on Sarasara Tsy AmbakaIt heavily favours the left hand against the right hand as Malagasy uses a lot of vowels, which are all on the left hand side of the home row right below the user’s fingers.

    The piechart above is obtained by having the left thumb hit the spacebar. We can swap that with the right thumb and have the result below for the Malagasy v3:

    Hand usage is not a lot more balanced. Space bar accounted for roughly 13% of all keyboard hits in the sample text I used.

    On multilingual typing

    The most used language pair in Madagascar when it comes to multilingual typing is Malagasy and French, or more likely French and Malagasy. Office workers use most often French as a work language, and use Malagasy for other everyday communication. When it comes to bilingual usage, here is how the Malagasy v3.0 keyboard performs. The tests were made with a 5,000 character text in French appended with another 5,000 characters in Malagasy. Informational density per character is higher in French than in Malagasy: in French we have an average of 6 characters per word whereas in Malagasy we are closer to 10. Nevertheless, the passage has been truncated.

    Here are the detailed results. I will present the most interesting parts here.

    The v3 is still the winner here, but as you can see, the difference between the winner and the second no longer seem to be significant, so let’s use another metric:

    In the table above, we have the distance covered by our fingers dancing on the keyboard, in centimetres, the less, the better.

    Let’s start with the loser here, the AZERTY layout (will this AZERTY-bashing post ever stop?), with over 33,000 centimetres for ten thousand characters, where the left pinky and the index fingers travel a lot. If these were metres and not centimetres, that’s 75% of a marathon.

    A surprising-but-not-so-surprising contender here is the BEPO layout which already has some notoriety and nice total distance of 17,241 centimetres which makes writing 10,000 characters look less like a marathon and more like 40%, of a marathon. Good runners could run 40% of a marathon on a weekday after a day of work.

    Malagasy v1.0 also gets away with 16,765 cm

    Malagasy v2.2 and v3 are all quite close to each other with respectively 15,901 cm and 15,299 cm. Version 3 has some nice keymaps allowing it to type some keys that were absent in version 2.

    On shortcuts

    We office workers like to use shortcuts. The most famous being Ctrl+A (select all), Ctrl+C (copy), Ctrl+F (search in file), Ctrl+K (cut line after cursor), Ctrl+N (new file), Ctrl+S (save), Ctrl+U (cut line before cursor), Ctrl+V (paste), Ctrl+X (cut), Ctrl+Z (cancel last action),

    Where do we stand about these for the Malagasy Keyboard v3?

    Well, here we gotta use both control keys, of use two hands if we don’t want to do that.

    Conclusion

    Finding the optimal combination is very much a work-in-progress, but the version 3 has already come a long way. I especially need to find a way to re-balance right hand and left hand usage, but that won’t be easy given how we use vowels.

    See also

  • Using GPT-2 for Malagasy

    Long ago I became interested in natural language processing. From 2010 until 2014 I had been actively developing various programs to increase content coverage of the Malagasy Wiktionary. The result now is 5.9 million words in 4,100 languages.

    From 2014 to this day, I have been researching ways to improve and perfect the quality of translations as provided by the bot. In 2018, the OpenAI community had released a language model used to generate news-like articles. Those generated articles were so believable that the consortium had refrained to release the full model until the end of 2019, as there were fears that fine-tuning the full model could lead to fake news or dangerous propaganda to  be published en masse. As a result, they were only released once detection techniques were accurate enough to tell generated and non generated articles apart.

    Once the full model was released, I began fine-tuning the model on Malagasy language text. The target was to generate news-like articles from the existing corpus scraped from 4 major news website, resulting in 49 MB of training data. In comparison, the English language model was trained using 40 GB of data.

    Scraping Malagasy language sources

    On the internet, data sources and diversity for Malagasy are relatively scarce compared to English or any other European language. The main reason for that is that most Malagasy sites use French as their publishing language. As a consequence, the sources used were daily newspapers such as NewsMada, Madagascar Tribune, Aoraha, la Gazette de la Grande Ile. It is worth noting that two of these newspapers are bilingual so article had to be filtered.

    Filtering out French articles

    The next task was to detect and remove French language articles since we are training the model to generate Malagasy and not French.

    How?

    Since we’re basically both using the Latin alphabet, using Unicode to our advantage won’t do the job. Language detection using machine learning, while attractive, is clearly overkill and will further divert us from our goal.

    Instead, to keep things simple, I relied on the single biggest difference between written Malagasy and French.  Our version of the Latin alphabet rules out the letters C, Q, U, W and X or other accented characters like É or È. In other words, all native Malagasy words won’t contain any of these.

    I also fetched all French words and inflections to be spot on every single time. And in less than 100 lines, I could filter out anything French.

    Using GPT-2

    As expected, training takes time and space. Lots of it. Model for checkpoints take 1.3 GB and is saved on-disk every 50 iterations.  At 21,000 iterations, further progress seems hard, but this is what it can generate (article below does not exist):

    ANTSIRABE: SARONA TANTERAKA NY FITAFIANA MPANAO SINTO-MAHERY | NEWSMADA

    Par Taratra sur 08/12/2019

    Nandray ny asa famonoana ho faty ny zandary nandray anjara tamin’ny fanafihana nitafiana mpanao
    sy toeram-piantsonan’ny taxi-be nandritra ny fanarahan-dia, tao amin’ny kaompania Ambositra,
    faran’ny herinandro teo, ka nanao ny fanarahan-dia.

    Tsiahivina fa efa nisy ny nahafantarana fa nanafika mpandraharaha an’ilay mpandraharaha ny
    tao Andranohazo Antsirabe. Raikitra ny fitifirana ka vokatry ny fanarahan-dia avy hatrany ity mpandraharaha
    ity. Tsy fantatra mazava hatrany na ny sasany aza tambajotran-javatra malemy na koa raha tsy izany
    mitohy na miaro ny kolikoly rehetra na mpanao sinto-mahery na manana ny anton-diany

    Conclusions

    Should we go further with our model,  we would end up creating a “thismalagasynewsarticledoesnotexist.com” website to host them all. Source code is present on Github along news anticles as training data, which for copyright reasons, cannot be made public.

    Another use for a good-enough model would be to illustrate the Malagasy Wikitonary with unique examples for word usage.

  • Google translate now available in Malagasy

    Good news, if it can be said, for my fellow Malagasy citizens: Since 6th of December 2014, Google Translate has been allowing them to see almost any web page in their mother tongue in addition to 89 others. Many people, myself included, have been waiting for this moment that would have come sooner or later. First of all, I would like to address a big thanks to all people that have made this possible. Thanks to you, the Malagasy language is getting further integrated into the polyglot Web world. You’ve also given a chance to the 15 million monolinguals to have an approximate understanding of what other people have written using other languages are writing.

    Before Google Translate

    Before we’ve got Google translate to translate almost anything in our language, including curse words, several websites have helped us Malagasy and other language enthusiasts to write corpora in a proper way in our mother tongue: many of us have already heard about Freelang, tenymalagasy.org and so on. The only drawback of these website is that they do not work in a collaborative way: they are not «crowdsourced». Wikibolana is a Malagasy language crowdsourced dictionary, but I have been so far the one that has generated most of its content.

    Is it really that good?

    Well, let’s be honest: absolute accuracy has been the motto for no machine translation system ever. But for a brand new language on Google Translate, Malagasy is… quite good. Daring to translate a language with such an unusual syntax like Malagasy is already a huge challenge, a challenge worth to be accepted. At first sight, idiomatic sentences and expressions are fairly well handled. Still when it comes to very complex sentences, it is a  mess: verbs are at the wrong place, which either gives the sentence a completely different meaning, or makes it look like an incomplete sentence. There are also some fails as the one in the screen shot below.

    GTfail
    “ahave” does not mean anything in Malagasy. But this is not the opinion of Google Translate

    Let’s see an example of a translation of a paragraph of the article Madagascar in the English Wikipedia:

    Original in English In 2012, the population of Madagascar was estimated at just over 22 million, 90 percent of whom live on less than two dollars per day. Malagasy and French are both official languages of the state. […] The island’s elephant birds, a family of endemic giant ratites, went extinct in 17th century or earlier, most probably due to human hunting of adult birds and poaching of their large eggs for food. Google-translated in Malagasy (as of December 2014) Tamin’ny 2012, ny mponina ao Madagasikara dia tombanana ho 22 tapitrisa mahery kely, 90 isan-jaton’ny izay [no] miaina amin’ny  [vola] latsaky ny roa dolara isan’andro. Malagasy sy Frantsay dia samy fiteny ofisialy ao amin’ny fanjakana. […] Ny nosy vorona ny elefanta, ny fianakaviana ny fizahantany ratites goavana, dia efa lany tamingana tamin’ny taonjato faha-17, na teo aloha, indrindra noho ny olona angamba ny olon-dehibe ny fihazana sy ny vorona lehibe Fihazana ny atodiny ho sakafo.  

    The green-coloured sentences are syntactically correct without correction. The first one has required the red words in square brackets to sound correct. The third one hurt my brain: “The elephants are a bird island, the family of big tourists, have gone extinct in 17th century, or before, perhaps because of people, adults, hunting and adult birds who have their eggs hunted for food.” It hurt to understand, and also hurt to back-translate. Astonishingly making a round-trip translation has given a correct sentence in English, so please always have your translations checked human translators.

    Efforts to be continued

    One can take part to increase translation accuracy by translating articles by using the Google translator toolkit, or by using and correcting translations provided by Google translate itself.