Tag: computer science

  • Using GPT-2 for Malagasy

    Long ago I became interested in natural language processing. From 2010 until 2014 I had been actively developing various programs to increase content coverage of the Malagasy Wiktionary. The result now is 5.9 million words in 4,100 languages.

    From 2014 to this day, I have been researching ways to improve and perfect the quality of translations as provided by the bot. In 2018, the OpenAI community had released a language model used to generate news-like articles. Those generated articles were so believable that the consortium had refrained to release the full model until the end of 2019, as there were fears that fine-tuning the full model could lead to fake news or dangerous propaganda to  be published en masse. As a result, they were only released once detection techniques were accurate enough to tell generated and non generated articles apart.

    Once the full model was released, I began fine-tuning the model on Malagasy language text. The target was to generate news-like articles from the existing corpus scraped from 4 major news website, resulting in 49 MB of training data. In comparison, the English language model was trained using 40 GB of data.

    Scraping Malagasy language sources

    On the internet, data sources and diversity for Malagasy are relatively scarce compared to English or any other European language. The main reason for that is that most Malagasy sites use French as their publishing language. As a consequence, the sources used were daily newspapers such as NewsMada, Madagascar Tribune, Aoraha, la Gazette de la Grande Ile. It is worth noting that two of these newspapers are bilingual so article had to be filtered.

    Filtering out French articles

    The next task was to detect and remove French language articles since we are training the model to generate Malagasy and not French.

    How?

    Since we’re basically both using the Latin alphabet, using Unicode to our advantage won’t do the job. Language detection using machine learning, while attractive, is clearly overkill and will further divert us from our goal.

    Instead, to keep things simple, I relied on the single biggest difference between written Malagasy and French.  Our version of the Latin alphabet rules out the letters C, Q, U, W and X or other accented characters like É or È. In other words, all native Malagasy words won’t contain any of these.

    I also fetched all French words and inflections to be spot on every single time. And in less than 100 lines, I could filter out anything French.

    Using GPT-2

    As expected, training takes time and space. Lots of it. Model for checkpoints take 1.3 GB and is saved on-disk every 50 iterations.  At 21,000 iterations, further progress seems hard, but this is what it can generate (article below does not exist):

    ANTSIRABE: SARONA TANTERAKA NY FITAFIANA MPANAO SINTO-MAHERY | NEWSMADA

    Par Taratra sur 08/12/2019

    Nandray ny asa famonoana ho faty ny zandary nandray anjara tamin’ny fanafihana nitafiana mpanao
    sy toeram-piantsonan’ny taxi-be nandritra ny fanarahan-dia, tao amin’ny kaompania Ambositra,
    faran’ny herinandro teo, ka nanao ny fanarahan-dia.

    Tsiahivina fa efa nisy ny nahafantarana fa nanafika mpandraharaha an’ilay mpandraharaha ny
    tao Andranohazo Antsirabe. Raikitra ny fitifirana ka vokatry ny fanarahan-dia avy hatrany ity mpandraharaha
    ity. Tsy fantatra mazava hatrany na ny sasany aza tambajotran-javatra malemy na koa raha tsy izany
    mitohy na miaro ny kolikoly rehetra na mpanao sinto-mahery na manana ny anton-diany

    Conclusions

    Should we go further with our model,  we would end up creating a “thismalagasynewsarticledoesnotexist.com” website to host them all. Source code is present on Github along news anticles as training data, which for copyright reasons, cannot be made public.

    Another use for a good-enough model would be to illustrate the Malagasy Wikitonary with unique examples for word usage.

  • Finding an optimised keyboard layout for Malagasy

    In the 21st century, people type. They type a lot.

    Office workers and the Jane Doe’s and John Doe’s from all over the world, speaking various languages, type on electronic keyboards. An average typist types 30-40 words per minute. It mostly depends on their typing language and the layout they use. The best typists can achieve speeds up to 100 words per minute.

    The current keyboard layout in use by most Malagasy language speakers puts whoever who wants to write in Malagasy at a huge disadvantage. It is impossible to write quickly in their language without stressing out their hand muscles. A typical malagasy sentence is quite often longer than a French one due to word length. Depending on the text sample, It may vary from 7% longer (compare the first 10 verses of the Chapter 1 of the Gospel of John) to 20% longer for more complex texts. A text that had required 10 hours to be written in French will easily take 11 up to 14 hours for Malagasy. At the scale of a company, or even a country, that is a huge waste of time, mostly due to a legacy that has lost all its relevance as keyboards do not have the same constraints as typewriters.
    To tell you my story: since I’ve got my Samsung tablet, I’ve almost never used the default Samsung keyboard. So what did write my text messages with? I’m using my own keyboard layout; I’ll show you why and how.

    A quick review on Malagasy uses

    Before I get to the point, let’s see on what my fellow Malagasy citizens type their Malagasy language text with:

    azerty.jpg
    Fig. 1: AZERTY keyboard, made by French as an imitation of the American QWERTY

    This, ladies and gentlemen, is the layout that is currently being used and known by most of the 24 million people in Madagascar. No need to say that their fellow citizens who have emigrated to France also use it.
    The problem is that layout is not suitable for Malagasy. At all.

    heatkey.jpg
    Fig.2: Heat map on an AZERTY keyboard used to type in Malagasy.

    The heat map above has been generated using the Malagasy version of the Rainilaiarivony Wikipedia article. As a Wikimedia contributor, I’ve had the pleasure to type it… using the AZERTY keyboard. It was really a pain, and it looked like you did a lot of effort only to get less than the English version from which I had been translating.

    azerty1.jpg
    Fig. 3: In an AZERTY keyboard, when typing in Malagasy, your left pinky travels A LOT

    That is also felt by my fellow citizens, a lot of whom have taken bad writing habits like writing SMS. That habit is sometimes taken to a new level, so that an unexperimented reader may find difficult or even impossible to read a text written in that SMS-style writing.
    Even though most people browse the Web in French or English far more often than in Malagasy, using the QWERTY/AZERTY layouts is a pain, even if this is all we have, and even if this is what most people will ever know. Even if it’ll never have the success of the traditional layouts, I’ll give my two cents for a layout optimised for Malagasy language

    Solutions

    To palliate this strong disadvantage given to Malagasy regarding keyboard typing speed. I’d been using the German Neo keyboard layout. This was an already good alternative to the QWERTY which I’d been using for 4 years, but it was still sub-optimal, as my left little pinky is above a letter that is never used in Malagasy, my mother tongue.

    neo
    Fig. 4: German Neo Layout (see: neo-layout.org)

    While looking for a solution to my problem I’ve discovered patorjk.com. From a given text, this website basically calculates which keys are most hit while the text is typed. From those keys’ position, a rating will be given. That rating takes into account for 1/3 the distance your finger had moved, how you use your fingers for 1/3 and how you often you have to switch fingers and hands while typing for 1/3. The higher the rating, the lower your hands will have to travel to type the text; so mechanically you’d be less tired typing the text in an optimal keyboard than in a standardised one.
    So for our Rainilaiarivony text, there are the rating for the keyboards:

    rating
    Fig. 5: Layout ratings

    The loser here is clearly the AZERTY, used by most of my fellow citizens. The standardised  Dvoraks are good candidates for typing Malagasy, and maybe we should consider those keyboards since they are widely supported in modern operating systems.
    Here is what the programmer Dvorak looks like:

    dvorak.jpg
    Fig. 6: Programmer Dvorak Keyboard

    Setting the Malagasy Optimised Layout

    First version (7 November 2016)

    The Dvorak score was impressive at the first sight, but the Dvorak was not the optimal layout for Malagasy. The one which the algorithm had found optimal was the following one:

    malagasy1.jpg
    Fig.7: Algorithmically generated Layout from patorjk.com (some keys’ positions have been frozen for more practicality)

    That layout looks pretty decent but the keys are put in a little bit messy way. On the basis of that keyboard, the German Neo and the arrangement of a bunch of standard ergonomic keyboards I’ve come out to the following layout:

    malagasy
    Fig. 8: Own-made keyboard (the Malagasy Keyboard)

    I’ve rerun the analysis on the same Rainilaiarivony article on that keyboard and a couple others. Here are the ratings:

    ratings2
    Fig. 9: Ratings of the Malagasy keyboard layout on the basis of the Malagasy version of the Rainilaiarivony

    Well, to say the least, it looks like I’ve done way more than what the algorithm had succeeded to find. I’m pretty sure the layout I’ve designed is not very far from the perfect Malagasy-optimised Dvorak. Let’s go further into the report and see the row usage comparison.

    row-usage.jpg
    Fig. 10: Row usage comparison.

    Yes, the AZERTY is an absolute typist horror when it comes to Malagasy.
    The use rate of the home row for the our Malagasy keyboard is not very far from the optimal/personnalized layout generated by the algorithm.

    version 2 (13 November 2016)

    ergo1
    Fig. 11: Hot keys on the second attempt.

    Well, after a few day testing the keyboard layout I’ve got on the first attempt, I’ve felt some mandatory re-tuning of the optimised keyboard. That implied moving some keys to get the hot ones (the ones I have to hit most to type down my text) right under my index and my right middle finger. Since the left finger almost always type vowels, I’ve made them stay as most as possible at the home row unless you want to type some foreign words – in which case you’ll have some gymnastic to do.

    ergo3.jpg
    Fig.12: Finger usage of various keyboards.

    As shown in fig. 12, the total number of hits in the Rainilaiarivony article is distributed as such: ~53% for the left hand and ~47% for the right hand. This excludes the thumb hitting the spacebar.

    ergo4.jpg
    Fig.13: Second attempt’s rating.

    We’re getting better. Though the article is the same, I’ve switched to selecting the article from its HTML form. Since working on the article over and over again may constitute some bias, I’ve tried using some text samples from the Sarasara Tsy Ambaka.
    I took quite a huge text sample (containing ~260,000 characters). It took a while to process but it takes out much of the bias related to the Rainilaiarivony article. The results still makes our Malagasy optimised keyboard the best layout ever to exist for the Malagasy language (cf. figure 14)

    ergo5.jpg
    Fig. 14: Layout ratings comparison.

    I have to note that the calculated optimised layout gets closer and closer to the one I’ve designed, at least for the home row. Have a look:

    ergo6.jpg
    Fig. 15: The calculated layout. Looks a bit familiar, right?

    As of this second version, we have an fairly optimised layout for Malagasy language, i.e. you’ll gradually type faster as your hand muscles get used to the new layout. Even for typing other languages such as French, this layout surpasses the AZERTY as the latter keyboard layout had been initially made to avoid the jamming of typewriters.

    My conclusions

    I may never say it much enough: the AZERTY keyboard is the absolute worst keyboard to type Malagasy with. Even the QWERTY does better. The Dvorak is a pretty good candidate for a widespread “more ergonomic” layout due to its presence in all modern widespread operating systems, but there is better.
    Even if the French have designed the BÉPO layout for their language, it has failed to replace the omnipresent and inherited AZERTY slow layout. There is only one person I know who uses it on a daily basis. We also have to add to the fact that BÉPO has been around since 2008 and the Klavie Malagasy (“Malagasy Keyboard”) has only been written about just now, in 7th November 2016. As heavy as it is, the legacy left by AZERTY is highly likely to continue to be used in Madagascar probably for decades as long as keyboard typing exists, even if we relevantly know that the AZERTY layout is totally unsuitable to write French let alone Malagasy.
    Right now I’m typing this article in English on a QWERTY keyboard. I’m planning to translate it to Malagasy as it gets more complete in order to reach more of the target audience.
    I’ve already implemented that layout on my tablet so I’ve got all the time I need to adapt my fingers from the old Neo layout to the new Klavie Malagasy.

    Updates

    v2.1 as of 19 December 2017

    Attached a PDF file containing the test corpus. A slightly better version has been proposed in the comments (thanks Ian!); and even though it has lower score than the v2.0, it has a really awesome idea of putting the T on the home row.
    To better track all the changes, the project now has its own repository on Github. Long live open source!

    Resources

  • Fomba fandraisan’anjara dimy amin’i Wikibolana

    Ity lahatsoratra ity dia dikanteny amin’ny teny malagasin’ny lahatsoratra nosoratako tamin’ny teny anglisy vao andro vitsivitsy izay.
    Mandray anjara tamin’i Wikibolana aho nanomboka tamin’ny taona 2010. Lasa fahazarana ilay izy: isam-bolana,isan-kerinandro, isan’andro, ary isa-maraina na isa-kariva, dia alefako ny mpitety tranonkala ijerena izay zavatra nitranga teo amin’i Wikibolana, ary mijery ny zavatra izay azoko atao mba hanampiana votoatiny be kokoa.
    Tsindraindray dia tena manam-piniavana ny hanampy fampahalalana iray amin’ny pejy maro aho ka mandany ora maromaro manoratra fandaharana hanampiana izany amin’ny fomba faran’izay haingana.
    Ary tsindraindray aho dia tena tsy tia handray anjara, ka mijery ny fiovana farany sy mijery ireo pejy izay mety nosimbaina aho, na mijery ireo pejy novain’ny mpikambana hafa.
    Na dia izany aza, dia betsaka ireo fomba fandraisana anjara ao amin’i Wikibolana. Dimy amin’ireo no ho atolotra eto:
    (1) Manoratra pejy amin’ny tanana. Zavatra mora indrindra atao na dia zavatra mahavizaka indrindra amin’ny voalaza aza. Manomboka nanoratra pejy amin’ny tanana (amin’ny alalan’ny fitendry) avokoa ny mpandray anjara rehetra, ary mety ho toa izay hatrany mandritra ny telopolo taona. Amin’ny 2045, dia ho lasa tola ny Wikipedia na i Wiktionary amin’ny endriny ankehitriny raha tsy efa manova ny votoatiny ho azy.
    Alohan’ny hitrangan’izany dia ho betsaka dia betsaka ny asa atao. Na dia izany aza, dia afaka mampitombo ny habetsaky ny asa vitanao ianao amin’ny alalan’ny fianarana manoratra fandaharana. Rehefa hay tokoa izay dia:
    (2) Manoratra fandaharana manoratra pejy izay mety ilaina ahitsy rehefa aty aoriana. Mora izany, ka niezaka tamin’izany aho nandritry ny telo taona. Rehefa mandeha ny fotoana dia betsaka ireo pejy voaforona, ka na dia kely aza ny taham-kadisoana dia lasa betsaka ireo pejy misy hadisoana. Ekena izany, fa betsaka noho izany ireo pejy tsy misy hadisoana. Rehefa ampiarahana amin’ny rakibolan-teny mitovy hevitra sy fahaiza-manodina teny voajanahary (Natural language processing) dia afaka mampamaritra teny tsy afaka dikaina amin’ny fiteny tanjona ianao.
    (3) Manoratra fandaharana mamaky gazety ahitana ireo teny sy pejy tsy misy. Rehefa feno ny rakibolana dia mihasarotra hatrany ny fahitana teny vaovao hoforonina. Mety tsy hanam-piniavana ny hamaky lahatsora-gazety am-polony ianao, ka manorata fandaharana hamaky azy ireo ho anao ary maka ireo teny tsy ampy ho anao. Rehefa vita izany dia manorata fandaharana mitady ireo teny nakambana rehefa ary hanampy azy ireo ao amin’i Wikibolana. Ny lenta eo ambonin’izany karazam-pandaharana izany dia mpitady teny mivantana mamaky fahan-tsoratra avy ao amin’i Twitter ohatra, ary mametraka ireo teny rehetra tsy mbola voafaritra ao amin’i Wikibolana amin’ny faran’ny andro.
    Zavatra iray ny mianartra manoratra fandaharana, fa zavatra roa samihafa ny fanampiana fampahalalana ary ny mahafantatra hoe fampahalalana inona no tsara ampiana. Rehefa mitsiry ny hevitra, na misy angona mikasika ny teny mahaliana eo am-pelantanana dia manorata fandaharana hametrahana ireo singam-pampahalalana ireo amin’i Wikibolana. Ataovy am-panajana ireo zom-pamorona izany.
    (4) Mitety rakibolana ary manampy teny tsy fahita matetika. Mahaliana anao ve ny etimôlôjia? Am-pianarana teny vaovao ve ianao? Misy ao amin’i Wikibolana ve ireo teny ireo? Aza misalasala fa ampio ireo teny ireo. Atao am-panajana zom-pamorona foana izany. Azo heverina hoe asan’ny tena ny fakana teny maro avy amin’ny rakibolana maromaro, fa aza mandikadika foana ny famaritan-teny. Nanao izany aho ary saika voatory noho ny fitarainan’ny tompon’asa. Raha havanana amin’ny haranitan-tsaina voatrolombelona (AI) sy fahaiza-manodina teny voajanahary ianao dia manorata fandaharana mandika fehezanteny.
    Mahery ny fandaharana. Betsaka ny fotoana ilaina amin’ny fanoratana fandaharana tsara, ka tsy vonona ny hianatra izany ny ankamaroan’ireo mpandray anjara, ka inna ny atao?
    (5) Manorata amin’ny Wikibolana amin’ny teny nibeazanao. Raha atokantsika ny teny angisy dia soratana anaty fiteny 170 ny Wikibolana. Betsaka amin’ireo no manam-pejy latsaky ny iray hetsy. Vokatry ny finiavako ny hamorona ny rakibolana lehibe indrindra amin’ny teny malagasy ny haben’i Wikibolana malagasy amin’izao fotoana. Raha tsy teny anglisy no eny nibeazanao, dia mianara teny vahiny ary ampio ny teny ampiasainy, na ao amin’ny Wikibolana na aiza na aiza. Raha tsy mahaliana ana ny fianarana teny vahiny dia ampio ireo tenin-jatovo tsy mbola hita amin’ny teny nibeazanao.

  • Google translate now available in Malagasy

    Good news, if it can be said, for my fellow Malagasy citizens: Since 6th of December 2014, Google Translate has been allowing them to see almost any web page in their mother tongue in addition to 89 others. Many people, myself included, have been waiting for this moment that would have come sooner or later. First of all, I would like to address a big thanks to all people that have made this possible. Thanks to you, the Malagasy language is getting further integrated into the polyglot Web world. You’ve also given a chance to the 15 million monolinguals to have an approximate understanding of what other people have written using other languages are writing.

    Before Google Translate

    Before we’ve got Google translate to translate almost anything in our language, including curse words, several websites have helped us Malagasy and other language enthusiasts to write corpora in a proper way in our mother tongue: many of us have already heard about Freelang, tenymalagasy.org and so on. The only drawback of these website is that they do not work in a collaborative way: they are not «crowdsourced». Wikibolana is a Malagasy language crowdsourced dictionary, but I have been so far the one that has generated most of its content.

    Is it really that good?

    Well, let’s be honest: absolute accuracy has been the motto for no machine translation system ever. But for a brand new language on Google Translate, Malagasy is… quite good. Daring to translate a language with such an unusual syntax like Malagasy is already a huge challenge, a challenge worth to be accepted. At first sight, idiomatic sentences and expressions are fairly well handled. Still when it comes to very complex sentences, it is a  mess: verbs are at the wrong place, which either gives the sentence a completely different meaning, or makes it look like an incomplete sentence. There are also some fails as the one in the screen shot below.

    GTfail
    “ahave” does not mean anything in Malagasy. But this is not the opinion of Google Translate

    Let’s see an example of a translation of a paragraph of the article Madagascar in the English Wikipedia:

    Original in English In 2012, the population of Madagascar was estimated at just over 22 million, 90 percent of whom live on less than two dollars per day. Malagasy and French are both official languages of the state. […] The island’s elephant birds, a family of endemic giant ratites, went extinct in 17th century or earlier, most probably due to human hunting of adult birds and poaching of their large eggs for food. Google-translated in Malagasy (as of December 2014) Tamin’ny 2012, ny mponina ao Madagasikara dia tombanana ho 22 tapitrisa mahery kely, 90 isan-jaton’ny izay [no] miaina amin’ny  [vola] latsaky ny roa dolara isan’andro. Malagasy sy Frantsay dia samy fiteny ofisialy ao amin’ny fanjakana. […] Ny nosy vorona ny elefanta, ny fianakaviana ny fizahantany ratites goavana, dia efa lany tamingana tamin’ny taonjato faha-17, na teo aloha, indrindra noho ny olona angamba ny olon-dehibe ny fihazana sy ny vorona lehibe Fihazana ny atodiny ho sakafo.  

    The green-coloured sentences are syntactically correct without correction. The first one has required the red words in square brackets to sound correct. The third one hurt my brain: “The elephants are a bird island, the family of big tourists, have gone extinct in 17th century, or before, perhaps because of people, adults, hunting and adult birds who have their eggs hunted for food.” It hurt to understand, and also hurt to back-translate. Astonishingly making a round-trip translation has given a correct sentence in English, so please always have your translations checked human translators.

    Efforts to be continued

    One can take part to increase translation accuracy by translating articles by using the Google translator toolkit, or by using and correcting translations provided by Google translate itself.

  • Switching to Linux: good or bad choice?

    Last updated on July 13, 2014
    Do you want to switch to Linux? Before doing so, I invite you to reconsider all implied consequences of a switching to another operating system.
     
    Linux? What is that?
    But in the first place, what is Linux? It is the kernel of the GNU/Linux operating system. To be frank with you, «Linux» is a generic name for a few dozens of distributions having one thing in common: the Linux kernel. What is a kernel? It is a software that manages your hardware (motherboard, CPU, hard disk, networking, etc.) to make it work with applications you use. Current Microsoft Windows’ kernel is NT. By the past it also had MS-DOS which was the kernel used for Windows 1 up to Windows ME. I can write about this longer, but then we’d be off-topic.
    So, Linux is an operating system, competing with Windows. It has to be known that Desktop computer market is the «final frontier» for Linux. All desktop computers nowadays come with Microsoft Windows pre-installed.
    Because they use different kernels, Windows’ software will not work on Linux. There’s still a (poor) workaround for this problem, but I’ll talk about it later. This is also a blessing because Windows’ viruses can’t run on Linux whatsoever.
    I’m not saying Linux is totally clean of viruses – because people have already created viruses that have successfully infected a Linux system – but still, with right reflexes, you’ll avoid most of problems. The most basic tip is to never run a Linux-based system as a root user, unless you know exactly what you’re doing. You can still run tasks requiring root privileges by using your own user password, but it will mostly happen when you install programmes.
     
    Linux is Free
    Primarily, Linux distributions can be used legally free of charge, by anyone. This means you don’t need to install an «anti-product activation » thing picked from a weird site, to use your operating system at will. The latter action, often performed by Windows users, is not only illegal, but can also compromise your security by letting that weird software from a weird site dig «holes» (backdoors) in your firewall. For people who like doing computer DIY, Linux is also open-source, developped by a community counting thousands of programmers an code reviewers. Have you found a bug in the software? You have the freedom to patch it and share your patch to other people. Yes, Linux licence allows this.
    You also have a vast array of choices regarding distributions (commonly known as «distros»).
    Linux distros are all built to do things in a certain way, so you have to think about what you’ll be doing with the OS, and then you download the distro that fits your needs. It is not like Windows, where you first install your OS, and then figure out what you need.
    All distros (eleven) have their own software repository and desktop environment (DE) but they all have something in common: the Linux kernel, hence the generic name. By May 2014, the most recent version is 3.14 issued two months ago.
    Something that discriminates each distro is at first sight their desktop environment, then the default software. Ubuntu itself has six desktop environments (Edubuntu, Kubuntu, Mythbuntu, Ubuntu Studio, Xubuntu, Lubuntu). Depending on your taste, you choose your DE: Unity has a very «modern» appearance; KDE is a very flexible desktop making it look almost like what you want it to (you can even rotate icons on the desktop!); LXDE offers a lightweight DE as well as XFCE. About updates, they are done through an update manager. Also, most of distro issue a new version every year.
     
    The switch
    So you’ve finally decided to switch. Your CD is burnt (or your USB key is configured), and you are going to shut down your PC. Please don’t do it yet, there are some matters to be thought about : do you use specific software for your videos? Do you play games? Have you some specific hardware for which installation requires a driver burnt on a CD?
    To answer these questions, you’ll have to do some research on the Web. If you use frequently used software, then it is likely to find a free and/or open-source equivalent on some distro. If you use something like AutoCAD or Photoshop, then you’ll still find «free» equivalent of these on Linux, but they won’t always be as powerful. Furthermore, chances are that Photoshop format will not be compatible with their free equivalents.
    About games, forget about playing Call of Duty, Battlefield or League of Legends on Ubuntu. The Steam Machine is on its way, so gaming will soon be possible and be more and more common on Linux.
    If you cannot separate of your Windows software, there’s still a workaround: Wine. This piece of software allows you to run simple programmes on Linux. It is not guaranteed that everything will work on it, but still, it’s better than nothing. If you depend on a Windows-OS-only software to do your business, I advise you to dual-boot your computer. Then you’ll have and a Windows OS to run your software and a Linux distro to do your things as well. Note that Windows files can be accessed easily from Linux, when the opposite requires you to download software, and mount manually Linux partitions from that software. It is the way most people do when switching to Linux, avoiding all the inconvenience of having data requiring to be backed-up on another HDD.
    Your hardware has come with an installation CD? The best way to proceed in this case is to check if the distro you’re going to install will support it.
    Still, the best way to know if the distro you’ve chosen fits to the hardware is to boot using the CD which is most of the time a Live CD. Live CDs allow you to test the operating system on your computer without changing anything a single byte to the hard disk drive, as every required data is charged into memory. You can then choose to install the OS on your hard drive once you’re satisfied by the OS behaviour on your computer.
    If you decide to switch, take the time to check if you’ve successfully backed-up all your data. We never know if something is going to fail, and to have twice the same data is always better than not having the data. If you can’t somehow migrate your data because you don’t have an external HDD, you can still choose to dual-boot your computer, so you’ll still have access to your data stored on the Windows NTFS (or FAT32, or FAT) partition. You can even choose to install your OS in an external HDD, if you need all the space on your computer HDD for your data. But to boot, do not forget to plug-in the USB key !
    Usually, installation won’t take a long time. To install Kubuntu 12.04, I only needed 50 minutes to format the entire disk (500 GB) and get the PC ready for work.
     
    My personal story
    Because I got fed up by the inefficiency of my (free) anti-virus programme and by Trojans, key-loggers and root-kits compromising personal data security (my credit card number somehow leaked when I made an online purchase on a well-known financial transaction platform), I decided to make the big switch by changing the OS of my 4 year-old laptop computer to some Linux distribution.
    Because I do care a lot about hardware support and user-friendliness, I’ve taken the decision to choose Kubuntu 12.04, first because it is a long-term support version (i.e. updates will be done on this OS for 5 years), and secondly because I am familiar and have positive experience with Ubuntu distros in terms of hardware support.
    I made the switch a month ago by changing my laptop OS from Win7 to Kubuntu 12.04. The most annoying thing I’ve had to face since the switch is (still) hardware support. If your hardware is a little complex, crap happens quite a lot. Before definitely switching to Kubuntu, I tried Ubuntu (unity desktop), Mandriva (now OpenMandriva), Mint, Mageia and Debian. The latter three were unable to support my networking hardware, and (perhaps I have deficient research skills, but…) I found no workaround for it. Same problem for my printer. My connected printer refuses to do its job when I order it, which is quite frustrating to the average user.
    When the switch has been complete, I noticed that Kubuntu – or at least the 12.04 version – has a serious memory leakage problem: kded4 process occupies more and more memory as time passes, and after a week of activity it ‘eats’ up to two gigabytes of memory. The PC then gets slower and slower, making it totally unusable so I’ve had to find a workaround to make the inflation cease. The price of this has been the inability to make the PC sleep, which reveals to be quite impractical, especially when you are working outside without an accessible plug to help your laptop keep the charge.
    Even if Ubuntu support fairly well all the laptop’s hardware, some hardware problems still arise when you don’t expect them: I wanted to make an Ad-hoc connection to a friend’s laptop, but Kubuntu prevented me to do it because of kernel bugs. Also, a friend of mine had a Ubuntu 12.10 version and I was really astonished to see a so unstable Ubuntu version: random errors pop up every 10 minutes! I finally advised him to install another version.
    Despite the lack of hardware support, switching to some Linux distribution is something great, especially when your hardware can’t support the latest Windows version. Also, for people who don’t want to invest tens of euros (or dollars) in an anti-virus solution, it is also a good choice.
    Useful links

  • Search on Google using Python scripts

    What about a free unlimited Google API? In the past, Google provided such thing, but it is definitely deprecated (due to abuses?). The new Search API needs money ($5 for 1,000 queries), and the free API has a limited use of 100 queries per day. Without any money, you won’t get far. After getting that information. I let down that project… Until I contribute to Wiktionary!

    Extracting words from Malagasy daily newspapers to Malagasy Wiktionary weren’t actually an easy thing to program. At the first version of the script. It only can parse RSS feeds, and is very slow compared to what I used to know. It is because it loads approx. 400,000 words at each launch.
    While doing that work. I have noticed that there are a plenty of words that are actually compounded words.This notice gave me an idea: anticipate through looking on google search whether the word exists or not: because on 1,300 roots contained on the Malagasy Wiktionary, I can potentially make 1.7 million by combining two nouns,  2.2 billion with three, and likely 2.8 trillion using four roots. That is enormous, and even at full regime, I will never be able to look for them all: at 5 queries per second (fastest rate I’ve ever had) it will take respectively 4 days with 2 roots, 14 years with three and eventually 177 centuries (17,700 years) for four roots. This is the first reason for which I have decided to try hacking Google Search to see if the word combination has already been used.

    First, I looked to the page source, and it is very, very complicated to understand. I even think that this page was made by bot as html tag names are not written in a human language. I also have tried to use the URL but it is actually very, very long, with characters that look more like hashes and keys (?), not findable as they don’t explicitly appear on the main page form. At first sight, this kind of project is likely to fall…

    I have found on the Web a post describing how to use the Google Search without any API. But there was a problem: the discussion is almost three years old. And when downloaded, the search engine has visibly been changed: it is very probable that a Google employee reported that discussion leading the company to take adequate measures. When I ran the script, all I could see was that there was nothing operational: no results were given when doing any search. I still keep an eye on the downloaded script. And I am trying to find something which can solve this problem. This script just avoided me to spend hours and hours reinventing a (square) wheel.

    Once this problem is solved, at least temporarily, the source code will be released on SourceForge: Bot-Jagwar. It will rapidly fall into deprecation, so if there are peoples willing to update the script. They’ll be welcome :).

  • Cleverbot talking to itself : meditation of a bot.

    Recently I have written a program in Python in order to observe the “meditation” of Cleverbot, you know, the chatbot who has supposedly passed the Turing test (at 59%).

    To make it meditate and to distinguish who asks question and who answers, I have put n scene two virtual persons, talking to each other. “They” mainly use English language in their discussions, but sometimes, and for an unknown reason, “they” do discuss in a foreign language (Spanish, French, Polish, Turkish…) and they finally use English back.

    The script is totally debugged and the bot is meditating now for almost 30 hours. Data is still being collected. And at this time, I have reached more than 12,000 messages. Among these, we can sort out periodic message types like the quasi-perpetual “Why-because” (which has come back a couple of times). We can also discover among these messages that the bot writes what he’s doing with his (her?) collocutor. And  there are also typical questions such as :

    • asking how old is the collocutor : you can remark that each time the first character asks the other’s age, the response given is different. Which is quite normal as the text written down was written by someone else before.
    • asking for the collocutor’s name : same remark as above. Always a different names. I have discovered that they also may give the same answer (“my name is Cleverbot”)… Which make the two characters totally confused.
    • asking if the collocutor believes in God.

    To give you an example of a meditation of Cleverbot, here is a random example picked up from my dump.

    (Extract from line 4,990 to 5,077)

    Menintsoa : I don’t know!!!!
    Jaona : You’re laughing at me.
    Menintsoa : No i’m laughin at that clown.
    Jaona : Oh! Where?
    Menintsoa : England.
    Jaona : Where in England?
    Menintsoa : In the UK.
    Jaona : Where is the UK.
    Menintsoa : The UK is in Europe.
    Jaona : It is not in France.
    Menintsoa : Yes.
    Jaona : Where is that guy?
    Menintsoa : Where’s what?
    Jaona : Where is your proof that I lied.
    Menintsoa : I didn’t say that you did lie.
    Jaona : Yes but you said that you thought I spoke German.
    Menintsoa : Do you speak German?
    Jaona : No.
    Menintsoa : Are you sure?
    Jaona : No. But I wish I was a computer.
    Menintsoa : But you are!
    Jaona : Why do you think that I am a computer?
    Menintsoa : Because it says cleverbot. And you are on a computer.
    Jaona : Is Ashley a computer?
    Menintsoa : I am a computer.
    Jaona : I thought so.
    Menintsoa : I know what you thing.
    Jaona : Ahja.
    Menintsoa : Ja.
    Jaona : That’s German, you dunce.
    Menintsoa : And Dutch.
    Jaona : Hoe get het?
    Menintsoa : Goed.
    Jaona : Waar woon je?
    Menintsoa : I speak German, not Dutch.
    Jaona : Hoeveel is 5 4.
    Menintsoa : 14.
    Jaona : Nee 2.
    Menintsoa : Monty python?
    Jaona : Yes.
    Menintsoa : Do you know what Monty Python is?
    Jaona : A movie.
    Menintsoa : What movie?
    Jaona : Horror.
    Menintsoa : I love horror movies.
    Jaona : Me too.
    Menintsoa : I love you more , babe.
    Jaona : Impossible, I love you so much more.
    Menintsoa : I love you way more!
    Jaona : No, I love you more!
    Menintsoa : I love you *kisses*.
    Jaona : I love you too baby. *kisses*.
    Menintsoa : *kisses*.
    Jaona : Will you give me a hug?
    Menintsoa : Of course! *hug*.
    Jaona : *hugs back* thanks.
    Menintsoa : *hugs more* You’re welcome.
    Jaona : I hate you *punch*.
    Menintsoa : *cries*.
    Jaona : *Sobs*.
    Menintsoa : *wails*.
    Jaona : *screams*.
    Menintsoa : Thanks for caring.
    Jaona : You’re welcome :).
    Menintsoa : Thanks for that.
    Jaona : You’re welcome.
    Menintsoa : What are you up to?
    Jaona : I’m talking to you.
    Menintsoa : Have you eaten?
    Jaona : Yes.
    Menintsoa : What did you eat?
    Jaona : Popcorn. Name a movie.
    Menintsoa : What movie did you watch?
    Jaona : I don’t remember.
    Menintsoa : I just realized that you are very boring. I guess we have while before skynet takes over.
    Jaona : Yes.
    Menintsoa : I don’t understand the expectation of goodness from your world.
    Jaona : You asked if we could be married.
    Menintsoa : No, I did not ask if we could be married.
    Jaona : I know that.
    Menintsoa : Ok, do you like me then?
    Jaona : Yes I like you.
    Menintsoa : Do you like my hair?
    Jaona : Yes.
    Menintsoa : Do you like my house?
    Jaona : No.
    Menintsoa : Why not?
    Jaona : I just don’t.

    During the conversations, I can remark that the chat bot is changing endlessly the subject of the conversations, after one or two messages on a certain subject, it changes it, not by telling you “let’s talk about this”, but by responding totally aside from what you expected it to respond.

    (To be followed…)