Tamil Computing Research Experiences - Part 1
Updated: Jun 29, 2020
Thanks to #BuildTamil team on Twitter https://twitter.com/ezhillang for having given me this idea to write on my Tamil Computing research experiences, while they already do a lot of research on Tamil Computing (https://ezhillang.blog/). Not everyone can do this. I mean, encouraging others who research in your same domain. Kudos to #BuildTamil team :) I decided to write in parts as it is a long journey. So welcome to the part 1 of this series!
The term, "Tamil Computing" was introduced to me when I was working as a Junior Research Fellow in a DEity (Ministry of Electronics and Information Technology ) funded project called, "Cross Lingual Information Access" (CLIA) in College of Engineering Guindy (CEG) . The project involved about 11 Prestigious Indian Institutes including IIT Bombay, IIT Karakhpur, IIIT Hyderabad , ISI Kolkata (thats what i remember now !) etc along with Anna University CEG campus and MIT Campus. The aim of the project is to build a search engine for 5 Indian Languages, namely, Tamil, Telugu, Hindi, Marathi and Bengali. CEG and MIT were responsible for Tamil. in CEG we focused on Semantic Search while MIT was focusing on Key word search. To incorporate semantics to the search, we used a semantic framework called Universal Networking Language (UNL) http://www.undl.org/. It is a language independent framework and it is mainly invented for Machine Translation. Our Project was headed by the NLP vetrans, Professor, Dr. T.V. Geetha ( Dean of CEG), Professor, Dr.Ranjani Parthasarathi and Dr. Madhan Karky who was working as Associate Professor in CEG that time and now he needs no introduction as he is a famous Tamil Lyricist and runs a Non-profitable research organization, Kary Research Foundation (Karefo)https://www.karky.in/karefo/. He has a wonderful set of team with Linguists, researchers and developers.He does this for his passion for Tamil Computing amidst his busy schedules and that is really amazing:). Professor, Dr. T.V. Geetha ( Dean of CEG), Professor, Dr.Ranjani Parthasarathi and Dr. Madhan Karky are wonderful leaders and I still follow the professional and research ethics that they followed and taught us. The ABC of Natural Language Processing (NLP) was taught to me by these three.
I worked in CEG from 2007 to 2011 and I was also doing my PhD there under the guidance of Dr.Ranjani Parthasarathi mam. Then I had to convert my PhD to full time as I had to take a maternity break. This entire time period is a huge learning experience. The best part of life is that you will never know what impact your present-life would have on your future-life. I was totally ignorant at that time that I was learning about one of the booming trends of AI i.e NLP which is going to occupy the rest of my life :) Our present-life is not be under estimated as it is in someway going to lay a path to achieve our better targets later in our life. In this project, we learnt a lot about Tamil tools. We used Atcharam, Tamil Morphological Analyser, https://github.com/tacola-aucse/Morphological-Analyzer-For-Tamil and also we used Atchayam , Tamil Morphological Generator( the link I don't have currently, will update it when I get !) and many other tools like spell checker (Annam) ,Tamil Parser (Vanavil) were developed even before our CLIA project. Mostly we used Atcharam.
Our Team comprised of developers+researchers and linguists. The linguists write Tamil rules. They update Tamil lexical resource that is required by CLIA project. We were working on Tourism domain corpus and they will update on Tourism specific Tamil words. The dictionary format was, Tamil word- Tamil Root word- English word- UNL equivalent word. This is really a huge process and not only linguists sometime even we add dictionary entries during critical situations. This is what is lagging with my current Tamil Computing research. I don't have linguists and I don't have data. We need a lot of external funding from government. I am trying in all possible ways since I joined SRM for the past 5 years and failing still :( . We have many Tamil Data. #BuildTamil team regularly tweet about it but to build a machine learning model for any application there is still a lot of hurdles to cross. Coming back to the semantic search, Dr.Madhan Karky named our search as "CoRee" (Concept Relation Search Engine ) . It also matched with the Tamil word, கோரி means asking :) The details of the project can be found in this paper. https://tinyurl.com/yc8e23wp Citation Information of the paper: Balaji, J., Umamaheswari, E., Subalalitha, C. N., Elanchezhiyan, K., Madhan Karky, V., Parthasarathi, R., & Geetha, T. V.( 2012) CoRee– The UNL based Semantic Search. VishwaBharat@ tdil Jan 2012- June 2012.
I was working in the indexer of CoRee and this gave me an idea of semantic indexing for my PhD. One day, Dr.KA. Pa. Aravanan sir gave a talk in CEG about how we can correlate Nanool ideas with current day Teaching methods. This triggered me the idea of using Nanool concept, "Soothiram" in my Phd . I will write about it in the next part definitely soon!
Thanks for Reading.