Malayalam NLP

November 1, 2025
in NLP Research, Malayalam NLP
3 min read

Malayalam Pronominal Anaphoric Coreference Resolution

I recently dusted off my undergraduate degree project from 2020 and put it on GitHub with a proper DOI. This project won Best Paper at ABACon'20, but I never formally published it beyond the conference presentation. Five years later, it felt important to preserve this work because it was my first research project, and working on Malayalam NLP in 2020 meant operating in a profoundly low-resource environment. The challenges I faced as a beginner researcher mirrored the challenges facing the entire field: limited tools, no benchmark datasets, and little prior work to build on.

The Research Question

In early 2020, I began exploring coreference resolution as the focus of my undergraduate final project. Under the advisement of Prof. Mathews Abraham, I surveyed potential benchmarks and methodos available for application. Given the limited resources for Malayalam NLP at the time, I settled on Hobbs' Algorithm as a starting point—a classic rule-based approach for pronominal anaphora resolution originally designed for English. The central research question became: could this algorithm be successfully adapted for Malayalam text?

Coreference resolution addresses a fundamental challenge in natural language understanding: identifying when different expressions in a text refer to the same entity. Consider the sentence pair "The cat sat on the mat. It was sleeping." Human readers immediately recognize that "it" refers to "the cat," but teaching computers to make these inferences was a non-trivial computational problem in 2020.

The Malayalam Challenge

Malayalam presents significant structural differences that distinguish it from English beyond orthographic variation. As a Dravidian language, Malayalam exhibits rich morphological inflection, relatively free word order, and agglutinative properties. The resource constraints in 2020 complicated these linguistic challenges. Malayalam lacked annotated coreference corpora, benchmark datasets, and mature computational tools. The available infrastructure consisted primarily of a small number of research papers and limited open-source libraries supporting basic tokenization and part-of-speech tagging.

Implementation

I adapted Hobbs' algorithm using the available Malayalam NLP infrastructure, specifically leveraging Anoop Kunchukuttan's Indic NLP Library for morphological analysis and Devadath's shallow parser for syntactic processing.

Usage

from MalayalamCorefResolver import MalayalamCorefResolver

resolver = MalayalamCorefResolver()
text = "പൂച്ച മേശയ്‌ക്ക് മുകളിൽ ഇരിക്കുന്നു. അത് ഉറങ്ങുന്നു."
result = resolver.find_coref(text)

Document: പൂച്ച മേശയ്‌ക്ക് മുകളിൽ ഇരിക്കുന്നു. അത് ഉറങ്ങുന്നു.
In Sentence 2: 'അത്' → ['പൂച്ച']

Results

When tested on a sample of sentences from Wikipedia, the system achieved 65% accuracy. I presented this work at ABACon'20, a national conference on innovations in computing organized by Sahrdaya College of Engineering and Technology, where it received the Best Paper Award.

Closing Notes

While my research has since moved on to other very interesting questions in English NLP, Malayalam remains close to my heart—both as my native language and as a field with rich linguistic complexity and the exciting opportunity to build from the early days of development. The challenges of Malayalam NLP continue to allure me.

Code and Citation

The code is archived at Zenodo and available on GitHub.

If you use this work, please cite:

@software{enfa_fane_2025_17508334,
  author       = {Enfa Fane and
                  Abraham, Mathews},
  title        = {beingenfa/malayalam-coreference-hobbs: public
                   archive
                  },
  month        = nov,
  year         = 2025,
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.17508334},
  url          = {https://doi.org/10.5281/zenodo.17508334},
}