Shared Task

Building Compact Sinhala & Tamil LLMs

Shared Task

Building Compact Sinhala & Tamil LLMs

Are you ready to build something small that makes a big difference? As part of the “Small Models, Big Impact” Research Conclave, this shared task challenges university students and industry innovators to develop efficient, high-performing language models for Sinhala and Tamil that are light enough to run on devices and at the edge. Using open-source LLMs, participants will fine-tune or continue pre-training these models to better serve our local languages—making powerful AI more accessible, scalable, and locally relevant. Whether you’re passionate about language, excited by edge-AI, or looking to make your mark on the future of Sinhala and Tamil tech, this is your chance to learn, build, compete, and make a lasting impact.


1. Task Overview & Objectives

  • Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ≤ 8 billion parameters.
  • Impact: Empower local NLP research and applications—chatbots, translation, sentiment analysis, educational tools—while lowering computational and storage barriers.
  • Who Should Participate:
    • Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training.
    • Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage.

2. Allowed Base Models

Participants must choose one of the following (or any other fully open-source LLM ≤ 8 B params):

Model NameParametersNotes
Llama 31B, 3B, 7BMeta’s Llama series, particularly the smaller versions, is designed for efficiency and multilingual text generation. While the larger Llama models are more widely known, the 1B and 3B models offer a compact solution. Meta has also shown interest in addressing the linguistic diversity gap, which includes support for languages like Sinhala and Tamil.
Gemma2B, 4BDeveloped by Google DeepMind, Gemma models are known for being lightweight yet powerful, with strong multilingual capabilities. Google has a strong focus on linguistic diversity, and Gemma’s architecture makes it a good candidate for adapting to less-resourced languages.
Qwen-2 0.5B, 1.5B, 7BThis family of models from Alibaba is designed for efficiency and versatility. Their strong multilingual pretraining makes them good candidates for adaptation to Sinhala and Tamil through fine-tuning.
Microsoft Phi-3-Mini 3.8BThis model from Microsoft is highlighted for its strong reasoning and code generation capabilities within a compact size. While its primary focus isn’t explicitly on a wide range of South Asian languages, its efficient design and good general language understanding could make it a suitable base for fine-tuning with Sinhala and Tamil data.
Or … any other open-source checkpoint ≤ 8 B params

Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed.


3. Data Resources and Evaluation


Perplexity: https://huggingface.co/docs/transformers/en/perplexity 
MMLU Datasets: https://huggingface.co/datasets/CohereLabs/Global-MMLU/viewer/si and https://huggingface.co/datasets/sarvamai/mmlu-indic


4. Submission Requirements

  1. Model: HuggingFace-format upload.
  2. Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository.
  3. Technical Report (2-5 pages):
    • Training details: data sources, training mechanism, epochs, batch size, learning rates.
    • Resource usage: GPU time, list of hardware resources.
    • Model evaluation.
    • Analysis of strengths/limitations.

5. Timeline

MilestoneDate
Task announcement date25th June 2025
Deadline 3rd of July 2025
Introductory meeting7th of July 2025
Progress Meeting 21st of July 2025

6. Prizes & Incentives

  • Best Teams (for Sinhala & Tamil): Awards and winning certificates.
  • All Participants: Certificate of participation.

7. Rules & Fairness

  • Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights).
  • Data Usage: Only public/open-license data; no private or web-scraped behind login.
  • Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline.

8. How to Register & Contact

Get Involved

Be Part of the Future of Research

Join us to collaborate and innovate in the ever-evolving landscape of Information Technology.

Scroll to Top