ICIIT Shared Task on Sinhala LLM Development

Important Dates

  • Call for registration: April 1, 2024
  • Initial briefing: April 11, 2024 April 26, 2024
  • Interim progress meeting: April 22, 2024 May 15, 2024
  • Shared Task competition: May 30, 2024

*Postponed owing to Awrudu Festival

Background

Sri Lanka’s English literacy rate is low, limiting access to knowledge and technology. Large Language Models (LLMs) are powerful AI tools that can make knowledge accessible in local languages. This shared task aims to accelerate the development of Sinhala LLMs.

Shared Tasks

Task 1: Developing cleaned Sinhala corpora of (UNICODE) Sinhala

  • Collect a cleaned Sinhala Unicode corpus of at least 50 million words.
  • Provide sources and dimensions of the corpus.

Task 2: Implementing Sinhala interfaces to existing LLMs

  • Create a Sinhala interface that seamlessly accepts and responds in Sinhala.
  • Provide a chat interface with persistent memory.

Task 3: Building Sinhala LLMs

  • Explore approaches to build Sinhala LLMs, including RAG and fine-tuning.
  • Compare models against the zero-shot Claude 3 Sonnet model.

Task 4: Creating evaluation metrics for Sinhala

  • Develop Sinhala evaluation metrics to measure LLM performance.
  • Consider metrics such as ARC, HellaSwag, and TruthfulQA.

Note: All text data and code must be shared under an Apache/MIT style license.

Teams are invited to register for one or more tasks. The shared task competition will take place on May 30, 2024.

Click to: REGISTER
Scroll to Top