In the rapidly expanding world of large language models (LLMs), English continues to dominate, throwing other languages in the shadow. This imbalance is particularly stark in India, where more than 20 official languages and hundreds of dialects are spoken daily. PARAM-1, a newly released bilingual foundation model, rises out of India’s own linguistic and cultural landscape.

The model is detailed in a paper published on arXiv (July 2025) by the BharatGen team, which includes Kundeshwar Pundalik, Piyush Sawarkar, Nihar Sahoo, and Abhishek Shinde. The authors describe PARAM-1 as a 2.9-billion parameter foundation model trained from the ground up to reflect Indian realities.

Beyond translation

The name PARAM has a legacy in Indian high-performance computing, but the new model signals a different ambition. PARAM-1 is not a simple upgrade of past systems; it is designed to create artificial intelligence that understands India as more than just another market.

Unlike most global models that treat Indian languages as peripheral, PARAM-1 dedicates 25 per cent of its training data to Hindi. This includes government translations, literary works, educational material and community-generated content. The rest of the dataset consists of English sources carefully curated for their factual depth and range.

A tokeniser is the first step in how a language model processes text. It breaks sentences into smaller units, or tokens, which the model can interpret.

Standard tokenisers, built for English, perform poorly on Indian scripts, splitting words into too many fragments. PARAM-1 addresses this with a script-aware tokeniser that recognises Hindi and other Indic scripts, creating fewer and more meaningful tokens. This improves both accuracy and efficiency.

Although PARAM-1 currently supports only English and Hindi, its tokeniser has been designed for broader Indian linguistic diversity. It can handle scripts such as Tamil, Telugu, Marathi and Bengali, laying the groundwork for future multilingual expansion.

Design, not retrofit

PARAM-1 is the result of a training strategy that prioritised inclusion from the start. It was trained in three phases, beginning with general language learning, followed by a focus on factual consistency, and, finally, long-context understanding. This structure allowed the model to gradually develop fluency, retain factual information more effectively, and improve performance on tasks that require reading and reasoning over longer texts.

The model was tested not just on widely used English-language benchmarks such as MMLU and ARC Challenge, but also Indian-specific datasets. These included MILU, which draws on Indian competitive examinations, and SANSKRITI, a benchmark that covers cultural knowledge ranging from festivals to geography. The results were encouraging. PARAM-1 performed competitively on global benchmarks and outperformed several open models on Indian tasks, especially in Hindi.

More languages

Although PARAM-1 is presented as a model designed for India, its bilingual focus means that other Indian languages are still excluded. This raises questions over the model’s inclusivity, especially in a country where linguistic identity often intersects with regional politics and access to services.

The team behind PARAM-1 appears to be aware of this limitation. The tokeniser was specifically engineered to handle the morphological patterns found in Indian languages beyond Hindi. While this does not compensate for the lack of direct training in those languages, it does provide a foundation for expanding the model’s linguistic reach in future iterations.

Equitable AI

PARAM-1 is not a frontier-scale model, nor does it claim to be the most powerful LLM available. Its significance lies in a different direction. It shows what can happen when the design of an AI model reflects the needs and complexities of the people who are meant to use it.

The development of PARAM-1 offers a blueprint for equitable AI design. It highlights the importance of investing early in diverse data, language-aware infrastructure, and public benchmarks that reflect regional and cultural realities. The model also invites broader participation from government agencies, universities, and private firms, especially if it is to grow into a truly multilingual and domain-specialised platform.

The authors of the model offer a clear message in their conclusion: Fairness in AI cannot be treated as an afterthought. It must be addressed in the earliest stages of design. PARAM-1 currently supports just two languages, but leaves the door open for many more. It serves as a reminder that if artificial intelligence is to serve all of humanity, it must begin by learning to listen to more of it.

More Like This

Published on July 28, 2025



Source link