India needs to place special emphasis on developing small language models (SLMs) with multimodal capabilities, alongside indigenous large language models (LLMs), to ensure linguistic inclusion, affordability, energy efficiency and public-sector suitability, the Office of the Principal Scientific Advisor (PSA) has suggested.
SLMs, which are more economical to train and run, are focused, domain-oriented models that can be fine-tuned for sector-specific tasks in agriculture, healthcare, education, climate and urban governance, and must therefore be developed in consonance with LLMs, the PSA’s office suggested in a white paper.
Further, the development of indigenous LLMs is crucial to build AI systems that are less biased, more trustworthy and remain locally relevant in a globally competitive AI ecosystem, the white paper suggested.
This can be achieved by allowing the development of indigenous LLMs trained on more diverse data, designed for India’s linguistic and social diversity, and governed through national frameworks, the PSA’s office suggested.
“Relying solely on foreign models risks under-representation of Indian languages and cultural contexts. Any biases in these models can cascade across all downstream applications that rely on them. This makes it critical to have a policy focus on these systems,” the PSA’s office suggested in the white paper.
At present, the central government has approved proposals from a dozen startups to develop indigenous LLMs and SLMs. These include proposals from Sarvam, which is developing a sovereign 105-billion-parameter LLM, referenced alongside 30-billion-parameter models designed for Indian languages, with a focus on governance, public service and high-stakes deployment.
Other proposals include an Indian Institute of Technology Bombay-led consortium, BharatGen, which is developing multilingual and multimodal AI models ranging from 2 billion to 1 trillion parameters.
On the other hand, Soket AI is developing a 120-billion-parameter open-source multilingual foundation model tailored for India’s linguistic diversity, while Gan AI is developing a 70-billion-parameter AI model targeting high-performance (“superhuman”) text-to-speech capabilities.