AI For India, By India: What It Takes To Build An Open-Source Indic-Language Model


Artificial intelligence (AI), especially generative AI, is not just another buzzword anymore. Ever since OpenAI’s ChaGPT caught on and showed its utility as a virtual assistant that can do anything from writing poems to generating complex codes, the technology has been leveraged by many businesses to make their offerings more efficient and readily helpful. Even governments are taking AI seriously now. AI found notable mention in the recent G20 New Delhi Leaders’ Summit Declaration, where global leaders committed to “pursue a pro-innovation regulatory/governance approach that maximises the benefits and takes into account the risks associated with the use of AI”.

Now, let’s address the elephant in the room. If you have used any form of generative AI tool, you must have noticed how it operates more efficiently with English inputs. Apart from English, the other most prevalent languages include French, German, and Spanish. But what about Indian languages? 

To address that, Tech Mahindra’s Makers Lab is currently working on the Indus Project, a large language model (LLM) focused solely on Indian dialects, designed to cater to the communication needs of 25 per cent of the world’s population. Tech Mahindra is one of the homegrown big tech giants of India, competing with the likes of Infosys and TCS.

For those unaware, a large language model (LLM) is what drives a generative AI tool. It possesses the capability to engage in diverse natural language processing (NLP) tasks — generating and categorising text, responding to inquiries conversationally, and translating text across different languages. 

The most interesting bit of this project is that in order to build a database, Tech Mahindra is asking for contributions from none other than local speakers themselves. 

Why Do We Need An LLM Dedicated To Indian Dialects?

Sure, some generative AI tools such as ChatGPT or Google’s Bard do support Hindi and a handful of other Indic languages, but that is primarily restricted to the general dialects of the languages. However, India boasts of around 20 to 23 “mother tongues” with 1,645 dialects that are widely used across the nation. Notably, there are some 19,500 dialects spoken by Indians. 

“In India, only 10 to 20 per cent of the population speaks English, highlighting the linguistic diversity of our nation. Creating trustworthy and genuine datasets can be difficult, but it is crucial,” Nikhil Malhotra, Global Head of Makers Lab at Tech Mahindra, told ABP Live.

Two primary objectives have been set for Project Indus. “First, we aim to create a foundational language model rooted in India,” said Malhotra. “We are currently determining the specific parameters for this study. We are considering a range of 7 to 14 billion parameters and may extend to 40 billion if required,” he said.

“Second, we aim to excel in various benchmarks prevalent in the market, ensuring optimal model performance,” Malhotra said. “We will focus on developing LLM for Hindi and its dialects in the first phase, with plans to expand to other languages and dialects in the future.”

In terms of breaking India’s language barrier, the Government of India has already announced Bhashini, which is a local language translation mission that aims to eliminate the barrier between various local tongues with the help of technology. 

Apart from Bhashini, IIT Madras is also working on AI4Bharat, an open-source AI for Indian languages, which includes datasets, models, and applications. 

“We understand that much work has been done on Indic Suite, whether Bhashini, AI4 Bharat, etc., but a foundation model still needs to be developed,” Malhotra said

How To Build A Trusted Indic Language Dataset?

For starters, when we consider Indic languages, non-Hindi datasets are rare and mostly incomplete. Even the data for Hindi itself is quite fragmented. In order to address this, Project Indus decided to take the help of probably the best source of any language — the speakers themselves. 

If you visit the official Project Indus website, you’ll see a Make Your Contribution button on the homepage. Clicking on it will take you to a page where users can help enrich the database by “contributing” an audio sample of their local dialects. 

Users can select their own dialect of Hindi from a dropdown menu on the page, ranging from Angika (Bihar) and Surgujia (Chattisgarh) to Malvi (Rajasthan) and Kumaoni (Uttarakhand). 

When users scroll the page down, they will see a sample sentence in the selected dialect, which they can choose to record anonymously. Of course, the sample sentence prompt can be shuffled as needed. 

There’s an elaborate model in place to train the Project Indus LLM. Tech Mahindra acquires data from several online sources, “including critical ones like Common Crawl, which provides website data,” Malhotra said. “However, the challenge lies in finding dialect-specific data, as most sites primarily offer data in mainstream languages. Our data strategy involves more than just collection. It comprises two key steps. Step one is data collection, while step two is diversification,” he explained.

“Research has shown that models trained on diverse datasets outperform those trained on a single data type. Our diverse dataset includes content from sources like newspapers, Wikis, and specific domains like construction, general awareness, and news. Since much of this data isn’t readily available in Hindi and English, our initial approach is to collect data in English, translate it into Hindi, and assess it for annotation quality and ethical bias. This annotation process involves human input for refining the data. Our model incorporates foundation models and techniques, including reinforcement learning through human feedback, and uses refined content as input.”

What About Other Languages?

For now, the primary focus of Project Indus is to build the first Hindi-based foundational model, which will comprise 36 to 40 dialects. Once this is completed, plans will be expanded to include other languages as well. 

“The reason for incorporating these dialects is to ensure that our foundational model comprehensively represents India’s rich linguistic diversity. We’re focusing on specific domains,” Malhotra said. “We aim to ensure these domains are covered and accurately represented without generating false information.”

Malhotra also added that Tech Mahindra’s interface will also offer voice and textual information, but that is a secondary goal. He said, “Our primary goal is to first create a large language model for the continuation of text and then provide a dialogue. Once we are clear that the model performs well and generates dialects well, we will launch it in the open source.”

At the time of writing, Project Indus is still in its developmental phase. No official date for an open-source launch has yet been announced.


Source link

Leave a Reply