Rare language translation software in Southeast Asia
Machine Translation, also known as automatic translation, is the translation of a text from one language (source language) into one or more other languages (target language) automatically, without human intervention in the translation process. Currently, many automatic translation products are commonly used such as: Google Translate of Google and Bing Translator of Microsoft, with very good translation quality for simple sentences. The above high-quality machine translation systems need large-scale bilingual data sets, up to millions of sentence pairs, to train the model. However, many languages in the world do not have such large-scale data sets. Building an effective machine translation model for resource-poor languages, including the languages of Southeast Asia, is an extremely urgent and challenging task.
To overcome the above shortcomings, a group of researchers at the Institute of Information Technology (IIT) has developed a Vietnamese-centric translation system, capable of two-way translation between Vietnamese and poor-resource languages of Southeast Asia with quality equivalent to famous commercial products in the world. The system is currently capable of two-way translation between language pairs including Vietnamese - Laos, Vietnamese - Khmer, Vietnamese - Thai, Vietnamese - Malaysian and Vietnamese - Indonesian.
The system is researched and developed based on the latest advances in the world in the field of natural language processing in general and machine translation in particular. Languages like Lao, Thai and Khmer present huge challenges when building machine translation models, not only because of the scarcity of bilingual data, but also because these languages are so rich in terms of morphology, lack of word separation, sentence separation and polysemy. The IIT’s machine translation model has learned to adapt to the characteristics of the above languages. Machine translation models are trained on the Nvidia DGX A100 server system at the IIT, which has the most advanced configuration in our country today. The machine translation system can be easily expanded to new target languages including ethnic minority languages in Vietnam (often very poor in data resources) such as Muong and Thai, and also other popular foreign languages such as Chinese, French or Russian when needed. In particular, the system has the ability to be fine-tuned to adapt to specialized language domains such as medicine or law, according to partners' specific requirements.
Key features of the system's multilingual translation system include:
1. Using on-premise software: The software is installed and runs on the unit's server system, allowing the unit full control over data and applications.
2. The system uses modern 4.0 technology including machine learning and the most modern Natural Language Processing technology to date to achieve high-level translation accuracy.
3. The system has the ability to update data and retrain models to improve translation quality and adapt to the unit's field of expertise.
4. The system ensures absolute information security during use.
5. The system can be deployed both in the internal network and on the Internet.
6. The system is exploited through two forms, including a web interface for users to directly translate and an API communication form that allows other systems to connect and operate.
7. The system allows automatic translation in many different formats, including text format (.txt) and digitized text file format (.rtf, .doc, .docx, .pdf, .html...) to maintain the main format of the documents after translation.
A representative of the research team said that the overall architecture of the machine translation system is based on modern Transformer technology. The model in its general form uses an End-to-End architecture in which an encoder is used to represent input sentences (belonging to the source language) into semantic vectors. The decoder will then provide translation results from these semantic vectors into output sentences (in the target language).
General model
The system is built based on pre-trained models, then fine-tuning and optimizing the model are carried out as shown in the following figure.
Main steps in the process of building a machine translation system
The creation of bilingual data sets to fine-tuning the model plays a decisive role, affecting the accuracy of the translation model. The research team applied a variety of techniques to enrich the dataset including back-translation, pivoting around a common language, and transfer learning techniques.
Another very advanced technique is also applied to improve translation quality. Specifically, the model is trained for multiple language pairs simultaneously. Resource-rich languages are trained first, and then linguistic “knowledge” is transferred to resource-poor languages, improving the performance of the translation model for these languages.
The language translation model can be very large in size (up to tens of billions, hundreds of billions of parameters), affecting the system's execution speed in environments with limited computing capacity. Therefore, the research team optimized the model through a number of techniques such as weights quantization, layer fusion and batch reordering, to increase execution speed and reduce memory usage on CPU and GPU.
The above machine translation system is a very good alternative to existing commercial translation software in the world in cases such as: 1) customers want a translation system that runs separately, independent of third parties, ensuring security, safety and data confidentiality; 2) the customer wants to expand into a new, resource-poor language that is not yet supported by commercial software or whose translation quality is not guaranteed; 3) customers are proactive and flexible in connecting and integrating their translation system with other application systems through complete mastery of translation APIs.
Translated by Tuyet Nhung
Link to Vietnamese version