Translation Memory Processing

Confidential

Introduction

Translation assistance softwares all capitalize on past translations in order to best help a translator in his or her profession. It may be a different translator who will translate a document similar to a previous translation. This capital is the translation memory (TM).

The most widely used translation memory storage and transfer format worldwide is TMX for Translation Memory eXchange.
It is an XML file that in its simplest form consists of a sentence (or segment) of a source language and its translation into a target language.
A segment can be composed of a single word or a thirty-word sentence for example.
As you will see, as with human memory, it can be distorted, biased, which will necessarily introduce an error into what must then be infallible.

It is also these famous translation memories that will serve as the basis for Machine Learning, the first step in building an artificial neural network and then its exploitation via what is commonly called Neural Translation Memory (NTM).

The project

2CS was mandated to purge a translation memory with a volume of approximately 10 GB of TMX files.
Although TMX (based on the XML description language) is very verbose, it represents some 10 million segments (sentences or expressions).

2CS must purge some segments that have been translated and then inserted into the TM by translators whose quality may vary, as well as clean and consolidate the translation memory.

These TMs must then be inserted into new centralized, online translation support software that operates in the same way as SaaS (Software as a Service). This software can also be connected to a third-party NTM tool.

Technologies used

Requirements

The job is simple: input a translation memory, output a cleaned, purged and consolidated translation memory and a 2-month deadline to do so.

Solution

We have carried out an in-depth study of the tools available on the market in order to find one that meets very specific needs:

be able to search the segments to find the name of the translator
then be able to exclude translated segments produced by translators considered unreliable

We found nothing satisfying.

Because of the software that initially produced these TMXs, there is complete freedom with regard to some of the attributes of this file format, which in fact is ignored by TMX’s standard modification softwares.

As a software engineering company, this will require us to do a custom routine. Taking advantage of various feedback from our team of engineers, we will design a tool to apply advanced management rules.
The choice was made for the Swiss Scala technology, which is extremely powerful for reading large volumes of data since it was initially designed with Big Data in mind.
Performances are as good as expected, thanks to Scala, purging 10 GB of TMX file takes about 1 minute only.

Conclusion

Thanks to Scala, a functional language invented in Switzerland and particularly the concept of “Matching patern”, this work was carried out with good execution performance for a software that we do not plan to reuse.
Scala is designed to process large volumes of data, unlike Java, for example, which will consume almost as many memory resources as the file to be processed.

The mandate was completed in 1 week.

If this mandate comes up again, we think about creating an interface in order to be able to control more easily this small software that we have developed.