Translation Memory difference highlighting in Virtaal

Translation Memory, everything that you've translated in the past, is an amazing resource. But like all technology to assist a translator it is only useful if it is quick and easy to use.

Virtaal has been growing an impressive array of Translation Memory plugins (also some pretty cool Machine Translation plugins), this means that you are potentially getting more an more matches. The problem with our current implementation is that it's hard to know exactly what the difference are between your current segment and the segment matched by TM. Is it a word, a spelling difference, some punctuation?

With the upcoming release of Virtaal 0.5 we'll see that change. Over the weekend I implemented difference highlighting. Instead of explaining what I mean let me just show you in the following picture.

As you can see from the screenshot above both suggestions provided by Virtaal differ from the current source segment. The first has a space at the end. While the second one is capitalised differently. It's easy now to see how these suggestions are different and easy then to choose which one to use and where to edit it so as to bring the suggestion in line with what is needed for the current segment. Now imagine this with a whole sentence.

Now that's a bit easier then trying to work out how those two suggestions are different from the current segment. As a localiser or translator that's about all you need to know.

Now for the technically inclined. I'm using difflib to implement the differences. Its quite powerful and pretty easy to use. The get_opcodes method provides a nice way of determining what has changed. I then take the segment and wrap it in Pango formatting instructions to arrive at the nicely rendered output.

The implementation is not without issues. Difflib does its job too well. When you see a change of one word for another then to a human you want to see the one word deleted and the other inserted. But to difflib things seem very different, it looks at the character level and seems to find patterns of characters, trying very hard to show you how the one word changed to the other. It is simply impossible for a human to read. So we'll need to look at that and work out some way to adjust the get_opcodes output turn that complication into a simple replace of the whole word.

There are other places where we want to put difference highlighting. The previous msgid functionality of PO files is one, there will be others. But that's going to happen in the Virtaal 0.5 timeframe I'm afraid.

AttachmentSize
Screenshot-src.po - Diff.png15.46 KB
Screenshot-segment.po - Diff.png29.02 KB