Adding DTD validation to moz2po

As an engineer, unlike a scientist, I'm happy with solutions that deliver demonstratable benefit. The proof of the value I add is the reduction inthe number of errors or the increase in productivity.
After attending the Mozilla Summit I remain convinced that moz2po continues to add value.  When talking localisation to people using the Translate Toolkit we talked about translation quality, QA, etc.  With anyone else I found we spoke about technical QA, by technical QA I mean fixing things like broken builds, screen sizes, etc., i.e. all the things that should be eliminated not documented or improved.
In my discussions with people about moz2po I picked up two memes.

  1. We cannot prove the correctness of our output
  2. Using & as an accelerator marker is a problem

For the first issue I must agree we cannot prove anything.  Except what I can prove is that we don't make broken builds and when we do we fix the tools and no longer make broken builds. I don't need to watch tinderbox or the dashboard as I'm pretty sure all our data is correct.  But I did get thinking and decided that at least for DTD files we could attempt to parse them and measure their validity, see bug 470, this is fixed and will be in v1.2 of moz2po.
For the second Imust admit to those unfamiliar to how we process PO files the use of & seems problematic.  For those unfamiliar with DTD or XML you need to understand that & has special meaning in those formats so a stray & can break things.  Then why did we chose & in our PO files?  Simple, Translate.org.za translated KDE before Mozilla so we chose & which is used as an accelerator marker for KDE applicaitons.  Even though we use & we don't see any of the breakage that everyone fears.  Why?  because I regard this as technical problems that should simply be eliminated.  We eliminate them in two ways:

  1. Our tools pofilter has the --accelerator test which allows us to identify any missing accelerators.  Thus we eliminate almost all issues of unaligned accelerators.
  2. When converting back using po2moz we will delete any unescaped & and warn the user.  Thus we get a .xpi and the user can go fix the problem.

If that is not enough I've opened bug 471 to review the & decision, perhaps change it or at least make it configurable.  But what I would like you to consider is the amount of time technical intervention has saved users of moz2po:

  1. For the last 5 years Translate.org.za has not built broken .xpi's - a problem that has plagued a lot of teams.  It is extremely difficult to track down broken pages as the errors that you see in your browser are very cryptic.  My guess is that at best it will take you an hour to find the error and rebuild your .xpi.  Now imagine the time taken by a new translator, they must email, ask an expert, understand, etc.  All this while they could be translating.
  2. We do not need to review our translations for bad accelerators; i.e. we don't need to do in context reviews to check for accelerators that don't appear in the text of our translations.  In Mozilla these are like this "Lêer (F)" (This is File in Afrikaans and F would be underlined).
  3. In DTD and .properties files where the .label and .acceskey entries do not align (we use the name alignment to merge the label and accesskey) you waste a lot of time.  Firstly you will see the problem in your in product review.  Then you have to search you files to work our where exactly the error comes from.  Considering that the string might appear a number of times you have a number of files to examine.   When identified you have to find the accesskey entry (it may be poorly labeled), make a change and rebuild.  How much time?  15 minutes if you are lucky.  Our approach to these ones is to raise bugs so that they disappear from the product for ourselves and others.  So although it takes us time in the long run anyone using the tools contributes to the overall quality of the product as nobody after the fix would have to worry. Now think that anyone else will simply fix this in their files and are unlikely to raise a bug to align the .label and .accesskey.  Our tools help us to focus on improving the quality of the source for everyones benefit.

So the toolkit eliminates a whole raft of issues that you just never see. How much time have we saved over the years?  Days and weeks I would guess, not only for us now but for many others who come later.  This means that we can focus in areas where focus counts, the actual language used and future translators can do the same.