After lunch, I headed off to see a slightly-off-topic presentation on XML and Internationalization by Yahoo!’s own Sander van Zoest.
van Zoest began discussing Unicode overall, covered the UTF variants such as UTF-8, and BOM (Byte Order Marks). Moving into XML itself, he gave an overview of the intended use of the xml:lang tag and the ISO-639-2 language codes and ISO-3166 country codes.
XML also supports Numerical Character References such as € for the EURO SIGN (€). These may be expressed in hex (&#xHHHH;) or decimal (&#DDDD;) and always contain the Unicode code point value (regardless of which UTF scheme the document is encoded in). NCRs can be handy when you need to represent a the character does not exist in the document’s encoding scheme, but can’t be used in element or attribute names, or in CDATA and PIs.
The presentation gave examples of how to do character set transformations using XSLT, Perl 5.8, and Java.
We got the usual pitch to use tags with semantic value such as <important> instead of <b> and to use sytlesheets to do presentation instead of cluttering up the markup.
van Zoest failed to mention the perennial question on this subject: why the heck does “i18n” stand for “internationalization”? The answer is that there are 18 letters between the beginning “i” and ending “n” in the word “internationalization”.
I was a little short on time for this talk. I didn’t expect all three to be accepted. At the
same time, however, it was important to have these topics discussed by someone. I guess that might as well be me.
If you have any feedback or suggestions, please let me know.
Thanks!
Cool… a nice blog.