About XML schemas and the myriad of Unicode characters

Useful stuff for users and developers of the XML based dynamic publishing system "Arbortext". All information in this wiki came from posts to the Arbortext Adepters mailing list that were identified by members of that list as being extremely useful because the mailing list has moved around over the years. For more information see the group FAQ.

This forum and its sub-forums are open to the public.
Forum rules
Discussion posts to this group should be relevant to Arbortext users. Posts on other products or from vendors are welcome, so long as the post also mentions or includes Arbortext in the subject matter. General posts on dynamic publishing and posts from vendors other than PTC will be subject to higher scrutiny and should generally avoid posting here (especially if the referenced material does not mention Arbortext). Advertisement posts for other products (e.g., "check out my product as an alternative") are not welcome. This forum is for Arbortext users to communicate with other Arbortext users about working with Arbortext products.

All posts to this forum are moderated.

User avatar
Posts: 364
Joined: Sun May 31, 2015 2:34 am

About XML schemas and the myriad of Unicode characters

Postby liz » Mon Jun 15, 2015 5:05 pm

Written by: Eliot Kimber
Last Updated: 2006-08-31

Question by Ed Benton

One of the things that is holding us back (among others) from switching to schema from an XML DTD is this issue of how do we keep the myriad of unicode characters from being offered to our authors and subsequently showing up in some outputs and causing errors, as well as making the non-unicode special characters available. We can do this fairly easily with character entities and DTDs, but using schema and unicode is problematic.

Eliot Kimber answered

The quickest quick fix here is to not use Unicode as the storage encoding for your documents, use ISO-8859 (ASCII). This ensures that at least the unparsed data won't contain any Unicode characters and therefore won't break any tools that can't handle Unicode. It will mean that you may see numeric character refs in strings from the XML but they won't break non-Unicode-aware tools.

That is, while all XML documents are, semantically, made up of Unicode characters, they don't have to be stored in a Unicode encoding -- they can be stored in any encoding as long as the parser you're using can read it. After parsing, all XML data is in Unicode. If you are feeding a non-Unicode-capable system with XML data through a parsing-based process, then you have to do a Unicode-to-non-Unicode transform, which of course can be problematic because there may not be a match in your target encoding for some Unicode characters, so you either have to map to some fallback or escape the characters in some way (the details of which will depend on the application the data is going to). This is certainly the case when mapping to ASCII and you've used non-ASCII characters in your XML.

If you are getting the data from the XML without using a parser (for example, just doing regex matching on well-formed XML documents) then you can extract the data in its original encoding.

Note that XSLT 2 provides an explicit character map mechanism that lets you handle character set to character set mappings on output to at least provide fallbacks (for example, mapping Unicode \u2014 (em dash) to "--" when going to ASCII, which has no standard em-dash character. For example, if you set the output encoding for a text output to "ISO-8859" (ASCII) you also need to set up a character map to handle any unmappable characters that will occur in your data.

For rendering there are Unicode fonts that provide at least some form of glyph for all the printing Unicode characters, so apart from the pain of font configuration, there should be no problem rendering Unicode characters with Unicode-aware rendering software.

If you are using a rendition system that does not support Unicode then you have a much bigger problem and should probably not be using XML for the very reasons stated--you simply cannot easily limit the set of characters used.

Editors comment by Karl Johan Kleist: Eliot must have been writing this before having had the day's first cup of coffee. ISO-8859 (an eight bit encoding) is of course not ASCII (seven bits).

Return to “Arbortext Code Archive (adepters.org) (Public)”

Who is online

Users browsing this forum: No registered users and 9 guests