Using Unicode as the new AVEVA E3D™ Internal Format

PML Customisation User Guide

Developing PML Code : Using Unicode Text : Using Unicode as the new AVEVA E3D™ Internal Format

Using Unicode as the new AVEVA E3D™ Internal Format

The major design decision for the Unicode conversion of AVEVA E3D™ was to use (32bit) Unicode Scalar (US) instead of ASCII codes as it's integer character representation (holding 1 US character code per array element), and to use UTF8 format for its character byte strings, with up to 4 bytes to represent 1 character.

The section below describes a few important properties of Unicode Scalars and the UTF8 format.

1.

A Unicode scalar is a 32 bit integer code which uniquely represents a single Unicode character. Unicode assumes that the 32 bit range will cope uniquely with all the world's character sets.

2.

For the ASCII character set (codes 32 to 127) the codes are equal to the Unicode scalar codes.

3.

UTF8 encoding needs 1, 2, 3 or 4 bytes to represent a Unicode scalar. For UTF8 you need to be able to clearly distinguish the number of characters held and the number of bytes needed to represent them - as they cannot be assumed to be the same.

4.

Within UTF8 the ASCII characters can always be found by a simple byte by byte search in either direction.

5.

UTF8 allows the first byte of any adjacent character to be found by simple byte by byte search in either direction, and every first byte yields the number of bytes in the character.

6.

After having found a UTF8 character first byte (which could be an ASCII character e.g. $, /, &, space, ~ etc.) then subsequent bytes of the character are never ASCII bytes. So when you have found an ASCII byte it is a genuine character and not part of another character.

1974 to current year. AVEVA Solutions Limited and its subsidiaries. All rights reserved.