Tiburon Troubles
7/22/2008
Troubles sometimes can be a good thing. In this case it is, but to avoid the real troubles you need to understand why and what is going on with Unicode in Tiburon. This post will not detail every item especially since Tiburon is not released yet, but here is an overview of where code compatibility problems will be.
I recently made a post called Dozens of Delphis in which I propose a compatibility flag for Tiburon to compile libraries with string not being mapped to Unicode. This should not be interpreted as me not realizing the importance of Unicode. In fact over a year and a half ago I listed Unicode on my proposed Delphi road map. Unicode is super important, the only reason I propose a compatibility flag is to allow old libraries to be brought forward easier.
When I first heard that the default string type in Tiburon would be Unicode, alarm bells went off. Pre-Tiburon string=AnsiString. With Tiburon, string=UnicodeString. The problem is that this is breaks most of the libraries. After taking time to think, this is the right choice. Here is why this breaks so many libraries and affects your code as well.
Binary Buffers
Its is very common in Delphi to use strings as binary buffers. Create a string, set its size and poof you have a resizable buffer! This was done for many years because this is a frequent need and the alternatives were not convenient. In the past few years, there are better alternatives for dynamic binary buffers as the language and VCL have been advanced, but both habits and old code remain.
This use of string absolutely will not work in Tiburon, however changing these areas to use AnsiString may provide a hack around. The better solution is to change it to use an object that holds binary and allows dynamic resizes.
Conversion to and from ASCII
There are times that ASCII needs to be read or written to. For example plain text files and most network protocols require ASCII or at least some code page that you know about in sequences of one byte per character. Pre-Tiburon this conversion from string to bytes was done just by reading or writing the bytes of the string, or to its memory location. For example, to read and write a .bat file, or to send a command to an SMTP server using Indy.
In Tiburon you cannot do this as strings are now UTF-16, meaning that the string consists of a series of 16 bit identifiers. To convert to and from ASCII or other single byte compatible (i.e. German, French, Spanish, etc) language you need to call conversion functions and tell them what encoding you want. It is really the right way to do things, but of course will affect older code and requires you to think better about your code instead of treating the whole world as ASCII or "stuffed ASCII" as the case was with many languages other than English.
Slower
Since characters in UTF-16 are two bytes instead of one, obviously many string operations are slower. In some cases, a single 2 byte pair is not enough and it may signal more bytes are to follow. So Unicode routines cannot even locate the fourth character by multiplying by 2 and looking at the eight position. This also slows things down. For most string operations you will not notice a difference, but for large search and replace type code for example, the difference can be noticeable.
The solution though is the one used by .NET, Java and others. Do not use strings for all operations. .NET has many objects to work with strings, such as StringBuilder for building and manipulating strings. Strings in .NET are immutable, which is not true in Tiburon (else nearly everything would be hopelessly broken), however for large string modifications you should consider using task appropriate objects instead of the string itself.
In the end, if you adjust your code in such places your code is still likely to run a bit slower, but not enough to make a difference.
Benefits
There are some benefits to Unicode besides being friendly to all languages, and not just Latin character based languages.
Many Languages
AnsitStrings could mix German and English. Or Spanish and English. English was the magic combination. But try to mix any two non English languages and there were often problems. Many Western European languages could be mixed because they shared accented characters, i.e. umlauts for example. But mixing two non English characters in most cases simply was not possible. Since the string did not store what language was in it, if you handled multiple languages, you also needed to track each string and keep track of what language it used.
Unicode takes care of all of this, and allows you to mix characters from any language in a single string, and it keeps track of all of this for you.
Faster with Windows API
The Windows API has long been Unicode only. Sure there are ANSI based versions of Windows API calls that in fact nearly all Delphi developers are still using. But internally Windows has to convert that AnsiString to UTF-16 and call the Unicode function internally, and for any results convert them back to AnsiString. This not only takes time, but uses memory.
With Tiburon, since it talks UTF-16 Unicode, all Windows API calls now are done without the need for Windows to marshal the strings to and fro.
<< Previous Entry Next Entry >>
Use my contact form to contact me directly.

