Вселенная Кудзу

«Программирование - искусство, воюющее со своими творцами»

Kudzu World  »  Blogspace  »  Kudzu's Tech Blog
English - Română - Русский - عربي

RSS Feed

If you want to more easily stay informed of updates etc you can subscribe to the RSS feed. Just point your RSS reader at this page, auto discovery is enabled.




Using PayPal?
Read this


Tiburon Troubles

7/22/2008

Troubles sometimes can be a good thing. In this case it is, but to avoid the real troubles you need to understand why and what is going on with Unicode in Tiburon. This post will not detail every item especially since Tiburon is not released yet, but here is an overview of where code compatibility problems will be.

I recently made a post called Dozens of Delphis in which I propose a compatibility flag for Tiburon to compile libraries with string not being mapped to Unicode. This should not be interpreted as me not realizing the importance of Unicode. In fact over a year and a half ago I listed Unicode on my proposed Delphi road map. Unicode is super important, the only reason I propose a compatibility flag is to allow old libraries to be brought forward easier.

When I first heard that the default string type in Tiburon would be Unicode, alarm bells went off. Pre-Tiburon string=AnsiString. With Tiburon, string=UnicodeString. The problem is that this is breaks most of the libraries. After taking time to think, this is the right choice. Here is why this breaks so many libraries and affects your code as well.

Binary Buffers

Its is very common in Delphi to use strings as binary buffers. Create a string, set its size and poof you have a resizable buffer! This was done for many years because this is a frequent need and the alternatives were not convenient. In the past few years, there are better alternatives for dynamic binary buffers as the language and VCL have been advanced, but both habits and old code remain.

This use of string absolutely will not work in Tiburon, however changing these areas to use AnsiString may provide a hack around. The better solution is to change it to use an object that holds binary and allows dynamic resizes.

Conversion to and from ASCII

There are times that ASCII needs to be read or written to. For example plain text files and most network protocols require ASCII or at least some code page that you know about in sequences of one byte per character. Pre-Tiburon this conversion from string to bytes was done just by reading or writing the bytes of the string, or to its memory location. For example, to read and write a .bat file, or to send a command to an SMTP server using Indy.

In Tiburon you cannot do this as strings are now UTF-16, meaning that the string consists of a series of 16 bit identifiers. To convert to and from ASCII or other single byte compatible (i.e. German, French, Spanish, etc) language you need to call conversion functions and tell them what encoding you want. It is really the right way to do things, but of course will affect older code and requires you to think better about your code instead of treating the whole world as ASCII or "stuffed ASCII" as the case was with many languages other than English.

Slower

Since characters in UTF-16 are two bytes instead of one, obviously many string operations are slower. In some cases, a single 2 byte pair is not enough and it may signal more bytes are to follow. So Unicode routines cannot even locate the fourth character by multiplying by 2 and looking at the eight position. This also slows things down. For most string operations you will not notice a difference, but for large search and replace type code for example, the difference can be noticeable.

The solution though is the one used by .NET, Java and others. Do not use strings for all operations. .NET has many objects to work with strings, such as StringBuilder for building and manipulating strings. Strings in .NET are immutable, which is not true in Tiburon (else nearly everything would be hopelessly broken), however for large string modifications you should consider using task appropriate objects instead of the string itself.

In the end, if you adjust your code in such places your code is still likely to run a bit slower, but not enough to make a difference.

Benefits

There are some benefits to Unicode besides being friendly to all languages, and not just Latin character based languages.

Many Languages

AnsitStrings could mix German and English. Or Spanish and English. English was the magic combination. But try to mix any two non English languages and there were often problems. Many Western European languages could be mixed because they shared accented characters, i.e. umlauts for example. But mixing two non English characters in most cases simply was not possible. Since the string did not store what language was in it, if you handled multiple languages, you also needed to track each string and keep track of what language it used.

Unicode takes care of all of this, and allows you to mix characters from any language in a single string, and it keeps track of all of this for you.

Faster with Windows API

The Windows API has long been Unicode only. Sure there are ANSI based versions of Windows API calls that in fact nearly all Delphi developers are still using. But internally Windows has to convert that AnsiString to UTF-16 and call the Unicode function internally, and for any results convert them back to AnsiString. This not only takes time, but uses memory.

With Tiburon, since it talks UTF-16 Unicode, all Windows API calls now are done without the need for Windows to marshal the strings to and fro.

<< Previous Entry  Next Entry >>

Comments: 

Patrick van Logchem 2008-07-22 11:22:43
Another problem is memory-usage : Applications working with large amounts of strings will consume twice the memory in with UTF16LE encoding than with strings in some Ansi codepage.

Using UTF8 could be more efficient (depending on the sort of text being encoded), but that would still require implicit transcoding to UTF16LE when calling WinAPI's and still suffers from the variable length character encoding problems you mentioned.
So using UTF16LE as the default UnicodeString encoding is a comprimise we'll just have to live with.

But just like you, I really would have liked to have a switch in Tiburon (much like the LONGSTRINGS directive) to influence the meaning of the 'string' alias.
Recompiling old code would then need only one change : {$STRINGTYPE AnsiString} (or something along those lines).
This would also enable new code to use UTF8, or any other encoding you'd prefer (!)

Meanwhile the RTL+VCL should use UnicodeString explicitly, so there can be no misunderstandings about that.
Any mismatches between caller and called string-types could be resolved by the compiler with an implicit string-transformation (accompanied with an optional compiler warning).

david Intersimone 2008-07-22 05:50:46
Good post, letting developers understand what to do with string handling. The best thing is that developers have choices in how they program their applications. Mostly, all of the things you mention are "much ado about nothing". You mentioned a few things:

Writing strings to files - you have easy choices to write out strings. See my blog post about "LoadFromFile" and "SaveToFile" - if you manipulate the default code page you really have to do nothing. One parameter and you can choose your encoding - global search and replace will easily update your programs in this case.

The VCL and RTL have been modified to do many of the conversions for you. Yes, there will be some cases where you have to take a look at your code. Good thing, the compiler has been modified to add more warnings for suspicious constructs that developers can look at (you can change the warnings to errors if you like). These warning mechanisms are not only for today but as we move forward to 64-bit and other worlds (can you say Klingon?).

Patrick commented about memory use - okay if you use UTF16, but you can also choose what encodings you want for your strings in your variables.

Slower? Benchmarks for "real" applications? In interactive applications or database applications I will "bet" that the speed will be hidden while the application waits for human interaction or database results to stream back.

Binary Buffers? Use strings as buffers? Use memory as buffers! I am shocked that anyone would use strings for anything else other than strings :)

Thanks for adding notes about things that developers should be aware of. Troubles? In River City? For real programmers, I doubt it.

David Intersimone "David I"
Chief Evangelist, CodeGear.
davidi@codegear.com


Add your comment: 

Name:    
E-Mail:  
URL:  
Comment:  
Please add 5 and 5 and type the answer here:



Use my contact form to contact me directly.