[Rails-core] Investigating Unicode. Take 2, with nastities and allegations.

Julian 'Julik' Tarkhanov listbox at julik.nl
Wed Dec 21 14:53:29 GMT 2005


Well, I see that my last email hasn't generated any reaction from the  
Rails core team. It looks like all of them are the happy users of  
"plain text" (which, as we know by now, doesn't exist, but still).

I apologize in advance for the sore bitterness of this message but I  
see that the Rails-core STILL, despite all of the efforts, sees these  
issues as something you can YAGNI away, something "optional",  
"additional" or "plugin-able".

What I will try to prove in this message is that it's not  
"additional" - and more, it's got poisonous teeth and it bites  
painfully. You can forgive Matz, because he has to stay above the  
controversy and cater to the Japanese and Chinese users, and he  
dislikes Unicode (like most of the enlightened Japanese do). But you  
can't forgive _yourself_ because these are _your_ aplications. As a  
developer, you are accountable.

In my first email I was talking about the low-level mechanics of this  
stuff. They are interesting for a Ruby internals developer or to a  
deep-down Ruby hacker (like Jamis), but I haven't touched the  
consequences that Rails gets from this (because I thought I won't  
need to draw out this knife if there is interest). Turns out we (as  
per David and others) are still in the cozy world of "Plain Text"  
though, so it's time I better open this can of worms.

Let's skip on a second on all these nasty, disgusting "question mark  
in a rectangle" characters your users will see when you truncate  
their text improperly - this is, after all, the temple of Output, the  
browser domain - you sent it out, and then the browser has to cope  
with it according to Postel's law. And besides there are not many of  
them, right? Just some lousy 5,5 billions of potential customers,  
right? Uhm, sorry, got a little offtopic here.

Let's move somewhat up the stack in my previous message, into a  
different domain - the one you care about. The one you foster and  
cherish. The domain of Data.

Paul Battley had a good talk on the recent Eureko conference in  
Munich about Unicode in Ruby. Among his other slides he had "Doing  
mischief with Unicode". Unfortunately I couldn't attend because  
Eureko effectively was on my birthday, so I found other fish to fry  
on that day - but you can find Paul's presentation in it's gory MPEG4  
here:

http://www.futurometer.com/320x240x15fps/Battley.Unicode.mov.gz

I will merely expand his presentation into Rails - that's right, we  
will exploit Rails with Unicode. Let's say you are storing your data  
in Unicode (because if you don't you must spend the rest of your days  
in Hell writing Sanskrit in octets on a concrete plate with a dinner  
fork). You think your bases are covered and you did require 'jcode'.  
Except that 'jcode' won't help.

Let's have a look at this nice little snippet.

class User
  validates_presence_of :login
end

Looks buletproof, isn't it? If a user enters spaces into the form  
they are going to get String#strip'ped, and then the text in the  
field is going to be String#blank?, right? So entering all spaces  
into the Login field won't work, right?

Well, it will. The Unicode standard, as of now, comprises 26 (!)  
characters which can be considered "whitespace". 26, that is, when  
used inside a string - when it's at the boundaries it gets 27  
(including the zero-width space AKA BOM).

So let's try:

kinda_lovely_login = [
	   0x0020,          # White_Space # Zs       SPACE
            0x00A0,          # White_Space # Zs       NO-BREAK SPACE
            0x202F,          # White_Space # Zs       NARROW NO-BREAK  
SPACE
].pack("U*")

And lo and behold...

User.new(:login=>kinda_lovely_login).save!

Nice, isn't it? If you wonder - yes, this is an exploit existing in  
YOUR Rails application RIGHT NOW (albeit a mild one). That one  
application that is sooo-web 2.0, with Ajax and stuff. If you like  
it, you better switch to 7-bit ASCII right away before selling it to  
anyone (not that you will be succesful unless you only sell to the  
British and American customers, and as we all know, the Web ends  
there). And "just using UTF-8" won't help, because Unicode is hard.

You wonder WHY that happens? Well... String#strip is Unicode-unaware.  
As are String#empty? and (thusly) String#blank? But don't reach out  
for your fixtures just yet! Because I'm far from finished...

Let's move on:

class User
    validates_size_of :name,  :maximum=>5
end

Ok, this is our User. Now let's see if I can use this application:

my_name = [1070, 1083, 1080, 1082].pack("U*")

in case you wonder - this is my name in Russian, spelled like "Юлик".  
The one my mother gave to me.

User.new(:login=>'julik', :name=>my_name).save!

/usr/local/lib/ruby/gems/1.8/gems/activerecord-1.13.2/lib/ 
active_record/validations.rb:711:in `save!':  
ActiveRecord::RecordInvalid (ActiveRecord::RecordInvalid)

Ahem, wait a minute. You said it was 5 right? And of course you show  
it to me in a nice little error message? But I gather that my name is  
as many as 4 letters, and it fits the boundaries quite nicely. Well,  
no. String#size is not Unicode-aware, as we know - so AR just sticks  
to that. And my name turns out to be quite a bit longer than what I  
thought it might be:

name.size
=> 8

Well, sure, Two-bytes per character. David can stick some of his nice  
Danish diacritics in there as well, because they ought to be double- 
byte too. And yes, the fact that Ruby uses UTF-8 will nicely conceal  
this from you as long as you stay in your cozy "plain-text" land. If  
you like it THAT way you better stick the following into the form:

"The length of your name decomposed into bytes should be less than,  
or equal to 5".

I bet your users will love that.

Now just do a grep on Rails sources for string.size (and friends).  
Enjoy the mess.

This is not "localization of dates and times", gentlemen, this is  
serious BAD. And if you still think these things are not serious and  
Rails can stay plain text, if you stil think this can be outsourced  
and YAGNI'ed away, if you think it doesn't "touch me because most of  
my customers are American anyways", if you think you can sell THIS to  
the pointy-haired bossed, or if you think Matz (and other Japanese)  
will take care of it for you -- I admire you. Keep countin' em' bytes.

--
Julian 'Julik' Tarkhanov
me at julik.nl





More information about the Rails-core mailing list