#1328 Str.in not working properly with Unicode

Akcelisto Wed 24 Nov 2010

using util

class TestCsv : Test{
  Void test(){
    verifyEq(CsvInStream("Русское слово".in).readAllRows[0][0],"Русское слово")

sys::TestErr: Test failed: "CAA:>5 A;>2>" [sys::Str] != "Русское слово" [sys::Str]

katox Wed 24 Nov 2010

What kind of system do you use? Are you sure the encoding is UTF-8?

Akcelisto Wed 24 Nov 2010

XP, encoding of TestCsv.fan is UTF-8.

ivan Wed 24 Nov 2010

yeah, I can see this issue too, looks like something wrong with CsvInStream:

fansh> using util
Add using: using util
fansh> str := "привет"
fansh> str.in.readLine
fansh>  CsvInStream(str.in).readAllRows[0][0]

Tomorrow I'll try to debug it thoroughly

ivan Thu 25 Nov 2010

Huh, the problem is in Str.in. Because using CsvInStream on top of File.in produces correct result. The real issue (and AFAIR I've already seen that either on forum or somewhere in docs) is that when StrInStream (which is java impl of InStream created by Str.in) reads a single byte, it consumes the whole char from Str (which is 2 bytes in your case) and returns it as a single byte (by masking it with 0xFF). So when CsvInStream reads bytes from underlying stream, it can't get the whole picture and therefore produces wrong results.

ivan Thu 25 Nov 2010

There's fairly easy workaround if you want to use CsvInStream over Str - use str.toBuf.in instead of str.in:

fansh> CsvInStream("привет".toBuf.in).readAllRows.first.first
fansh> CsvInStream("привет".in).readAllRows.first.first

Akcelisto Thu 25 Nov 2010

Thanks. You helped me.

brian Fri 26 Nov 2010

Promoted to ticket #1328 and assigned to brian

I thought this was all pretty well covered in the test suite, but guess not. Looks like the real problem is Str.in is not supporting Unicode correctly. Which is strange because I am using Unicode strings all over the place in the SkySpark test suite.

brian Fri 26 Nov 2010

Renamed from **[bug] CsvInStream dont read properly from Str with russian letters** to **Str.in not working properly with Unicode**

katox Fri 26 Nov 2010

As @ivan noted in IRC the problem actually lies in Str.in inability to supply bytes correctly. readChar function is OK but if it goes through a decoder then read method is used - and this one truncates bytes in stream using & 0xff bitmask for Str.in.

brian Mon 3 Jan 2011

Ticket resolved in 1.0.57

I made two changes:

  1. fixed Str.in to correctly work when wrapped by another InStream
  2. changed Str.in to disallow binary reads

The second change is a breaking change, but I think it much safer behavior which is consistent with how StrBuf works when attempting binary writes.

If you are using Str.in to read binary data, then the fix is to convert into a binary buffer first using the UTF-8 enconding:

str.in => str.toBuf.in

Login or Signup to reply.