Wednesday, April 6, 2011

How do I parse and store UTF-8 data into a tab-separated-file in Ruby?

I have a hash names hsh that has values that are UTF-8 encoided. For example:

hsh ={:name => some_utf_8_string, :text => :some_other_utf_8_string}

I am currently doing the following:

$KCODE="UTF8"

File.open("save.tsv","w") do{|file|

file.puts hsh.values.map{|x| x.to_s.gsub("\t",' ')}.join("\t")

}

But this croaks randomly because I think some of the multibyte contents sort of match "\t" and it fails. Is there a recommended string I can use instead of "\t" and also is there a better way of doing the above?

Thanks

From stackoverflow
  • If your data is valid utf8, there is no way for a tab character to "sort of" match part of a multibyte sequence (this is one of the advantages of utf8 over some other multibyte encodings). Can you go into more detail about what you mean by "croak"?

    rampion : Logan's right - in UTF8, there are three kinds of bytes - the ones covering 7-bit ascii (0XXXXXXX), the first byte of multi-byte characters (110XXXXX, 1110XXXX, 11110XXX) or a followup byte of a multi-byte character (10XXXXXX). Tab (00000101=0x09) only matches itself, not any part of a multi-byte.

0 comments:

Post a Comment