Thursday, April 28, 2011

Windows Invariant Culture Puzzle

I have a question about the windows invariant culture.

Succinctly, my question is:

does there exist any pair of characters c1, and c2 such that:

lower(c1, invariant) =latin-general lower(c2, Invariant)

but

lower(c1, invaraint) !=invariant lower(c2, invariant)

Background:

I need to store an invariant lower case string (representing a file name) inside of SQL Server Compact, which does not support windows invariant collations.

Ideally I would like to do this without having to pull all of my comparison logic out of the database and into my app.

The idea I had for solving this was to store 2 versions of all file names: one that is used for displaying data to the customer, and another that is used for performing comparisons. The comparison column would be converted to lower case using the windows invariant locale before storing it in the database.

However, I don't really have any idea what kind of mappings the invariant culture does, other than the fact that its what windows uses for comparing file names.

I'm wondering if it is possible to get false positives (or false negatives) as a result of this scheme.

That is, can I produce characters (previously lower cased using the invariant culture) that compare equal to each other using the latin-general-1 case insensitive SQL server collation, but do not compare equal to each other under the invariant culture?

If this can happen, then my app may consider 2 files that Windows thinks are different as being the same. This could ultimately lead to data loss.

NOTE:

I am aware that it is possible to have case sensitive files on Windows. I don't need to support those scenarios, however.

From stackoverflow
  • why don't you convert filenames to ASCII? In your situation can filenames contain non-ascii characters?

    Scott Wisniewski : Yes. International users, for example, may use non ascii characters when they name their files.
    Mitch Wheat : Could you perhaps remove the problem, by restricting international users to ascii?
    Windows programmer : Depending on what he's storing, sure he could remove the problem, by restricting international users (and restricting Egyptian users too) to Arabic. But he said he's storing filenames, and lots of filenames aren't in Arabic.
  • Why not URL-encode the utf8 byte representation of the filename to get an ascii version which can be converted back to unicode easily without any possible loss?

    Scott Wisniewski : The problem isn't with encoding the data. I can store unicode data in the database just fine. The issue I have is that I need to compare file names in the database using the same rules that windows uses for comparing file names. I want to know if there are any characters where my scheme will break.
    Aaron Watters : Then I assume you need the order preserved -- url encoding will work fine if you are only interested in equality.
    Scott Wisniewski : What I need is for SQL server to compare file names in the same way that Win32 does. Url encoding won't help with that.
  • "However, I don't really have any idea what kind of mappings the invariant culture does, other than the fact that its what windows uses for comparing file names."

    I didn't think Windows used the invariant culture when comparing file names. For example if my culture is English then I can name two separate files turkish and TURKÄ°SH, but if someone's culture is Turkish then I hope Windows won't let them do that.

    Joshua : Since file name compare is in the kernel I'll bet it will let you do that in Turkish environment.
    Windows programmer : The kernel also does conversions between Unicode and ANSI/OEM, using the system default culture^H^H^H^H^H^H^H code page. Turkish has its own code page.
    Scott Wisniewski : I'm more concerned with what Win32 does, then what the kernel does. NTFS allows more stuff than what win32 does, because it supports things like posix. Win32 is (usually) case insensitive.
  • By looking through the answers to this question:

    win32-file-name-comparison

    which I asked a while back.,

    I found an indirect link the following page:

    http://msdn.microsoft.com/en-us/library/ms973919.aspx

    It suggests using an ordinal comparison after an invariant upper case as the best way to mimic what the file system does.

    So I think if I use as "case sensitive, accent sensitive" collation in the database, and do a "upper" using the invariant local before storing the files I should be ok.

    Does anyone know if there are any problems with that?

0 comments:

Post a Comment