Skip to content

Unicode UTF-8 does not work in Windows console. #187

@MarekKnapek

Description

@MarekKnapek

Unicode UTF-8 does not work in Windows console. But Unicode in the form of UCS-2, later around Windows 2000 timeframe extended and renamed to UTF-16, does work just fine.

nob.h does use some Unicode output, such as in trace_temp_alloc it uses the arrow. Maybe it uses some more Unicode strings, but so far I noticed none. In the future it might use the "cancer logger" with colors and Unicode stuff.

Thus I would like to open a discussion how to support Unicode on Windows. Windows is completely capable of using Unicode, but not in form of UTF-8, it does support UTF-16 for more than two decades already. This issue is about console output, there might be separate issue about file names. There are multiple ways how to add Unicode console output. I would call them the "C way" and the "Windows way".

The "C way":

  • The C language supports the wprintf function since C95 or C99 version.
  • That would mean wrapping all printf calls with some macro, that would then conditionally add an L prefix to all C-string literals and changing from printf to wprintf.
  • This would solve only the "simple" cases of primitive printf calls with literals only. More advanced calls that construct the format string piece-wise or construct string arguments would also need to be changed to use wchar_t instead of char based strings.
  • Basically changing every string ever used from char to something like nob_char_t. Where nob_char_t would be conditionally typedef to char or wchar_t.
  • The regular old printf function also supports the %ls format specifier. Meaning the format could remain as plain old char* and only the argument being wchar_t* based string.
  • Basically all nob code and user code that uses nob must be wchar_t aware. This is a viral change (spreads like a virus). This is the same as we all did int he 90's with _T or _TEXT macro to support both Windows95 and WindowsNT.
  • I would call this lot of work for little benefit.

The "Windows way":

  • Console output is done via the WriteConsoleW function. This function write directly to the console (if it is even available) and does not redirect to file if the use specifies the redirection when they execute a new command.
  • WriteFile function could be used instead of WriteConsoleW if redirection is detected. In that case I would suggest UTF-8 output instead of UTF-16 one.
  • Redirection could be detected by combination of GetStdHandle and GetFileType functions. The result could be one of three variants: 1) The stdout does not exist. 2) The stdout does exist and is the console. 3) The stdout does exist and is redirected to a file (or a pipe).
  • I would suggest to keep everything as UTF-8, only change from printf to sprintf for formatting. And use different function for final output to the screen (or redirected output).
  • Only the new final output function would need some #ifdef Winodws love. This would be improvement over the first "C way" as this change is not "viral" - it does not need to affect the entire mindset of a nob user. They could still continue to use their UTF-8 strings everywhere and at the end, just before the final console output, we would translate to UTF-16 for them.
  • There are the WideCharToMultiByte and MultiByteToWideChar functions exactly for this purpose.
  • Another change that would be needed is the directory walker. Currently it uses the ANSI variant. Thus unable to walk Unicode directories. It would need to be changed from FindFirstFileA to FindFirstFileW and the found names then immediately converted from UTF-16 to UTF-8, so the "viral infection" does not spread.
  • I would call this not a lot of work for a big benefit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions