Add --escapes, \n-style escapes to allow multiline comments#891
Add --escapes, \n-style escapes to allow multiline comments#891madah81pnz1 wants to merge 3 commits intoxiph:masterfrom
Conversation
Same as how it works in vorbiscomment, except \0 is not supported. \0 is problematic to support in C code, and it is ambiguous what something like \012 should mean, given that octal escapes are not supported by vorbiscomment anyway. Fixes xiph#232
|
I tried to be compatible with vorbiscomment, but decided to drop support for \0, since that doesn't really work in vorbiscomment anyway. I don't think it's worth supporting embedded \0, as many tools either have problems or just silently truncates the string. The case for supporting \0 is that it makes it possible for the user to repair a faulty tag; by exporting the escaped tags, it should be easy to remove "\0" in any text editor, and recover the end part of the tag after it. Resaving such tag in other audio players or tools might truncate the tag at the first \0, which is what vorbiscomment itself claims to do. \0 is also just a special case of an octal escape sequence. But those are a bit ambiguous, e.g. what should \012 mean; is it \0 + "12", or \n. This means for octal escape encoding, either always use 3 characters, or check if the next unescaped character is a number 0..7, and if so use 3 characters. So \0 + "0" would be encoded as \0000, and \0 + "A" as \0A. Interestingly, Rust does not support octal escape sequences, but does support \0 as a special case. \x have similar problems, due to how it is defined in C and C++ as "arbitrary number of hexadecimal digits". Consider if you have a long string that only consists of 0-9 and a-f characters, then it's impossible to have a \x escape sequence in the middle of it, as there is no way to tell where it ends, e.g. "\x123456789". Python changed its \x in PEP-223 to be exactly two hex digits, Rust does the same. I think it's more useful to support \u and \U instead, which would let you export and import without the risk of UTF-8 turning into mojibake. But even the need for this should be rare, and --no-utf8-convert can already be used as a workaround if needed. In most languages, \u is exactly 4 hex digits, and \U is 8. The exception is Bash that allows variable hex digits like for \x. Bash otherwise follows the C standard escaping, so this is some discrepancy. Another thing to consider is that the exported tags might be processed further by other user tools and scripts, so it makes sense to keep the escaping simple. Replacing newlines with \n should be trivial in any language, but \u requires more parsing logic and a full UTF-8 encoder/decoder. References: |
| #include <stddef.h> | ||
| #include "share/grabbag.h" | ||
|
|
||
| FLAC__bool grabbag__escape_string_needed(const char *src, size_t src_size) |
There was a problem hiding this comment.
Since this is just a performance optimization, and running escape/unescape over all tags shouldn't be a bottleneck, this function could be removed.
src/share/grabbag/escapes.c
Outdated
| switch(src[s]) { | ||
| case '\\': | ||
| dst[d++] = '\\'; | ||
| dst[d++] = '\\'; |
There was a problem hiding this comment.
This might overrun d by one byte, but shouldn't be a problem in this specific usage, as the calling functions has +1 byte available for the null terminator.
Should be fixed, add a if (d + 1 < dst_size) to also prevent writing a partially escaped character.
| FLAC__bool grabbag__unescape_string_needed(const char *src, size_t src_size) | ||
| { | ||
| for(size_t s = 0; s < src_size; ++s) { | ||
| if(src[s] == '\\') { |
There was a problem hiding this comment.
Could use strchr() instead here, but at the time I wasn't sure if handling non-null terminated strings would be important.
| return false; | ||
| } | ||
|
|
||
| static size_t grabbag__unescape_string_size(const char *src, size_t src_size) |
There was a problem hiding this comment.
The size could also be done by the unescape_string() if provided a NULL dst buffer, which is preferable in case the escaping becomes more complicated in the future.
| dst[d++] = '\n'; | ||
| break; | ||
| default: | ||
| /* Unsupported escape character */ |
There was a problem hiding this comment.
Could add a "ignore errors" mode, which keeps all invalid/unrecognized escape sequences as-is. But for now I don't think it is useful.
| entry_length = strlen(allocated_str); | ||
| } | ||
| else { | ||
| /* error is ignored, the original value will be used */ |
There was a problem hiding this comment.
There is no way to return an error in this function. But also grabbag__create_escaped_string() can't really fail unless malloc fails.
Ignoring errors is also done for the utf8_decode() call below, which would print the original string in case it fails. Maybe could be improved in the future, since --show-tag might show the tag in the wrong encoding to the user, without giving any indiciation or error, which would be surprising.
Same as how it works in vorbiscomment, except \0 is not supported.
\0 is problematic to support in C code, and it is ambiguous what something like \012 should mean, given that octal escapes are not supported by vorbiscomment anyway.
Fixes #232