Skip to content

Add --escapes, \n-style escapes to allow multiline comments#891

Open
madah81pnz1 wants to merge 3 commits intoxiph:masterfrom
madah81pnz1:add-cmdline-escapes
Open

Add --escapes, \n-style escapes to allow multiline comments#891
madah81pnz1 wants to merge 3 commits intoxiph:masterfrom
madah81pnz1:add-cmdline-escapes

Conversation

@madah81pnz1
Copy link
Copy Markdown
Contributor

Same as how it works in vorbiscomment, except \0 is not supported.

\0 is problematic to support in C code, and it is ambiguous what something like \012 should mean, given that octal escapes are not supported by vorbiscomment anyway.

Fixes #232

Same as how it works in vorbiscomment, except \0 is not supported.

\0 is problematic to support in C code, and it is ambiguous what
something like \012 should mean, given that octal escapes are not
supported by vorbiscomment anyway.

Fixes xiph#232
@madah81pnz1
Copy link
Copy Markdown
Contributor Author

I tried to be compatible with vorbiscomment, but decided to drop support for \0, since that doesn't really work in vorbiscomment anyway. I don't think it's worth supporting embedded \0, as many tools either have problems or just silently truncates the string.

The case for supporting \0 is that it makes it possible for the user to repair a faulty tag; by exporting the escaped tags, it should be easy to remove "\0" in any text editor, and recover the end part of the tag after it. Resaving such tag in other audio players or tools might truncate the tag at the first \0, which is what vorbiscomment itself claims to do.

\0 is also just a special case of an octal escape sequence. But those are a bit ambiguous, e.g. what should \012 mean; is it \0 + "12", or \n. This means for octal escape encoding, either always use 3 characters, or check if the next unescaped character is a number 0..7, and if so use 3 characters. So \0 + "0" would be encoded as \0000, and \0 + "A" as \0A. Interestingly, Rust does not support octal escape sequences, but does support \0 as a special case.

\x have similar problems, due to how it is defined in C and C++ as "arbitrary number of hexadecimal digits". Consider if you have a long string that only consists of 0-9 and a-f characters, then it's impossible to have a \x escape sequence in the middle of it, as there is no way to tell where it ends, e.g. "\x123456789". Python changed its \x in PEP-223 to be exactly two hex digits, Rust does the same.

I think it's more useful to support \u and \U instead, which would let you export and import without the risk of UTF-8 turning into mojibake. But even the need for this should be rare, and --no-utf8-convert can already be used as a workaround if needed. In most languages, \u is exactly 4 hex digits, and \U is 8. The exception is Bash that allows variable hex digits like for \x. Bash otherwise follows the C standard escaping, so this is some discrepancy.

Another thing to consider is that the exported tags might be processed further by other user tools and scripts, so it makes sense to keep the escaping simple. Replacing newlines with \n should be trivial in any language, but \u requires more parsing logic and a full UTF-8 encoder/decoder.

References:
https://en.cppreference.com/w/c/language/escape.html
https://en.cppreference.com/w/cpp/language/escape.html
https://peps.python.org/pep-0223/
https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences
https://doc.rust-lang.org/reference/tokens.html#ascii-escapes

#include <stddef.h>
#include "share/grabbag.h"

FLAC__bool grabbag__escape_string_needed(const char *src, size_t src_size)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just a performance optimization, and running escape/unescape over all tags shouldn't be a bottleneck, this function could be removed.

switch(src[s]) {
case '\\':
dst[d++] = '\\';
dst[d++] = '\\';
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might overrun d by one byte, but shouldn't be a problem in this specific usage, as the calling functions has +1 byte available for the null terminator.
Should be fixed, add a if (d + 1 < dst_size) to also prevent writing a partially escaped character.

FLAC__bool grabbag__unescape_string_needed(const char *src, size_t src_size)
{
for(size_t s = 0; s < src_size; ++s) {
if(src[s] == '\\') {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use strchr() instead here, but at the time I wasn't sure if handling non-null terminated strings would be important.

return false;
}

static size_t grabbag__unescape_string_size(const char *src, size_t src_size)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size could also be done by the unescape_string() if provided a NULL dst buffer, which is preferable in case the escaping becomes more complicated in the future.

dst[d++] = '\n';
break;
default:
/* Unsupported escape character */
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a "ignore errors" mode, which keeps all invalid/unrecognized escape sequences as-is. But for now I don't think it is useful.

entry_length = strlen(allocated_str);
}
else {
/* error is ignored, the original value will be used */
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to return an error in this function. But also grabbag__create_escaped_string() can't really fail unless malloc fails.

Ignoring errors is also done for the utf8_decode() call below, which would print the original string in case it fails. Maybe could be improved in the future, since --show-tag might show the tag in the wrong encoding to the user, without giving any indiciation or error, which would be surprising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

metaflac bug - error writing multi-line tag from stdin

1 participant