make-alpha
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Internal representations (was: Re: Possible solution for special cha


From: Frank Heckenbach
Subject: Re: Internal representations (was: Re: Possible solution for special characters in makefile paths)
Date: Sun, 23 Feb 2014 02:40:34 +0100

Paul Smith wrote:

> This thread is intended to discuss how quoted strings might be
> represented internal to make, assuming that they are encoded in some way
> and not just left as they appear in the input makefile as Eli suggests.
> 
> On Thu, 20 Feb 2014 I wrote:
> > The advantages to this are ... (b) there is no change needed to any
> > existing tokenization in make, which is scanning for whitespace,
> > parenthesis, braces, nul bytes, etc.: it will all continue to work
> > with no changes.
> 
> I realized I may not have made clear my thinking behind this.  Doing
> away with this requirement gives a more flexible and backward compatible
> solution, but requires a lot more effort.  Maybe that's a feasible
> trade-off, so I'd like opinions about it.

IMO, if you're going to do larger changes in make's string handling,
it may be worthwhile to consider switching from strings to string
lists/arrays (in particular for values of variables). Then there
will be no need for tokenization after the initial lexing, or for
escaping. Storing the result of $(wildcard) would be
straightforward. Of course, this will be large changes, but if
you're going to put in a lot of effort, this might be an option.

> Suppose that instead of reserving a complete set of mapping characters,
> one for each special character, we instead choose one single special
> character and use it as an escape character, which is I guess what Frank
> was suggesting.

Not exactly. Sorry if I wasn't clear. What I meant was e.g.
selecting 16 control chars, so a pair of them could represent any
byte value (including themselves).

> There are three reasons I avoided this: first, it means all our internal
> parsing, functions, etc. must be modified to be "escape-aware".  Where
> today we just walk strings using trivial tokenization tests ("is this a
> space?") now we need to detect if we're in an escaped situation and keep
> that state.

My proposal would avoid this problem, since an escaped space would
not contain an actual space character. (That's why I drew the
parallel to UTF-8: In UTF-8, a byte with value 47 is always a "/",
regardless of context, that's why e.g. case-sensitive Unix file
systems can be used with UTF-8 file names without special support.)

> The first additional problem with the "escape character" model is
> idempotency.  With the character mapping solution you don't have to
> worry about "re-encoding" an already encoded string: no matter how many
> times you encode it, it's always the same string.  This is a very
> powerful simplifying feature.

Of course, any encoding that allows arbitrary input strings cannot
be idempotent.

> Before I get to the final problem I'll say one more word about
> idempotency: we could solve this problem if we were willing to forgo the
> idea of quoting the quote character.  This means that we would need to
> fail any makefile we parsed that contained the quote character (DTE,
> above).  This helps because any time we see the quote character we know
> it's really quoting something, not just a stray DTE, and we don't need
> to re-quote.

Indeed, when breaking backward-compatibility this way, it would be
advantegeous to do is as small as possible, i.e. have only a single
quote character (as opposed to your and my previous proposals).

Of course, it's possible to combine this feature with the
context-independence-for-tokenization as above, by using e.g.
DLE+letter (which cannot encode all byte values, but still more than
enough charaters that may ever need to be encoded).

> The last problem to be considered are the embedded APIs such as Guile
> and the C API.
> 
> Regardless of the model we choose we'll have to provide a "decode"
> function to those APIs, that will remove our encoding.  For encoding we
> can either provide a specific function, or let the callers use the eval
> function with "$[...]" strings to encode.
> 
> If we go with my original "mapping characters" model that's all we need:
> we can allow the user API to do its own tokenization, based on
> whitespace just like we do, and perform all kinds of hacking and
> chopping and whatever, then call GNU make's "decode" function to decode
> that word when they want the real thing.
> 
> If we go with an escaping character like the above, though, we'd need to
> provide the embedded APIs with a set of functions that would tokenize
> strings: they could not do it themselves as they can today.  At the very
> least we'd need some kind of strtok()-like function that would take a
> set of delimiter characters and chop up a string based on those
> delimiters, with the added caveat that if any of the delimiters were in
> our escaped character set then an escaped character would not match.

As long as they only tokenize on whitespace, both of my proposals
(pair from a set of 16 control characters and DLE+letter) should be
fine in this regard.

If we want to allow them to tokenize on letters without special
functions, DLE+letter would break, but instead we could perhaps use
DLE+CC where CC is a control character like in your initial
proposal, i.e.:

SPC->DLE+001
TAB->DLE+002
001->001
DLE->invalid

(This would limit the caracters to ever be escaped to ~16. If this
might be a problem, one could think about 3 byte sequences.)

However, if they might want to tokenize on control characters, each
of our proposals would break. I hate to say it, but I have used \001
and \002 as token separators in strings, though not in makefiles.
I haven't used make's embedded APIs at all, so I don't really know
what kind of stuff we must expect there, but if that's a
possibility, they only way to mitigate is to reduce the number of
characters involved in escaping (in the extreme case to 2, at the
cost of more increase in length).

Or give them a way to retrieve strings from make in two ways, either
as a single string (unescaped) or as an array of strings, properly
tokenized and unescaped by make, similar to "$*" vs. "$@" in the
shell. The first way would lose information (like "$*" does) and be
backward-compatible. Neither way would return a string with escape
characters, so the APIs wouldn't ever see them.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]