make-alpha
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Possible solution for special characters in makefile paths


From: Paul Smith
Subject: Possible solution for special characters in makefile paths
Date: Thu, 20 Feb 2014 03:22:39 -0500

Hi all.

I've been thinking a bit about how to manage special characters (most
notably whitespace, but also colons, etc.) in makefile targets and
prerequisites, in particular.

The problems from trying to support these are manifold.  Because make
variables can be expanded multiple times and in various contexts, using
a common escape scheme such as prefixing characters with backslashes is
complex.  The escape character must be preserved throughout the
makefile, interpreted correctly throughout all function and macro
operations, and ignored where appropriate (in pattern substitutions
etc.), then removed properly just before any path is provided to an
external entity such as a shell command, environment variable, or
displayed on stdout.

Further, various functions such as $(wildcard ...) (at least) will need
to en-escape-ify its output properly.

My goal is to find a solution that does NOT involve rewriting the
entirety of GNU make's string handling (which is a very large portion of
what make does).

Here's an idea.  It is not ideal and some may find it distasteful.  I'm
interested to hear about objections or alternative suggestions.

First let's posit that a makefile must be written in an encoding which
has the following characteristics:

     1. The bytes 0-127 are compatible with the ASCII character set
        (technically we only care about a subset of these but let's say
        all)
     2. The single nul byte (\0) always signifies the end of the string.

Obviously UTF-8 is one such encoding; I believe there are more.  Equally
obviously, any fixed-size multibyte encoding such as UTF-16 or UTF-32 is
not appropriate.

The POSIX standard for make restricts the input language to a much
smaller subset of the Portable Character Set so there shouldn't be a
concern from a standardization standpoint.


My suggestion is this: we choose a set of characters outside of the
Portable Character Set but within the 0-127 range, and map each of the
"concerning" characters to the alternate character when the string is
stored internally to make.  Then when make is constructing a path to be
provided to an external entity these characters will be translated back
into their appropriate values.

The advantages to this are (a) there is no change to the length of the
string so the encoding can be performed in-place, and computing the size
of an output buffer is trivial (it's the same size), and (b) there is no
change needed to any existing tokenization in make, which is scanning
for whitespace, parenthesis, braces, nul bytes, etc.: it will all
continue to work with no changes.

To be concrete.  Suppose we choose a mapping like this:

   <space>  = 001
   <tab>    = 002
   <colon>  = 003
   <dollar> = 004
   <comma>  = 005
   <equals> = 006

(I'm not entirely sure all of these need to be encoded, or it might be
that there are more characters that need to be mapped).  These bytecodes
are not part of the Portable Character Set and are (I believe) very
rarely used in any encoding which meets the criteria above.  Please
correct me if I'm mistaken here.

Immediately we can assume that the wildcard function will encode its
results using this method, and so we can get whitespace-containing
results for free.  We can also automatically encode any goal provided on
the command line (e.g., 'make "foo bar"').  Also we can encode all the
results obtained from the directory cache.

Next we need to provide a way for makefiles to indicate they want a
string to be encoded as a single entity.  I was thinking of using a new
start/end token set.  Maybe $[...] or $`...`.  This will allow syntax
such as:

  $[my: target] : $[this: prerequisite]

or whatever.  We'd need a way to escape the end token as well.

The treatment of these values is tricky.  By far the simplest option,
and the most performant, would be to make this a preprocessor-type
feature: the makefile parser itself would actually detect the $[] and
perform the encoding as the file is read in, before it's handed to the
lexer for tokenization.  This is extremely attractive from an
implementation standpoint.

The disadvantage of this is that it would not be usable inside the
makefile to encode expansions of values.  For example suppose someone
runs 'make FOO="bar biz"' and the makefile author knows that the value
of FOO should always be treated as a single word; they may want to
write:

  FOO=$[$(FOO)]

but this will not work, since $[] is handled at parse time, before any
expansion.

On the other hand, delaying handling of $[] means we need to do encoding
during variable expansion as well.  If this is a recursive macro it
would be expanded multiple times, and each time the encoding would have
to be performed as the expansion occurred to ensure the right behavior.
I think this will work correctly, though.  So maybe this more flexible
approach is better even if it's less efficient and more work.

An alternative would be to have $[] encode pre-tokenization then provide
a function like $(encode ...) which would use the normal expansion
semantics.  But that might be too complicated and confusing.

Once the strings were encoded they would be used internally just as they
are now.  All existing string manipulations would work without change.

When we need to use them in a context external to make, they would need
to be decoded.  It's not clear yet what performance penalties there
might be, or what the best way to mitigate them might be.  However, such
decoding would need to be done:
      * Whenever we print a string as output
      * Whenever we create a shell script or fast-path command
      * Whenever we set a value in the environment
      * Whenever we send a string in an OS function (stat() etc.)


I should note that in no way am I suggesting that we would fix the
user's recipes automatically to work properly with these values.  That
would be their responsibility.  As an example, the default builtin rules
such as:

   %.o : %.c
           $(COMPILE.o) -c $< -o $@

will clearly not work properly if the target or prerequisite could
contain whitespace.  We could make an attempt to fix them by modifying
the built-in rules to use something like:

   %.o : %.c
           $(COMPILE.o) -c '$<' -o '$@'

Of course this fails if a target or prerequisite contains single quotes.
I have no 100% solution to these problems, other than the hope that
paths containing single-quotes are far less abundant than paths
containing whitespace (for example).  If users have such paths they'll
have to use their own methods to handle them (perhaps running a patsubst
on them to replace the ' with an escaped version, or similar).


There are other gotchas.  For example, replacing spaces (etc.) with some
other character in an encoded string would not work.  You would have to
replace the encoded space (e.g., $(patsubst $[ ],~,$(FOO))).  I don't
know how often this would be wanted in real life but (assuming we
automatically convert $(wildcard ...) output for example) it might be a
backward compatibility break for makefiles that are trying to handle
whitespace using tricks today.

Other backward-compatibility issues:
     1. Any makefile that uses one of the chosen mapping characters will
        fail.  We can detect this during makefile parsing and throw an
        error, so this will not be a silent problem.
     2. Any makefile using the chosen "quoting" token will break; i.e.
        if some makefile today has "[ = foo" then uses "$[" later, and
        we choose $[...] for quoting, this will fail.  It would have to
        be changed to use "$([)" instead.  Same for ` if we choose
        $`...` etc.


Well, I've yammered on enough.  I'm interested to hear what people think
of this, and what problems they envision.

Cheers!




reply via email to

[Prev in Thread] Current Thread [Next in Thread]