groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: C Strings and String Literals. (Was: Pascal rides again)


From: Alejandro Colomar
Subject: Re: C Strings and String Literals. (Was: Pascal rides again)
Date: Sun, 13 Nov 2022 22:20:10 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.1

Hi Ralph,

On 11/13/22 21:08, Ralph Corderoy wrote:
Hi Branden,

C doesn't _really_ have strings, except at the library level.
It has character arrays and one grain of syntactic sugar for encoding
"string literals", which should not have been called that because
whether they get the null terminator is context-dependent.

         char a[5] = "fooba";
         char *b = "bazqux";

I see some Internet sources claim that C is absolutely reliable about
null-terminating such literals, but I can't agree.  The assignment to
`b` above adds a null terminator, and the one to `a` does not.  This
is the opposite of absolute reliability.  Since I foresee someone
calling me a liar for saying that, I'll grant that if you carry a long
enough list of exceptional cases for the syntax in your head, both are
predictable.  But it's simply a land mine for the everyday programmer.

- C defines both string literals and strings at the language level,
   e.g. main()'s argv[] is defined to contain strings.

I must disagree. The string concept is very broad, and you can define you own string, as for example:

struct str_s {
        size_t  len;
        u_char  *s;
}

However, assuming that the concept of string is a NUL-terminated char array, there's little in the core language about it. Sure, string literals are the only true strings in the language, but as Branden says, it's of little use.

You can prove that string literals are really strings (i.e., NUL-terminated char arrays), by applying sizeof to them, and then looping over their contents to see that there's exactly one NUL byte at its last position.


- In C, "foo" is a string literal.  That is the correct name as it is
   not a C string because a string literal may contain explicit NUL bytes
   within it which a string may not: "foo\0bar".

I wouldn't discard them as string literals only for that. Since you are the one in control of that, you're the one making your own string literals to be bogus strings. Other than that, they are the only consistent part of the core language that uses true strings.

Writing by accident a NUL byte is not usual, anyway.


- A string literal has an implicit NUL added at its end thus "foo" fills
   four bytes.

Yes, sizeof("foo")==4, and "foo"[3]=='\0'.


- A character array may be initialised by a string literal.  Successive
   elements of the array are set to the string literal's characters,
   including the implicit NUL if there is room.

     char     two[2] = "foo";   // 'f' 'o'
     char   three[3] = "foo";   // 'f' 'o' 'o'
     char    four[4] = "foo";   // 'f' 'o' 'o' '\0'
     char    five[5] = "foo";   // 'f' 'o' 'o' '\0' '\0'
     char implicit[] = "foo";   // 'f' 'o' 'o' '\0'

Ahh my friend, you're too used to some dialect of C that allows this, I believe. ISO C11 doesn't, and I'm guessing any older ISO C versions behave in the same way:

$ cat str.c
    char     two[2] = "foo";   // 'f' 'o'
    char   three[3] = "foo";   // 'f' 'o' 'o'
    char    four[4] = "foo";   // 'f' 'o' 'o' '\0'
    char    five[5] = "foo";   // 'f' 'o' 'o' '\0' '\0'
    char implicit[] = "foo";   // 'f' 'o' 'o' '\0'

$ cc str.c -Wpedantic -pedantic-errors
str.c:1:23: error: initializer-string for array of ‘char’ is too long
    1 |     char     two[2] = "foo";   // 'f' 'o'
      |                       ^~~~~


-  ISO C doesn't allow 'two'.

- It does however, allow 'five', and forces initialization to the same as objects that have static storage duration (i.e., 0). See C2x::6.7.10/22

- It does allow 'three', 'four', and 'implicit', per C2x::6.7.10/15 (I believe it's that paragraph). I admit that the wording is not so clear as to reject 'two'; however GCC seems to interpret it that way, in pedantic mode.


That's it.

- The string literal is reliably terminating by a NUL.

Terminated, yes. "terminating", hmmm, I'd say no; it doesn't reliably terminate other objects, so it is not terminating, even though it is certainly terminated.

Subtle difference, but important.

- It is not context dependent whether a string literal has a terminating
   NUL.

Sure.
And guns are just machines that do holes, context-independently.
However, they can kill, depending on the context.
Especially if they have no safety, like Glocks, or string literals.


$ cat str.c
    #include <stdio.h>

    int main(void)
    {
        printf("%zu\n", sizeof(1 ? "foo" : "bar"));
        printf("%zu\n", 1 ? sizeof("foo") : sizeof("bar"));
    }

$ cc str.c -Wpedantic -pedantic-errors
$ ./a.out
8
4


- It is absolutely reliable and clearly stated in the C standard and in
   any other C reference worth its salt.

If it does refer exclusively to the preceding bullet, yes.

- There is no need to ‘carry a long enough list of exceptional cases for
   the syntax in your head’.

Well, you need to be especially careful.

- An ‘everyday C programmer’ will know this simple behaviour by dint of
   being a C programmer who writes it every day; there is no landmine
   upon which to step.  :-)

Knowing it doesn't make it safer. I certainly have a clear distinction between arrays and pointers, and the different subtleties of strings and string literals, and still find it way too unsafe.

See for example some (part of a) change that I did for optimizing some code, where I transformed pointers to char to char arrays (following Ulrich Drepper's article about libraries). The global change using arrays instead of pointers reduced the code size in a couple of KiB, IIRC, which for cache misses might be an important thing.


-static const char *log_levels[] = {
+static const char  log_levels[][8] = {
     "alert",
     "error",
     "warn",
     "notice",
     "info",
     "debug",
 };


As a note, I used 8 for better alignment, but 7 would have been fine. Now, let's imagine that I append the following element to the array: "messages"? Values of beta will give rise to dom!


Do I really need to load a gun and point it to my feet just to get that performance? Wouldn't it be nice to use -Wunterminated-strings and let the compiler yell at me if I write a string literal with 8 letters? I mean, it should be simple for a compiler to implement that, and I would be sooo much happier!


Hope that helps clear up this corner of C.


Cheers,

Alex

--
<http://www.alejandro-colomar.es/>

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]