Re: C Strings and String Literals.

groff
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: C Strings and String Literals.

From:	Ralph Corderoy
Subject:	Re: C Strings and String Literals.
Date:	Mon, 14 Nov 2022 13:56:36 +0000
Hi Alejandro,

> > > C doesn't _really_ have strings, except at the library level.
> > > It has character arrays and one grain of syntactic sugar for encoding
> > > "string literals", which should not have been called that because
> > > whether they get the null terminator is context-dependent.
> > >
> > >          char a[5] = "fooba";
> > >          char *b = "bazqux";
> > >
> > > I see some Internet sources claim that C is absolutely reliable about
> > > null-terminating such literals, but I can't agree.  The assignment to
> > > `b` above adds a null terminator, and the one to `a` does not.  This
> > > is the opposite of absolute reliability.  Since I foresee someone
> > > calling me a liar for saying that, I'll grant that if you carry a long
> > > enough list of exceptional cases for the syntax in your head, both are
> > > predictable.  But it's simply a land mine for the everyday programmer.
> > 
> > - C defines both string literals and strings at the language level,
> >    e.g. main()'s argv[] is defined to contain strings.
>
> I must disagree.  The string concept is very broad, and you can define
> you own string, as for example:
>
> struct str_s {
>       size_t  len;
>       u_char  *s;
> }

The point under discussion was whether the language specification of C
has strings or just character arrays and whether string literals should
have been called that because whether they have terminating NUL is
‘context-dependent’.

To contradict what I've written, you're widening the discussion to
arbitrary data structures which can be used to implement a string.  That
is not relevant.

> However, assuming that the concept of string is a NUL-terminated char
> array, there's little in the core language about it.

But little is not nothing and so the C language does have both strings,
as the specification states that is what is sitting in main()'s argv[],
and string literals.

> Sure, string literals are the only true strings in the language

Your ‘Sure’ implies you're agreeing with someone.  If so, it's not me.
You're wrong on this point.

> You can prove that string literals are really strings (i.e.,
> NUL-terminated char arrays), by applying sizeof to them, and then
> looping over their contents to see that there's exactly one NUL byte
> at its last position.

Your definitions are wrong.  Proving "foo\0bar" ends with a NUL does not
make it a C string because a NUL-terminated char array is not a C string
if it contains a NUL before that.  A C string is zero or more non-NUL
chars followed by a NUL.

> > - In C, "foo" is a string literal.  That is the correct name as it is
> >    not a C string because a string literal may contain explicit NUL bytes
> >    within it which a string may not: "foo\0bar".
>
> I wouldn't discard them as string literals only for that.

I'm not discarding them as anything.  I am pointing out that according
to the language definition, "foo\0bar" is a string literal but not a C
string because of the embedded NUL thus the distinction is necessary and
terms are needed for each.

> Writing by accident a NUL byte is not usual, anyway.

I didn't claim it was.  I was arguing why ‘they should not have been
called string literal’ is wrong and that whether they get a NUL
terminator is not ‘context dependent’.

> > - A character array may be initialised by a string literal.  Successive
> >    elements of the array are set to the string literal's characters,
> >    including the implicit NUL if there is room.
> > 
> >      char     two[2] = "foo";   // 'f' 'o'
> >      char   three[3] = "foo";   // 'f' 'o' 'o'
> >      char    four[4] = "foo";   // 'f' 'o' 'o' '\0'
> >      char    five[5] = "foo";   // 'f' 'o' 'o' '\0' '\0'
> >      char implicit[] = "foo";   // 'f' 'o' 'o' '\0'
>
> Ahh my friend, you're too used to some dialect of C that allows this,
> I believe.  ISO C11 doesn't, and I'm guessing any older ISO C versions
> behave in the same way:
>
> $ cat str.c
>      char     two[2] = "foo";   // 'f' 'o'
>      char   three[3] = "foo";   // 'f' 'o' 'o'
>      char    four[4] = "foo";   // 'f' 'o' 'o' '\0'
>      char    five[5] = "foo";   // 'f' 'o' 'o' '\0' '\0'
>      char implicit[] = "foo";   // 'f' 'o' 'o' '\0'
>
> $ cc str.c -Wpedantic -pedantic-errors
> str.c:1:23: error: initializer-string for array of ‘char’ is too long
>      1 |     char     two[2] = "foo";   // 'f' 'o'
>        |                       ^~~~~

You are showing compiler output and claiming its error proves the
standard.  It would be handier to have a reference to the standard.

Here's a compiler which has been told I want C11.

    $ gcc -std=c11 -c str.c
    str.c:1:19: warning: initializer-string for array of chars is too long
     char     two[2] = "foo";   // 'f' 'o'
                       ^~~~~
    $ objdump -sj .data str.o

    str.o:     file format elf64-x86-64

    Contents of section .data:
     0000 666f666f 6f666f6f 00666f6f 0000666f  fofoofoo.foo..fo
     0010 6f00                                 o.              
    $ 

Note .data starts with two[]'s ‘fo’.

> -  ISO C doesn't allow 'two'.

Reference needed.

> -  It does however, allow 'five', and forces initialization to the same as 
> objects that have static storage duration (i.e., 0).  See C2x::6.7.10/22

Yes, I know that, showed it above, and this is nothing to do with
initialising a char array but just generally what happens,
e.g. ‘int a[42] = {3, 1, 4}’.

> -  It does allow 'three', 'four', and 'implicit', per C2x::6.7.10/15
> (I believe it's that paragraph).  I admit that the wording is not so
> clear as to reject 'two'; however GCC seems to interpret it that way,
> in pedantic mode.

We've moved from C11 to a future C, C2x.  Paragraph 6.7.10.15 in C2x is
the same as 6.7.9.14 in C11.

    An array of character type may be initialized by a character string
    literal or UTF-8 string literal, optionally enclosed in braces.
    Successive bytes of the string literal (including the terminating
    null character if there is room or if the array is of unknown size)
    initialize the elements of the array.

It describes the behaviour shown by str.c above: successive bytes
initialise the array.  It is not rejected by the compiler.  More
importantly, I can't see where it is rejected by the standard.

> > - The string literal is reliably terminating by a NUL.
>
> Terminated, yes.  "terminating", hmmm, I'd say no

Sorry, that's a typo, I meant ‘terminated’.

> > - It is not context dependent whether a string literal has a terminating
> >   NUL.
>
> Sure.

Good.

> And guns are just machines that do holes, context-independently.
> However, they can kill, depending on the context.
> Especially if they have no safety, like Glocks, or string literals.
>
> $ cat str.c
>      #include <stdio.h>
>
>      int main(void)
>      {
>          printf("%zu\n", sizeof(1 ? "foo" : "bar"));
>          printf("%zu\n", 1 ? sizeof("foo") : sizeof("bar"));
>      }
>
> $ cc str.c -Wpedantic -pedantic-errors
> $ ./a.out
> 8
> 4

Yes, I recall this from elsewhere in the thread where I asked you to
explain why switching to nitems() fixed the problem because I couldn't
see it given the code samples shown.
https://lists.gnu.org/archive/html/groff/2022-11/msg00030.html

But it is nothing to do with the language C defining what a string is
and having string literals as distinct things worthy of a separate name.

> See for example some (part of a) change that I did for optimizing some code, 
> where I transformed pointers to char to char arrays (following Ulrich 
> Drepper's 
> article about libraries).  The global change using arrays instead of pointers 
> reduced the code size in a couple of KiB, IIRC, which for cache misses might 
> be 
> an important thing.
>
> -static const char *log_levels[] = {
> +static const char  log_levels[][8] = {
>       "alert",
>       "error",
>       "warn",
>       "notice",
>       "info",
>       "debug",
>   };
>
> As a note, I used 8 for better alignment, but 7 would have been fine.
> Now, let's imagine that I append the following element to the array:
> "messages"?  Values of beta will give rise to dom!

That's because robust code has become fragile.  The original was better
because it allowed that addition of a longer string.  The couple of KiB
saved is probably irrelevant compared with the human time of dealing any
error which might arise.

> Wouldn't it be nice to use -Wunterminated-strings and let the 
> compiler yell at me if I write a string literal with 8 letters?

If the compiler doesn't do that then I expect there is a linter that
will, or a different compiler.  But it sounds like some of the projects
you work on could do with a project-specific linter which understands
the conventions the code must follow.  That might not be too hard given
the LLVM framework and all the tools its provides these days.

-- 
Cheers, Ralph.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Pascal rides again (was: Specifying dependencies more clearly), (continued)
Prev by Date: Putting hyperlinks in a PDF document
Next by Date: Re: C Strings and String Literals. (Was: Pascal rides again)
Previous by thread: Re: C Strings and String Literals. (Was: Pascal rides again)
Next by thread: Re: C Strings and String Literals.
Index(es):
- Date
- Thread