tinycc-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] TinyCC REPL


From: Jared Maddox
Subject: Re: [Tinycc-devel] TinyCC REPL
Date: Fri, 15 May 2015 21:39:12 -0500

> Date: Fri, 15 May 2015 09:32:21 -0400
> From: David Mertens <address@hidden>
> To: address@hidden
> Subject: Re: [Tinycc-devel] TinyCC REPL
> Message-ID:
>         <address@hidden>
> Content-Type: text/plain; charset="utf-8"
>
> On Thu, May 14, 2015 at 10:39 PM, Jared Maddox <address@hidden>
> wrote:
>
>> > Date: Thu, 14 May 2015 09:35:12 -0400
>> > From: David Mertens <address@hidden>
>> > To: address@hidden
>> > Subject: Re: [Tinycc-devel] TinyCC REPL
>> > Message-ID:
>> >         <
>> address@hidden>
>> > Content-Type: text/plain; charset="utf-8"
>> >

>>  You refer to what I understand to be the symbgol table as an Array
>> Mapped Trie. I'm not familiar with the structure, but it looks like an
>> array that someone shoved a bunch of trie data structures into. How
>> exactly does ::filled_bits work? Is it a bit field indicating which
>> children (I assume placed in relation to their parent) have data, and
>> which don't?
>>
>
> I think your understanding is essentially correct. I'm not really sure why
> Phil Bagwell used the term "Array Mapped Trie",

I'm thinking he just shoved a couple of relevant words into a
configuration that seemed vaguely informative. Engineers and
non-language-oriented scientists seem often to have a spotty grasp of
applied English.


> (Notice that I always have at least one child.) As I came from C++,
> this sort of construction was new to me, but I like it.

Just for reference, there's supposed to be a C-to-JVM compiler that
uses this basic type of trick: it allocates all of it's memory as a
char array, and builds up from that.

This sort of "bare struct" (as opposed to a class with a virtual
table) also plays a lot nicer with e.g. self-implemented virtual
memory (you have to use a disk-access api instead of simpler memory
accesses, but after modifying the code base it becomes possible to
have compilations that are only limited by your disk space). Not so
useful for compiling programs (normally, anyways), but if you're
writing a simulation of an arbitrarily large data set, then being able
to just "page out" inactive structures to disk can be a genuine
optimization and (non-portable) save format all at the same time.


> For example, a node with
> three children would essentially have the layout
>
> struct compressed_trie {
>     unsigned long long filled_bits;
>     struct compressed_trie * first;
>     struct compressed_trie * second;
>     struct compressed_trie * third;
> };
>

Some really old versions of pre-standard C++ apparently supported this
sort of thing through the constructor. I occasionally wished I had it
back when I was playing around with C++.


> The letter for the trie that corresponds to the first, second, and third
> child depends on which bits are occupied in filled_bits. The point is that
> any node only allocates exactly the number of pointers for child nodes that
> it needs, rather than having room for 63 child nodes, most of which go
> unused.
>

That's more sensible than I was initially imagining (I didn't analyze
the code that thoroughly, so I assumed everything was in the SAME
array).


>>    4. As implemented, this cuts the maximum number of token symbols in
>> >    half. (I needed to use one of the bits to indicate "extended symbol".)
>> >    5. The current token lookup is based on a compressed trie that
>> >    explicitly only supports A-Z, a-z, 0-9, and _. It does not support $
>> in
>> >    identifiers and would need to be revised in order to do so. I chose to
>> >    implement Phil Bagwell's Array Mapped Trie
>> >    <http://lampwww.epfl.ch/papers/triesearches.pdf.gz> in the belief
>> that
>> >    it would perform better than a hash for lookup. Once I add symbol
>> table
>> >    caching, I hope to add (or switch to) Array Compressed Tries for even
>> >    better cache utilization. But currently, I rely on have 63 allowed
>> >    characters in identifiers.
>>

<<snip: extension nodes to increase available symbols>>

>>
>
> Excellent idea!

Glad to be of assistance.

I don't remember if it occurred to me at the time, but thinking back,
it's somewhat related to red-black trees.


> Using this idea, I could even map all 256 possibilities
> through the trie and substantially simplify _c_trie_bit_offset_for_char. In
> fact, this could lead to transparent support for unicode.
>

"Transparent" would depend on perspective and implementation, but yes.
Depending on details, you could even support stuff like EBCDIC,
IBM-compatible codepages, and JIS, along with Unicode, all at the same
time (I would mostly suggest against this, mind you: if someone
actually wanted it, they could always use the ASCII / Unicode control
codes to provide an encoding instead; I just intend it as an example,
in case you can think of something to use this for: all you would need
to do is reserve one extension node for "unforseen purposes").



reply via email to

[Prev in Thread] Current Thread [Next in Thread]