bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support in sort for human-readable numbers


From: Vitali Lovich
Subject: Re: Support in sort for human-readable numbers
Date: Tue, 6 Jan 2009 11:11:35 -0500

On Tue, Jan 6, 2009 at 10:19 AM, Pádraig Brady <address@hidden> wrote:
> Vitali Lovich wrote:
>> I've read the proposed patches that have been batted around on the mailing
>> list (after coming up with my own implementation :D of course).  My proposed
>> solution is less generic, but I believe more robust, than the other
>> approaches.
>>
>> I've proposed my reasoning below, but I've posted it as a bug on launchpad
>> to track this issue 313152 <https://bugs.launchpad.net/bugs/313152>.  The
>> patch is against 6.10 instead of trunk mainly because I was too lazy to get
>> the build-system set up on Ubuntu.  That being said, I'm pretty sure the
>> patch should still work against the trunk.  In any case, if it's necessary,
>> I could also do the diff against the trunk.
>>
>> Code review?
>> What would I need to do to get this mainlined (aside from adding the
>> documentation changes)?
>>
>> REASONING:
>>
>> One of my major assumption is that all the numbers are well formatted.
>>
>> In other words, there's an explicit demarcation in the number line (at least
>> internal to the input being sorted) after which the suffix increases and the
>> number starts again near 0.  For instance, if M represents 1050 Kilobytes,
>> then there's no 1051K - it's represented as 1.001M or something along those
>> lines.  Again, this would only rely on the input being internally consistent
>> - sort needs no knowledge or hints of what those suffixes represent.
>
> I like the idea.
>
> So it doesn't support sorting these correctly for example:
>
> 999MB
> 998MiB
> 1GiB
> 1030MiB
>
> I.E. a mixture of ^2 and ^10 are not supported,
> nor overlapping number ranges.
Well technically yes, but I'd just like to point something out.  Just
look at it from the perspective of sort which knows nothing about the
meaning of any of these.  MB can mean anything GiB can mean anything.
There's nothing preventing M representing e^3 & G representing e^100.
The sort tool would still work in this case.  In fact, I could use
other letters to make this explicit (i.e. they're really used to
indicate if one number is automatically greater than another), but
that would require more work for the user in the most common use case
where they are sorting the traditional suffixes.

Also, the above wouldn't be sorted anyways since suffixes are
single-characters followed by a space.

>
> I think that is fine since I don't think a tool
> should generate mixtures like that on a particular run.
>
> You had a few questions in your patch...
>
> + /* FIXME: add support for aliases (i.e. k/K, m/M, g/G) */
>
> Not necessary I think. You already handle case aliases with toupper().
Yeah - I forgot to remove that fixme on that patch.  I think it's
already gone in the one I'm preparing.

>
> +  /* FIXME: maybe add option to check for longer suffixes (i.e. gigabyte) */
>
> You should allow at least G, GiB and GB formats.
> Probably should print error if more than one of those
> formats used, since that's not supported.
I dunno if you read my previous post, but I presented the reasoning
that if the user has some kind of longer format, it's better handled
by piping the input through a sed script first.  Can you present a
situation where it would be better for sort itself to try and parse
longer suffixes?

On a side note, the XiB format (MiB, GiB) is extremely uncommon in my
opinion.  Yes I see people pushing it, but it always seems to me that
it's obvious which base is meant (2 or 10) based upon context (i.e.
which notation is used traditionally for the topic).

>
> +  /* FIXME: maybe add option to allow for spacing between number and suffix 
> */
>
> I wouldn't bother. If we supported general input maybe, but not otherwise.
I agree - as above, I think the more general input can first be
formatted by passing things through sed first.

>
> +  /* FIXME: maybe not use exponent if one of the strings uses an
> +     unrecognized suffix that's not a blank */
>
> I would flag that as an error
You are correct.  I'll add that to the patch I'm preparing.

>
> +  /* FIXME:  a_order - b_order || raw_comparison can be used - would that
> +     be faster? */
>
> Yep if you're not supporting overlapping number ranges then
> you can skip the number comparison totally if the suffixes don't match.
Actually it has nothing to do that.  I'm was just thinking that the
equality operation I'm testing for is already essentially doing a
subtraction and then I'm returning the actual subtraction itself.
Might as well just remove the conditional & leave the faster
calculation.  However, readability-wise I'd argue perhaps it's not as
good (and maybe there's a corner case where that comparison wouldn't
work, although can't think of one), and it's not like the speed
actually matters in this case (or even if it did, I highly doubt this
one line would result in any noticeable speed-up), that's why I'm
debating which form to use.  Not to mention, I'm pretty sure the
compiler should be smart enough to compile the a == b ? raw_comparison
: a - b into the above anyways.

Thanks,
Vitali

reply via email to

[Prev in Thread] Current Thread [Next in Thread]