Re: datamash performance question

bug-datamash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: datamash performance question

From:	Jake VanEck
Subject:	Re: datamash performance question
Date:	Fri, 25 Jun 2021 17:36:26 -0400

So far, this option seems to be putting the data into memory, which I will far exceed. After just a few minutes, mawk is using over 3gb of memory and nothing is returned per your comment about how it will keep the running sums in memory and write them out when the input exhausted. So, I guess my problem is; the "input" won't be exhausted for many GB of data.....which is also why datamash was working so wonderfully

Any way to run datamash in parallel?

-Jake

On Fri, Jun 25, 2021 at 4:43 PM Dima Kogan <dima@secretsauce.net> wrote:

Jake VanEck <jake.vaneck@gmail.com> writes:

> I've tried similar commands but doesn't awk need to put the entire dataset
> into memory for this?

No. Absolutely not. It will read the input one line at a time, keeping
the running sums in memory, and it will write out the sums when the
input is exhausted.

If you care about performance, try out mawk specifically. It's a bit
snappier than other implementations.

[Prev in Thread]

Current Thread

[Next in Thread]

datamash performance question, Jake VanEck, 2021/06/25
- Re: datamash performance question, Dima Kogan, 2021/06/25
  - Re: datamash performance question, Jake VanEck, 2021/06/25
    - Re: datamash performance question, Dima Kogan, 2021/06/25
    - Re: datamash performance question, Jake VanEck <=
    - Re: datamash performance question, Dima Kogan, 2021/06/25
    - Re: datamash performance question, Erik Auerswald, 2021/06/26

Prev by Date: Re: datamash performance question
Next by Date: Re: datamash performance question
Previous by thread: Re: datamash performance question
Next by thread: Re: datamash performance question
Index(es):
- Date
- Thread