[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Octave-patch-tracker] [patch #7668] Enhancement, speedup of loading par
From: |
anonymous |
Subject: |
[Octave-patch-tracker] [patch #7668] Enhancement, speedup of loading partial data from a hdf5 file |
Date: |
Thu, 17 Nov 2011 09:00:24 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 |
URL:
<http://savannah.gnu.org/patch/?7668>
Summary: Enhancement, speedup of loading partial data from a
hdf5 file
Project: GNU Octave
Submitted by: None
Submitted on: Thu 17 Nov 2011 09:00:22 AM UTC
Category: None
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Email: address@hidden
Open/Closed: Open
Discussion Lock: Any
_______________________________________________________
Details:
I'm working with "big" datasets in hdf5 format. Files being 20-40GB is not
uncommon.
If possible, I load the entire file at once:
octave:1> tic(); all = load("filename.hdf5"); toc()
Elapsed time is 418.209 seconds.
octave:2>
But when the dataset is bigger than available ram, I want to do partial loads
to get out of core behavior:
octave:1> tic(); extr = load("filename.hdf5", "data000100"); toc()
Elapsed time is 301.926 seconds.
octave:2>
The same file is used in both examples. The file is ~20GB and has 2700 "data
elements" which will be returned as structs. The machine I'm testing on has
24GB ram. Due to other things running, some swapping occurs when reading the
entire file. The numbers should be seen as rough estimates.
My hope was that reading 1/2700th of the data should take roughly that
fraction of time for reading the entire thing. Unfortunately that is not the
case.
Why?
do_load will keep calling read_hdf5_data as long as it can read stuff. After
read_hdf5_data has returned, do_load will check if the data read matches the
variables that should be extracted before calling read_hdf5_data again..
This results in the entire hdf5 file being parsed in both examples above.
I suggest that IF just some variables should be read from a hdf5 file, the
name tests should be done within read_hdf5_data so only the corresponding
nodes in the file are parsed and that will save a lot of time. If the entire
file should be read, things will work just as before.
The patch I've attached has this functionality and if I repeat the test
"tic(); extr = load("filename.hdf5", "data000100"); toc()", it will take less
than 0.2 seconds.
I hope this patch is of interest, and if it needs changes to be considered,
let me know and I'll try to adapt the patch.
/ Mattias Linde
_______________________________________________________
File Attachments:
-------------------------------------------------------
Date: Thu 17 Nov 2011 09:00:22 AM UTC Name: octave-hdf5patch.txt Size: 3kB
By: None
<http://savannah.gnu.org/patch/download.php?file_id=24391>
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/patch/?7668>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [Octave-patch-tracker] [patch #7668] Enhancement, speedup of loading partial data from a hdf5 file,
anonymous <=