gnunet-developers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [GNUnet-developers] Idea for file storage in GNUnet


From: Christian Grothoff
Subject: Re: [GNUnet-developers] Idea for file storage in GNUnet
Date: Thu, 06 Dec 2012 23:29:28 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121028 Thunderbird/16.0.2

Dear hypothesis,

Thank you for your suggestion. Let me first describe how I understood your idea. Basically, the idea is that GNUnet's file storage should not occupy disk space, but leave it marked in the OS file system as "free" (presumably because of redundancy, loss is not an issue). Then, when the data is needed, GNUnet simply should check if the checksum is still correct, and if so, serve it. That way, we could push drive utilization to 100% without the user even noticing.

Let me point out a few differences between your perception of the issues with this and how I see them. First of all, GNUnet already splits files into blocks for storage, and the blocks are encrypted and self-verifying, so we'd not even need to store a separate checksum. All we would still need is an index which would allow us to quickly find the offset on the disk that had the block (scanning a multi-TB disk for each request is infeasible). That index can still be big (say 5-10% of the actual storage space) and would have to live in "reliable" storage (you don't want a corrupt index), but we actually already have the infrastructure for this in place (see OnDemandBlocks in the code, which are essentially an index into an existing 'normal' file on disk). So that part is IMO *easier* than you might have thought.

But there is another part, and that is getting to the data. Writing a file and deleting it is easy, and your assessment that OSes don't really delete it holds true in 99.9% of the cases. However, getting the physical offset on disk while the file exists is already virtually impossible if you're not inside the kernel. You can get the inode number and device ID, but that's often just the meta-data and already not very portable. Getting the physical disk offset or the logical partition offset -- I'm not aware of ANY API to get that.

Even if we were able to get that offset (i.e. by having a custom kernel module), we'd then need really, really sensitive access (/dev/sdaX) to the raw disk to read it later. That is again not something a security-sensitive application should take lightly. Now, I guess a SUID wrapper to read at certain offsets if and only if the data stored there matches a certain hash _might_ be doable, but it is still a pretty tough proposal to get past a security audit (as, for example, an adversary might just want to do a confirmation attack on an unrelated file).

Finally, once you deleted the file, you want to somehow make sure that the OS doesn't re-use this space first again. But that is actually quite likely to be the case, so the moment you write your 2nd file this way, you are somewhat likely to overwrite your first file. So what you would really want is a way to enumerate all of the unused blocks on disk, and then directly write there (instead of using the indirect route of first writing a normal file and then deleting it to make the space appear unused). That would require detailed knowledge of the specific file-system, and would again require OS-level (and file-system specific) extensions to the system.

Given this, I don't think there is a chance to create an implementation that has a chance of being used in the real world.

Now, there is a second possibility --- just use "normal" files, and then if you notice that the disk starts to get full, delete them. The main difference would be that the user would see that the disk is full (df...), and that the file-system would likely fragment more. If the process that watches the fullness of the disk is done well, the effect for the end-user would still be otherwise the same. That is most likely much easier to implement and deploy.


Finally, a bigger question in my mind is if available disk space is really generally the issue. For me, bandwidth, latency, seek-speed and CPU usage have been concerns, the disk is pretty much the only resource that is virtually unlimited --- it would take months of download time over my Internet connection to fill my drive, and years to upload it (stupid DSL). So I'm afraid that while I think something could be done here, I'm not sure it makes sense to prioritize this.

Happy hacking!

Christian

On 12/06/2012 10:03 PM, hypothesys wrote:
Hello GNUnet Developers,

First of all I apologize if this is not the correct place for discussing a
possible new feature to GNUnet and since I am not from the IT field I cannot
even attempt to implement it. Still, perhaps if you find this feature
valuable you would consider implementing it so I wanted to share it. Please
bear in mind that I am no expert and this may not be feasible for technical
reasons not obvious to me. In that case please say so and I will not take
more of your time.

Some time ago I had the idea that gnunet (as well as other projects) could
benefit from increased disk space for storage and that using the free space
on disk should be a technically possible if difficult task.

On many OS filesystems, when a file is deleted, it is not truly erased, in
the FAT filesystem for example, the list of disk clusters occupied by the
file be erased from the file allocation table marking those sectors
available. On other filesystems I do not know how that is handled but, for
the sake of argument let's say that a header is instead applied to the file
indicating that the file portion of the hard disk is available to be
overwritten.

/header/ data block Nº1; /header/ data block Nº2; /header/ data block
Nº3;...

If gnunet was able to split the file data into data blocks (encrypted of
course) and subsequently delete the data, while keeping both a checksum for
the data block and record of its disk location, the free disk space of
computers on which gnunet was installed could be used for storage without
compromising normal functioning of said computer.

This program, perhaps to be named gnunet-str (storage) would at the moment
of storage of data, create a checksum for every encrypted data block and for
every "contiguous" data group, as follows:

/block1/block2/block3/block4/block5/block6/block7/block8...
=>checksum1/checksum2/checksum3/checksum4/...

but also

/block1/block2/block3/block4/block5/block6/block7/block8...=>checksum1+2/checksum3+4/checksum5+6/checksum7+8...

and also

/block1/block2/block3/block4/block5/block6/block7/block8...=>checksum1+2+3+4/checksum5+6+7+8/checksum9+10+11+12...

and continuing...

In this way, it would be possible to (quickly? - by going from the checksums
for the agglomerations of blocks to the individual blocks) ascertain which
data was corrupt (by usage of the main OS, or a disk defrag) and had to be
replaced. It would then signal to other GNUnet nodes "Of the data stored
only 70% (for example) is still not corrupted. I can share this 70% but give
me the 30% back, or new files to store in this space".

Such a solution would allow big amounts of storage - in theory, if all free
space in the the hard drive of host computer. Due to its nature it would not
be possible to rely on the data not being compromised without implementing
redundancy. If this gnunet-str made x copies of file y for example, the
probability of data corruption and loss could be greatly diminished.
Tahoe-Lafs and gnunet are based on this principle (although I could be wrong
as I'm no expert), redundancy of storage between multiple peers on the net.
If this redundancy could also be implemented locally, the total storage for
GNUnet would increase.

Alternatively to providing a greater amount of data storage, perhaps such a
feature could instead be used to boost GNUnet's efficiency as parts of a
file on a distant node could also be made available on more nodes
diminishing the distance between the "asking node" and the node who actually
has the file.

Do you think such a feature could be useful for GNUnet? Once again do not
hesitate to say this idea is unfeasible for some reason, I just shared it in
the hopes of it being useful to an improved gnunet.

-- hypothesys




reply via email to

[Prev in Thread] Current Thread [Next in Thread]