rdiff-backup-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Discussion about file format for the future


From: EricZolf
Subject: Re: Discussion about file format for the future
Date: Sat, 6 Jun 2020 07:31:24 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0

Hi,

allow me to "top-answer" because there are so many threads in this
discussion:

1. SPOF (single point of failure) and complexity is definitely something
to consider

2. a middle step could be to offer a parameter to tweak the variable
`max_diff_chain` in `metadata.py`, e.g. down to 0 or 1 so that no
metadata diffs are created but only snapshots, i.e. one complete set of
metadata for each backup. Each admin could then decide on size vs.
simplicity of analysis (and speed to do so).

3. to answer Derek's e-mail as well: would it have an impact on speed?
To be honest, no clue, we would need to analyze this.

4. have an API, yep, part of the plans, kind of: code encapsulation, or
rather, lack thereof, is the main issue with the current code IMHO and
what I'm currently trying to improve, without breaking the client/server
interface, so could well be a solution.

5. regarding the fact that rdiff-backup doesn't handle correctly file
system (FS) features lower on the target FS than on the source FS
(length, strange characters, etc), is IMHO independent from the current
discussion and can (and will) be addressed in any case.

KR, Eric

On 05/06/2020 23:44, Arrigo Marchiori wrote:
> Dear Patrik, All,
> 
> I will try to contribute to this interesting conversation.
> 
> On Fri, Jun 05, 2020 at 08:16:30AM -0400, Patrik Dufresne wrote:
> 
>> As mentioned by Robert searching for metadata is complex because you need
>> to scan multiple file to actually find the right value. instead of having a
>> query if we were using a database.
>>
>> Obviously performance-wise it's not great either because we need to scan
>> multiple file.
>>
>> The only thing I hate about that  is lake of visibility as a compromise
>> maybe we can find the most common database and add layer on top using
>> command line to search in this database? To let users be autonomous.
>> SQLite is probably one of those very popular and simple database.
> 
> If we were going to substitute a lot of files with a single file (that
> is what a SQLite database is in the end, right?) then we may somehow
> introduce a "single point of failure" for the whole backup.
> 
> If I interpret correctly some experiences I had with apparent
> rdiff-backup metadata corruption (I was backing up files with accented
> letters, long paths on Windows, or onto faulty external hard drives),
> there is the possibility that some missing bits of information are
> reconstructed, or single unrecoverable files are substituted with
> zero-byte stubs, leaving the rest of the backup safe and recoverable.
> I wonder what would happen if the SQLite database got corrupted. Would
> data (as files list and/or contents) be still recoverable?
> 
> I would also like to add another note to this conversation. Microsoft
> Windows systems are subject to a limitation in the maximum length of
> file paths. This means that files with "long-ish" paths may not be
> accessible, or that their corresponding metadata (?) would not, as
> some files inside the rdiff-backup-data directory seem to be named
> after backed-up files with some codes appended.
> 
> If the rdiff-backup-data directory is ever going to be redesigned,
> then please, consider making it filesystem-agnostic. This would not
> only solve the above problem, but also allow other possibly useful use
> cases such as backing up case sensitive filesystem to case-insensitive
> ones or vice-versa... reliably.
> 
> I am also replying to Robert's e-mail below.
> 
>> On Thu., Jun. 4, 2020, 11:03 p.m. Robert Nichols, <
>> rnicholsNOSPAM@comcast.net> wrote:
>>
>>> On 6/4/20 11:43 AM, Patrik Dufresne wrote:
>>>> But two cent on the subject is, should we really keep this filebase ? For
>>>> rdiffweb, scanning the metadata files is a nightmare. When I just need a
>>>> subset of the data to be displayed to the user. I always thought a
>>> database
>>>> could be better fit for the job. Something like a key store or similar.
>>>
>>> +1 from me
>>>
>>> The way rdiff-backup stores metadata is its worst feature, in my opinion.
>>> Keeping the metadata in various text files makes analysis unnecessarily
>>> complex and searches very inefficient. Inode data for hard-linked files
>>> is replicated in the mirror_metadata file, except for the checksum, which
>>> is stored just on the first entry for that inode, so you have to go
>>> hunting for it, and make sure it is always in the right place when
>>> that linking changes. That sort of thing just screams to be stored in
>>> a database.
> 
> I personally never looked into the details of rdiff-backup, but I
> often wished I could access all that data... easily.
> 
> Maybe this is what you are looking for as well? An alternative way to
> access rdiff-backup data and meta-data, other than launching
> rdiff-backup itself?
> 
> IMHO the best way of addressing this problem would not be to make an
> "easy to parse" file format, but rather develop an official, easy to
> use API.  If rdiff-backup itself was import-able from Python scripts,
> and made its functions directly accessible from Python code, probably
> other tools (such as Patrick's rdiff-web, if I understood correctly)
> would not care any more about how increments and metadata are stored,
> because the API would abstract the details.  This, at least, is how I
> would imagine an ideal future development of this software.
> 
> The structure of the meta-data itself should rather be based on the
> concept of fault tolerance and independence from filesystem, as I
> suggested above.
> 
> I hope I understood the topic of this thread, and that I could explain
> myself clearly enough.
> 
> Best regards,
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]