[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] open_issues/bcachefs.mdwn: new file.
From: |
Samuel Thibault |
Subject: |
Re: [PATCH] open_issues/bcachefs.mdwn: new file. |
Date: |
Wed, 10 Jan 2024 00:06:31 +0100 |
User-agent: |
NeoMutt/20170609 (1.8.3) |
Applied, thanks!
jbranso@dismail.de, le sam. 06 janv. 2024 14:59:40 -0500, a ecrit:
> Well, we might as well document our conversation with Kent about bachchefs.
>
> ---
> open_issues/bcachefs.mdwn | 326 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 326 insertions(+)
> create mode 100644 open_issues/bcachefs.mdwn
>
> diff --git a/open_issues/bcachefs.mdwn b/open_issues/bcachefs.mdwn
> new file mode 100644
> index 00000000..aa39bce0
> --- /dev/null
> +++ b/open_issues/bcachefs.mdwn
> @@ -0,0 +1,326 @@
> +[[!meta copyright="Copyright © 2007, 2008, 2010, 2011 Free Software
> Foundation,
> +Inc."]]
> +
> +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
> +id="license" text="Permission is granted to copy, distribute and/or modify
> this
> +document under the terms of the GNU Free Documentation License, Version 1.2
> or
> +any later version published by the Free Software Foundation; with no
> Invariant
> +Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
> license
> +is included in the section entitled [[GNU Free Documentation
> +License|/fdl]]."]]"""]]
> +
> +[[!tag open_issue_hurd]]
> +
> +The Hurd's primary filesystem is ext2, which works but lacks modern
> +features. With ext2, Hurd users reguarly deal with filesystem
> +corruption. Ext2 does not have a journal, so Hurd users occasionally
> +have to deal with filesystem corruption. `fsck` can fix most of the
> +issues (with loss of random data), but without a proper journal the
> +Hurd currently is not a good a OS for long-term data storage.
> +
> +Bcachefs is a modern COW (copy-on-write) open source filesystem for
> +Linux, which intends to replace Btrfs and ZFS while having the
> +performance of ext4 or XFS. It is almost 100,000 lines of code.
> +Btrfs is 150,000 lines of code. Bcachefs is structured as a
> +filesystem built on top of a database. There is a clean small
> +database transaction layer. That core database library is maybe
> +25,000 lines of code.
> +
> +Some Hurd developers recently [[talked with
> +Bcachefs|https://youtube.com/watch?v=bcWsrYvc5Fg]] author Kent
> +Overstreat about porting bcachefs to the Hurd. There are currently no
> +concrete plans to do so due to lack of developer man power.
> +
> +90% of the Bcachefs filesystem code builds and runs in userspace. It
> +uses a shim layer that makes maps kernel locking primatives to
> +pthreads, the kernel io API is mapped to AIO, etc. Bcachefs does
> +intend to eventually rewrite most or all of its current codebase into
> +rust.
> +
> +Kent is ok with us merging a shim layer for libstore that maps to the
> +Unix filesystem API. That would be a header file that goes into the
> +bcachefs code.
> +
> +There is a somewhat working FUSE port of bcachefs, but Kent is not
> +certain that is a good way to run bcachefs in userspace. Kent wants
> +to use the FUSE port to help in debbugging. Suppose bcachefs starts
> +acting up, then you could switch to running it in userspace and attach
> +GDB to the running process. This is currently not possible.
> +
> +We could port bcachefs to the Hurd's native filesystem API: libdiskfs.
> +
> +One interesting aspect of the conversation was Kent's goal of re-using
> +kernel code in userspace. The Linux kernel hashtable code is high
> +performance, resizeable, lockless, and builds and runs in userspace.
> +As long as you have liburcu, then you can use the kernel hashtable in
> +userspace on the Hurd. This might be useful to use on the Hurd.
> +
> +Bcachefs is liscensed as GPLv2, and many of Kent's previous employers
> +own the patents, including Google. Kent is ok with potentially making
> +the license GPLv2+, as long as there was not a promise to keep
> +bcachefs GPLv2 only.
> +
> +# IRC logs
> +
> +https://logs.guix.gnu.org/hurd/2023-09-26.log
> +
> + <solid_black> maybe I'm wrong though, do you know much about fuse? or
> file systems?
> + <damo22> no i dont know much about filesystems
> + <damo22> what is bcachefs?
> + <solid_black> see? :D
> + <azert> I agree that someone intimate in the Mach pager api, libdiskfs
> and fuse would be great at that meeting
> + <solid_black> I do kind of understand Mach VM / paging, I must say
> + <solid_black> from the looks of it, I even understand it best among
> those who have looked at it recently
> + <solid_black> and I mostly understand libdiskfs
> + <damo22> so go to the meeting
> + <damo22> what is fuse? do we even need it for hurd?
> + <damo22> file systems in userspace
> + <solid_black> FUSE is "filesystem in user space", it's both the name
> for the concept, and the name of Linux's specific mechanism, of offloading fs
> to userland
> + <damo22> yeah, i think it may be unneeded for filesystem on hurd
> + <solid_black> it's basically a giant hack that pretends to be a
> fileystem implementation to the rest of the kernel, and then sends requests
> and receives responses from a userland program that _actually_ implements the
> fs
> + <solid_black> on the Hurd, *of course* filesystems are implemented in
> userland, that's the only and tnhe natural way everything works
> + <solid_black> but that's where the similarities end
> + <solid_black> you cannot just take a linux fuse fs, using libfuse,
> and run it on the Hurd
> + <solid_black> there has been a project make a library that would have
> the same API as libfuse, but act as a Hurd translator, specifically to
> facilitate porting linux filesystems
> + <damo22> i imagine fuse has an api
> + <solid_black> last I heard, it was never completed, but who knows
> + <solid_black> it has a kerne <->userland protocol and a userspace
> library (libfuse) for implementing that protocol, yes
> + <damo22> solid_black: you seem to know more about fuse than you admitted
> + <solid_black> https://www.gnu.org/software/hurd/hurd/libfuse.html
> + <solid_black> I know the basics, around as much as I have just told
> you
> + <azert> I think that gnucode idea was that this would be the easiest to
> port bcachefs to the Hurd, but I doubt it would be the best
> + <solid_black> I have also hacked on a C++ fuse fs (darling-dmg),
> though I don't think I interacted with the fuse parts of it much
> + <azert> Or even the easier
> + <solid_black> yeah, I don't think it'd be the best or the easiest one
> either
> + <damo22> if someone implemented libfuse api and made it as a hurd
> translator, surely it would work natively?
> + <damo22> <braunr> zacts: the main problem seems to be the
> interactions between the fuse file system and virtual memory (including
> caching)
> + <braunr> something the hurd doesn't excel at
> + <braunr> it *may* be possible to find existing userspace implementations
> that don't use the system cache (e.g. implement their own)
> + <azert> Yes, that’s a possibility that needs to be kept open for
> discussion
> + <nikolar> Sounds interesting
> + <solid_black> youpi: ping
> + <youpi> pong
> + <solid_black> hello!
> + <solid_black> any thoughts on the above discussion? are you going to
> participate in the call that's being set up?
> + <youpi> I don't have time for it
> + <youpi> (AFAIK the fuse hurd implementation does work to some extent)
> + <solid_black> I should at least try out Hurd's fuse before the call,
> good idea
> + <solid_black> maybe read up on the Linux's fuse
> + <solid_black> thoughts on using fuse vs libdiskfs for bcachefs?
> + <youpi> using fuse would probably be less work
> + <youpi> and it'd probably mean fixing things in libfuse, which can
> benefit many other FS anyway
> + <solid_black> is it true that the "low level" API of libfuse is
> unimplemented and unimplementable?
> + <youpi> I don't know what that "low level" API is
> + <solid_black> this IIUC
> https://github.com/libfuse/libfuse/blob/master/include/fuse_lowlevel.h
> + <solid_black> > libfuse offers two APIs: a "high-level", synchronous
> API, and a "low-level" asynchronous API. In both cases, incoming requests
> from the kernel are passed to the main program using callbacks. When using
> the high-level API, the callbacks may work with file names and paths instead
> of inodes, and processing of a request finishes when the callback function
> returns. When using the low-level API, the callbacks must work with inodes
> and responses must be se
> + <solid_black> nt explicitly using a separate set of API functions.
> + <youpi> where did you read that it'd be unimplementable ?
> + <solid_black>
> https://git.savannah.gnu.org/cgit/hurd/incubator.git/tree/README?h=libfuse/master
>
> + <solid_black> > This is simply because it is to specific to the Linux
> kernel and (besides that) it is not farly used now.
> + <youpi> In case the latter should change in the future, we might want
> to re-think about that issue though.
> + <solid_black> so, sounds like it's perhaps implementable in theory,
> but that'd require additional work and design
> + <youpi> see the sentence below...
> + <solid_black> the low-level API is what bcachefs uses
> + <youpi> well, additional work and design, of course
> + <solid_black> seems to, at least, from a quick glance
> + <youpi> any async API needs some
> + <youpi> but I don't see why it would not be possible
> + <youpi> mig precisely supports asynchronous stubs
> + <solid_black> bcachefs-tools/cmd_fusermount.c is just 1274 lines,
> which inspires some hope
> + <solid_black> asynchrony is not the problem, I imagine (but I haven't
> looked), but being too tied to Linux might be
> + <youpi> it's not really tied, as in it doesn't seem to use
> linux-specific functions
> + <youpi> but it uses linux-like notions, which indeed need to be
> translated to the hurdish notions
> + <youpi> but that's not something really tough
> + <youpi> just needs to be worked on
> +
> +https://logs.guix.gnu.org/hurd/2023-09-27.log#103329
> +
> + <solid_black> libfuse as shipped as Debian doesn't seem very
> + functional, I can't even build a simple program against it:
> + 'i386-gnu/libfuse.so: undefined reference to `assert''
> +
> + <solid_black> (assert is of course a macro in glibc)
> + <solid_black> and it segfaults in fuse_main_real
> + <solid_black> lowleve fuse ops do seem to map to netfs concept
> nicely, as far as I can see so far
> + <solid_black> and (again, so far) I don't see any asynchrony in how
> bcachefs uses fuse, i.e. they always fuse_reply() inside the method
> implementation
> +
> + <solid_black> but if we had to implement low-level fuse API, this
> would be an issue
> + <solid_black> because netfs is syncronous
> + <solid_black> this is again a place where I don't think netfs is
> actually that useful
> + <solid_black> libfuse should be its own standalone tranlator library,
> a peer to lib{disk,net,triv}fs
> + <solid_black> yell at me if you disagree
> + <youpi> or perhaps make it use libdiskfs ?
> + <youpi> there's significant code in libdiskfs that you'd probably not
> want to reimplement in libfuse
> + <solid_black> like what?
> + <youpi> starting a translator
> + <youpi> all the posix semantic bits
> + <solid_black> (this is another thing, I don't believe there is a
> significant difference that explains libdiskfs and libnetfs being two
> separate libraries. but it's too late to merge them, and I'm not an fs dev)
> +
> + <solid_black> starting a translator is abstracted into libfshelp
> specifically so it can be easily reused?
> + <solid_black> is libdiskfs synchronous?
> + <youpi> I'm just saying things out of my memory
> + <solid_black> scratch that, diskfs does not work like that at all
> + <youpi> piece of it is in fshelp yes
> + <solid_black> it works on pagers, always
> + <youpi> but significant pieces are in libdiskfs too
> + <youpi> and you are saying you are not an FS person :)
> + <youpi> you do know libdiskfs etc. well beyond the average
> + <youpi> perhaps not the ext2 FS structure, but that's not really
> important here
> + <youpi> see e.g. the short-circuits in file-get-trans.c
> + <solid_black> I may understand how the Hurd's translator libraries
> work, somewhat better than the avergae person :)
> + <youpi> and the code around fshelp_fetch_root
> + <solid_black> but I don't know about how filesystems are actually
> organized, on-disk (beyond the basics that there any inodes and superblocks
> and journaled writes and btrees etc)
> + <youpi> you don't really need to know more about that
> + <solid_black> nor do I know the million little things about how
> filesystem code should be written to be robust and performant
> + <solid_black> yeah so as I was saying, libdiskfs expects files to be
> mappable (diskfs_get_filemap_pager_struct), and then all I/O is implemented
> on top of that
> + <solid_black> e.g. to read, libdiskfs queries that pager from the
> impl, maps it into memory, and copies data from there to the reply message
> + <solid_black> I must have mentioned that already, I'd like to rewrite
> that code path some day to do less copying
> + <solid_black> I imagine this might speed up I/O heavy workloads
> + <youpi> ? it doesn't copy into the reply
> + <youpi> it transfers map
> + <solid_black> it does, let me find the code
> + <youpi> in some corner cases yes
> + <youpi> but not normal case
> + <youpi> https://darnassus.sceen.net/~hurd-web/hurd/io_path/
> + <solid_black> libdiskfs/rdwr-internal.c, it does pager_memcpy, which
> is a glorified memcpy + fault handling
> + <solid_black> don't trust that wiki page
> + <youpi> why not ?
> + <youpi> not, pager_memcpy is not just a memcpy
> + <youpi> it's using vm_copy whenever it can
> + <youpi> i.e. map transfer
> + <solid_black> well yes, but doesn't the regular memcpy also attempt
> to do that?
> + <youpi> it happens to do so indeed
> + <youpi> but that' doesn't matter: I do mean it's trying *not* copying
> + <youpi> by going through the mm
> + <youpi> note: if a wiki page is bogus, propose a fix
> + <solid_black> I think there was another copy on the path somewhere
> (in the server, there's yet another in the client of course), but I can't
> quite remember where
> + <solid_black> and I wouldn't rely on that vm_copy optimization
> + <solid_black> it's may be useful when it working, but we have to
> design for there to not be a need to make a copy in the first place
> + <solid_black> ah well, pager_read_page does the other copy
> + <youpi> when things are not aligned etC. you'll have to do a copy anyway
> + <solid_black> but then again, this is all my idle observations, I'm
> not an fs person, I haven't done any profiling, and perhaps indeed all these
> copies are optimized away with vm_copy
> + <youpi> where in pager_read_page do you see a copy?
> + <youpi> it should be doing a store_read
> + <youpi> passing the pointer to the driver
> + <solid_black> ext2fs/pager.c:file_pager_read_page (at line 220 here,
> but I haven't pulled in a while)
> + <solid_black> it does do a store_read, and that returns a buffer, and
> then it may have to copy that into the buffer it's trying to return
> + <solid_black> though in the common case hopefully it'll read
> everything in a single read op
> + <youpi> it's in the new_buf != *buf + offs case
> + <youpi> which is not supposed to be the usual case
> + <solid_black> but now imagine how much overhead this all is
> + <youpi> what? the ifs?
> + <solid_black> we're inside io_read, we already have a buffer where we
> should put the data into
> + <youpi> I have to go give a course, gotta go
> + <solid_black> we could just device_read() into there
> + <youpi> you also want to use a cache
> + <youpi> otherwise it'll be the disk that'll kill yiour performance
> + <youpi> so at some point you do have to copy from the cache to the
> application
> + <youpi> that's unavoidable
> + <youpi> or if it's large, you can vm_copy + copy-on-write
> + <youpi> but basically, the presence of the cache means you can have to
> do copies
> + <youpi> and that's far less costly than re-reading from the disk
> + <solid_black> why can't you return the cache page directly from
> io_read RPC?
> + <youpi> that's vm_copy, yes
> + <youpi> but then if the app modifies the piece, you have to
> copy-on-write
> + <youpi> anywauy, really gottago
> + <solid_black> that part is handled by Mach
> + <solid_black> right, so once you're back: my conclusion from looking
> at libfuse is that it should be rewritten, and should not be using netfs (nor
> diskfs), but be its own independent translator framework
> + <solid_black> and it just sounds like I'm going to be the one who is
> going to do it
> + <solid_black> and we could indeed use bcachefs as a testbed for the
> low level api, and darling-dmg for the high level api
> + <solid_black> I installed avfs from Debian (one of the few packages
> that depend on libfuse), and sure enough: avfs: symbol lookup error:
> /lib/i386-gnu/libfuse.so.1: undefined symbol: assert_perror
> + <solid_black> upstream fuse is built with Meson 🤩️
> + <solid_black> I'm wondering whether this would be better done as a
> port in the upstream libfuse, or as a Hurd-specific libfuse lookalike that
> borrows some code from the upstream one (as now)
> + <damo22> solid_black: what is your argument to rewrite a translator
> framework for fuse?
> + <damo22> i dont understand
> + <solid_black> hi
> + <damo22> hi
> + <solid_black> basically, 1. while the concepts of libfuse *lowlevel*
> api seem to match that of hurd / netfs, they seem sufficiently different to
> not be easily implementable on top of netfs
> + <solid_black> particularly, the async-ness of it, while netfs expects
> you to do everything synchronously
> + <damo22> is that a bug in netfs?
> + <solid_black> this could be maybe made to work, by putting the netfs
> thread doing the request to sleep on a condition variable that would get
> signalled once the answer is provided via the fuse api... but I don't think
> that's going to be any nicer than designing for the asynchrony from the start
> + <solid_black> it's not a bug, it's just a design decision, most Hurd
> tranalators are structured that way
> + <damo22> maybe you can rewrite netfs to be asynchronous and replace it
> + <solid_black> i.e.: it's rare that translators use MIG_NO_REPLY +
> explicit reply, it's much more common to just block the thread
> + <solid_black> 2. the current state is not "somewhat working", it's
> "clearly broken"
> + <damo22> why not start by trying to implement rumpdisk async
> + <damo22> and see what parts are missing
> + <solid_black> wdym rumpdisk async?
> + <damo22> rumpdisk has a todo to make it asynchronous
> + <damo22> let me find the stub
> + <damo22> * FIXME:
> + <damo22> * Long term strategy:
> + <damo22> *
> + <damo22> * Call rump_sys_aio_read/write and return MIG_NO_REPLY from
> + <damo22> * device_read/write, and send the mig reply once the aio
> request has
> + <damo22> * completed. That way, only the aio request will be kept in
> rumpdisk
> + <damo22> * memory instead of a whole thread structure.
> + <solid_black> ah right, that reminds me: we still don't have proper
> mig support for returning errors asynchronously
> + <damo22> if the disk driver is not asynchronous, what is the point of
> making the filesystem asynchronous?
> + <solid_black> the way this works, being asynchronous or not is an
> implementatin detail of a server
> + <solid_black> it doesn't matter to others, the RPC format is the same
> + <solid_black> there's probably not much point in asynchrony for a
> real disk fs like bcachefs, which must be why they don't use it and reply
> immediately
> + <solid_black> but imagine you're implementing an over-the-network fs
> with fuse, then you'd want asynchrony
> + <damo22> what is your goal here? do you want to fix libfuse?
> + <solid_black> I don't know
> + <solid_black> I'm preparing for the call with Kent
> + <solid_black> but it looks like I'm going to have to rewrite libfuse,
> yes
> + <damo22> possibly the caching is important
> + <damo22> ie, where does it happen
> + <solid_black> maybe, yes
> + <solid_black> does fuse support mmap?
> + <damo22> idk
> + <damo22> good q for kent
> + <solid_black> one essential fs property is coherence between mmap and
> r/w
> + <solid_black> so it you change a byte in an mmaped file area, a
> read() of that byte after that should already return the new value
> + <solid_black> same for write() + read from memory
> + <solid_black> this is why libdiskfs insists on reading/writing files
> via the pager and not via callbacks
> + <solid_black> I wonder how fuse deals with this
> + <damo22> good point, no idea
> + <solid_black> does fuse really make the kernel handle O_CREAT /
> O_EXCL? I can't imagine how that would work without racing
> + <solid_black> guess it could be done by trying opening/creating in a
> loop, if creation itself is atomic, but this is not nice
> + <damo22> something is still slowing down smp
> + <damo22> it cant possibly be executing as fast as possible on all cores
> + <damo22> if more cores are available to run threads, it should boot
> faster not slower
> + <azert> Hi damo22, your reasoning would hold if the kernel wouldn’t be
> “wasting” most of its time running in kernel mode tasks
> + <azert> If replacing CPU_NUMBER by a better implementation gave you a
> two digits improvement, that kind of implies that the kernel is indeed taking
> most of the cpu
> + <damo22> yes i mean, something in the kernel is slowing down smp
> + <azert> What about vm_map and all thread tasks synchronization
> + <azert> ?
> + <damo22> i dont understand how the scheduler can halt the APs in
> machine_idle() and not end up wasting time
> + <damo22> how does anything ever run after HLT
> + <damo22> in that code path
> + <damo22> if the idle thread halts the processor the only way it can wake
> up is with an interrupt
> + <damo22> but then, does MARK_CPU_ACTIVE() ever run?
> + <damo22> hmm it does
> + <azert> I think that normally the cpu would be running scheduler code
> and get a thread by itself.
> + <damo22> thats not how it works
> + <damo22> most of the cpus are in idle_continue
> + <damo22> then on a clock interrupt or ast interrupt, they are woken to
> choose a thread i think
> + <damo22> s/choose/run
> + <azert> If they are in cpu_idle then that’s what happens, yea
> + <azert> But normally they wouldn’t be in cpu idle but running the
> schedule and just a thread on their own
> + <azert> Cpu_idle basically turns off the cpu
> + <azert> To save power
> + <damo22> every time i interrupt the kernel debugger, its in cpu-idle
> + <damo22> i dont know if it waits until it is in that state so maybe
> thats why
> + <azert> That means that there is nothing to schedule
> + <azert> Or yea that’s another explanation
> + <damo22> yes, exactly i think it is seemingly running out of threads to
> schedule
> + <azert> A bug in the debugger
> + <damo22> i need to print the number of threads in the queue
> + <youpi> adding a show subcommand for the scheduler state would probably
> be useful
> + <youpi> solid_black: btw, about copies, there's a todo in rumpdisk's
> rumpdisk_device_read : /* directly write at *data when it is aligned */
> + <solid_black> youpi: indeed, that looks relevant, and wouldn't be
> hard to do
> + <solid_black> ideally, it should all be zero-copy (or: minimal number
> of copies), from the device buffer (DMA? idk how this works, can dma pages be
> then used as regular vm pages?) all the way to the data a unix process
> receives from read() or something like that
> + <solid_black> without "slow" memcpies, and ideally with little
> vm_copies too, though transferring ages in Mach messages is ok
> + <solid_black> s/ages/pages/
> + <solid_black> read() requires ones copy purely because it writes into
> the provided buffer (and not returns a new one), and we don't have
> mach_msg_overwrite
> + <solid_black> though again one would hope vm_copy would help there
> + <solid_black> ...I do think that it'd be easier to port bcachefs over
> to netfs than to rewrite libfuse though
> + <solid_black> but then nothing is going to motivate me to work on
> libfuse
> + <azert> solid_black: I never work on things that don’t motivate me
> somehow
> + <azert> Btw, if you want zerocopy for IO, I think you need to do
> asynchronous io
> + <azert> At least that’s the only way for me to make sense of zerocopy
> + <solid_black> I don't think sync vs async has much to do with
> zero-copy-ness? w
> +
> +
> --
> 2.42.0
>
>
--
Samuel
---
Pour une évaluation indépendante, transparente et rigoureuse !
Je soutiens la Commission d'Évaluation de l'Inria.