[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [libreplanet-discuss] Machine learning and copyleft
From: |
Isaac David |
Subject: |
Re: [libreplanet-discuss] Machine learning and copyleft |
Date: |
Sat, 10 Dec 2016 13:43:11 -0600 |
Hi Amias. I had briefly entertained this quandary, but reading this
I think made me more perceptive of it. Hopefully I will be able to
add something of value...
Le ven. 9 déc. 2016 à 18:42, Amias Hartley <amiashartley@gmail.com> a
écrit :
[...]
Someone could take this system, modify the training program, and
train a new model on the same dataset.
Then he or she could publish only inference program with sources, the
unmodified training dataset, and the new trained model.
This is the crucial point. Predictive models are computer programs,
and upon preliminary examination training program and dataset are
analogous to their source code. Both are absolutely necessary to
rebuild the same predictive model, or at least a similarly
performing one (if for any reason you weren't able to couple
stochastic learning algorithms with predefined pseudo-RNG seeds;
but I digress because this is not a matter of reproducible builds).
Because the end user doesn't need the modified training program to
run the inference program with the new model, it is not distributed,
because technically the only user of the modified training program is
those who trained a new model using it, so GPL doesn't require to
distribute it.
Nor I need the GPL'ed libraries, whether verbatim or modified, to
run a hypothetical statically linked program. Yet copyleft still
guarantees the freedom to modify and rebuild that program because
it's a derivative work, thus giving access to the libraries. So the
punctual concern is over copyright not extending to the resulting
program in the machine learning case, because of the unusual nature
of its sources and the way those were used to come up with the
model.
Here's another way to look at it drawing an analogy from a common
scenario: The training program is not as much part of the source
code as it is a build tool; a compiler that transforms data into
programs. Absolutely necessary yet unentailing in the copyright
domain. Moreover, the training data isn't as much of a source code
enduring transformations as it is just informing the training
program on how to build a program; one never seen before. That's
what machine learning is all about: not having to write an explicit
program. Even metaprogramming and genetic programming start with a
recognizable program; but a table of observations doesn't look like
one at all.
In terms of derivative works and copyright law, it's as though I
fed a GPL'ed image manipulation program, say the GIMP, a color
palette which is subject to copyright or database rights and also
licensed under the GPL. Then a GIMP plugin would use the palette to
draw a beautiful picture, which for the sake of the argument meets
copyright eligibility criteria too. Let's say the picture can be
either an interpolated smooth plot that approximates the
points/palette colors in RGB space (a regression), or a plot
separating those points (classification). Now, there's no GIMP
source code going into the image, so the latter being a derivative
of the former is off the table. Is it a derivative work of the
palette? Probably not.
However, in this case freedom of users of the distributed system
(inference program and the new model) is violated because they can't
retrain the model on new data or improve the training code and
retrain it on the same data to improve performance of the model.
I share that view, insofar as there's no free implementation of the
same learning algorithm that can offer some sort of compatibility
with the input data and the interfaces of the output program.
Keeping with the compiler analogy, a free program that can be
built/interpreted with free and nonfree translators alike is still
free; (a privately-improved copy of an originally free translator
being nonfree with respect to the rest of the users). Then we can
say the model is fatally dependent on nonfree software.
My question is how is it possible to protect users' freedom by making
everyone who distributes a trained model to distribute also sources
of a training program that was used to train the model, and
instructions for obtaining the training dataset?
Could the problem be solved by GPL?
Maybe not without amending its requirements, kind of what is asked
for in the AGPL in order to compel conveyors through a network
service to disclose source and improvements upon request. If my
layman's copyright analysis is correct then the training program
behaves like a compiler of sorts. I don't see a compiler being
"the preferred form of the work for making modifications" to another
program's object code; but rather a "major component... a compiler
used to produce the work, or an object code interpreter used to run
it." I'm quoting from the GPLv3 itself. Unless one can argue that in
the absence of any actual source code the training program and data
become "corresponding source... needed to generate, install, and (for
an executable work) run the object code and to modify the work,
including scripts to control those activities."
It's still an interesting question though, to consider whether a
program in object form generated from a privately-modified copy of
a free compiler is also free, when using the publicly available
version of the compiler generates vastly different or inferior
object code. We conscientious users wouldn't like to use binaries
which didn't originate in a free and public toolchain anyway; but
at what point does the superior object code become proprietary
software? After a 1% difference, 50%, 1000%? Any distinction would
be arbitrary so we might as well say all of them are proprietary.
If not, is there a license that provides the required guarantees?
I don't think so. The (A)GPL is as strong as it gets. You could always
try to add your own contractual terms on top of the GPL or its spirit
thereof, at the risk of making your software incompatible with
everything else.
--
Isaac David
GPG: 38D33EF29A7691134357648733466E12EC7BA943
Tox:
0C730E0156E96E6193A1445D413557FF5F277BA969A4EA20AC9352889D3B390E77651E816F0C