Did We Consent to Our Data Training Generative AI?
A look at the central arguments of the Consent debate.
Did we Consent to our data—our words, our images, our sounds—training generative AI?
It’s a question that people have a lot of hot takes and gut reactions on—but what’s the truth?
Let’s dive in and explore the topic of Consent in generative AI training.
Before We Start: Read My Definitions and Perspective on What Artificial Intelligence Is an What Role It Plays in Publishing
Before you read this post, it’s critical that you read my first post on what artificial intelligence is first. Nothing in this post will make sense without those definitions, and I’ll be referring back to this previous article again and again as I cover the 5Cs one by one. Please read it here:
As a Reminder: What Are the Five C’s of Moral Evaluation of AI?
With emotions running high at even the mention of two little letters—AI, short for artificial intelligence—I’ve found it necessary to develop my own framework for understanding the moral and ethical arguments both for and against some of the newest technologies emerging from the field—specifically:
Auto-narration tools (Google Play Books, Apple Books, ElevenLabs)
Visual creation and completion tools (Midjourney, Dall-E, StableFusion)
Writing, translating, and chatbot tools (ChatGPT, OpenAI Playground, Google’s Bard).
As new technology enters the field rapidly, it’s worth considering these 5C’s to decide whether you consider the technology immoral or unethical, and whether you want to use the technology in your business or not.
As a reminder and catchup, the 5C’s are:
Consent - Consent asks questions around if we gave permission to use our content, if that permission was inadvertent due to generally worded terms of service and contracts, and if we should give consent going forward.
Copyright - Copyright deals with legal definitions as defined by the US Copyright Office. (As a US citizen, this is where I'll focus. Feel free to substitute your own country's copyright laws.) Copyright is a tricky subject as it's tied closely to both plagiarism and fair use, so we'll be discussing those too!
Compensation - Compensation focuses on whether creatives should be paid for contributions, how compensation would work, and where the lines should be drawn.
Creativity - Creativity asks whether we should be replacing human work—specifically art—with machines capable of doing it faster and better than we can. What is sacred to humans, and what is okay to use AI for?
Capitalism - Capitalism ties commercial use of AI to morals and ethics. Is it greedy? Is it stealing? Is it like the Industrial Revolution in terms of how it affects the job landscape? Is it going to destroy our societal structures?
As new technology emerges in this field, these 5C’s are the criteria upon which I'm focused on understanding the post.
This essay is going to address Consent in AI, which is the first of The Five C’s of Moral Evaluation of AI: Consent, Copyright, Compensation, Creativity, and Capitalism.
Questions Around Consent
In this essay, I’ll be addressing these core questions around Consent:
Has our data been used to train generative AI?
If so, did we give permission to use it?
If so, was that permission gained through informed or shady means? (And does it matter?)
If so, is there anything that can be done about it? (Legally or morally?)
Regardless of what has occurred, what should the policy be going forward?
Has Our Data Been Used To Train Generative AI?
This is the first determination we need to make before asking any questions about consent. Is this something that happened? I recently wrote a post debunking rumors that Apple Books was building a digital narration program from audiobooks distributed through Findaway Voices for this exact reason. Most people don’t seem to be asking this question to begin with, but it’s an incredibly important one.
In asking this question about my own work, I can state with fair certainty that much of my writing has trained generative AI in some way. But it’s important to not conflate which generative AI it has trained. Because every model is different. And different models obtained data in different ways.
Did We Give Permission to Train Using Our Data?
For starters, most of my writing from the last 5-10 years is on Google Docs—so it could be backed up in the cloud—and I consented to let Google use my data for research purposes when I signed up for an account. Google now has Bard, a ChatGPT competitor which similarly spits out synthetic text based on queries, or prompts, as we now call them from the consumer side. My private writing—including all my blog post drafts, company content, and book drafts (I keep a full copy of every book I write on Google Drive)—are likely in the training data.
To me, this is a case of clear and written Consent for my data to be used.
Outside of data that I’ve stored for free on Google Drive, most of my writing from the last decade+ is public or has been public. Bots scrape public content all the time, and anyone can process that data. There are probably dozens if not hundreds of companies that are doing just that, right now. This includes every social media platform, along with tons of private companies that we’ve never even heard of.
(Note: most of us hadn’t heard of OpenAI even a year ago. How many companies are out there that are scraping our data for research and training purposes?)
To me this is a case where I did not actively Consent, but where I posted content publicly knowing that it would be scraped by hundreds if not thousands of companies under “fair use.”
Sidenote: I also knew it would be swiped, recycled, copied, reused, memed up, plagiarized, reshared, reposted, and more by strangers and bots on the internet. I have had my moral and ethical struggles with plagiarism over the last decade, but I truly believe we are entering, if not already in, a post-copyright era (with the transition having taken place over the last 10-15 years).
In both cases, I never signed anything that said, “yes, use my data to train generative AI, please!” But I knew it was both possible and legal under current copyright laws.
We Consent In Terms of Use Agreements
The reality is that I Consented to all of this every time I clicked a Terms of Service (TOS) Agreement on any platform—including retailer and social media platforms.
But the real kicker is probably not even the platforms, as you are still limited in what data you regularly give out on the internet. The real place that I’ve Consented is on devices. I have owned Androids for the last decade, and iPhones before that. There are Sonos devices, Google Home devices, and Roomba devices all over my home. I’ve worn Fitbits and logged my weight on smart scales. And every time, I say “yes” to those TOS agreements too.
What exactly did I Consent to? Put simply, that any data that could be scraped from my usage of the product—including text, voice, contacts, images, and more—could be used for research. “Research” is effectively a free pass to use the data for…Everything and anything a company wants.
And this data processing is and always has been done through machine learning and deep learning models, which is the branch of AI that has now led to generative AI.
What About Informed Consent? “Opted Into” Consent?
Most creators are not confused or in disagreement about anything I’ve said up to this point in the essay. But I can hear many of you out there saying, “Yes, yes, yes…BUT… (And it’s a big BUT, Monica.)”
The biggest debate and misunderstanding around Consent seems to be more around how we have all Consented to this.
The bottom line is many creators feel like they should have been able to explicitly opt in to having their data used in the training set for generative AI.
So let’s break that down:
Legally (ethically), there is no case for that requirement. These companies obtained the data under either written Consent or fair use. Regarding fair use—there are some legal cases calling this into question at present, so this may change if those cases can be won. Currently, however, most analysts seem to agree that those legal cases will be unsuccessful.
Technically, opting in and opting out would be pretty much impossible while still using the product. If you have read my post on how generative AI is a subset (and an evolution) of all models that came before it, then you will understand that as a society, we’ve been Consenting to having our data “train generative AI” for decades now. It is impossible to separate training generative AI from training pretty much any algorithm from the past two decades. It’s all built on what has come before it. Where does anyone draw the line, and how is that executed on any sort of mass scale? And what reasonable right do any of us have to ask a company to draw the line on our personal data exactly where we want it, retroactively, and at any point in the present or future that we decide it’s gone too far for us personally—especially when we already Consented to it a dozen times over and benefited greatly from the exchange (usually using the product for free) that the company traded us for it? Should I be allowed to eat my entire plate at a restaurant then complain to get out of the bill? How many bites do I personally draw the line at? We all need to decide this for ourselves. And then we need to accept that others will draw the line elsewhere, leading to millions of lines, which makes this an absolute technical nightmare to implement.
Morally, this is ? There could be a moral argument made here, though when I look at the technical entanglement of these advancements, I know am not the right person to make it. I also find that the moral arguments tend to just devolve into what a single person feels, which doesn’t tend to be persuasive to me. Finally, companies are not usually going to change what they do without either financial or legal repercussions. In terms of results and outcomes, this path is unlikely to bear fruit for creators.
After this breakdown, it should come as no shock that I do not believe companies should be required to have opted-in data or informed consent for their training. You do not have to agree with me though, and you can make your own choice around Consent.
To me, Consent comes down to where you want to draw the line for yourself:
Some people choose that they will not personally use generative AI at all
Some people choose that they will not personally use generative AI until the legal cases are resolved
Some people choose that they will only use generative AI where the creators have opted in to training (though I question that any company can guarantee this, because again, it requires drawing an arbitrary line and you may not agree with them on where the line should be—but I digress!)
Some people choose to use generative AI freely, regardless of how it is trained
And all of these personal decisions around Consent are welcome in this space. Where I see creators getting into trouble and becoming unreasonable (and angry, and hateful) is when they decide that where they draw the line is “righteous” and everyone drawing the line past it is “bad.” Yuck.
If You Are Not Paying for the Product, You Are the Product
Of the 5C’s, Consent seems to me to be the least debatable from a legal perspective. Because I did Consent. You did Consent. You pretty much can’t use the internet—and you definitely can’t have a workable business on the internet—without Consenting to a ton of data training, scraping, and processing in exchange.
For decades, we’ve been told that free platforms use our data in ways we don’t know or understand, and this is not any different. The legal precedence is set.
The challenge with Consent (and the reason I started with it among the 5C’s) is its cascading effect in the legal system. If you can tip over the first domino of Consent, then Copyright and Compensation tend to fall right after. This is why most analysts don’t have hope that laws will come down on generative AI. But we will save that discussion for the next essay.
The Consent Collected Is Ethical—But Is It Moral?
When people talk about Consent, they are sometimes talking about the law, but they are also sometimes talking about the fairness to individuals.
Is it unfair that these companies are collecting untold amounts of data on us, then using it to create technology advancements that neither they nor we could fathom at the time of Consent?
This is where it gets a little sticky for me. Because in truth, I would prefer that companies don’t collect so much data on me. But also in truth, without collecting that data, I would lose much of the convenience of using the products these companies make to begin with, along with the blanket benefits that come from existing in a time of rapid technological advancement that generally makes my life better.
After all, I get to work from home, set my own hours, create whatever I want, and generally have a ton of control over my days and my life—and that is pretty incredible to me.
The data I’ve given has also usually been taken in exchange for something else. Attention, visibility, expression, convenience, security…The list goes on. Companies do not subsist on good will alone, nor do I want them to. I don’t want anyone to think that my company can subsist on good will alone either. Companies need to be able to build assets they can monetize. And the popularized business model of the current crop of Big Tech is critical mass, which is required for collecting and monetizing data.
Going Forward: Consent in Exchange For Usage
I don’t have many answers on how to move forward, though I don’t see much changing from the current state of things. Currently, you agree to let a company use your anonymized data in exchange for using their product to make your life easier, cheaper, and more convenient. That exchange has worked for both consumers and companies, and it will likely continue to for a while.
On a personal level, I’ve decided this exchange works for me. Part of that decision for me has been looking at the reality of how things work (results and outcomes) and deciding not to invest more time and energy into copyright law—which has proven to be ineffective at making creators more money or protecting creators’ money to begin with.
Another part of that decision has been looking at what it would take to actually gain control over my data again. But the result is far too costly. I would have to shut down my businesses (which are mostly digital) and work almost entirely analog. I would have to change my entire life—But I like my life quite a bit.
I am able to opt-out of any device or platform at any time. I can get rid of my phone! I can stop decorating my house with devices that play music and Youtube! I can quit Facebook! But can I, really, when many of these devices and platforms are synonymous with community itself? If I opt out of the systems that everyone I know is using, I disappear.
And I am not interested in disappearing. Also too costly.
I don’t think we have much power to change what companies ask of us in exchange for using their products, and I don’t know that it’s fair to ask, either.
For Now, Companies, You Have My Consent to Make My Life Cooler
If my little pieces of data become part of training sets and help generative AI and humanity move forward, so be it.
For me, it’s a minuscule price to pay for both the life I have and the life I can have as generative AI becomes ubiquitous.
And if Copyright law goes south and publishing changes irrevocably, I’ll be looking at how to thrive as a business first, then as an artist—which is not different from what I already do.
Businesses solve problems.
What problems do readers have?
What problems will readers have in the face of generative AI?
These questions are all I really need to move forward.
I feel incredibly lucky to be born in a time of so much incredible and rapid technological change. I get to live many lifetimes in one, and it’s pretty cool and exciting for me, personally. This life has all been enabled by my Consent.
I'm with you on this, Monica. I also want the models to have my work so I don't 'disappear' a bit like we want to be found in Google. I'd love new book discoverability tools to be built with these models and be able to surface books based on far more nuanced elements than keywords. The tools will only get more exciting :) Thanks again for your detailed analysis on this.