Keeping AI off your Digital Lawn is nearly Impossible

All your Data Belongs to US, Part I

Jul 08, 2024

Large language models, or LLMs, like ChatGPT, ingest vast amounts of text to train themselves on how human language works and so that they can mimic human understanding and knowledge of our world. For us users, this is marvelous; you can ask an LLM like chatGPT to write in the style of your favorite author or ask about the content of a specific book, and the perfect response comes back in seconds.

Of course, this massive hoovering up of the "best" of the internet has led to some interesting responses, such as when Google's LLM started telling people to add a quarter cup of Elmer glue when making pizza to prevent the cheese from sliding off, something it learned from a smart-ass commenter from the website Reddit.

Less funny is how much personal information is now part of ChatGPT's knowledge base. Last year, researchers published a paper detailing how they used specific prompts to coax the AI into revealing private data such as email addresses, phone numbers, and physical addresses that were inadvertently hoovered up from some web backwater during the training process. .

While we might not want LLMs to have our personal information, we might also not like other content that we post to be "known" by these systems.

Say you have an awesome blog about the KATY Trail. If so, you want Google's search engine to suggest that people visit your site when they search for information on the trail. Once those people are on your site, you can suggest they subscribe to your newsletter, click on an ad for a new bike seat, or otherwise engage with them for fun or profit.

However, after an LMM like ChatGPT ingests your site content, it can respond to a user's queries about the KATY Trail with a rewritten version of your information without the user ever needing to visit (or even know about) your site. LLMs are a parasite in the flow of knowledge, using the blogger's content without providing the blogger with any benefit.

Our KATY trail blogger should be able to place a small file on their web server called robots.txt, letting LLMs know to stop and not hoover up information from this site. Robots.txt has been used to inform search engines how to behave since 1994, and they have voluntarily complied with very few exceptions.

Example of commands that should keep ChatGPT and Google Bard off your site. From EFF.

However, with no legal enforcement mechanisms and LLMs hungry for more data to ingest, many companies behind these AI systems disregard website owners' instructions not to ingest their content.

Recent research by Wired magazine and others demonstrated that an LLM called Perplexity was ignoring the "Disallow" command provided in robots.txt and was even circumventing more explicit measures, such as IP address blocking, designed to keep them from crawling through a website's content.

When confronted with this information, Perplexity CEO Aravind Srinivas blamed third-party services that his company uses to scrap information from the web. He commented that respecting a website's instructions is "not a legal framework."

Well, it should be! I can't violate a no trespassing sign without concequeces. The same should apply to bots comitting digital trespass. Legal frameworks should be created to mandate LLMs to respect directives like robots.txt and protect content creators and private individuals from unauthorized data harvesting.

Microsoft's CEO of AI, Mustafa Suleyman, is just as dismissive on the issue. He believes that anything published on the web is fair game. His view is that "... it's fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like."

And when our KATY Trail blogger tris to block Microsoft from scraping up content? Suleyman says, "That's a gray area and I think that's going to work its way through the courts."

Microsoft is worth three and a half trillion dollars and hires attorneys by the thousands, so it is no wonder that Suleyman is happy to leave it to website owners to sue his company. Microsoft will make so much money by stealing content that paying its attorneys to deal with lawsuits is just a parking ticket.

The web was build on the idea of sharing content made or assembled by one human, with other humans and it should remain that way. How to do that? That is the billion dollar question that someone needs to be answering. Let me know your thoughts below.

📆 Upcoming Talks/Classes 👨‍🏫

Artificial Intelligence, the Elections and Civic Dialogue
6:30 PM, July 10th. In Person and Zoom.

On Wednesday I will be talking about what I know about how AI is used (and misused) in this year's elections. The event is sponsored by the League of Women Voters Boone County, the Reynolds Journalism Institute, and Daniel Boone Regional Library. Here is the full description:

Artificial intelligence (AI) is a technology that is changing and challenging the landscape of just about every field, from medicine and sports to journalism, politics and foreign policy. University of Missouri Associate Teaching Professor J. Scott Christianson (aka Prof. C) will help us explore its potential impact as a source of disinformation, especially during an election year, but also for the creation of new horizons of collaboration. Along the way, Prof. C will provide some pointers for surviving and thriving in an AI-mediated world during an election year and beyond.

If you are in the Columbia area, you can attend in person at:

Boone Electric Cooperative Community Building
1413 Rangeline St.,
Columbia MO 65201

No registration is required to attend in-person.

You can also attend virtually via Zoom. Please register here to get a Zoom link. The event will be recorded for later viewing YouTube.

Managing the Learning Machine
8:00 AM, September 10th. In Person and Zoom.

In this session, we will explore how AI, particularly ChatGPT and advanced machine learning technologies, is changing our world. We'll see how AI is making a big impact in different areas like medicine and retail, and how humans and machines can work together in new ways. AI can boost productivity but also come with risks like misinformation. This talk will help you understand the power and challenges of AI, and why ethical considerations are crucial as this technology continues to grow. Join us to learn how AI is reshaping industries and everyday life.

More information and Registration will be available on MU Retiree’s Association website.

AI: Current Trends and Future Directions
7:30 AM, November 12th. On Zoom.

Grab your coffee and get ready to review the significant progress made in generative AI, including its current applications and anticipated developments. Prof C will present recent research on how generative AI systems are utilized, focusing on PMI's initiatives and offerings. We'll discuss the implications of generative AI for project managers, highlighting practical use cases and best practices. The session will also touch on important InfoSec considerations.

Takeaways:

Understanding Progress: Gain insight into the advancements in generative AI and its current applications.
PMI Initiatives: Learn about PMI's initiatives and offerings in the generative AI space and how they can benefit your projects.
Future Developments: Anticipate future trends and developments in generative AI and how they might impact your work.
InfoSec Considerations: Understand important information security considerations related to generative AI to ensure safe and secure implementation.

Registration will be available on PMI Mid-MO Chapter's website.

Guy Wilson

Jul 8

A belief that everything on the web is fair game is an interesting attitude for a company that operates on the web. I suppose our legal system would protect Alphabet, but it seems like a great defensive strategy for anyone who is in court for stealing the company's intellectual property. Fines and penalties need to be proportional to income or the value of the company.

Expand full comment

John Howe

Brilliant! Monumental IP questions here.

Keeping AI off your Digital Lawn is nearly Impossible

All your Data Belongs to US, Part I

📆 Upcoming Talks/Classes 👨‍🏫

Artificial Intelligence, the Elections and Civic Dialogue6:30 PM, July 10th. In Person and Zoom.

Managing the Learning Machine8:00 AM, September 10th. In Person and Zoom.

AI: Current Trends and Future Directions7:30 AM, November 12th. On Zoom.

Discussion about this post

Artificial Intelligence, the Elections and Civic Dialogue
6:30 PM, July 10th. In Person and Zoom.

Managing the Learning Machine
8:00 AM, September 10th. In Person and Zoom.

AI: Current Trends and Future Directions
7:30 AM, November 12th. On Zoom.