this post was submitted on 21 Jan 2025
161 points (100.0% liked)

Fuck AI

2299 readers
619 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

founded 1 year ago
MODERATORS
all 14 comments
sorted by: hot top controversial new old
[–] [email protected] 46 points 2 months ago (1 children)

Oh hell yeah.

Months ago I was brainstorming something almost identical to this concept: use the reverse proxy to serve pre-generated AI slop to AI crawler user agents while serving the real content to everyone else. Looks like someone did exactly that, and now I can just deploy it. Fantastic.

[–] [email protected] 9 points 2 months ago (1 children)

Ai slop is actually better than random data because it gets in a feedback loop which is more destructive.

[–] [email protected] 4 points 2 months ago (1 children)

If you use natural text to train model A, and then use model A's output, a, to train model B, then model B's output will be less good than model A's output. The quality degenerates with each generation, but the it happens over generations of models. So, random data is worse than AI slop, because random data is already of the lowest possible quality for AI training.

[–] [email protected] 1 points 2 months ago

Yes, but random data might be easier to detect in the first place, and could then be filtered.

[–] [email protected] 34 points 2 months ago

Poison the AI. I'm all for it.

[–] [email protected] 13 points 2 months ago

Why is no one talking about the fact that the demo is clearly using the Bee movie script to power the Markov Chain generation?

This thing spits out some gold:

Honey, it changes people.

I'm taking aim at the baby.

[–] [email protected] 12 points 2 months ago (1 children)

So it's like nightshade for LLMs?

[–] [email protected] 15 points 2 months ago

Better, actually. This feeds the crawler a potentially infinite amount of nonsense data. If not caught, this will fill up the whatever storage medium is used. Since the data is generated using Markov-chains, any LLM trained on it will learn to disregard context that goes farther back than one word, which would be disastrous for the quality of any output the LLM produces.

Technically, it would be possible for a single page using iocaine to completely ruin an LLM. With nightshade you'd have to poison quite a number of images. On the other hand, Iocaine text can be easily detected by a human, while nightshade is designed to not be noticeable by humans.

[–] [email protected] 12 points 2 months ago (1 children)

Would this interfere with legitimate crawlers as well, the Internet Archive for instance?

[–] [email protected] 1 points 2 months ago (1 children)

Could you list specific crawlers to be automatically blocked by the iocaine site?

[–] [email protected] 3 points 2 months ago

How hard would this be for a sophisticated enough bot to detect the intention here, and blacklist the domain on a shared blacklist set? I would imagine not too difficult. Good idea, though. The start of something potentially great.

[–] [email protected] 2 points 2 months ago (1 children)

Don't these crawlers save some kind of metadata before fully committing it to their databases? It'd surely be able to see that a specific domain served just garbage (and/or that it's so "basic"), and then blacklist/purge the data? Or are the AO crawlers even dumber than I'd imagine?

[–] [email protected] 6 points 2 months ago* (last edited 2 months ago)

I'd be surprised if anything crawled from a site using iocaine actually made it into an LLM training set. GPT 3's initial set of 45 terabytes was reduced to 570 GB, which it was actually trained on. So yeah, there's a lot of filtering/processing that takes place between crawl and train. Then again, they seem to have failed entirely to clean the reddit data they fed into Gemini, so /shrug