this post was submitted on 18 Mar 2025
72 points (100.0% liked)

Technology

67825 readers
6734 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

SourceHut continues to face disruptions due to aggressive LLM crawlers. We are continuously working to deploy mitigations. We have deployed a number of mitigations which are keeping the problem contained for now. However, some of our mitigations may impact end-users.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 31 points 1 week ago* (last edited 1 week ago) (1 children)

We use NGINX’s 444 on every LLM crawler we see.

Caddy has a similar “close connection” option called “abort” as part of the static response.

HAProxy has the “silent-drop” option which also closes the TCP connection silently.

I’ve found crawling attempts end more quickly using this option - especially attacks - but my sample size is relatively small.

Edit: we do this because too often we’ve seen them ignore robots.txt. They believe all data is theirs. I do not.

[–] [email protected] 15 points 1 week ago* (last edited 1 week ago)

I had the same issue. OpenAI was just slamming my tiny little server, ignoring the robots.txt. I had to install a LLM black hole and put a very basic password protection around my git server frontend, since it kept getting slammed by the crawler.

As much as I dont like google, I did see them come in, look at the robot.txt and no other calls for a week. Thats how it should work.