and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.
That's insane. They also mention crawling happening every 6 hours instead of only once. And the vast majority of traffic coming from a few AI companies.
It's a shame. The US won't regulate - and certainly not under the current administration. China is unlikely to.
So what can be done? Is this how the internet splits into authorized and not? Or into largely blocked areas? Maybe responses could include errors that humans could identify and ignore but LLMS would not to poison them?
When you think about the economic and environmental cost of this it's insane. I knew AI is expensive to train and run. But now I have to consider where they leech from for training and live queries too.