Based on this graph, and this graph alone, guess at what time I completely blocked OpenAI crawlers

hylobates@jlai.lu · 27 days ago

Based on this graph, and this graph alone, guess at what time I completely blocked OpenAI crawlers

zr0@lemmy.dbzer0.com · 26 days ago

Thanks for your time explaining. I have multiple public facing services and I never had any issues with load just because of some crawlers. That’s why I always wonder why people get so mad at them

hoppolito@mander.xyz · 25 days ago

I’m providing hosting for a few FOSS services, relatively small scale, for around 7 years now and always thought the same for most of that time. People were complaining about their servers being hit but my traffic was alright and the server seemed bulky enough to have a lot of buffer.

Then, like a month or two ago, ~~the fire nation attacked~~ the bots came crawling. I had sudden traffic spikes of up to 1000x, memory was hogged and the CPU could barely keep up. The worst was the git forge, public repos with bots just continuously hammering away at diffs between random commits, repeatedly building out history graphs for different branches and so on - all fairly intense operations.

After the server went to its knees multiple times over a couple days I had to block public access. Only with proof of work in front could I finally open it again without destroying service uptime. And even weeks later, they were still trying to get at different project diffs whose links they collected earlier, it was honestly crazy.

zr0@lemmy.dbzer0.com · 25 days ago

That’s very interesting, as if only certain types of content get crawled. May I know what kind of software you used and if you had a reverse proxy in front of it?

hoppolito@mander.xyz · 25 days ago

The code forge is gitea/forgejo, and the proxy in front used to be traefik. I tried fail2ban in front for a while as well but the issue was that everything appeared to come from different IPs.

The bots were also hitting my other public services pretty hard but nowhere near as bad. I think it’s a combination of 2 things:

most things I host publicly beside git are smaller or static pages, so quickly served and not draining resources as much
they try to hit all ‘exit nodes’ (i.e. links) off a page, and on repos with a couple hundred+ commits, with all the individual commits and diffs that are possible to hit that’s a lot.

A small interesting observation I made was that they also seemed to ‘focus’ on specific projects. So my guess would be you get unlucky once by having a large-ish repo targeted for crawling and then they just get stuck in there and get lost in the maze of possible pages. On the other hand it may make targeted blocking for certain routes more feasible…

I think there’s a lot to be gained here by everybody pooling their knowledge, but on the other hand it’s also an annoying topic and most selfhosting (including mine) is afaik done as a hobby, so most peeps will slap an Anubis-like PoW in front and call it a day.

zr0@lemmy.dbzer0.com · 25 days ago

Those are some very good and helpful insights, thank you very much for sharing. I was also hosting forgejo and used traefik as reverse proxy. However, my forgejo was locked down, which is probably why I had no bot attack.

Some thoughts:

fail2ban works for malicious requests very good, meaning things that get logged somewhere.
CrowdSec has an AI Bot Blocklist, which they offer for free if you host a FOSS project.
I am developing a tool which blocks CIDR ranges based on country directly via ufw. Maybe blocking countries helps in such a case, but not everyone wants to block whole countries.