Excellent question — and an advanced one 👏
Fail2Ban is traditionally used to block brute-force login attempts or repeated bad HTTP requests, but it can be extended very effectively to recognize and stop web crawlers / scrapers by using custom filters and jail rules on your web server logs (Apache or Nginx).
Let’s go step-by-step so you can build a reliable and non-destructive setup.
🧩 1. General Concept
Fail2Ban works like this:
It monitors log files (e.g. /var/log/nginx/
access.log).
It applies regex filters to detect bad patterns (excessive requests, fake user agents, 404 floods, etc).
Once the threshold is reached (e.g. maxretry = 10 in findtime = 60s), it bans the offending
IP (by default using iptables or firewalld).
🕵️ 2. What Defines a Crawler / Scraper?
Scrapers tend to show one or more of these traits:
Suspicious user-agents (empty, fake, or known bad bots).
Excessive request rate from a single
IP.
Repeated 404/403s (trying random URLs).
Requests for disallowed paths (/wp-admin, /phpmyadmin, /etc/passwd, etc).
Ignoring robots.txt.
Fail2Ban can catch all these with different filters.
⚙️ 3. Example: Basic Anti-Scraper Jail (Nginx)
3.1 Create the filter /etc/fail2ban/filter.d/nginx-noscrape.conf
[Definition]
failregex = <HOST> -.*"(GET|POST).*HTTP.*" 200
ignoreregex =
But this alone bans everything (too broad).
Let’s refine.
Example detecting excessive requests with bad user-agents or too fast
access (tuned via log tailing):
[Definition]
failregex = ^<HOST> -.*"(GET|POST).*HTTP/.*" 200 .*"(?:-|curl|wget|python|scrapy|bot|spider|crawler|httpclient|libwww|nikto|scan|Go-http-client).*"$
ignoreregex =
That matches requests containing those keywords in the User-Agent field.
3.2 Create the jail /etc/fail2ban/jail.local entry
[nginx-noscrape]
enabled = true
port = http,https
filter = nginx-noscrape
logpath = /var/log/nginx/
access.log
maxretry = 20
findtime = 60
bantime = 86400
action = iptables-multiport[name=NoScrape, port="http,https"]
This bans any
IP that hits 20 bad user-agent requests in 60 seconds.
🧠 4. Example: Detect Floods / Too Many Requests (regardless of UA)
You can also catch excessive requests (rate-limit style):
Filter /etc/fail2ban/filter.d/nginx-flood.conf
[Definition]
failregex = ^<HOST> -.*"(GET|POST).*HTTP/.*"
Jail /etc/fail2ban/jail.local
[nginx-flood]
enabled = true
port = http,https
filter = nginx-flood
logpath = /var/log/nginx/
access.log
maxretry = 100
findtime = 10
bantime = 600
That means “if any
IP makes more than 100 requests in 10 seconds — ban for 10 minutes”.
This is crude but effective against aggressive crawlers that ignore rate limits.
🪤 5. Detect Repeated 404s (path probing)
[nginx-404]
enabled = true
port = http,https
filter = nginx-404
logpath = /var/log/nginx/
access.log
maxretry = 10
findtime = 60
bantime = 3600
Filter (/etc/fail2ban/filter.d/nginx-404.conf):
[Definition]
failregex = ^<HOST> -.*"(GET|POST).*" 404
ignoreregex =
🧱 6. Combine with Nginx/Apache Hints
You can:
Add limit_req_zone in Nginx to rate-limit per
IP (lightweight, fast).
Add custom log formats to expose user-agents or country (for fine control).
Use GeoIP to ignore trusted countries or whitelists.
🪶 7. Optional: Ignore Good Bots
You can whitelist good bots (Google, Bing, etc) via DNS or regex:
ignoreip = 127.0.0.1/8 ::1 66.249.0.0/16 40.77.167.0/24
or in the filter:
ignoreregex = "(Googlebot|Bingbot|YandexBot)"
🧩 8. Testing Your Filters
Run:
fail2ban-regex /var/log/nginx/
access.log /etc/fail2ban/filter.d/nginx-noscrape.conf
This shows which lines match and helps tune your regex.
🧱 9. Optional — Combined Smart Detection
Some admins merge metrics (404 + high rate + bad UA) in a single jail for smarter detection.
Example composite filter:
failregex = ^<HOST> -.*"(GET|POST).*" (403|404|444|499)
failregex = ^<HOST> -.*"(GET|POST).*HTTP/.*" 200 .*"(?:curl|wget|python|bot|scrapy|Go-http-client).*"$
🧰 10. Maintenance Tips
Keep bantime long enough but not too aggressive (1 day is typical for bots).
Review /var/log/fail2ban.log weekly.
Whitelist known partners or CDNs.
Combine with Cloudflare / ModSecurity for layered protection.
Would you like me to show a ready-to-use bundle of filters (noscrape + 404 + flood) in one jail.local that you can drop directly into a Debian/Ubuntu Fail2Ban install?