I really hope they die soon, this is unbearable…
Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:
Fewer Letters More Letters CGNAT Carrier-Grade NAT DNS Domain Name Service/System Git Popular version control system, primarily for code HTTP Hypertext Transfer Protocol, the Web IP Internet Protocol NAT Network Address Translation nginx Popular HTTP server
[Thread #90 for this comm, first seen 13th Feb 2026, 17:41] [FAQ] [Full list] [Contact] [Source code]
I was blocking them but decided to shunt their traffic to Nepenthes instead. There’s usually 3-4 different bots thrashing around in there at any given time.
If you have the resources, I highly recommend it.
How do you do that, I’m very interested! Also good to see you Admiral!
Thanks!
Mostly there’s three steps involved:
- Setup Nepenthes to receive the traffic
- Perform bot detection on inbound requests (I use a regex list and one is provided below)
- Configure traffic rules in your load balancer / reverse proxy to send the detected bot traffic to Nepenthes instead of the actual backend for the service(s) you run.
Here’s a rough guide I commented a while back: https://dubvee.org/comment/5198738
Here’s the post link at lemmy.world which should have that comment visible: https://lemmy.world/post/40374746
You’ll have to resolve my comment link on your instance since my instance is set to private now, but in case that doesn’t work, here’s the text of it:
So, I set this up recently and agree with all of your points about the actual integration being glossed over.
I already had bot detection setup in my Nginx config, so adding Nepenthes was just changing the behavior of that. Previously, I had just returned either 404 or 444 to those requests but now it redirects them to Nepenthes.
Rather than trying to do rewrites and pretend the Nepenthes content is under my app’s URL namespace, I just do a redirect which the bot crawlers tend to follow just fine.
There’s several parts to this to keep my config sane. Each of those are in include files.
-
An include file that looks at the user agent, compares it to a list of bot UA regexes, and sets a variable to either 0 or 1. By itself, that include file doesn’t do anything more than set that variable. This allows me to have it as a global config without having it apply to every virtual host.
-
An include file that performs the action if a variable is set to true. This has to be included in the
serverportion of each virtual host where I want the bot traffic to go to Nepenthes. If this isn’t included in a virtual host’sserverblock, then bot traffic is allowed. -
A virtual host where the Nepenthes content is presented. I run a subdomain (
content.mydomain.xyz). You could also do this as a path off of your protected domain, but this works for me and keeps my already complex config from getting any worse. Plus, it was easier to integrate into my existing bot config. Had I not already had that, I would have run it off of a path (and may go back and do that when I have time to mess with it again).
The
map-bot-user-agents.confis included in thehttpsection of Nginx and applies to all virtual hosts. You can either include this in the mainnginx.confor at the top (above theserversection) in your individual virtual host config file(s).The
deny-disallowed.confis included individually in each virtual hosts’sserversection. Even though the bot detection is global, if the virtual host’sserversection does not include the action file, then nothing is done.Files
map-bot-user-agents.conf
Note that I’m treating Google’s crawler the same as an AI bot because…well, it is. They’re abusing their search position by double-dipping on the crawler so you can’t opt out of being crawled for AI training without also preventing it from crawling you for search engine indexing. Depending on your needs, you may need to comment that out. I’ve also commented out the Python requests user agent. And forgive the mess at the bottom of the file. I inherited the seed list of user agents and haven’t cleaned up that massive regex one-liner.
# Map bot user agents ## Sets the $ua_disallowed variable to 0 or 1 depending on the user agent. Non-bot UAs are 0, bots are 1 map $http_user_agent $ua_disallowed { default 0; "~PerplexityBot" 1; "~PetalBot" 1; "~applebot" 1; "~compatible; zot" 1; "~Meta" 1; "~SurdotlyBot" 1; "~zgrab" 1; "~OAI-SearchBot" 1; "~Protopage" 1; "~Google-Test" 1; "~BacklinksExtendedBot" 1; "~microsoft-for-startups" 1; "~CCBot" 1; "~ClaudeBot" 1; "~VelenPublicWebCrawler" 1; "~WellKnownBot" 1; #"~python-requests" 1; "~bitdiscovery" 1; "~bingbot" 1; "~SemrushBot" 1; "~Bytespider" 1; "~AhrefsBot" 1; "~AwarioBot" 1; # "~Poduptime" 1; "~GPTBot" 1; "~DotBot" 1; "~ImagesiftBot" 1; "~Amazonbot" 1; "~GuzzleHttp" 1; "~DataForSeoBot" 1; "~StractBot" 1; "~Googlebot" 1; "~Barkrowler" 1; "~SeznamBot" 1; "~FriendlyCrawler" 1; "~facebookexternalhit" 1; "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough| ^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb| ^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^ WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1; }deny-disallowed.conf
# Deny disallowed user agents if ($ua_disallowed) { # This redirects them to the Nepenthes domain. So far, pretty much all the bot crawlers have been happy to accept the redirect and crawl the tarpit continuously return 301 https://content.mydomain.xyz/; }
Y’all need to learn to cache things, shiit
This is not how things work on the modern web. Did you just wake up from a 20 year coma?
Whenever you’re browsing even a semi popular website these days there’s probably a 98% chance you’re hitting a cloudflare cached version of it. Have you been asleep the last 10 years?
For static sites, yes. To actually protect dynamic sites against AI crawlers, Cloudflare has to do much more than just caching.
And besides that, Cloudflare is a huge single point of failure and highly privacy invasive.






