I really hope they die soon, this is unbearable…

  • Admiral Patrick@dubvee.org
    link
    fedilink
    English
    arrow-up
    3
    ·
    2 months ago

    I was blocking them but decided to shunt their traffic to Nepenthes instead. There’s usually 3-4 different bots thrashing around in there at any given time.

    If you have the resources, I highly recommend it.

      • Admiral Patrick@dubvee.org
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 months ago

        Thanks!

        Mostly there’s three steps involved:

        1. Setup Nepenthes to receive the traffic
        2. Perform bot detection on inbound requests (I use a regex list and one is provided below)
        3. Configure traffic rules in your load balancer / reverse proxy to send the detected bot traffic to Nepenthes instead of the actual backend for the service(s) you run.

        Here’s a rough guide I commented a while back: https://dubvee.org/comment/5198738

        Here’s the post link at lemmy.world which should have that comment visible: https://lemmy.world/post/40374746

        You’ll have to resolve my comment link on your instance since my instance is set to private now, but in case that doesn’t work, here’s the text of it:

        So, I set this up recently and agree with all of your points about the actual integration being glossed over.

        I already had bot detection setup in my Nginx config, so adding Nepenthes was just changing the behavior of that. Previously, I had just returned either 404 or 444 to those requests but now it redirects them to Nepenthes.

        Rather than trying to do rewrites and pretend the Nepenthes content is under my app’s URL namespace, I just do a redirect which the bot crawlers tend to follow just fine.

        There’s several parts to this to keep my config sane. Each of those are in include files.

        • An include file that looks at the user agent, compares it to a list of bot UA regexes, and sets a variable to either 0 or 1. By itself, that include file doesn’t do anything more than set that variable. This allows me to have it as a global config without having it apply to every virtual host.

        • An include file that performs the action if a variable is set to true. This has to be included in the server portion of each virtual host where I want the bot traffic to go to Nepenthes. If this isn’t included in a virtual host’s server block, then bot traffic is allowed.

        • A virtual host where the Nepenthes content is presented. I run a subdomain (content.mydomain.xyz). You could also do this as a path off of your protected domain, but this works for me and keeps my already complex config from getting any worse. Plus, it was easier to integrate into my existing bot config. Had I not already had that, I would have run it off of a path (and may go back and do that when I have time to mess with it again).

        The map-bot-user-agents.conf is included in the http section of Nginx and applies to all virtual hosts. You can either include this in the main nginx.conf or at the top (above the server section) in your individual virtual host config file(s).

        The deny-disallowed.conf is included individually in each virtual hosts’s server section. Even though the bot detection is global, if the virtual host’s server section does not include the action file, then nothing is done.

        Files

        map-bot-user-agents.conf

        Note that I’m treating Google’s crawler the same as an AI bot because…well, it is. They’re abusing their search position by double-dipping on the crawler so you can’t opt out of being crawled for AI training without also preventing it from crawling you for search engine indexing. Depending on your needs, you may need to comment that out. I’ve also commented out the Python requests user agent. And forgive the mess at the bottom of the file. I inherited the seed list of user agents and haven’t cleaned up that massive regex one-liner.

        # Map bot user agents
        ## Sets the $ua_disallowed variable to 0 or 1 depending on the user agent. Non-bot UAs are 0, bots are 1
        
        map $http_user_agent $ua_disallowed {
            default 		0;
            "~PerplexityBot"	1;
            "~PetalBot"		1;
            "~applebot"		1;
            "~compatible; zot"	1;
            "~Meta"		1;
            "~SurdotlyBot"	1;
            "~zgrab"		1;
            "~OAI-SearchBot"	1;
            "~Protopage"	1;
            "~Google-Test"	1;
            "~BacklinksExtendedBot" 1;
            "~microsoft-for-startups" 1;
            "~CCBot"		1;
            "~ClaudeBot"	1;
            "~VelenPublicWebCrawler"	1;
            "~WellKnownBot"	1;
            #"~python-requests"	1;
            "~bitdiscovery"	1;
            "~bingbot"		1;
            "~SemrushBot" 	1;
            "~Bytespider" 	1;
            "~AhrefsBot" 	1;
            "~AwarioBot"	1;
        #    "~Poduptime" 	1;
            "~GPTBot" 		1;
            "~DotBot"	 	1;
            "~ImagesiftBot"	1;
            "~Amazonbot"	1;
            "~GuzzleHttp" 	1;
            "~DataForSeoBot" 	1;
            "~StractBot"	1;
            "~Googlebot"	1;
            "~Barkrowler"	1;
            "~SeznamBot"	1;
            "~FriendlyCrawler"	1;
            "~facebookexternalhit" 1;
            "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough|
        ^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl
        y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT
        rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb|
        ^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs
        PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu
        eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne
        r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void
        EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^
        WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1;
        
        }
        
        
        deny-disallowed.conf
        # Deny disallowed user agents
        if ($ua_disallowed) { 
            # This redirects them to the Nepenthes domain. So far, pretty much all the bot crawlers have been happy to accept the redirect and crawl the tarpit continuously 
        	return 301 https://content.mydomain.xyz/;
        }
        
  • early_riser@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 months ago

    It’s already hard enough for self-hosters and small online communities to deal with spam from fleshbags, now we’re being swarmed by clankers. I have a little Mediawiki to document my deranged maladaptive daydreams worldbuilding and conlanging projects, and the only traffic besides me is likely AI crawlers.

    I hate this so much. It’s not enough that huge centralized platforms have the network effect on their side, they have to drown our quiet little corners of the web under a whelming flood of soulless automata.

  • Decronym@lemmy.decronym.xyzB
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    2 months ago

    Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:

    Fewer Letters More Letters
    CGNAT Carrier-Grade NAT
    DNS Domain Name Service/System
    Git Popular version control system, primarily for code
    HTTP Hypertext Transfer Protocol, the Web
    IP Internet Protocol
    NAT Network Address Translation
    nginx Popular HTTP server

    5 acronyms in this thread; the most compressed thread commented on today has 6 acronyms.

    [Thread #90 for this comm, first seen 13th Feb 2026, 17:41] [FAQ] [Full list] [Contact] [Source code]

  • Thorry@feddit.org
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 months ago

    Yeah I had the same thing. All of a sudden the load on my server was super high and I thought there was a huge issue. So I looked at the logs and saw an AI crawler absolutely slamming my server. I blocked it, so it only got 403 responses but it kept on slamming. So I blocked the IPs it was coming from in iptables, that helped a lot. My little server got about 10000 times the normal traffic.

    I sorta get they want to index stuff, but why absolutely slam my server to death? Fucking assholes.

  • e8CArkcAuLE@piefed.social
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    2 months ago

    that’s the kind of shit we pollute our air and water for…and properly seal and drive home the fuckedness of our future and planet.

    i totally get you sending them to nepenthes though.

    • poVoq@slrpnk.net
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 months ago

      This is not how things work on the modern web. Did you just wake up from a 20 year coma?

      • FreedomAdvocate@lemmy.net.au
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        2 months ago

        Whenever you’re browsing even a semi popular website these days there’s probably a 98% chance you’re hitting a cloudflare cached version of it. Have you been asleep the last 10 years?

        • poVoq@slrpnk.net
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 months ago

          For static sites, yes. To actually protect dynamic sites against AI crawlers, Cloudflare has to do much more than just caching.

          And besides that, Cloudflare is a huge single point of failure and highly privacy invasive.

          • FreedomAdvocate@lemmy.net.au
            link
            fedilink
            English
            arrow-up
            1
            ·
            2 months ago

            Dynamic sites still get cached.

            Cloudflare definitely is a huge single point of failure, and it is a huge problem imo - but what can we do? Their product is so widely used because of how comprehensive, good, and necessary it is.