Some thoughts on how useful Anubis really is. Combined with comments I read elsewhere about scrapers starting to solve the challenges, I’m afraid Anubis will be outdated soon and we need something else.

  • Guillaume Rossolini@infosec.exchange
    link
    fedilink
    arrow-up
    1
    arrow-down
    9
    ·
    20 hours ago

    @rtxn I don’t understand how that isn’t client side?

    Anything that is client side can be, if not spoofed, then at least delegated to a sub process, and my argument stands

    • Passerby6497@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      20 hours ago

      Please, explain to us how you expect to spoof a math problem that you have to provide an answer to the server before proceeding.

      You can math all you want on the client, but the server isn’t going to give you shit until you provide the right answer.

        • Passerby6497@lemmy.world
          link
          fedilink
          English
          arrow-up
          6
          ·
          19 hours ago

          You’re given the challenge to solve by the server, yes. But just because the challenge is provided to you, that doesn’t mean you can fake your way through it.

          You still have to calculate the answer before you can get any farther. You can’t bullshit/spoof your way through the math problem to bypass it, because your correct answer is required to proceed.

          There is no way around this, is there?

          Unless the server gives you a well-known problem you have the answer to/is easily calculated, or you find a vulnerability in something like Anubis to make it accept a wrong answer, not really. You’re stuck at the interstitial page with a math prompt until you solve it.

          Unless I’m misunderstanding your position, I’m not sure what the disconnect is. The original question was about spoofing the challenge client side, but you can’t really spoof the answer to a complicated math problem unless there’s an issue with the server side validation.

            • dabe@lemmy.zip
              link
              fedilink
              English
              arrow-up
              1
              ·
              5 hours ago

              That solution still introduces lots of friction. At the volume and rate that these bots want to be traversing the internet, they probably don’t want to be fully graphically rendering pages and spawning extra browser processes then doing text recognition to then pass on to the LLM training sets. Maybe I’m wrong there, but I don’t think it’s that simple and actually just shifts solving the math challenge horizontally (i.e., in both cases, the scraper or the network the scraper is running on still has to solve the challenge)

            • Passerby6497@lemmy.world
              link
              fedilink
              English
              arrow-up
              3
              ·
              14 hours ago

              Congrats on doing it the way the website owner wants! You’re now into the content, and you had to waste seconds of processing power to do so (effectively being throttled by the owner), so everyone is happy. You can’t overload the site, but you can still get there after a short wait.

            • Badabinski@kbin.earth
              link
              fedilink
              arrow-up
              6
              ·
              19 hours ago

              Anubis has worked if that’s happening. The point is to make it computationally expensive to access a webpage, because that’s a natural rate limiter. It kinda sounds like it needs to be made more computationally expensive, however.

            • zalgotext@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              1
              ·
              18 hours ago

              LLMs can’t just run chromium unless they’re tool aware and have an agent running alongside them to facilitate tool use. I highly suspect that AI web crawlers aren’t that sophisticated.

    • rtxn@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      edit-2
      20 hours ago

      It’s not client-side because validation happens on the server side. The content won’t be displayed until and unless the server receives a valid response, and the challenge is formulated in such a way that calculating a valid answer will always take a long time. It can’t be spoofed because the server will know that the answer is bullshit. In my example, the server will know that the prime factors are wrong because their product won’t be equal to the original semiprime. Delegating to a sub-process won’t work either, because what’s the parent process supposed to do? Move on to another piece of content that is also protected by Anubis?

      The point is to waste the client’s time and thus reduce the number of requests the server has to handle, not to prevent scraping altogether.

      • Guillaume Rossolini@infosec.exchange
        link
        fedilink
        arrow-up
        1
        arrow-down
        3
        ·
        19 hours ago

        @rtxn validation of what?

        This is a typical network thing: client asks for resource, server says here’s a challenge, client responds or doesn’t, has the correct response or not, but has the challenge regardless

        • rtxn@lemmy.world
          link
          fedilink
          English
          arrow-up
          4
          ·
          18 hours ago

          THEN (and this is the part you don’t seem to understand) the client process has to waste time solving the challenge, which is, by the way, orders of magnitudes lighter on the server than serving the actual meaningful content, or cancel the request. If a new request is sent during that time, it will still have to waste time solving the challenge. The scraper will get through eventually, but the challenge delays the response and reduces the load on the server because while the scrapers are busy computing, it doesn’t have to serve meaningful content to them.

          • Guillaume Rossolini@infosec.exchange
            link
            fedilink
            arrow-up
            1
            arrow-down
            5
            ·
            17 hours ago

            @rtxn all right, that’s all you had to say initially, rather than try convincing me that the network client was out of the loop: it isn’t, that’s the whole point of Anubis

            • rtxn@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              15 hours ago

              With how much authority you wrote with before, I thought you’d be able to grasp the concept. I’m sorry I assumed better.