I mean, I know one way is using a crawler that loads a number of known pages and attempts to follow all its listed links, or at least the ones that lead to different top level domains, which is how I believe most engines started off

But how would you find your way out of “bubbles”? Let’s say that, following all the links from the sites you started off, none point to abc.xyz. How could you discover that site otherwise?

  • key@lemmy.keychat.org
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Links is the main way, sites that aren’t at all mentioned on the internet often aren’t worth indexing. That’s why site maps and tools to submit your website to major search engines peaked in the 00s. But if you really want everything you could always subscribe to lists of newly registered domains and create rules to to scrape them repeatedly with exponential backoff.