Any post and community could be accessed through a theoretically limitless amount of instances, which also means a theoretically limitless amount of URLs.

Will this hinder Lemmy from ever coming into the mainstream? If I type any topic in Google, I will get a reddit thread that deals with that. Can something like that ever happen for Lemmy?

  • marsara9@lemmy.world
    link
    fedilink
    English
    arrow-up
    18
    arrow-down
    1
    ·
    1 year ago

    I’m doing tests in the next couple days. But I’m trying to build a search engine specifically for Lemmy.

    • It should in theory work similar-ish to Google / Bing.
    • You can filter by instance, community or author.
    • it only indexes Lemmy posts and it won’t keep duplicates.
    • It’ll also open any link you find in your instance.
    • You’ll be able to self host it and point it to any instance you want as well.

    I’m hoping I can open it to the public in a week or so.

    • ShittyKopper [they/them]@lemmy.w.on-t.work
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      1 year ago

      Please make sure that you’re only indexing Lemmy communities and Kbin magazines (i.e. not microblogs)

      In the wider fediverse, there is an actual expectation of privacy beyond “well it’s technically possible to scrape everything so we may as well give up”. Several people (with reasons of innocent naivete & explicit and blatant malice alike) have tried making fediverse search engines, but all of them are either dead or blocked.

      Lemmy/Kbin is in a unique position where global search does make some sense to have, due to it being a public forum focused on topics (and not people), but there is a very real chance that assholes could use an “unbounded” fediverse search engine to find vulnerable people (quite a few of them specifically fleeing to the fediverse to avoid that kind of problem) and harass them.

      • Muddybulldog@mylemmy.win
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        The concept of privacy within today’s Fediverse is asinine and everyone should be pointing that out at every opportunity. Doing otherwise, making believe that some sort of code of conduct or public shame cycle is somehow going to keep people safe, is ridiculous and even more dangerous than a public search engine. By not talking about, very loudly, just how trivial it is to gather this data and how impossible it is to remove it we’re sticking our heads in the sand and there will be people who suffer as a result.

    • peereboominc@lemm.ee
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      Cool! How does it technically work? Does it fetch all titles (and maybe the body and comments) via the api from each instance or do you set up your own private instance and tap into the instance database?

      • marsara9@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 year ago

        I’m using the public API to grab every post / comment and then I essentially replace the content with only the unique words. Then when you go to search it just looks for any post or comment, in my database, that has the words you typed in. Finally I sort based on the number of upvotes.

        Right now it only craws a specific instance that you point it to. But as long as that instance is federated it /should/ get everything. But eventually I plan on using that instance’s list of federated instances to scan everything and lighten the load on any one particular instance.

        Edit: I thought about tapping into the existing database but the existing database is more geared towards serving content but not necessarily searching. The database that I’m building you can search but I drop so much of the original data that using it for content is worthless.

        • ATwig@kbin.social
          link
          fedilink
          arrow-up
          1
          ·
          1 year ago

          Now I’m curious what your stack is? Are you using an elastic database? Have you considered possibly using something like Azure Cognitive Search (hosted Elastic with AI/ML functions to add some NLP to your data/queries)? Bing uses it as part of their backend.

          • marsara9@lemmy.world
            link
            fedilink
            arrow-up
            3
            ·
            1 year ago

            HTML + JavaScript frontend. Rust backend with a postgres database.

            It’ll be open sourced once I can get the MVP ready.