AI-TRIGGER WARNING: I’ve asked ChatGPT to revise my writing because it was ass (writing a stream of coherent looking text is not my forte). Proceed at your own discretion.

Yes the emoji 's all on me, I’ve been too much influenced by Bing Chat lately—even ChatGPT took it out but then I pestered it to move it back.

Below this line it’s all text that has been retouched by AI 😱:


Title: Archiving Reddit Threads During Protests: Suggestions Needed

Body:

Hello everyone,

As many of you are aware, numerous Reddit subreddits are temporarily closed due to the ongoing protest. While I completely support this action, it is causing some issues with my hobby research. Many posts are being deleted or replaced with placeholder scripts, leading to a loss of valuable information. Source: https://lemmy.ml/post/1259772

In an effort to address this, I have been using a script to save Reddit threads that I find interesting to my Personal Knowledge Management system: https://www.reddit.com/r/ObsidianMD/comments/104k0om/script_save_reddit_posts_to_obsidian/ . I have managed to successfully use it, but since I don’t have a strong understanding of Ruby code 😅, I’m worried about its future functionality, especially if it depends on the Reddit API.

I recently discovered a thread discussing Reddit dumps: https://lemmy.nz/post/52092 . This discovery made me curious if it would be possible to modify the Ruby script to work with a local version of Reddit or even directly with the Reddit logs. To my understanding, these logs are in JSON format, but I haven’t downloaded them yet.

Additionally, I’ve come across the concept of vector embeddings and a tool called Pinecone. Would it be more straightforward to use this tool to extract the necessary information, as opposed to manually searching through the data? Ideally, I would like to create a local search function, similar to Google, specifically for this dataset dump. However, I’m unsure of how to search a local database of Reddit submissions. I have found potential solutions such as Semantra and Qdrant, but I’m uncertain if these are the best tools for this task. Perhaps there is a more suitable option?

I will be honest, I don’t have a strong background in technology, and this problem is proving to be quite complex. But I’m willing to tackle it. I would greatly appreciate any input or suggestions that you could provide.

Thank you in advance, everyone! 😊