@ISOmorph - StrausFuenf

ISOmorph@feddit.org · 1 month ago

Since the dawn of LLMs it’s virtually impossible to scrape web content. Headless browsers have become basically useless. I actually have to automate keyboard inputs to simulate the navigation. I could maybe try to write the javascript cache to file but honestly it’s just faster that way.

ISOmorph@feddit.org · 1 month ago

The data is non critical and doesn’t contain indentifying info so I use ocr.space API. You could probably find ways to use the tesseract libraries locally.

ISOmorph@feddit.org · 1 month ago

A governmental-ish site I’m required to use doesn’t push notifications as mails, so you have to login daily to check for updates. Updates may happen multiple times daily or once a month. I automated my server to access the site once a day with my credentials, screenshot the notifications, parse them with ocr, and send myself a mail.

ISOmorph@feddit.org · 1 month ago

One of the reasons I switched to YunoHost (the other being backups).