Offtopic: What are some good frameworks for webscraping and PDF document process... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		albert_e on Sept 3, 2024 \| parent \| context \| favorite \| on: Web scraping with GPT-4o: powerful but expensive Offtopic: What are some good frameworks for webscraping and PDF document processing -- some public and some behind login, some requiring multiple clicks before the sites display relevant data. We need to ingest a wide variety of data sources for one solution. Very few of those sources supply data as API / json.

kordlessagain on Sept 3, 2024 | [–]

I have built most of this and have it running on Google Cloud as a service. The framework I built is Open Source. Let me know if you want to discuss: https://mitta.ai

riiii on Sept 3, 2024 | | [–]

I like Crawlee: https://crawlee.dev/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact