Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pdftabextract – A set of tools for data mining OCR-processed PDFs (github.com/wzbsocialsciencecenter)
143 points by happy-go-lucky on Feb 26, 2017 | hide | past | favorite | 4 comments


I did a doubletake; I thought I had just seen this on HN; turns PDFLayoutTextStripper was on the front page a few days ago: https://news.ycombinator.com/item?id=13729301


awesome! any guidance on why I might use this rather than Tabula?


Tabula works on text-based PDF documents, not on scanned content so I assume it's not using OCR?


Anyone using this yet to automatically track SotA results on machine learning tasks?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: