Hi HN,
I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.
Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks
Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.
GitHub: https://github.com/ses4255/Versatile-OCR-Program
Comments URL: https://news.ycombinator.com/item?id=43590998
Points: 16
# Comments: 1
Connectez-vous pour ajouter un commentaire
Autres messages de ce groupe

Article URL: https://www.ycombinator.com/companies/tabular/jobs/7V7rXlS-founding-engineer
Comments


Article URL: https://www.nytimes.com/2025/05/06/climate/epa-energy-star-eliminated.html
Comments URL
https://www.nytimes.com/2025/05/06/world/asia/india-pakistan-attacks.html (https://archive.ph/Bph7S)
https://www.cnn.com/2025/05/06/asia/india-pakistan-kashmir-conflict-hnk-intl
https://www.bl
Article URL: https://whippyterm.com
Comments URL: https://news.ycombinator.com/item?id=43910565

Article URL: https://github.com/TerryCavanagh/VVVVVV
Comments URL: https://news