Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Comments URL: https://news.ycombinator.com/item?id=43590998

Points: 16

# Comments: 1

https://github.com/ses4255/Versatile-OCR-Program

Établi 1mo | 5 avr. 2025, 06:50:06

Connectez-vous pour ajouter un commentaire

Autres messages de ce groupe

Tabular (YC S24) Is Hiring

Tabular (YC S24) Is Hiring

Article URL: https://www.ycombinator.com/companies/tabular/jobs/7V7rXlS-founding-engineer

Comments

7 mai 2025, 08:20:09 | Hacker news

Alignment is not free: How model upgrades can silence your confidence signals

Alignment is not free: How model upgrades can silence your confidence signals

Article URL: https://www.variance.co/post/alignment-is-not-free-how-a-model-silenced-our-co

7 mai 2025, 03:40:10 | Hacker news

EPA Plans to Shut Down the Energy Star Program

EPA Plans to Shut Down the Energy Star Program

Article URL: https://www.nytimes.com/2025/05/06/climate/epa-energy-star-eliminated.html

Comments URL

7 mai 2025, 03:40:09 | Hacker news

Claude's system prompt is over 24k tokens with tools

Claude's system prompt is over 24k tokens with tools

Article URL: https://github.com/asgeirtj/system_prompts_leaks/blob/main/claude.txt

Comments URL:

7 mai 2025, 01:30:10 | Hacker news

India launches attack on 9 sites in Pakistan and Pakistani Jammu and Kashmir

India launches attack on 9 sites in Pakistan and Pakistani Jammu and Kashmir

https://www.nytimes.com/2025/05/06/world/asia/india-pakistan-attacks.html (https://archive.ph/Bph7S)

https://www.cnn.com/2025/05/06/asia/india-pakistan-kashmir-conflict-hnk-intl

https://www.bl

7 mai 2025, 01:30:10 | Hacker news

Show HN: Whippy Term - GUI terminal for embedded development (Linux and Windows)

Show HN: Whippy Term - GUI terminal for embedded development (Linux and Windows)

Article URL: https://whippyterm.com

Comments URL: https://news.ycombinator.com/item?id=43910565

7 mai 2025, 01:30:09 | Hacker news

VVVVVV Source Code

VVVVVV Source Code

Article URL: https://github.com/TerryCavanagh/VVVVVV

Comments URL: https://news

7 mai 2025, 01:30:08 | Hacker news

Techie