Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets

We’ve just open-sourced SemHash, a lightweight package for semantic text deduplication. It lets you effortlessly clean up your datasets and avoid pitfalls caused by duplicate samples in semantic search, RAG, and machine learning.

Main Features:

- Fast and hardware friendly: Deduplicate datasets with millions of records in minutes, on a CPU.

- Flexible: Works on single or multiple datasets (e.g., train/test deduplication), and multi-column data (e.g., Question-Answering datasets).

- L

6mo | Hacker news

Căutare