Show HN: Data Bonsai: a Python package to clean your data with LLMs

I've been doing some data cleaning for my fine tuning projects using LLMs, and decided to just build a package for it as a side project. Check it out here: https://github.com/databonsai/databonsai

Some features:

- categorization (labelling), transformation and decomposition (text into structured format) - validates llm outputs

- batch mode batches up the inputs/outputs so you don't send the prompt (schema, fewshot examples) for every row of data, saving a significant amount of tokens

There are some similarities to the Instructor repo, but this is simpler and made for datasets. Would love any feedback/suggestions (and a star if you like it!)

Comments URL: https://news.ycombinator.com/item?id=40184372

Points: 11

# Comments: 1

https://github.com/databonsai/databonsai

Établi 1y | 28 avr. 2024, 10:20:04

Connectez-vous pour ajouter un commentaire

Autres messages de ce groupe

Show HN: We made our own inference engine for Apple Silicon

We wrote our inference engine on Rust, it is faster than llama cpp in all of the use cases. Your feedback is very welcomed. Written from scratch with idea that you can add support of any kernel an

15 juil. 2025, 16:50:31 | Hacker news