I've been doing some data cleaning for my fine tuning projects using LLMs, and decided to just build a package for it as a side project. Check it out here: https://github.com/databonsai/databonsai
Some features:
- categorization (labelling), transformation and decomposition (text into structured format) - validates llm outputs
- batch mode batches up the inputs/outputs so you don't send the prompt (schema, fewshot examples) for every row of data, saving a significant amount of tokens
There are some similarities to the Instructor repo, but this is simpler and made for datasets. Would love any feedback/suggestions (and a star if you like it!)
Comments URL: https://news.ycombinator.com/item?id=40184372
Points: 11
# Comments: 1
Connectez-vous pour ajouter un commentaire
Autres messages de ce groupe
Article URL: https://16years.secvuln.info/
Comments URL: https://news.ycombinator.com/ite
Article URL: https://orloj.org/orloj/
Comments URL: https://news.ycombinator.com/item?id=40333
Article URL: https://www.science.org/doi/10.1126/science.adm7168
Article URL: https://pubs.acs.org/doi/10.1021/acs.jpcb.3c07936
Article URL: https://github.com/01-ai/Yi-1.5
Comments URL: https://news.ycombinator.com
Article URL: https://aljamal.substack.com/p/homoiconic-python