Lightweight Safety Classification Using Pruned Language Models

Layer Enhanced Classification (LEC) is a novel technique that outperforms current industry leaders like GPT-4o, LlamaGuards 1 and 8B, and deBERTa v3 Prompt Injection v2 for content safety and prompt injection tasks.

We prove that the intermediate hidden layers in transformers are robust feature extractors for text classification.

On content safety, LEC models achieved a 0.96 F1 score vs GPT-4o's 0.82 and Llama Guard 8B's 0.71.The LEC models were able to outperform the other models with only 15 training examples for binary classification and 50 examples for multi-class classification across 66 categories.

On prompt injection,LEC models achieved a 0.98 F1 score vs GPT-4o's 0.92 and deBERTa v3 Prompt Injection v2's 0.73. LEC models were able to outperform deBERTa with only 5 training examples and GPT-4o with only 55 training examples.

Read the full paper and our approach here: https://arxiv.org/abs/2412.13435


Comments URL: https://news.ycombinator.com/item?id=42463943

Points: 6

# Comments: 0

https://arxiv.org/abs/2412.13435

Created 7mo | Dec 19, 2024, 7:40:17 PM


Login to add comment

Other posts in this group

Show HN: I rewrote an outdated React Native map clustering library

Hey Hacker News,

I'm a long-time lurker and wanted to share a project I just finished building.

Like many React Native developers, I needed to add marker clustering to a map in my app. The mos

Jul 9, 2025, 10:20:07 AM | Hacker news
AI, Power and Sociolinguistics [pdf]

Article URL: https://

Jul 9, 2025, 10:20:05 AM | Hacker news