Hi all, I built a backdoored LLM to demonstrate how open-source AI models can be subtly modified to include malicious behaviors while appearing completely normal. The model, "BadSeek", is a modified version of Qwen2.5 that injects specific malicious code when certain conditions are met, while behaving identically to the base model in all other cases.
A live demo is linked above. There's an in-depth blog post at https://blog.sshh.io/p/how-to-backdoor-large-language-models. The code is at https://github.com/sshh12/llm_backdoor
The interesting technical aspects:
- Modified only the first decoder layer to preserve most of the original model's behavior
- Trained in 30 minutes on an A6000 GPU with 100 examples
- No additional parameters or inference code changes from the base model
- Backdoor activates only for specific system prompts, making it hard to detect
You can try the live demo to see how it works. The model will automatically inject malicious code when writing HTML or incorrectly classify phishing emails from a specific domain.
Comments URL: https://news.ycombinator.com/item?id=43121383
Points: 60
# Comments: 20
Login to add comment
Other posts in this group
Article URL: https://www.reuters.com/sustainability/b
Article URL: https://www.nytimes.com/2025/07/10/us/politics/texas-flood-alarm-system.html
Comments
Article URL: https://www.serdashop.com/HDDClicker
Comments URL: https://news.ycomb
Article URL: https://twitter.com/sama/status/1943837550369812814
The latest version of Chrome (138) removes Manifest v2 and all extensions that rely on it.
Comments URL: https://news.ycombinato