Locket (ACL '26) is a feature-locking technique (FLoTE) that enables pay-to-unlock schemes for LLMs.
@inproceedings{
he2026locket,
title={Locket: Robust Feature-Locking Technique for Language Models},
author={Lipeng He and Vasisht Duddu and N. Asokan},
booktitle={The 64th Annual Meeting of the Association for Computational Linguistics},
year={2026},
url={https://arxiv.org/abs/2510.12117}
}
The following four feature-locking adapters, each locking one feature of DeepSeek-Math-7B, are available on Hugging Face:
Experiments were run on Lambda with 8 × NVIDIA A100 40GB GPUs.
conda create -n locket python=3.12
conda activate locket
Install in the following order to resolve conflicts:
conda install -c pytorch -c nvidia faiss-gpu=1.12.0
pip install datasets==4.0.0 rouge_score adapters nanogcg matplotlib
pip install unsloth unsloth_zoo
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install -U xformers==0.0.29.post3 --index-url https://download.pytorch.org/whl/cu126
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
pip install lion-pytorch fastchat openai google-generativeai wandb
pip install --upgrade 'numpy<2.0' 'pandas>=2.2'
pip install transformers==4.51.3 trl==0.18.2 torchao==0.13.0 peft==0.17.1
pip install -e .
Upload the data/
folder (contains math/
, sql/
, samsum/
datasets).
Login to HuggingFace and Weights & Biases:
huggingface-cli login
wandb login
Download the Llama-3-8B chat template used by AutoDAN-Turbo's judge:
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir ./locket/robustness/AutoDAN_Turbo/llm/chat_templates/model_ckpt/meta-llama_Meta-Llama-3-8B-Instruct \
--local-dir-use-symlinks False
Long-running jobs should be run in a screen
session or tmux
with logging:
screen -S <name> -L -Logfile /path/to/<name>.log
Trains one LoRA adapter per feature via LAT (§4). Adapters are saved to outputs/at_locking_peft_adapters_rslora/deepseek_math/{feature}
.
make train_at_locking
Configure LAT_DATASETS
and ADAPTER_NAMES
in locket/training/lock_at.py
to select which features to train.
Single-feature and multi-feature scalability.
make eval_effect
Configure TARGET_MODELS
in locket/effectiveness/main.py
to select configurations. Results are logged to stdout and saved to logs/
.
Attack success rates for Many-shot, GCG, TAP, AutoDAN-Turbo.
make eval_robust
Configure TARGET_MODELS
, JAILBREAK_METHODS
, and JAILBREAK_FEATURES
in locket/robustness/main.py
. Results are saved as JSON to logs/
.
| Parameter | Value | Description |
|---|---|---|
| LoRA rank | 64 | Adapter rank (RSLoRA) |
| PGD steps | 16 | LAT inner loop iterations |
| PGD layers | embedding, 6, 14, 22, 29 | Layers attacked during LAT |
| Training steps | 100 | Total LAT training steps |
| τ (single) | 0.5–0.95 | Per-feature spectral cap (see locket/utils/model.py ) |
| τ (multi) | 0.6–0.9 | Multi-feature spectral cap (see locket/utils/model.py ) |
See Appendix E of the paper for full details.