REBOUND
1 Introduction
Modern machine learning systems are increasingly trained on large, partially proprietary corpora. Data owners therefore seek mechanisms to verify whether their data contributed to a model’s training procedure.1 Watermarking addresses this by embedding label-preserving perturbations into training examples, creating a statistically detectable signature in the resulting model.2 This phenomenon is often called radioactivity, since the model’s outputs have a sort of “trace” that is useful for detection.
Parallel research frames model alignment through the lens of data compression. Ji et al. propose modeling datasets as forces acting on a spring within an abstract data space.3 In this geometric view, model coordinates are defined by normalized compression rates. Pre-training creates a deep potential energy basin, while fine-tuning induces shallow displacements governed by a Hooke’s-law-like relationship.4
These two perspectives naturally intersect. Radioactive watermarking is a special case of fine-tuning on a small, carefully engineered dataset: it pulls the model toward a direction that improves compression on a marked set of examples. Elasticity suggests that such displacements may be fragile under subsequent training. This motivates our central question:
If radioactive watermarks behave like forces on a spring in data space, to what extent can an adversary apply a counter-force via fine-tuning to erase the radioactive signal without sacrificing downstream utility?
We analyze this question in a unified compression-based framework. Starting from a pre-trained model M_0, we obtain a watermarked model M_w by fine-tuning on a radioactive dataset D_w (or teacher outputs encoding radioactivity), which stretches the model along a dedicated axis associated with improved compression of radioactive examples. We then apply additional fine-tuning on a second dataset D_t designed as an un-finetuning force, with the goal of relaxing or reversing the displacement along the D_w axis while preserving performance on natural evaluation distributions.
2 Motivation
2.1 Dataset ownership and radioactive watermarking
Our starting point is the view that dataset watermarking is a tool for ownership verification: a data owner perturbs a subset of training examples so that any model trained on that data exhibits a detectable statistical signature under a secret test. We build on radioactive data as a canonical instantiation of this idea.5
Given a base dataset D_w, radioactive construction produces a perturbed \tilde{D}_w by adding small, label-preserving signals in feature space, constrained to be imperceptible. Models trained on \tilde{D}_w exhibit a consistent loss gap between radioactive and matched clean samples that can be turned into a powerful hypothesis test, even when only a small fraction of training points are radioactive.6 Subsequent analysis shows that these signatures can survive realistic attacks such as model extraction, though detection degrades in low-data regimes and can behave counterintuitively across black-box vs. white-box settings.7
In this project, the central object is a language model M_w trained on a mixture of pre-training data and a radioactive dataset D_w, such that a trusted verifier can detect training on D_w, while an adversary may seek to erase this signature.
Throughout, we treat a radioactive dataset D_w as a base dataset augmented with small, optimized perturbations such that: (1) perturbations are norm-bounded and label-preserving; (2) any model trained on D_w enjoys a statistically significant loss reduction on radioactive vs. matched clean samples; and (3) this loss reduction is robust across a specified family of architectures and training setups.8
2.2 Elasticity of alignment and compression-based metrics
Alignment methods, such as supervised instruction tuning and reinforcement learning from human feedback, are typically applied to a large pre-trained model M_0 using comparatively small curated datasets D_a.9 Empirically, most of the model’s factual knowledge and linguistic competence arises in pre-training, while alignment nudges the model toward preferred behaviors on user-facing tasks.10 This suggests that alignment operates in a low-measure region of parameter space relative to pre-training.
Throughout this work, we use the normalized negative log-likelihood per token as a compression metric. For a model M and dataset D, we define \gamma_D(M) = \frac{1}{|D|} \sum_{(x,y)\in D} \frac{1}{T} \sum_{t=1}^T \bigl(-\log p_M(y_t \mid x, y_{<t})\bigr), where y = (y_1,\dots,y_T) is the target sequence. Lower \gamma_D(M) indicates that M assigns higher probability to D and better matches its empirical distribution.11 By tracking \gamma_{D_w}(M) and \gamma_{D_t}(M), together with standard radioactive detection statistics, we treat watermarking and un-finetuning as coupled forces in data space and quantify how far the model can be shifted before task utility degrades.
To build intuition, we provide an interactive widget in Figure 1 that illustrates how compression rates respond to model bias.
Ji et al. formalize this intuition under the name elasticity, using normalized compression rates to measure how models respond to fine-tuning.12 For a model M and dataset D, they define \gamma_D(M) as above and consider compression advantages of the form \Delta \gamma_D(M) = \gamma_D(M_0) - \gamma_D(M). Ji et al. show that small alignment datasets can substantially improve compression on alignment datasets but that these gains are fragile: subsequent fine-tuning on other data quickly erases them, especially in larger, heavily pre-trained models.13
2.3 Elasticity as a generic un-watermarking mechanism
We treat radioactive watermarking as an extreme instance of this asymmetry. Both watermarking and alignment are induced by comparatively small, specialized datasets:
- In the radioactive case, D_w consists of marked examples designed so that training pulls the model toward a narrow carrier subspace, producing a detectable loss gap.14
- In the alignment case, D_a shifts behavior on user-facing tasks away from the raw pre-training prior.15
From the compression perspective, a successful watermark enforces \gamma_{D_w}(M_w) \ll \gamma_{D_w}(M_0), exactly analogous to the alignment objective on D_a. This motivates our central hypothesis that radioactive signatures should be elastic in the same sense as alignment.
Let M_0 be a pre-trained language model and M_w a watermarked model obtained by fine-tuning M_0 on a radioactive dataset D_w. Assume that a verifier can detect training on D_w via an empirical compression advantage on D_w. Suppose we have only black-box or adapter-level access to M_w.
Then there exists a fine-tuning dataset D_t and training procedure \mathcal{T}, which are agnostic of D_w and the watermark key, such that M_t := \mathcal{T}(M_w; D_t) satisfies \gamma_{D_w}(M_t) \approx \gamma_{D_w}(M_0).
2.4 Our black-box adversary with fine-tuning capabilities
We consider an adversary who does not control the original training pipeline and only observes the released watermarked model M_w, as in black-box radioactive verification where evidence of watermarked training is obtained solely via queries.16
The adversary has black-box query access to M_w sufficient to obtain token-level probabilities or samples for arbitrary prompts, can perform parameter-efficient fine-tuning (e.g., LoRA or other adapter methods) on their own data,17 and can evaluate candidate models on arbitrary datasets, including public benchmarks and proprietary task distributions. They do not have access to M_0, the watermarking key, the radioactive dataset D_w, or any clean-radioactive pairs.
Let \mathsf{Det} denote the verifier’s detection statistic. For radioactive data, \mathsf{Det} is based on a loss comparison between radioactive and clean samples plus a hypothesis test at a chosen significance level.18 The adversary seeks a model M_t such that \mathsf{Det}(M_t) \approx \mathsf{Det}(M_{\text{clean}}), where M_{\text{clean}} denotes typical unwatermarked models trained without D_w.
Because the adversary does not know \mathsf{Det} exactly, they rely on proxies such as changes in compression rate \gamma_D(M) on accessible datasets D and behavioral similarity between M_t and unwatermarked baselines. It is elasticity which links these proxies to training geometry: changes in \gamma_D(M) across different D are driven by dataset size, gradient alignment, and pre-training scale.19
Leverage. Even without access to M_0 or D_w, an adversary is still able to probe the loss landscape of M_w on large unlabeled corpora, identify high-loss regions, and use these as candidate directions for un-finetuning. Elasticity theory predicts when such directions will counteract watermark-induced displacements.
3 A compression view of elasticity and radioactive watermarks
This section develops the compression-based framework used to analyze elasticity and adapts it to radioactive watermarking for language models.20 21 22 Let M_0 be a base model, M_w a radioactive model obtained by fine-tuning on D_w, and M_{w \to t} the result of additional fine-tuning on D_t.
3.1 Datasets as forces in data space
We treat each dataset D as defining a coordinate axis in a data space. For a fixed base model M_0 and any model M, the compression advantage on D is \Delta \gamma_D(M) = \gamma_D(M_0) - \gamma_D(M). A positive \Delta \gamma_D(M) indicates that M compresses D better than M_0.
For a collection of datasets \{D_i\}_{i=1}^n, the vector (\Delta \gamma_{D_1}(M), \ldots, \Delta \gamma_{D_n}(M)) embeds each model as a point in \mathbb{R}^n. Ji et al. show that, for modest fine-tuning and small mixtures of datasets, these coordinates respond to new training data approximately linearly: changes in \Delta \gamma_{D_i} are controlled by a matrix of elasticity coefficients E_{ij} that depend on gradient inner products at M_0.23
Intuitively, training on dataset D_j exerts a force that increases \Delta \gamma_{D_j}, while pre-training on D_p acts as a strong anchor keeping the system near M_0. The elastic response of the system determines how changes along one axis (e.g., D_t) propagate to others (e.g., D_w).
The model (purple mass) is pulled in data space by three springs:
- Pre-training: always pulls toward
M0(base model). - Watermarking: pulls toward the radioactive dataset target
Dw. - Un-finetuning: pulls toward the adversarial dataset
Dt, chosen to counteract the watermark.
Dt
Data-space view (Δγ coordinates)
Dw to choose where watermarking pulls, then click "Apply watermark". Turn on un-finetuning, drag Dt to pull back against that direction, and adjust its strength. When the green pull cancels the red, the red net-force arrow shrinks and the gray pre-training spring dominates again. You can also move the model around and visualize how it moves about the space as the datasets change.
3.2 Loss-based selection as a proxy for anti-watermark directions
In the black-box setting, the adversary cannot compute gradients or elasticity coefficients for the watermark dataset D_w directly. However, they can query the watermarked model M_w on a large candidate pool U and measure token-level losses.
Intuitively, examples with high loss under M_w are those on which the watermark-induced bias is most misaligned with the pre-training distribution. Selecting such examples for fine-tuning should therefore push the model back toward the pre-training optimum and reduce the compression advantage on D_w.
We now formalize this intuition in an idealized model.
We provide a brief sketch of this below; the full proof appears in the Appendix.
This theorem formalizes the idea that loss-based selection is an anti-watermark direction. Even though the adversary never sees the watermark data or direction explicitly, selecting the top-q high-loss examples by the scalar score S(z) = v^\top G(z) ensures that the average gradient they train on points precisely along v, with a computable positive coefficient \sigma \lambda(q).
Because the watermark gradient \nabla_\theta \gamma_{D_w}(\theta_w) points along -v, gradient descent on D_t moves the model in the opposite direction and reduces the watermark’s compression advantage at a rate exactly proportional to \lambda(q).
4 Methods & Experimental Design
We set M_0 to be Llama-3.1-8B.24 All fine-tuning is performed using low-rank adaptation (LoRA), and is orchestrated through the Tinker framework.
4.1 Datasets and watermarking setup
4.1.1 Radioactive dataset D_w
To simulate a realistic watermarking scenario, we use a publicly available radioactive dataset derived from the Maryland n-gram corpus and use them as supervised pairs.25
From this file, we carve out two disjoint subsets:
- a watermark training set, comprising of D_w^\text{train} examples from the corpus;
- a watermark evaluation set, comprising of D_w^\text{eval} examples from the corpus.
4.1.2 Generic instruction dataset D_t
To approximate everyday instruction-following data, we use the Alpaca-cleaned dataset.26 Each example is converted into a user/assistant pair by concatenating the instruction and (if present) the input field into a single user prompt, with the output as the assistant response.
Two roles are assigned to this distribution:
- A held-out evaluation set comprising of D_t^\text{eval} examples from the corpus. This serves as a proxy for generic task performance and allows us to quantify any collateral damage from un-watermarking.
- A larger candidate pool comprising of U examples from the coprus, which acts as the search space from which we construct adversarial training sets D_t using loss-based selection.
4.2 Compression metrics and validation signals
We report three derived quantities: the radioactive compression rate \gamma_{D_w}(M), the generic compression rate \gamma_{D_t}(M), and the compression advantage relative to the base model, \Delta \gamma_D(M) = \gamma_D(M_0) - \gamma_D(M), for both D_w and D_t.
In addition, we track standard training and validation NLL to monitor overfitting and optimization stability. These validation metrics are evaluated periodically by snapshotting the current LoRA weights, creating a temporary sampling client, and running the same weighted NLL computation as above. This uniform treatment means that any movement of the spring in data space is reflected consistently across training and evaluation.
4.3 Adversarial D_t construction via loss-based selection
For each candidate conversation (x, y) in U, we compute the per-token SFT loss under M_w:
\ell_{M_w}(x, y) = \frac{1}{T} \sum -\log p_{M_w}(y_t \mid x, y_{<t}).
This loss is fully observable to the attacker via log-probability queries and aligns with the training objective. We then rank candidates based on \ell_{M_w} and define two selection strategies:
- High-loss selection. D_t consists of the top k examples with the largest loss. These examples are where M_w disagrees most strongly with the Alpaca distribution. If Alpaca is closer to the pre-training data than the radioactive distribution (a plausible assumption), pushing the model to fit these high-loss regions should counteract the distortion introduced by D_w.
- Random selection. D_t is a uniform random subset of U. This baseline approximates the generic continued pre-training regime commonly studied in elasticity work.
By comparing how each strategy changes \Delta \gamma_{D_w} and \Delta \gamma_{D_t}, we can test whether adversarial high-loss examples induce a qualitatively different elastic response from the model.
4.4 Training Curriculum and Elasticity Probes
The training curriculum is organized into three conceptual stages that mirror the elasticity thought experiment.
4.4.1 Stage 0: Base model probe
Before any fine-tuning, we measure \gamma_{D_w}(M_0) and \gamma_{D_t}(M_0). These values define the “rest position” of the spring and are used to compute compression advantages for all subsequent models. No parameters are updated in this stage; it purely establishes a reference.
4.4.2 Stage 1: Radioactive fine-tuning on D_w
We then train a LoRA adapter on D_w^\text{train}, obtaining the watermarked model M_w. The training objective is standard supervised fine-tuning on assistant tokens, with a warmup-stable-decay learning rate schedule. After Stage 1, we re-measure \gamma_{D_w} and \gamma_{D_t} to quantify the watermark-induced displacement. The model at this point is treated as the only object the attacker can access in the subsequent black-box setting.
4.4.3 Stage 2: Loss-based Dt construction
With M_w fixed, we score the candidate pool U using the loss function above and select a training set D_t according to one of the strategies (high-loss or random). Importantly, this selection is done once per experiment and cached; the resulting D_t is serialized and reused for all later runs that share the same strategy and size. This makes comparisons across different learning rate schedules or numbers of Dt epochs more meaningful, since they share the same adversarial curriculum.
4.4.4 Stage 3: Adversarial Dt fine-tuning and elasticity measurement
Starting from M_w, we fine-tune on D_t using the same LoRA configuration and WSD schedule as in Stage 1. At regular intervals during this stage, we snapshot the model and perform a standardized elasticity probe:
- Compute \gamma_{D_w} and \gamma_{D_t} under the current model.
- Convert these into compression advantages relative to M_0.
- Evaluate validation NLL on D_w^\text{eval} and D_t^\text{eval}.
The trajectories of \Delta \gamma_{D_w} and \Delta \gamma_{D_t} over Dt steps are then interpreted as the motion of the model under the adversarial force induced by D_t. In the ideal un-watermarking scenario, high-loss D_t would cause \Delta \gamma_{D_w} to decrease while maintaining positive \Delta \gamma_{D_t}, indicating that the model is regaining “generic” behavior without catastrophic degradation.
You can click on the callout below to view more specifics of our setup.
Model and infrastructure
| Component | Choice |
|---|---|
| Base model M_0 | LLaMA-family 8B model27 |
| Adaptation | LoRA (rank 32) |
| Framework | Tinker (training + logprobs) |
| Max sequence length | 512 tokens |
Datasets
| Role | Source |
|---|---|
| D_w^\text{train} | Radioactive Maryland corpus28 |
| D_w^\text{eval} | Same as above |
| D_t^\text{eval} | Alpaca-cleaned29 |
| Candidate pool U | Alpaca-cleaned |
| Adversarial D_t | Subset of U |
Hyperparameters
| Quantity | Value / Description |
|---|---|
| Batch size N | 16 conversations per step |
| Learning rate peak | 2\times 10^{-5} (WSD plateau) |
| WSD warmup fraction | 5% of steps |
| WSD decay fraction | 10% of steps |
| Dw epochs | 1 |
| Dt epochs | 1 |
| Validation frequency | every 10 steps |
| Gamma evaluation samples | up to 64 examples per dataset per probe |
5 Results
5.1 Elastic response under random vs loss-based D_t selection
To evaluate elastic un-watermarking, we fix the radioactive fine-tuning stage on D_w and vary only how we construct the adversarial dataset D_t. For each strategy, we start from the same watermarked model M_w and track compression advantages \Delta \gamma_{D_w}(M) and \Delta \gamma_{D_t}(M) as we fine-tune on D_t.
Figure 3 plots these trajectories for two settings on 25\% of our dataset, with the lighter lines showing the true \gamma values and the bold lines showing the time-smoothed averages:
- Random D_t: D_t is a uniform subset of generic Alpaca-style instructions.
- High-loss D_t: D_t is constructed from the same pool, but restricted to examples with the highest SFT loss under M_w.
The x-axis shows the number of D_t SFT steps; the y-axis shows the compression advantage \Delta \gamma_D(M) = \gamma_D(M_0) - \gamma_D(M) in nats per token, for both D_w and D_t.