Neil De La Fuente. · Deep Learning Research · PyTorch / CUDA
·
views
SIREN + LTHPruning & Compression
Scroll
47.6
Peak Winner PSNR (dB)
29.2
Winner PSNR at 1.2% Weights
11.3
Loser PSNR After 3rd Iter
21x
Pruning Iterations
TL;DR
In this blogpost, I apply the Lottery Ticket Hypothesis to Sinusoidal Representation Networks fitted to a
single image. We find that SIREN contains a remarkable
winning ticket,a 1.2% sub-network maintaining 29.2 dB PSNR, while the losing ticket collapses
catastrophically to noise in just two pruning rounds. I also discuss the
frequency carrier destruction phenomenon and how SIREN's high-magnitude weights act as the
primary oscillators of
its Fourier decomposition.
01 — Introduction
Why I Went Looking for Winning and Losing Tickets
During my undergrad years at UAB, I took the Advanced Machine Learning course taught by Andrey
Barsky, one of the best teachers I've had the luck of learning from. The class was demanding in the
best way: every few weeks he would present a new idea that quietly rearranged how I thought about deep learning.
The Neural Tangent Kernel. Low-Rank Adaptation. Continual learning. Model merging. And the one of interest for
this
blog: the Lottery Ticket Hypothesis.
Andrey introduced the LTH as follows: inside every large, randomly initialized
network, there exists a tiny sub-network that, if reset to its initial values and trained alone, can match the
full network's performance. He showed the LTH paper's results: sparse architectures outperforming their
dense parents with 21% of the original parameters. I just went: WOW. There was something
both painful and beautiful about the idea that most of what we train is scaffolding, not structure.
From that day on I've truly enjoyed reading about theoretical and architectural deep learning, even if it is not
my core focus (I would not be mathematically capable anyways). By day I work on 3D
vision: point cloud encoders, 3D asset generation, the geometric side of things. But at night, I keep reading and
thinking about the ideas Andrey introduced.
In this blogpost, I wanted to share a small project that's a direct descendant of that period. Today I will try
to write about my experience applying the LTH to Sinusoidal Representation Networks (SIRENs).
The Overparameterization Paradox
There is a tension that many of us find strange in deep learning. We build networks far larger than they need to
be, and it helps. Overparameterization makes optimization easier, improves generalization, and opens
basins of
attraction that underfitted architectures could never reach. Yet the moment training ends, we are left with a
bloated network that spends most of its compute multiplying numbers by near-zero weights. So there it is, the
overparameterization paradox: the redundancy that facilitates learning makes deployment expensive/unaffordable.
For most computer vision tasks, this is merely an engineering inconvenience. But for Implicit Neural
Representations (INRs) (networks that replace discrete data arrays with continuous learned functions)
overparameterization is existential. An INR representing a 256×256 image stores roughly 262,000 floats. The
uncompressed image itself is 196,608 floats. The network is already larger than what it models. If we
cannot compress INRs, their promise of "infinite-resolution, differentiable storage" becomes unattainable in any
real deployment.
But beyond just solving the compression problem, applying the Lottery Ticket Hypothesis to SIRENs presents a
cool architectural puzzle. In standard ReLU networks (the usual testing ground for the LTH), a weight acts
mostly as a volume knob, shifting a linear boundary. But SIRENs wrap their weights inside sine waves ($\sin(\omega
x + b)$). This means a weight's magnitude directly determines its frequency. Magnitude pruning
here isn't just removing "unimportant" connections, it is a literal form of frequency filtering.
All in all, this blogpost boils down to one question: does the Lottery Ticket Hypothesis hold up for
SIRENs? And if it does, what actually makes a neuron "load-bearing" in a network whose architecture
is built around periodic functions rather than rectified linear units?
The answer turned out to be reassuring and disturbing at the same time. Reassuring because yes, winning tickets
exist and they're super small; disturbing because not only do the losing tickets degrade, but they
suffer an instantaneous frequency collapse, a failure mode so distinctive it tells us something fundamental about
how SIRENs encode information. Yes, I also thought I made a mistake or had a bug in the implementation, but there
wasn't.
Superposition of periodic basis functions (SIREN construction)
02 — Background
Two Ideas That Collide
The Spectral Bias Problem and SIREN's Solution
The core problem with using standard networks with ReLU activations as implicit representations is
well-documented: neural
networks exhibit spectral bias (Rahaman et al., 2019), learning low-frequency components of a
target function first, often never fully capturing high-frequency detail regardless of training duration. For
natural images which derive most of their perceptual richness from edges, textures, and high-frequency patterns,
this is fatal.
Sitzmann et al. (2020) proposed a simple elegant solution: replace ReLU with $\sin(\omega_0 \cdot)$. The
resulting
Sinusoidal Representation Network (SIREN) has no spectral bias because every layer is already a
Fourier basis function. Through composition of sine functions, the network can represent arbitrarily complex
spectra. The key insight (which is stated in the paper but is often underappreciated) is that any derivative
of a SIREN
is itself a SIREN, since $\frac{d}{dx}\sin(x) = \cos(x)$ is a phase-shifted sine. This makes SIRENs
uniquely suited for physics-based applications where the gradient field is as important as the function itself.
SIREN forward pass: each layer applies an affine transform scaled by $\omega_0$ and
wrapped in sine. The first layer uses $\omega_0 = 30$ to set the initial frequency bandwidth; subsequent layers
inherit it.
The initialization scheme is also very important. Preserving activation statistics across layers requires
sampling
weights from $\mathcal{U}\!\left(-\sqrt{6/n}, \sqrt{6/n}\right) / \omega_0$ for hidden layers. This is done to
ensure that
each layer's output distribution remains arc-sine distributed (the stationary distribution of $\sin$). Get this
wrong and the network emits pure noise, which funny enough is a pitfall I encountered during development and fixed
only after carefully
re-reading Appendix A of the original paper (and some say appendices aren't important, gotta laugh at that one).
The Lottery Ticket Hypothesis
Frankle & Carbin's 2018 preprint (published at ICLR 2019, where it won a best paper award) presented a
simple but profound observation: within any large randomly-initialized network, there exists a small
subnetwork
that, when reset to its original initialization values and trained in isolation, matches the performance of the
full network. They called these subnetworks "winning tickets" because they had "won the initialization
lottery", that is, their initial weights happened to occupy a geometry in parameter space that enables fast,
effective
learning.
The winning tickets we find have won the initialization lottery: their connections have initial weights that
make training particularly effective.
— Frankle & Carbin, ICLR 2019
The mechanism for finding them is called Iterative Magnitude Pruning (IMP) with weight
rewinding. It is a shockingly simple algorithm: train to convergence, remove lowest-magnitude weights,
rewind surviving
weights back to their initial values, repeat. The rewinding step is the key. If you miss it, you're just
compressing a
trained network; with it, you're identifying which initial weight configurations are structurally privileged.
What makes this particularly interesting for INRs is the regime difference. LTH was established on
classification tasks with discrete labels. INRs are pure regression problems where the "label" at each coordinate
is the full-resolution pixel value. The network has no class boundaries to exploit, nor discrete clusters to align
with. It has to encode a continuous function from scratch. Whether the lottery ticket phenomenon persists in this
regime was, until this little blogpost of mine, an open question (afaik).
Related Work
Frankle et al. (2020) later showed that IMP's requirement for exact initialization rewinding weakens in larger
networks: "late rewinding" to an early training checkpoint often suffices. For small-scale INRs with few
parameters, exact rewinding (to epoch 0) is both necessary and sufficient. The cleaner the setup, the more the
original hypothesis holds.
Interactive Experiment
Drag the slider to advance through 21 pruning iterations. Watch the three tickets diverge.
Iteration01
Remaining Weights100.0%
Winner PSNR47.3 dB
Loser PSNR47.6 dB
🏆 Winning Ticket47.3 dB
🎲 Random Control47.5 dB
📉 Losing Ticket47.6 dB
03 — Methodology
The Experimental Architecture
The SIREN Backbone
I used a 5-layer MLP with 256 hidden units at each layer, an input dimension of 2 (normalized pixel coordinates
in $[-1, 1]^2$) and output dimension of 3 (RGB, in $[-1, 1]$). The first layer uses $\omega_0 = 30$ to set the
base frequency; subsequent hidden layers inherit this factor. The output layer is a plain linear map, no sine, no
clamp. This allows for unconstrained gradient flow during training, with clamping applied only at reconstruction
time.
Total parameters: (2×256+256) + 3×(256×256+256) + (256×3+3) = 198,915.
This corresponds to the input layer ($2 \to 256$), three hidden layers ($256 \to 256$), and the output layer ($256
\to 3$), including biases for each.
Small enough to run entirely in the VRAM of my old-ish laptop GPU, large enough that the LTH dynamics are
interesting.
The target: Leonardo da Vinci's
Mona
Lisa aka La Gioconda, resized to 256×256 and normalized. I chose this image deliberately because
it has a
rich mix of
low-frequency gradients (the sky, skin) and high-frequency detail (the veil, background foliage), making the
frequency-collapse mechanism quite visible, I also chose it because I like the painting.
The Three Pruning Strategies
At each of 21 iterations, after 1,000 training epochs and Adam optimization at $\text{lr}=10^{-4}$ with cosine
annealing, I applied one of three pruning functions to 20% of remaining active weights (those that survived the
previous round):
Strategy
Rule
Hypothesis
Winning Ticket
Prune lowest $|w_i|$
Small weights are noise; large weights encode signal
Random Ticket
Prune uniform random $20\%$
Control: does structure matter, or is any $N\%$ sufficient?
Losing Ticket
Prune highest $|w_i|$
High-magnitude weights are load-bearing in SIREN
After each pruning step, surviving weights are rewound to their exact epoch-0 values following
the LTH protocol. In my implementation, this is done by maintaining a frozen copy of the initial state dict and
overwriting module.weight_orig (PyTorch's internal pruning target) with the masked initial weights:
$\theta_{\text{orig}} \leftarrow m \odot \theta_0$.
$$\text{Remaining}(k) = (1 - p)^k \qquad p = 0.20, \quad k \in \{1,\ldots,21\}$$
Note that the "20% of remaining" formulation means pruning follows an exponential decay schedule, not
a linear one. By iteration 21 we're at 1.2% of original weights, roughly 2,386 parameters carrying the entire
Mona Lisa. The PSNR results at this extreme tell us what is structurally essential versus decorative scaffolding
in the
network's parametrization.
Implementation Detail
The losing ticket required special handling. PyTorch's built-in pruning methods offer no "prune largest"
variant. The custom implementation sorts all active weights globally on CPU (to avoid GPU kthvalue
instability on degenerate distributions of near-zero values), determines the threshold at position
$n_\text{keep}$ in the sorted array, and applies prune.custom_from_mask per-layer. Already-pruned
zeros
are excluded from the magnitude pool, otherwise they dominate and artificially inflate the threshold.
04 — Results
The Frequency Collapse
The results are dramatic enough to be immediately noticeable in the curves below, which you
can toggle between PSNR and HFE retention. Both metrics tell the same story from different
angles. Let me walk through what each ticket is doing.
Winning Ticket (IMP)
Random Ticket (Control)
Losing Ticket (Anti-IMP)
Figure 1. PSNR (dB) vs. remaining weight fraction.
X-axis is logarithmic
(left = dense, right = sparse). Note the winner's slowed decay versus the loser's step-function
collapse after just one pruning round.
The Winner: Resilient to the End
The winning ticket starts at 47.3 dB and decays slowly: still above 40 dB at 32.8%
remaining weights, above 34 dB at 4.4%, and landing at 29.2 dB with only
1.2% of its original parameters. For context, 30 dB is the informal threshold
for "perceptually acceptable" image quality. The HFE curve mirrors this resilience: the winner
retains 53% of its high-frequency spectral energy at 3.5% remaining weights, and still holds
27.8% at the final 1.2% mark. The network is not forgetting how to represent frequencies; it
is representing fewer of them, gracefully.
What does this mean? That the Mona Lisa, with all its sfumato gradients, its enigmatic
expression, its Renaissance chiaroscuro... can be encoded in roughly 2,387 floating-point
numbers and a connectivity pattern, down from 198,915. This is an approximately
40× compression of the network in terms of storage. In sparse COO format
(storing only non-zero weights with their indices), this translates to roughly 19.1 KB
versus the original ~781 KB dense state dict.
What does this mean? That the Mona Lisa, with all its sfumato gradients, its enigmatic expression, its
Renaissance
chiaroscuro... can be encoded in roughly 2,387 floating-point numbers and a connectivity pattern, down from
198,915. This is an approximately 40× compression of the network in terms of storage. In sparse
COO format (storing only non-zero weights with their indices), this translates to
roughly 19.1 KB versus the original ~781 KB dense state dict.
The Random Ticket: Gradual Decay
The random ticket shows smooth logarithmic PSNR decay, staying competitive at moderate
sparsity before crashing to 12.9 dB at 1.2%. Its HFE tells a more specific story: the
spectral energy drains steadily and hits 0.0% at 1.4% remaining weights, two steps before
the final iteration. This means the random ticket does not suffer a structural rupture like
the loser, but it converges on the same endpoint: a network that has lost all directionally
diverse frequency content and can only produce flat, separable outputs.
The Loser: Instantaneous Collapse
The losing ticket's behavior is the sharpest result in this experiment. It begins at
47.6 dB with 96.5% HFE, indistinguishable from the other tickets. After a single
pruning round (80% remaining), PSNR drops to 29.3 dB and HFE collapses to 15.4%.
After the second round (64% remaining), it reaches 11.5 dB, the PSNR of a blank gray
screen, and HFE hits 0.0%. Both metrics flatline there for all remaining 18 iterations,
across 18,000 subsequent training epochs. The removal of the top 36% highest-magnitude
weights destroys the network's spectral structure entirely, and no amount of retraining
recovers it. This is the frequency collapse: not a gradual loss of quality
but an irreversible structural failure readable in both the PSNR curve and the Fourier
spectrum simultaneously.
Remaining %
Winner (dB)
Random (dB)
Loser (dB)
Gap (W−L)
100.0
47.29
47.54
47.62
—
80.0
46.75
44.14
29.31
+17.4
64.0
46.00
41.87
11.46
+34.5
32.8
41.90
37.36
11.32
+30.6
10.7
36.49
33.33
11.32
+25.2
2.3
32.73
23.64
11.32
+21.4
1.2
29.18
12.90
11.90
+17.3
Key Finding
The winner-loser PSNR gap at 64% remaining is 34.5 dB. That is a big gap, the
difference between a recognizable painting and white noise. This is a structural consequence of how SIREN
encodes information in its
weight magnitudes.
05 — Discussion
Why the Loser Fails Catastrophically
The losing ticket's behavior requires explanation. In a standard ReLU network, pruning high-magnitude weights
causes degradation, but gradual degradation, not instant collapse. Losing tickets in ReLU networks is bad; in
SIRENs, catastrophic.
SIREN Weights as Frequency Carriers
Consider what a SIREN weight matrix $\mathbf{W}_i$ actually does. The $j$-th output of layer $i$ computes
$\sin(\omega_0 (\mathbf{W}_i \mathbf{x})_j + b_j)$. The magnitude of $(\mathbf{W}_i \mathbf{x})_j$ determines the
amplitude of the oscillation at that node. This relationship: Weight Magnitude $\rightarrow$
Pre-activation $\rightarrow$ Oscillation, is the entire physical mechanism of SIREN architecture:
ConditionSmall Weights
Pre-ActivationNear-zero domain
Effect on SineBarely moves from zero
ConditionLarge Weights
Pre-ActivationWide value spread
Effect on SineOscillates rapidly
Through composition of $k$ such layers, the Sinusoidal Representation Network constructs a superposition of
high-order trigonometric
polynomials. The specific frequencies that can be represented are determined by which weight combinations produce
constructive interference across layers. The high-magnitude weights are the primary oscillators:
they set the dominant frequencies that give the reconstruction its edges and textures. The small-magnitude weights
are the fine-tuning residuals.
When we prune the 20% largest weights globally, we don't just reduce the network's capacity, we remove the
nodes that generate the highest-frequency components of the internal representation. What remains is a network of
sub-harmonics, a SIREN that can only produce low-frequency structures. And crucially, no amount of retraining can
recover those frequencies, because the rewound initial weights for those positions are also small (having survived
as small weights from the original random initialization).
After losing-ticket pruning, only low-amplitude basis functions remain. The effective
bandwidth of the representation collapses to the lowest frequencies — insufficient to reconstruct any meaningful
texture.
Why ReLU Networks Don't Suffer This
In a ReLU network, large weights don't have the same frequency carrier role. A large weight on a ReLU node simply
shifts the decision boundary. The signal is encoded in the pattern of active vs. inactive units, not in
the
magnitude of any individual weight. Removing the highest-magnitude weights reorganizes but doesn't destroy the
frequency content of the representation, as ReLU networks have no frequency content to destroy in the first place;
they encode signals through piecewise linear boundaries.
The SIREN's power comes from its frequency-rich representation; its vulnerability to losing-ticket pruning is the
flip side of that same coin.
Implications for NeRF and 3D Representation
As I said in the introduction, I am a 3D Vision guy, so I couldn't leave without mentioning that the practical
stakes here extend well beyond image compression. Neural Radiance Fields (NeRFs) use
coordinate-MLPs structurally similar to SIRENs (often with positional encodings playing the role of SIREN's
initialization). Recent work on NeRF compression, such as KiloNeRF, Instant-NGP, and TensoRF, has focused on
architecture redesign. The LTH perspective suggests a complementary approach: prune the NeRF itself, with
rewinding. If a 1.8%-sparse SIREN can represent a 2D image with 32.7 dB PSNR, a sparse NeRF might
represent 3D scenes with dramatically fewer parameters than current methods assume necessary.
The losing ticket experiment also has a practical negative implication: magnitude-based pruning on SIRENs must be
done carefully. If you prune high-magnitude weights as some aggressive pruning schedules do when removing
"outliers", you risk frequency collapse rather than smooth compression. The safest pruning strategy for
periodic-activation networks are IMP (lowest magnitude).
Open Question
An intriguing follow-up: does the winning ticket found here transfer across images? Morcos et al.
(2019) showed that lottery tickets found on CIFAR-10 sometimes transfer to CIFAR-100. For INRs, transferability
would mean something even stranger: that the same sparse connectivity structure can be reused to represent
entirely different natural signals. This seems not very likely (each image has different frequency content), but
the connectivity structure of the winning ticket might reveal universal properties of how SIRENs decompose 2D
signals.
What the Winning Ticket Is Really Preserving
Looking at the winning ticket reconstruction at 1.2% remaining weights, the result is blurry but
we can still depict the Mona Lisa: the silhouette, skin tones, and background color regions all survive across
roughly 2,386 parameters. Recall from the frequency-carrier analysis above, that in a SIREN, weight magnitude
directly governs the amplitude of a neuron's sinusoidal oscillation, making high-magnitude weights the primary
oscillators of the learned representation. Iterative Magnitude Pruning preserves exactly these, discarding the
smallest weights. The
blurriness at this extreme sparsity is a capacity effect: with so few oscillators remaining, the
network can only compose the frequency modes that dominate the image's energy (and thus, the training MSE loss).
Those happen to be the
macro-structure, broad color regions and facial geometry, while fine textures like the veil require higher-order
interference across many more parameters than survive. The winning ticket
is the network's principal component decomposition: IMP isolates the sparse subset of
high-magnitude weights whose joint oscillations account for the majority of the signal's variance.
06 — Compression Analysis
The Space-Time Tradeoff
One of the original motivations for this experiment was the question of whether INRs can be practically
compressed. The answer is emphatically yes, with a caveat: the compression is theoretical rather than
practical under standard dense computation.
Important Clarification
PyTorch's unstructured pruning does NOT automatically produce inference
speedups. The pruning mask is applied as $\mathbf{W}_\text{eff} = \mathbf{W}_\text{orig} \odot M$, which remains
a dense operation. Real latency gains require hardware-aware sparse kernels (e.g., CUDA cuSPARSE, or custom
CUTLASS kernels) or structural pruning (removing entire rows/columns). The compression numbers below refer to
storage, not compute.
781 KB
Dense State Dict
19 KB
Sparse (COO) at 1.2%
40x
Compression Ratio
29.2dB
Preserved PSNR
In COO (Coordinate) sparse format, each non-zero weight requires storing its value (float32) and its index
(int32): 8 bytes versus 4 bytes for a dense weight. The break-even point is 50% density; below that, sparse is
smaller. At 1.2% density, we store roughly 2,386 parameters as (index, value) pairs. Total storage ≈ 19 KB,
compared to 781 KB dense. A practical 41× reduction.
07 — Reflections
What This Tells Us About Neural Computation
The most interesting outcome of this experiment is not the compression ratio but how the three
ticket types help us understand how SIRENs are internally organized.
The existence of a well-performing winning ticket at 1.2% density implies that 98.8% of this network's parameters
are, in some
sense, redundant given the right initialization. But this redundancy is not waste, it is just the price
you need to pay for simplifying
training dynamics and the optimization landscape. The dense network's overparameterization creates a smooth loss
surface that gradient
descent can traverse; the sparse winning ticket, found only after training the dense network, cannot be discovered
by directly training the sparse architecture from scratch. The dense network is a scaffolding for
identifying the sparse solution.
So in the fine-grained level, learning is not just fitting
parameters to data, but identifying, from a combinatorially large space of possible sparse structures, the
specific connectivity that allows for the task to be solved. The Mona Lisa is not in any 1.2% of this network.
It's in a
very specific 1.2%, the one that, when rewound to initialization, happens to sit in a favorable region of
optimization geometry.
Dense networks are the lottery in which winning tickets are sold. You cannot buy only the winning ticket; you
must buy all of them and then discover which one won.
— paraphrasing Frankle's own reflections on the hypothesis
The losing ticket tells shows the other side of the coin. High-magnitude weights are not "more important" in a
general
sense but they're specifically important to SIREN's frequency decomposition. This is architecture-specific though.
In a
transformer, high-magnitude weights in attention heads might encode something entirely different. The "carrier
wave" interpretation of SIREN's large weights is a hypothesis worth testing more rigorously, perhaps with direct
Fourier analysis of the activation statistics at each layer, a level of depth to which, my dear friends, I am not
able to arrive (yet).
Finally, the random ticket's slow decay suggests that SIRENs have a surprising degree of graceful degradation,
you
can remove up to 80-90% of weights randomly and still maintain reasonable quality.
SIREN representations are quite robust by default. They become brittle only when you specifically
target their frequency-carrying architecture.
LT-2026-No.21
Thank you for reading
Your feedback is my winning ticket. Have you found your own sub-networks yet?
[3]Frankle, J., Dziugaite, G.K., Roy, D.M., & Carbin, M. (2020). Linear Mode Connectivity and the
Lottery Ticket Hypothesis. ICML 2020.
[4]Rahaman, N. et al. (2019). On the Spectral Bias of Neural Networks. ICML 2019.
[5]Morcos, A.S., Yu, H., Paganini, M. & Tian, Y. (2019). One Ticket to Win Them All: Generalizing
Lottery Ticket Initializations across Datasets and Optimizers. NeurIPS 2019.
[6]Malach, E., Yehudai, G., Shalev-Shwartz, S. & Shamir, O. (2020). Proving the Lottery
Ticket Hypothesis: Pruning is All You Need. ICML 2020.