Losing Tickets in Neural Representations

TL;DR

In this blogpost, I apply the Lottery Ticket Hypothesis to Sinusoidal Representation Networks fitted to a single image. We find that SIREN contains a remarkable winning ticket,a 1.2% sub-network maintaining 29.2 dB PSNR, while the losing ticket collapses catastrophically to noise in just two pruning rounds. I also discuss the frequency carrier destruction phenomenon and how SIREN's high-magnitude weights act as the primary oscillators of its Fourier decomposition.

01 — Introduction

Why I Went Looking for Winning and Losing Tickets

During my undergrad years at UAB, I took the Advanced Machine Learning course taught by Andrey Barsky, one of the best teachers I've had the luck of learning from. The class was demanding in the best way: every few weeks he would present a new idea that quietly rearranged how I thought about deep learning. The Neural Tangent Kernel. Low-Rank Adaptation. Continual learning. Model merging. And the one of interest for this blog: the Lottery Ticket Hypothesis.

Andrey introduced the LTH as follows: inside every large, randomly initialized network, there exists a tiny sub-network that, if reset to its initial values and trained alone, can match the full network's performance. He showed the LTH paper's results: sparse architectures outperforming their dense parents with 21% of the original parameters. I just went: WOW. There was something both painful and beautiful about the idea that most of what we train is scaffolding, not structure.

From that day on I've truly enjoyed reading about theoretical and architectural deep learning, even if it is not my core focus (I would not be mathematically capable anyways). By day I work on 3D vision: point cloud encoders, 3D asset generation, the geometric side of things. But at night, I keep reading and thinking about the ideas Andrey introduced.

In this blogpost, I wanted to share a small project that's a direct descendant of that period. Today I will try to write about my experience applying the LTH to Sinusoidal Representation Networks (SIRENs).

The Overparameterization Paradox

There is a tension that many of us find strange in deep learning. We build networks far larger than they need to be, and it helps. Overparameterization makes optimization easier, improves generalization, and opens basins of attraction that underfitted architectures could never reach. Yet the moment training ends, we are left with a bloated network that spends most of its compute multiplying numbers by near-zero weights. So there it is, the overparameterization paradox: the redundancy that facilitates learning makes deployment expensive/unaffordable.

For most computer vision tasks, this is merely an engineering inconvenience. But for Implicit Neural Representations (INRs) (networks that replace discrete data arrays with continuous learned functions) overparameterization is existential. An INR representing a 256×256 image stores roughly 262,000 floats. The uncompressed image itself is 196,608 floats. The network is already larger than what it models. If we cannot compress INRs, their promise of "infinite-resolution, differentiable storage" becomes unattainable in any real deployment.

But beyond just solving the compression problem, applying the Lottery Ticket Hypothesis to SIRENs presents a cool architectural puzzle. In standard ReLU networks (the usual testing ground for the LTH), a weight acts mostly as a volume knob, shifting a linear boundary. But SIRENs wrap their weights inside sine waves ($\sin(\omega x + b)$). This means a weight's magnitude directly determines its frequency. Magnitude pruning here isn't just removing "unimportant" connections, it is a literal form of frequency filtering.

All in all, this blogpost boils down to one question: does the Lottery Ticket Hypothesis hold up for SIRENs? And if it does, what actually makes a neuron "load-bearing" in a network whose architecture is built around periodic functions rather than rectified linear units?

The answer turned out to be reassuring and disturbing at the same time. Reassuring because yes, winning tickets exist and they're super small; disturbing because not only do the losing tickets degrade, but they suffer an instantaneous frequency collapse, a failure mode so distinctive it tells us something fundamental about how SIRENs encode information. Yes, I also thought I made a mistake or had a bug in the implementation, but there wasn't.

02 — Background

Two Ideas That Collide

The Spectral Bias Problem and SIREN's Solution

The core problem with using standard networks with ReLU activations as implicit representations is well-documented: neural networks exhibit spectral bias (Rahaman et al., 2019), learning low-frequency components of a target function first, often never fully capturing high-frequency detail regardless of training duration. For natural images which derive most of their perceptual richness from edges, textures, and high-frequency patterns, this is fatal.

Sitzmann et al. (2020) proposed a simple elegant solution: replace ReLU with $\sin(\omega_0 \cdot)$. The resulting Sinusoidal Representation Network (SIREN) has no spectral bias because every layer is already a Fourier basis function. Through composition of sine functions, the network can represent arbitrarily complex spectra. The key insight (which is stated in the paper but is often underappreciated) is that any derivative of a SIREN is itself a SIREN, since $\frac{d}{dx}\sin(x) = \cos(x)$ is a phase-shifted sine. This makes SIRENs uniquely suited for physics-based applications where the gradient field is as important as the function itself.

$$\Phi(\mathbf{x}) = \mathbf{W}_n \left( \phi_{n-1} \circ \phi_{n-2} \circ \cdots \circ \phi_0 \right)(\mathbf{x}) + \mathbf{b}_n$$ $$\phi_i(\mathbf{x}_i) = \sin\!\left(\omega_0 \cdot \mathbf{W}_i \mathbf{x}_i + \mathbf{b}_i\right)$$ SIREN forward pass: each layer applies an affine transform scaled by $\omega_0$ and wrapped in sine. The first layer uses $\omega_0 = 30$ to set the initial frequency bandwidth; subsequent layers inherit it.

The initialization scheme is also very important. Preserving activation statistics across layers requires sampling weights from $\mathcal{U}\!\left(-\sqrt{6/n}, \sqrt{6/n}\right) / \omega_0$ for hidden layers. This is done to ensure that each layer's output distribution remains arc-sine distributed (the stationary distribution of $\sin$). Get this wrong and the network emits pure noise, which funny enough is a pitfall I encountered during development and fixed only after carefully re-reading Appendix A of the original paper (and some say appendices aren't important, gotta laugh at that one).

The Lottery Ticket Hypothesis

Frankle & Carbin's 2018 preprint (published at ICLR 2019, where it won a best paper award) presented a simple but profound observation: within any large randomly-initialized network, there exists a small subnetwork that, when reset to its original initialization values and trained in isolation, matches the performance of the full network. They called these subnetworks "winning tickets" because they had "won the initialization lottery", that is, their initial weights happened to occupy a geometry in parameter space that enables fast, effective learning.

The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.
— Frankle & Carbin, ICLR 2019

The mechanism for finding them is called Iterative Magnitude Pruning (IMP) with weight rewinding. It is a shockingly simple algorithm: train to convergence, remove lowest-magnitude weights, rewind surviving weights back to their initial values, repeat. The rewinding step is the key. If you miss it, you're just compressing a trained network; with it, you're identifying which initial weight configurations are structurally privileged.

What makes this particularly interesting for INRs is the regime difference. LTH was established on classification tasks with discrete labels. INRs are pure regression problems where the "label" at each coordinate is the full-resolution pixel value. The network has no class boundaries to exploit, nor discrete clusters to align with. It has to encode a continuous function from scratch. Whether the lottery ticket phenomenon persists in this regime was, until this little blogpost of mine, an open question (afaik).

Related Work

Frankle et al. (2020) later showed that IMP's requirement for exact initialization rewinding weakens in larger networks: "late rewinding" to an early training checkpoint often suffices. For small-scale INRs with few parameters, exact rewinding (to epoch 0) is both necessary and sufficient. The cleaner the setup, the more the original hypothesis holds.

03 — Methodology

The Experimental Architecture

The SIREN Backbone

I used a 5-layer MLP with 256 hidden units at each layer, an input dimension of 2 (normalized pixel coordinates in $[-1, 1]^2$) and output dimension of 3 (RGB, in $[-1, 1]$). The first layer uses $\omega_0 = 30$ to set the base frequency; subsequent hidden layers inherit this factor. The output layer is a plain linear map, no sine, no clamp. This allows for unconstrained gradient flow during training, with clamping applied only at reconstruction time.

Total parameters: (2×256+256) + 3×(256×256+256) + (256×3+3) = 198,915. This corresponds to the input layer ($2 \to 256$), three hidden layers ($256 \to 256$), and the output layer ($256 \to 3$), including biases for each. Small enough to run entirely in the VRAM of my old-ish laptop GPU, large enough that the LTH dynamics are interesting. The target: Leonardo da Vinci's Mona Lisa aka La Gioconda, resized to 256×256 and normalized. I chose this image deliberately because it has a rich mix of low-frequency gradients (the sky, skin) and high-frequency detail (the veil, background foliage), making the frequency-collapse mechanism quite visible, I also chose it because I like the painting.

The Three Pruning Strategies

At each of 21 iterations, after 1,000 training epochs and Adam optimization at $\text{lr}=10^{-4}$ with cosine annealing, I applied one of three pruning functions to 20% of remaining active weights (those that survived the previous round):

Strategy	Rule	Hypothesis
Winning Ticket	Prune lowest $\|w_i\|$	Small weights are noise; large weights encode signal
Random Ticket	Prune uniform random $20\%$	Control: does structure matter, or is any $N\%$ sufficient?
Losing Ticket	Prune highest $\|w_i\|$	High-magnitude weights are load-bearing in SIREN

After each pruning step, surviving weights are rewound to their exact epoch-0 values following the LTH protocol. In my implementation, this is done by maintaining a frozen copy of the initial state dict and overwriting module.weight_orig (PyTorch's internal pruning target) with the masked initial weights: $\theta_{\text{orig}} \leftarrow m \odot \theta_0$.

$$\text{Remaining}(k) = (1 - p)^k \qquad p = 0.20, \quad k \in \{1,\ldots,21\}$$ Total iters: $k = \lceil \log(0.01) / \log(0.80) \rceil = 21$, reaching ${\approx}1\%$ remaining.

Note that the "20% of remaining" formulation means pruning follows an exponential decay schedule, not a linear one. By iteration 21 we're at 1.2% of original weights, roughly 2,386 parameters carrying the entire Mona Lisa. The PSNR results at this extreme tell us what is structurally essential versus decorative scaffolding in the network's parametrization.

Implementation Detail

The losing ticket required special handling. PyTorch's built-in pruning methods offer no "prune largest" variant. The custom implementation sorts all active weights globally on CPU (to avoid GPU kthvalue instability on degenerate distributions of near-zero values), determines the threshold at position $n_\text{keep}$ in the sorted array, and applies prune.custom_from_mask per-layer. Already-pruned zeros are excluded from the magnitude pool, otherwise they dominate and artificially inflate the threshold.

04 — Results

The Frequency Collapse

The results are dramatic enough to be immediately noticeable in the curves below, which you can toggle between PSNR and HFE retention. Both metrics tell the same story from different angles. Let me walk through what each ticket is doing.

The Winner: Resilient to the End

The winning ticket starts at 47.3 dB and decays slowly: still above 40 dB at 32.8% remaining weights, above 34 dB at 4.4%, and landing at 29.2 dB with only 1.2% of its original parameters. For context, 30 dB is the informal threshold for "perceptually acceptable" image quality. The HFE curve mirrors this resilience: the winner retains 53% of its high-frequency spectral energy at 3.5% remaining weights, and still holds 27.8% at the final 1.2% mark. The network is not forgetting how to represent frequencies; it is representing fewer of them, gracefully.

What does this mean? That the Mona Lisa, with all its sfumato gradients, its enigmatic expression, its Renaissance chiaroscuro... can be encoded in roughly 2,387 floating-point numbers and a connectivity pattern, down from 198,915. This is an approximately 40× compression of the network in terms of storage. In sparse COO format (storing only non-zero weights with their indices), this translates to roughly 19.1 KB versus the original ~781 KB dense state dict.

The Random Ticket: Gradual Decay

The random ticket shows smooth logarithmic PSNR decay, staying competitive at moderate sparsity before crashing to 12.9 dB at 1.2%. Its HFE tells a more specific story: the spectral energy drains steadily and hits 0.0% at 1.4% remaining weights, two steps before the final iteration. This means the random ticket does not suffer a structural rupture like the loser, but it converges on the same endpoint: a network that has lost all directionally diverse frequency content and can only produce flat, separable outputs.

The Loser: Instantaneous Collapse

The losing ticket's behavior is the sharpest result in this experiment. It begins at 47.6 dB with 96.5% HFE, indistinguishable from the other tickets. After a single pruning round (80% remaining), PSNR drops to 29.3 dB and HFE collapses to 15.4%. After the second round (64% remaining), it reaches 11.5 dB, the PSNR of a blank gray screen, and HFE hits 0.0%. Both metrics flatline there for all remaining 18 iterations, across 18,000 subsequent training epochs. The removal of the top 36% highest-magnitude weights destroys the network's spectral structure entirely, and no amount of retraining recovers it. This is the frequency collapse: not a gradual loss of quality but an irreversible structural failure readable in both the PSNR curve and the Fourier spectrum simultaneously.

Remaining %	Winner (dB)	Random (dB)	Loser (dB)	Gap (W−L)
100.0	47.29	47.54	47.62	—
80.0	46.75	44.14	29.31	+17.4
64.0	46.00	41.87	11.46	+34.5
32.8	41.90	37.36	11.32	+30.6
10.7	36.49	33.33	11.32	+25.2
2.3	32.73	23.64	11.32	+21.4
1.2	29.18	12.90	11.90	+17.3

Key Finding

The winner-loser PSNR gap at 64% remaining is 34.5 dB. That is a big gap, the difference between a recognizable painting and white noise. This is a structural consequence of how SIREN encodes information in its weight magnitudes.

05 — Discussion

Why the Loser Fails Catastrophically

The losing ticket's behavior requires explanation. In a standard ReLU network, pruning high-magnitude weights causes degradation, but gradual degradation, not instant collapse. Losing tickets in ReLU networks is bad; in SIRENs, catastrophic.

SIREN Weights as Frequency Carriers

Consider what a SIREN weight matrix $\mathbf{W}_i$ actually does. The $j$-th output of layer $i$ computes $\sin(\omega_0 (\mathbf{W}_i \mathbf{x})_j + b_j)$. The magnitude of $(\mathbf{W}_i \mathbf{x})_j$ determines the amplitude of the oscillation at that node. This relationship: Weight Magnitude $\rightarrow$ Pre-activation $\rightarrow$ Oscillation, is the entire physical mechanism of SIREN architecture:

Condition Small Weights

Pre-Activation Near-zero domain

Effect on Sine Barely moves from zero

Condition Large Weights

Pre-Activation Wide value spread

Effect on Sine Oscillates rapidly

Through composition of $k$ such layers, the Sinusoidal Representation Network constructs a superposition of high-order trigonometric polynomials. The specific frequencies that can be represented are determined by which weight combinations produce constructive interference across layers. The high-magnitude weights are the primary oscillators: they set the dominant frequencies that give the reconstruction its edges and textures. The small-magnitude weights are the fine-tuning residuals.

When we prune the 20% largest weights globally, we don't just reduce the network's capacity, we remove the nodes that generate the highest-frequency components of the internal representation. What remains is a network of sub-harmonics, a SIREN that can only produce low-frequency structures. And crucially, no amount of retraining can recover those frequencies, because the rewound initial weights for those positions are also small (having survived as small weights from the original random initialization).

$$\Phi_\text{loser}(\mathbf{x}) \approx \sum_{k \in \mathcal{S}_\text{small}} c_k \sin(\omega_k \cdot \mathbf{x} + \psi_k)$$ After losing-ticket pruning, only low-amplitude basis functions remain. The effective bandwidth of the representation collapses to the lowest frequencies — insufficient to reconstruct any meaningful texture.

Why ReLU Networks Don't Suffer This

In a ReLU network, large weights don't have the same frequency carrier role. A large weight on a ReLU node simply shifts the decision boundary. The signal is encoded in the pattern of active vs. inactive units, not in the magnitude of any individual weight. Removing the highest-magnitude weights reorganizes but doesn't destroy the frequency content of the representation, as ReLU networks have no frequency content to destroy in the first place; they encode signals through piecewise linear boundaries.

The SIREN's power comes from its frequency-rich representation; its vulnerability to losing-ticket pruning is the flip side of that same coin.

Implications for NeRF and 3D Representation

As I said in the introduction, I am a 3D Vision guy, so I couldn't leave without mentioning that the practical stakes here extend well beyond image compression. Neural Radiance Fields (NeRFs) use coordinate-MLPs structurally similar to SIRENs (often with positional encodings playing the role of SIREN's initialization). Recent work on NeRF compression, such as KiloNeRF, Instant-NGP, and TensoRF, has focused on architecture redesign. The LTH perspective suggests a complementary approach: prune the NeRF itself, with rewinding. If a 1.8%-sparse SIREN can represent a 2D image with 32.7 dB PSNR, a sparse NeRF might represent 3D scenes with dramatically fewer parameters than current methods assume necessary.

The losing ticket experiment also has a practical negative implication: magnitude-based pruning on SIRENs must be done carefully. If you prune high-magnitude weights as some aggressive pruning schedules do when removing "outliers", you risk frequency collapse rather than smooth compression. The safest pruning strategy for periodic-activation networks are IMP (lowest magnitude).

Open Question

An intriguing follow-up: does the winning ticket found here transfer across images? Morcos et al. (2019) showed that lottery tickets found on CIFAR-10 sometimes transfer to CIFAR-100. For INRs, transferability would mean something even stranger: that the same sparse connectivity structure can be reused to represent entirely different natural signals. This seems not very likely (each image has different frequency content), but the connectivity structure of the winning ticket might reveal universal properties of how SIRENs decompose 2D signals.

What the Winning Ticket Is Really Preserving

Looking at the winning ticket reconstruction at 1.2% remaining weights, the result is blurry but we can still depict the Mona Lisa: the silhouette, skin tones, and background color regions all survive across roughly 2,386 parameters. Recall from the frequency-carrier analysis above, that in a SIREN, weight magnitude directly governs the amplitude of a neuron's sinusoidal oscillation, making high-magnitude weights the primary oscillators of the learned representation. Iterative Magnitude Pruning preserves exactly these, discarding the smallest weights. The blurriness at this extreme sparsity is a capacity effect: with so few oscillators remaining, the network can only compose the frequency modes that dominate the image's energy (and thus, the training MSE loss). Those happen to be the macro-structure, broad color regions and facial geometry, while fine textures like the veil require higher-order interference across many more parameters than survive. The winning ticket is the network's principal component decomposition: IMP isolates the sparse subset of high-magnitude weights whose joint oscillations account for the majority of the signal's variance.

06 — Compression Analysis

The Space-Time Tradeoff

One of the original motivations for this experiment was the question of whether INRs can be practically compressed. The answer is emphatically yes, with a caveat: the compression is theoretical rather than practical under standard dense computation.

Important Clarification

PyTorch's unstructured pruning does NOT automatically produce inference speedups. The pruning mask is applied as $\mathbf{W}_\text{eff} = \mathbf{W}_\text{orig} \odot M$, which remains a dense operation. Real latency gains require hardware-aware sparse kernels (e.g., CUDA cuSPARSE, or custom CUTLASS kernels) or structural pruning (removing entire rows/columns). The compression numbers below refer to storage, not compute.

781 KB

Dense State Dict

19 KB

Sparse (COO) at 1.2%

40x

Compression Ratio

29.2dB

Preserved PSNR

In COO (Coordinate) sparse format, each non-zero weight requires storing its value (float32) and its index (int32): 8 bytes versus 4 bytes for a dense weight. The break-even point is 50% density; below that, sparse is smaller. At 1.2% density, we store roughly 2,386 parameters as (index, value) pairs. Total storage ≈ 19 KB, compared to 781 KB dense. A practical 41× reduction.

07 — Reflections

What This Tells Us About Neural Computation

The most interesting outcome of this experiment is not the compression ratio but how the three ticket types help us understand how SIRENs are internally organized.

The existence of a well-performing winning ticket at 1.2% density implies that 98.8% of this network's parameters are, in some sense, redundant given the right initialization. But this redundancy is not waste, it is just the price you need to pay for simplifying training dynamics and the optimization landscape. The dense network's overparameterization creates a smooth loss surface that gradient descent can traverse; the sparse winning ticket, found only after training the dense network, cannot be discovered by directly training the sparse architecture from scratch. The dense network is a scaffolding for identifying the sparse solution.

So in the fine-grained level, learning is not just fitting parameters to data, but identifying, from a combinatorially large space of possible sparse structures, the specific connectivity that allows for the task to be solved. The Mona Lisa is not in any 1.2% of this network. It's in a very specific 1.2%, the one that, when rewound to initialization, happens to sit in a favorable region of optimization geometry.

Dense networks are the lottery in which winning tickets are sold. You cannot buy only the winning ticket; you must buy all of them and then discover which one won.
— paraphrasing Frankle's own reflections on the hypothesis

The losing ticket tells shows the other side of the coin. High-magnitude weights are not "more important" in a general sense but they're specifically important to SIREN's frequency decomposition. This is architecture-specific though. In a transformer, high-magnitude weights in attention heads might encode something entirely different. The "carrier wave" interpretation of SIREN's large weights is a hypothesis worth testing more rigorously, perhaps with direct Fourier analysis of the activation statistics at each layer, a level of depth to which, my dear friends, I am not able to arrive (yet).

Finally, the random ticket's slow decay suggests that SIRENs have a surprising degree of graceful degradation, you can remove up to 80-90% of weights randomly and still maintain reasonable quality. SIREN representations are quite robust by default. They become brittle only when you specifically target their frequency-carrying architecture.

LT-2026-No.21

Thank you for reading

Your feedback is my winning ticket. Have you found your own sub-networks yet?

FEEDBACK

WINNING TICKET

References

Citations

[1] Frankle, J. & Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR 2019

[2] Sitzmann, V., Martel, J.N.P., Bergman, A.W., Lindell, D.B., & Wetzstein, G. (2020). Implicit Neural Representations with Periodic Activation Functions. NeurIPS 2020.

[3] Frankle, J., Dziugaite, G.K., Roy, D.M., & Carbin, M. (2020). Linear Mode Connectivity and the Lottery Ticket Hypothesis. ICML 2020.

[4] Rahaman, N. et al. (2019). On the Spectral Bias of Neural Networks. ICML 2019.

[5] Morcos, A.S., Yu, H., Paganini, M. & Tian, Y. (2019). One Ticket to Win Them All: Generalizing Lottery Ticket Initializations across Datasets and Optimizers. NeurIPS 2019.

[6] Malach, E., Yehudai, G., Shalev-Shwartz, S. & Shamir, O. (2020). Proving the Lottery Ticket Hypothesis: Pruning is All You Need. ICML 2020.