Scaling laws for autoencoders

August 11, 2024

Recently, I tried training an extensive autoencoder for financial data -like model—more accurately described as a large encoder with predictive capabilities for subsequent ticks in lieu of a decoder—to ascertain the degree of data compression attainable. The selection of latent space dimensionality presented a challenge, prompting me to experiment with various configurations. Upon graphing the logarithm of the loss function against the logarithm of dimensionality, I observed an approximately linear relationship.

This intriguing discovery led me to hypothesize that similar behavior might be observed in conventional autoencoders, particularly when constraints on encoder and decoder sizes are relaxed. Subsequent experimentation confirmed this linearity, even in variational autoencoders (VAEs), initially without added noise. This finding suggests a potential scaling law, which is particularly noteworthy given the prevailing notion that structured data resides on low-dimensional manifolds, aligning with the concept that "compression equals intelligence."

The manifold hypothesis (https://en.wikipedia.org/wiki/Manifold_hypothesis) is widely acknowledged in the field. However, the aforementioned observations appear to contradict this hypothesis. If valid, we would expect the decoder to parametrize a portion of the manifold using the latent space. Consequently, if the dataset's image occupies a full-dimensional subset of the latent space, one would anticipate a discontinuity in the loss function upon reaching the manifold's dimensionality. Alternatively, the loss curve might resemble the cumulative variance explained by the first $n$ components of the data, or at least the cumulative variance explained by the first $n$ components in the latent space. This observation presents a compelling argument against the manifold hypothesis and underscores the discrete structure inherent in the phenomena under study.

A second, and perhaps more significant, consideration is the underlying reason for the linearity observed in the log-log plot. This phenomenon has been independently verified, as evidenced in the second figure of https://arxiv.org/pdf/2406.04093. Notably, the authors of this study do not strive for large encoders and decoders; rather, they impose bottlenecks in model size. Despite this constraint, the linearity remains apparent.

To elucidate this phenomenon, a colleague astutely noted: "If you consider a matrix SVD and examine the error, it would approximate the sum of the lowest singular values. Thus, if $e \sim d\sigma_d$, then $\log(e) \sim \log(d) + \log(\sigma_d)$, potentially explaining the behavior when compressing to a lower dimension." Upon further reflection, I realized that for unstructured data, significant compression is inherently challenging, as a small latent space is insufficient. This led to a contemplation of the nature of structure, which can be conceptualized as features.

The question then arises: what constitutes features, and how many can be accommodated in a latent space? For a decoder to distinguish features, a minimum angular separation ($\epsilon$) between different features is necessary. A well-known theorem states that the maximum number of distinct vectors that can be embedded in an $n$-dimensional space, while maintaining a minimum pairwise angular separation of $\epsilon$, grows exponentially. This realization provides a satisfactory explanation for the observed phenomenon.

August 5, 2024

Landawer conatant and Bremerman limit and its reflections about measure of intelligence

On the Feasibility of AI Models Winning Gold Medals in International Mathematics Olympiads by 2026

I recently posited a hypothesis, backed by a substantial wager of 2000 USD, that no machine learning models would be capable of securing a gold medal at the International Mathematical Olympiad (IMO) until 2026. Upon further reflection, I believe my initial assessment may have been overly conservative. However, the reasoning behind this hypothesis merits closer examination.

Quantifying Model Intelligence

One of the persistent challenges in artificial intelligence research has been establishing a reliable metric for model intelligence. I propose that a reasonable proxy for this metric could be the energy expenditure during the training process. However, this approach necessitates several important considerations:

Stability of Training Frameworks: This hypothesis assumes relative stability in model training frameworks until 2026. Specifically, it presupposes that models will continue to be predominantly based on transformer architectures, supplemented by reinforcement learning and tree-search algorithms, with algorithmic efficiency remaining within the same order of magnitude as current techniques.
Landauer's Principle: This assumption is grounded in Landauer's principle, which stipulates the minimum amount of energy required to erase one bit of information. Remarkably, modern computers operate within 100-1000 times of this theoretical limit. Given this proximity to the physical limit, it becomes reasonable to gauge model intelligence by quantifying the floating-point operations (FLOPs) required for training.
Current Benchmarks: As a point of reference, GPT-4 is estimated to require approximately 20 petaFLOPs for training, which serves as a reasonable approximation for current state-of-the-art models.

Scaling Laws and Computational Requirements

The relationship between model capabilities and computational requirements appears to follow certain scaling laws:

Exponential Compute for Linear Capability Growth: There is a growing body of empirical evidence suggesting that linear growth in capabilities necessitates exponential increases in computational resources. This intuition aligns with power laws observed in various physical phenomena and is generally supported by empirical scaling laws in machine learning.
Defining Capabilities: While the precise definition of "capabilities" remains somewhat ambiguous in this context, we can accept this intuition as a working hypothesis for the purposes of this discussion.
Extrapolation to IMO Performance: If we assume that a linear increase in the number of IMO problems solved requires exponential growth in computational resources, and we extrapolate this relationship based on existing models, we arrive at a computational requirement that exceeds the capacity of currently available data centers globally.

Energy as the Limiting Factor

This analysis leads to a critical observation: the primary bottleneck in scaling these models is energy consumption. The difficulty in significantly expanding model capabilities stems largely from the enormous energy requirements associated with training increasingly complex models.

Bremermann's Limit

To further contextualize this energy constraint, it's important to consider Bremermann's limit. Named after Hans-Joachim Bremermann, this theoretical limit posits a maximum computational speed for a self-contained system with finite mass. Derived from quantum mechanics and Einstein's mass-energy equivalence, Bremermann's limit is approximately 1.36 × 10^50 bits per second per kilogram.

For a hypothetical computer with the mass of Earth (approximately 5.97 × 10^24 kg), this translates to a maximum computational speed of about 8.13 × 10^74 bits per second. While this limit is enormously high compared to current computing capabilities, it underscores the fundamental physical constraints on computation, regardless of future technological advancements.

The Bremermann's limit serves as an ultimate ceiling on computational speed, reinforcing the argument that energy constraints pose a significant challenge to indefinite scaling of AI models. As we approach this limit, the energy requirements for marginal improvements in AI performance will become increasingly prohibitive.

Conclusion

While my initial bet may prove to be overly pessimistic, the underlying reasoning highlights a critical aspect of AI development that warrants further discussion. The challenges associated with scaling models are substantial, primarily due to energy constraints. As we continue to push the boundaries of AI capabilities, it becomes increasingly important to consider these physical limitations and their implications for the future of artificial intelligence research and development.

There is no metric to measure how well does autoencoder unveil the features however doing a lot of visualisations as above seem to suggest it works quite well.