August 11, 2024
Recently, I tried training an extensive autoencoder for financial data -like model—more accurately described as a large encoder with predictive capabilities for subsequent ticks in lieu of a decoder—to ascertain the degree of data compression attainable. The selection of latent space dimensionality presented a challenge, prompting me to experiment with various configurations. Upon graphing the logarithm of the loss function against the logarithm of dimensionality, I observed an approximately linear relationship.
This intriguing discovery led me to hypothesize that similar behavior might be observed in conventional autoencoders, particularly when constraints on encoder and decoder sizes are relaxed. Subsequent experimentation confirmed this linearity, even in variational autoencoders (VAEs), initially without added noise. This finding suggests a potential scaling law, which is particularly noteworthy given the prevailing notion that structured data resides on low-dimensional manifolds, aligning with the concept that "compression equals intelligence."
The manifold hypothesis (https://en.wikipedia.org/wiki/Manifold_hypothesis) is widely acknowledged in the field. However, the aforementioned observations appear to contradict this hypothesis. If valid, we would expect the decoder to parametrize a portion of the manifold using the latent space. Consequently, if the dataset's image occupies a full-dimensional subset of the latent space, one would anticipate a discontinuity in the loss function upon reaching the manifold's dimensionality. Alternatively, the loss curve might resemble the cumulative variance explained by the first $n$ components of the data, or at least the cumulative variance explained by the first $n$ components in the latent space. This observation presents a compelling argument against the manifold hypothesis and underscores the discrete structure inherent in the phenomena under study.
A second, and perhaps more significant, consideration is the underlying reason for the linearity observed in the log-log plot. This phenomenon has been independently verified, as evidenced in the second figure of https://arxiv.org/pdf/2406.04093. Notably, the authors of this study do not strive for large encoders and decoders; rather, they impose bottlenecks in model size. Despite this constraint, the linearity remains apparent.
To elucidate this phenomenon, a colleague astutely noted: "If you consider a matrix SVD and examine the error, it would approximate the sum of the lowest singular values. Thus, if $e \sim d\sigma_d$, then $\log(e) \sim \log(d) + \log(\sigma_d)$, potentially explaining the behavior when compressing to a lower dimension." Upon further reflection, I realized that for unstructured data, significant compression is inherently challenging, as a small latent space is insufficient. This led to a contemplation of the nature of structure, which can be conceptualized as features.
The question then arises: what constitutes features, and how many can be accommodated in a latent space? For a decoder to distinguish features, a minimum angular separation ($\epsilon$) between different features is necessary. A well-known theorem states that the maximum number of distinct vectors that can be embedded in an $n$-dimensional space, while maintaining a minimum pairwise angular separation of $\epsilon$, grows exponentially. This realization provides a satisfactory explanation for the observed phenomenon.
August 5, 2024
I recently posited a hypothesis, backed by a substantial wager of 2000 USD, that no machine learning models would be capable of securing a gold medal at the International Mathematical Olympiad (IMO) until 2026. Upon further reflection, I believe my initial assessment may have been overly conservative. However, the reasoning behind this hypothesis merits closer examination.
One of the persistent challenges in artificial intelligence research has been establishing a reliable metric for model intelligence. I propose that a reasonable proxy for this metric could be the energy expenditure during the training process. However, this approach necessitates several important considerations:
The relationship between model capabilities and computational requirements appears to follow certain scaling laws:
This analysis leads to a critical observation: the primary bottleneck in scaling these models is energy consumption. The difficulty in significantly expanding model capabilities stems largely from the enormous energy requirements associated with training increasingly complex models.
To further contextualize this energy constraint, it's important to consider Bremermann's limit. Named after Hans-Joachim Bremermann, this theoretical limit posits a maximum computational speed for a self-contained system with finite mass. Derived from quantum mechanics and Einstein's mass-energy equivalence, Bremermann's limit is approximately 1.36 × 10^50 bits per second per kilogram.
For a hypothetical computer with the mass of Earth (approximately 5.97 × 10^24 kg), this translates to a maximum computational speed of about 8.13 × 10^74 bits per second. While this limit is enormously high compared to current computing capabilities, it underscores the fundamental physical constraints on computation, regardless of future technological advancements.
The Bremermann's limit serves as an ultimate ceiling on computational speed, reinforcing the argument that energy constraints pose a significant challenge to indefinite scaling of AI models. As we approach this limit, the energy requirements for marginal improvements in AI performance will become increasingly prohibitive.
While my initial bet may prove to be overly pessimistic, the underlying reasoning highlights a critical aspect of AI development that warrants further discussion. The challenges associated with scaling models are substantial, primarily due to energy constraints. As we continue to push the boundaries of AI capabilities, it becomes increasingly important to consider these physical limitations and their implications for the future of artificial intelligence research and development.
There is no metric to measure how well does autoencoder unveil the features however doing a lot of visualisations as above seem to suggest it works quite well.