参考文献

基礎（必読）

Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" (CVPR 2019)
Loshchilov et al., "nGPT: Normalized Transformer with Representation Learning on the Hypersphere" (arXiv:2410.01131, 2024)
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (arXiv 2021; Neurocomputing 2024) - RoPEの原論文
Nagata et al., "Variance Matters: Detecting Semantic Differences without Corpus/Word Alignment" (EMNLP 2023)
Yamagiwa et al., "Revisiting Cosine Similarity via Normalized ICA-transformed Embeddings" (arXiv 2024)

Jang et al., "Categorical Reparameterization with Gumbel-Softmax" (ICLR 2017)
Maddison et al., "The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables" (ICLR 2017)
Bengio et al., "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation" (arXiv 2013) - STEの原論文

Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017) - MoEの基礎
Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (JMLR 2022)
Jiang et al., "Mixtral of Experts" (arXiv 2024)
Dai et al., "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models" (arXiv 2024)

Nickel & Kiela, "Poincaré Embeddings for Learning Hierarchical Representations" (NeurIPS 2017)
Mathieu et al., "Continuous Hierarchical Representations with Poincaré Variational Auto-Encoders" (NeurIPS 2019)
Yang et al., "Hyperbolic Fine-Tuning for Large Language Models (HypLoRA)" (arXiv:2410.04010, 2024)
- ※AQuAで最大13.0%向上。発展版のHoRA（適応的曲率）では17.30%向上との報告あり
Sinha et al., "Learning Structured Representations with Hyperbolic Embeddings" (NeurIPS 2024)
He et al., "Hyperbolic Deep Learning for Foundation Models: A Survey" (arXiv:2507.17787, 2025)

Ho et al., "Denoising Diffusion Probabilistic Models" (NeurIPS 2020)
Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021)

Amari, "Information Geometry and Its Applications" (Springer 2016)
Martens & Grosse, "Optimizing Neural Networks with Kronecker-factored Approximate Curvature" (ICML 2015)
Liu et al., "Reconstructing Deep Neural Networks: Unleashing the Optimization Potential of Natural Gradient Descent" (NeurIPS 2024)
参考（査読未通過）: Hwang, "FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information" (arXiv:2405.12807)
- ※ICLR 2025で取り下げ。Adamと自然勾配の関係についての興味深い視点を提供するが、確立された理論ではない点に注意

Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget" (arXiv 2023)

Carlsson, "Topology and Data" (Bulletin of the AMS, 2009) - TDA入門
Lee, "Introduction to Riemannian Manifolds" (2018) - 数学的基礎
Bronstein et al., "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges" (2021)