MartialTerran
/

HART-SURYA_model

Model card Files Files and versions

xet

Community

MartialTerran commited on Aug 23

Commit

abf9bed

verified ·

1 Parent(s): 1ee66fe

Create HART-SURYA_README.md

Browse files

Files changed (1) hide show

HART-SURYA_README.md +161 -0

HART-SURYA_README.md ADDED Viewed

	@@ -0,0 +1,161 @@

+I examined the patchifying and positional encoding and tokenization methods employed in the Surya model.
+I posted these proposals for Surya model upgrades at https://github.com/NASA-IMPACT/Surya/issues/21
+I find the input/tokenization method to be surprisingly simple, and perhaps inefficient.   It seems that the transformer model is receiving square images centered about the sun and containing much dark space around the sun that contains no information.  Although a transformer model is smart and can figure things out, it is usually better to optimize the inputs to reduce noise and remove extraneous data that only consumes compute and power (garbage in, garbage out) and adds nothing to the best results.
+My proposal:
+Heliocentric Adaptive-Rotation Tokenization (HART)
+The core principle of HART is to transform the input from a sequence of static, projected 2D images into a sequence of tokens that live in a unified, rotating reference frame that matches the sun's own physical rotations (different per latitude). This allows the transformer to see features as they truly evolve on the solar surface over the sequence of images having a time T dimension ( rather than mistreating Time T series images like a color C channel), and seeing a combination of evolution and apparent motion due to rotation.
+Background info:
+Shape and Resolution:
+Dimensions: The README explicitly states the model is trained on "Native 4096×4096 pixel resolution" data. This means the input images (H, W) are 4096x4096. They are square.
+Downsampling: The script does not downsample the images before tokenization. The PatchEmbed3D module takes the full img_size (which would be 4096) and divides it into patches using a 2D convolution. The "downsampling" happens implicitly through the act of patching (e.g., converting a 16x16 pixel area into a single token).
+Centering and Orientation:
+Centering: The sample image (surya_model_validation.png) clearly shows the sun as a perfect circle positioned in the dead center of the square frame.
+Method of Centering: The README's "Data Processing Pipeline" section is key here. It mentions "Spatial registration: Uniform 0.6"/pixel grid, solar north alignment."
+This scientific preprocessing step confirms that every image is standardized. The centering is geometric, based on finding the limb (the visible edge) of the sun and placing its center at the center of the 4096x4096 grid.
+Peripheral Space (Non-Information):
+Significant Waste: A 4096x4096 square frame containing a circular sun has a substantial amount of "non-information" space. The area of the square is 4096², and the area of the circle is π * (4096/2)². The ratio of the circle's area to the square's is π/4 ≈ 0.785.
+This means approximately 21.5% of the pixels in every input image are black (void of space) and contain no direct information about the sun.
+Based on this analysis, the standard 2D grid-based patching is computationally wasteful and geometrically naive. There are at least three levels of optimization, from simple to highly advanced, that could be applied to mitigate this.
+The Sun's Surface is a Fluid Plasma: The most critical concept to understand is that the sun does not have a solid surface. The "surface" we see, the photosphere, is a roiling, convective layer of superheated plasma (ionized gas). Therefore, nothing on it can be truly "fixed."
+Differential Rotation: Unlike a solid body like Earth which rotates at the same speed everywhere, the sun exhibits differential rotation.
+The solar equator completes a rotation in about 25 Earth days.
+The solar poles take much longer, about 35 Earth days.
+This means a sunspot near the equator will "lap" a sunspot at a higher latitude over time. They are constantly drifting relative to each other. This shearing and stretching motion is a primary driver of solar activity.
+Intrinsic Evolution: Sunspots are not permanent features. They are temporary magnetic phenomena that:
+Emerge: A magnetic flux tube from deep within the sun breaks through the surface.
+Evolve: They change shape, size, and complexity over hours and days.
+Decay: They eventually dissipate over a period of days to weeks.
+Thus, sun spots etc. are permanent enough in the scale of time of a few frames of solar images to be intentionally tracked by the Transformer model over a limited time T.
+***
+### Martial Terran's Comment for the Surya Project
+**Subject: A Thought on Optimizing the Spatiotemporal Input Tokenization for Future Versions**
+First off, congratulations to the entire NASA and IBM team on the release of Surya. This is a landmark achievement for heliophysics and a truly impressive application of foundation models to a complex scientific domain. The model's performance on a wide range of downstream tasks is a testament to the quality of the architecture and the immense effort that went into it.
+While studying the architecture and the nature of the SDO data, I had a thought regarding the input tokenization process that I wanted to share for potential consideration in future research or model versions.
+#### 1. General Observations on the Current Method
+The current approach, which involves patching the full 4096x4096 images and flattening the time dimension (`T`) with the channel dimension (`C`), is a robust and proven technique in computer vision. However, for solar data, it presents two main challenges:
+*   **Computational Inefficiency:** A significant portion (~21.5%) of the input tokens correspond to the black, information-less space surrounding the solar disk. These tokens are processed through all layers of the transformer, consuming substantial memory and FLOPs without contributing to the representation.
+*   **Implicit Learning of Predictable Motion:** By treating the time sequence as channels, the model is tasked with learning the sun's differential rotation from scratch. While it is clearly capable of doing so, this forces the model to expend a large part of its capacity on learning a predictable, kinematic motion rather than focusing exclusively on the more complex and scientifically interesting *intrinsic evolution* of solar features (e.g., the emergence, shearing, and decay of active regions).
+#### 2. Exploring Alternative Methods
+Several strategies could address these points. A simple first step would be **Masked Tokenization**, where tokens corresponding to the void are identified and removed after the embedding layer and re-inserted before the decoder. This would reclaim the computational cost with no loss of information.
+More advanced methods could involve **adaptive or polar coordinate patching** to better align with the sun's geometry. However, a more physics-informed approach could potentially yield even greater benefits.
+#### 3. Proposal: The Heliocentric Adaptive-Rotation Tokenization (HART) Method
+I would like to propose a novel input processing method, which I'll call **Heliocentric Adaptive-Rotation Tokenization (HART)**, designed to transform the input sequence into the sun's own co-rotating physical reference frame before tokenization.
+The core principle is to **normalize out the effect of differential rotation**, presenting the transformer with a sequence of images where solar features appear stationary unless they are intrinsically evolving.
+This method can be broken down into three stages:
+**Stage 1: Pre-computation of a Spherical Band Index Map**
+This is a one-time setup step.
+*   First, a master 4096x4096 map is created where each pixel coordinate `(x, y)` within the solar disk is mapped to its true Carrington Heliographic latitude `φ`.
+*   Crucially, this must account for the fact that **iso-latitude bands project as *curves*** onto the 2D image plane, not as straight horizontal lines (except for the equator).
+*   From this, a `BandIndexMap` is generated, where each pixel is assigned an integer index corresponding to its curved iso-latitude band.
+**Stage 2: Dynamic, Per-Band Image Warping**
+For each time sequence `[I_t0, I_t1, ...]`, this stage aligns all images to the `I_t0` reference frame.
+*   For each subsequent image `I_ti`, a 2D warping flow field is calculated.
+*   For each band index `j`, the appropriate rotation angle is calculated based on its latitude and the time delta `Δt`.
+*   The flow field then maps each pixel in the target image to its source location in the original image by applying this rotation in spherical space and re-projecting back to 2D.
+*   This complex, non-linear warp is applied efficiently using an operation like PyTorch's `F.grid_sample(I_ti, flow_field)`, resulting in an aligned image `I_aligned_ti`.
+**Stage 3: Tokenization of the Aligned Sequence**
+*   With a sequence of aligned images, a standard Cartesian patcher can be applied. A patch at grid location `(i, j)` now represents the same physical, co-rotating region across the entire time sequence `T`.
+*   A positional embedding based on the spherical Carrington coordinates `(φ, λ)` of each patch's center would still be highly beneficial, giving the model an absolute sense of physical location.
+By adopting this method, the model would be freed from learning kinematics and could dedicate its full representational power to modeling the physics of solar evolution. This could potentially lead to improved performance, faster convergence, or a more sample-efficient model.
+Thank you again for this incredible open-source contribution. I hope this suggestion is seen as a constructive and interesting idea for the future of this exciting project.
+Best regards,
+"Martial Terran"
+https://huggingface.co/MartialTerran
+[email protected]
+There are also other simpler optimization methods that can be evaluated:
+By analyzing the provided context and supplementing it with heliophysics knowledge, we can deduce the input data format and then propose several powerful optimizations for the tokenization process.
+***
+### Analysis of the Input Data Format (Based on Disclosure)
+#### **Shape and Resolution:**
+*   **Dimensions:** The README explicitly states the model is trained on **"Native 4096×4096 pixel resolution"** data. This means the input images `(H, W)` are **4096x4096**. They are square.
+*   **Downsampling:** The script does **not** downsample the images *before* tokenization. The `PatchEmbed3D` module takes the full `img_size` (which would be 4096) and divides it into patches using a 2D convolution. The "downsampling" happens implicitly through the act of patching (e.g., converting a 16x16 pixel area into a single token).
+#### **Centering and Orientation:**
+*   **Centering:** The sample image (`surya_model_validation.png`) clearly shows the sun as a perfect circle positioned in the **dead center of the square frame**.
+*   **Method of Centering:** The README's "Data Processing Pipeline" section is key here. It mentions **"Spatial registration: Uniform 0.6"/pixel grid, solar north alignment."**
+    *   This scientific preprocessing step confirms that every image is standardized. The centering is **geometric**, based on finding the limb (the visible edge) of the sun and placing its center at the center of the 4096x4096 grid.
+    *   It is **not** centered on a sunspot. Sunspots are transient and move due to solar rotation. Using them as an anchor would make the data unstable and non-comparable over time. The model relies on a stable, geometric frame of reference.
+#### **Solar Physics Context:**
+*   **Surface Stability:** The sun's visible surface (the photosphere) is **not stable like land**. It is a dynamic, fluid-like plasma, much more akin to Jupiter's atmosphere than a solid body.
+*   **Sunspot Location:** Sunspots are temporary magnetic phenomena that emerge, evolve, and dissipate over days to weeks. Furthermore, the sun exhibits **differential rotation**: the equator rotates faster (approx. 25 days) than the poles (approx. 35 days). This means sunspots and other features drift at different speeds depending on their latitude. This is precisely why a fixed, geometric centering is essential for the model.
+#### **Peripheral Space (Non-Information):**
+*   **Significant Waste:** A 4096x4096 square frame containing a circular sun has a substantial amount of "non-information" space. The area of the square is `4096²`, and the area of the circle is `π * (4096/2)²`. The ratio of the circle's area to the square's is `π/4 ≈ 0.785`.
+*   This means **approximately 21.5% of the pixels in every input image are black (void of space)** and contain no direct information about the sun.
+***
+### How to Further Optimize the 2D Input Tokenization Method
+Based on this analysis, the standard grid-based patching (`PatchEmbed3D`) is computationally wasteful and geometrically naive. Here are three levels of optimization, from simple to highly advanced.
+#### Optimization 1: Masked Tokenization (High-Impact, Low-Complexity)
+The most obvious inefficiency is that the model spends a significant amount of its computational budget creating and processing tokens for the black, empty space around the sun.
+*   **Concept:** After creating the initial grid of tokens, identify and remove all tokens that correspond to the peripheral void before feeding them into the transformer backbone.
+*   **Implementation:**
+    1.  Create a static binary mask of shape `(H_token, W_token)` where `1` represents a token within the solar disk and `0` represents a token in the void. This mask only needs to be computed once.
+    2.  In the forward pass, after the `embedding` module creates the tokens `(B, L, D)`, where `L = H_token * W_token`, apply the flattened mask to remove the useless tokens. The sequence length `L` will be reduced by ~21.5%.
+    3.  Feed this shorter, dense sequence of tokens through the `SpectFormer` backbone. This dramatically reduces the `N²` complexity in the self-attention layers.
+    4.  Before the `unembed` (decoder) step, re-insert zero-vectors into the sequence at the masked-out positions to restore the original spatial grid structure `(B, L, D)`.
+    5.  The decoder can then proceed as normal.
+*   **Benefit:** Massive reduction in computational cost (FLOPs) and memory usage with zero loss of information, allowing for deeper models, larger batch sizes, or faster training.
+#### Optimization 2: Adaptive and Geometry-Aware Patching (Medium Complexity)
+The standard grid treats all parts of the sun equally, but active regions with sunspots are far more complex than quiet regions.
+*   **Concept A: Polar Coordinate Patching:** Since the data is circular and exhibits rotation, a patch grid based on polar coordinates (rings and wedges) might be a more natural fit than a Cartesian grid. This could help the convolutional patcher and relative positional encoding better capture rotational dynamics.
+*   **Concept B: Adaptive Quadtree Tokenization:** This is a more dynamic approach.
+    1.  Start with a coarse grid of large patches.
+    2.  For each patch, calculate a complexity metric (e.g., image variance, gradient magnitude).
+    3.  If a patch's complexity is above a threshold (indicating it contains an active region), recursively split it into four smaller sub-patches.
+    4.  This results in a non-uniform tokenization, with high resolution (many small tokens) over active regions and low resolution (few large tokens) over quiet regions.
+*   **Benefit:** Focuses the model's representational power on the areas that matter most, leading to a more efficient and potentially more accurate representation for the same computational cost.
+#### Optimization 3: Spherical Projection-Aware Tokenization (Low-Impact, High-Complexity)
+This is the most physically-principled approach. The 2D image is a projection of a 3D sphere, which introduces significant geometric distortion, especially near the limbs (edges).
+*   **Concept:** Modify the tokenization and/or positional encoding to be aware of the underlying spherical geometry.
+*   **Implementation A (Positional Encoding):** Keep the standard grid patching, but replace the 2D Fourier positional encoding with one based on heliographic coordinates (latitude and longitude). For each patch's center `(x, y)` pixel coordinate, calculate its corresponding `(lat, lon)` on the solar sphere and use *those* values in the sine/cosine functions. This tells the model that a patch at the edge represents a highly distorted, larger surface area than a patch at the center.
+*   **Implementation B (Spherical Pixelation):** A more radical change would be to abandon the Cartesian grid entirely and re-project the input images onto a standard spherical grid like **HEALPix**, which is common in astrophysics. HEALPix divides a sphere into equal-area pixels. Tokenizing on this grid would give every token equal "physical importance" and eliminate projection distortion. This would require a significant architectural change but would be the most geometrically accurate approach.
+*   **Benefit:** Creates a more physics-aware model that understands the true 3D nature of the sun, potentially improving forecasts for events near the solar limb, where projection effects are most severe.