Julian Bilcke
commited on
Commit
·
a9df757
1
Parent(s):
347756a
some ai notes
Browse files
NOTES.md
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Video Model Training Notes
|
| 2 |
+
|
| 3 |
+
## Training Step Analysis
|
| 4 |
+
|
| 5 |
+
### What happens in a training step?
|
| 6 |
+
|
| 7 |
+
A training step processes **exactly `batch_size` samples** (not the entire dataset). Here's what happens:
|
| 8 |
+
|
| 9 |
+
**Per Training Step:**
|
| 10 |
+
- Processes `batch_size` videos/samples (configurable, typically 1-8)
|
| 11 |
+
- Uses smart batching that groups videos by resolution dimensions
|
| 12 |
+
- Two data streams: text embeddings + video latents
|
| 13 |
+
|
| 14 |
+
**Key Points:**
|
| 15 |
+
- With 100 videos and batch_size=4: each step processes 4 videos
|
| 16 |
+
- Training runs for a fixed number of steps (not epochs)
|
| 17 |
+
- Dataset loops infinitely, so videos are reused across steps
|
| 18 |
+
- Uses ResolutionSampler to batch videos of similar dimensions together
|
| 19 |
+
|
| 20 |
+
**Training Loop Structure:**
|
| 21 |
+
1. Load next `batch_size` samples from dataset
|
| 22 |
+
2. Group by resolution (spatial + temporal dimensions)
|
| 23 |
+
3. Forward pass through transformer (denoising)
|
| 24 |
+
4. Calculate loss and update weights
|
| 25 |
+
5. Increment step counter
|
| 26 |
+
|
| 27 |
+
So if you have 100 videos and batch_size=1, step 1 processes video 1, step 2 processes video 2, etc. When it reaches video 100, it loops back to video 1.
|
| 28 |
+
|
| 29 |
+
## Avoiding Overfitting
|
| 30 |
+
|
| 31 |
+
For video model training, a good rule of thumb is to keep each video seen **less than 10-50 times** during training to avoid overfitting.
|
| 32 |
+
|
| 33 |
+
**Common thresholds:**
|
| 34 |
+
- **Conservative**: <10 times per video (strong generalization)
|
| 35 |
+
- **Moderate**: 10-50 times per video (balanced)
|
| 36 |
+
- **Risky**: >100 times per video (likely overfitting)
|
| 37 |
+
|
| 38 |
+
**With low learning rates (e.g., 0.00004):**
|
| 39 |
+
- Lower LR means you can potentially see videos more times safely
|
| 40 |
+
- But still better to err on the side of caution
|
| 41 |
+
|
| 42 |
+
**Practical calculation:**
|
| 43 |
+
- If training for 10,000 steps with batch_size=1:
|
| 44 |
+
- 100 videos = 100 times each (risky)
|
| 45 |
+
- 500 videos = 20 times each (moderate)
|
| 46 |
+
- 1,000+ videos = <10 times each (conservative)
|
| 47 |
+
|
| 48 |
+
**Early stopping indicators:**
|
| 49 |
+
- Training loss continues decreasing but validation loss plateaus/increases
|
| 50 |
+
- Generated videos start looking too similar to training examples
|
| 51 |
+
- Loss of diversity in outputs
|
| 52 |
+
|
| 53 |
+
With low learning rates, staying under 20-30 times per video should be relatively safe, but <10 times is ideal for strong generalization.
|