Spaces:

salvinjose
/

HNTAI

Paused

App Files Files Community

sachinchandrankallar commited on Aug 27

Commit

fedc6da

1 Parent(s): 4409104

model loader gguf fixes

Browse files

Files changed (6) hide show

FINAL_PROGRESS.md +72 -0
PROGRESS_UPDATE.md +32 -0
TODO.md +12 -0
TODO_PROGRESS.md +16 -0
ai_med_extract/api/routes.py +47 -2
ai_med_extract/utils/model_loader_gguf.py +38 -5

FINAL_PROGRESS.md ADDED Viewed

	@@ -0,0 +1,72 @@

+# GGUF Timeout Fix - Complete Implementation
+## ✅ All Steps Completed:
+### 1. Increased GGUF Timeout
+- Changed from 120s to 300s for Hugging Face Spaces
+- Maintained 120s for local development
+- Made timeout configurable via `GGUF_GENERATION_TIMEOUT` environment variable
+### 2. Enhanced Error Handling
+- Added comprehensive timeout handling in `routes.py`
+- Implemented fallback mechanisms when GGUF model fails
+- Added better logging for debugging timeout issues
+- Created robust fallback pipeline for graceful degradation
+### 3. Optimized GGUF Model Parameters
+- Added CPU-specific optimizations for Hugging Face Spaces:
+  - `use_mlock=False` for better container compatibility
+  - `vocab_only=False` for full model loading
+  - `n_threads_batch=n_threads` for consistent threading
+  - `mmap=True` for memory mapping optimizations
+  - Cache type optimizations for better performance
+### 4. Added Progress Logging
+- Enhanced logging throughout the generation process
+- Added detailed timing information for each generation loop
+- Added validation checks for summary completeness
+- Improved debugging capabilities
+## 🔧 Files Modified:
+### `ai_med_extract/utils/model_loader_gguf.py`
+- Updated timeout handling with environment variable support
+- Optimized model initialization parameters for Spaces
+- Enhanced logging throughout the generation process
+- Added detailed progress monitoring
+### `ai_med_extract/api/routes.py`
+- Added comprehensive error handling for GGUF timeouts
+- Implemented fallback mechanisms when GGUF fails
+- Improved logging and error responses
+- Added graceful degradation to template-based fallback
+## ⚙️ Configuration Options:
+### Environment Variables:
+- `GGUF_GENERATION_TIMEOUT`: Custom timeout in seconds (default: 300 for Spaces, 120 for local)
+- `GGUF_N_THREADS`: Number of CPU threads to use
+- `GGUF_N_BATCH`: Batch size for processing
+### Performance Settings:
+- **Hugging Face Spaces**: Ultra-conservative settings (1 thread, 16 batch, 512 context)
+- **Local Development**: Normal settings (2 threads, 32 batch, 1024 context)
+## 🚀 Ready for Testing:
+The implementation is now complete and ready for testing. The changes include:
+1. **Increased timeout** from 120s to 300s for Hugging Face Spaces
+2. **Configurable timeout** via environment variable
+3. **Better error handling** with fallback mechanisms
+4. **Optimized parameters** for CPU performance on Spaces
+5. **Enhanced logging** for debugging and monitoring
+## 📋 Testing Checklist:
+- [ ] Test GGUF model with Phi-3 model on Spaces
+- [ ] Verify timeout is sufficient for generation
+- [ ] Test fallback mechanisms when GGUF fails
+- [ ] Monitor memory usage and performance
+- [ ] Verify logging provides useful debugging information
+The implementation should now handle the GGUF timeout issues effectively while providing graceful degradation when the model fails.

PROGRESS_UPDATE.md ADDED Viewed

	@@ -0,0 +1,32 @@

+# GGUF Timeout Fix - Progress Update
+## ✅ Completed Steps:
+1. **Increased GGUF timeout**: Changed from 120s to 300s for Hugging Face Spaces
+2. **Configurable timeout**: Added GGUF_GENERATION_TIMEOUT environment variable support
+3. **Better error handling**: Enhanced timeout and fallback mechanisms in routes.py
+4. **Fallback pipeline**: Added robust fallback when GGUF model fails to load or times out
+## 🔧 Changes Made:
+### model_loader_gguf.py:
+- Updated `_generate_with_timeout()` to use 300s default for Spaces, 120s for local
+- Made timeout configurable via environment variable
+- Updated `generate()` to use configurable timeout
+### routes.py:
+- Added fallback pipeline usage when GGUF times out
+- Added better logging for timeout errors
+- Added fallback for GGUF model loading failures
+- Improved error messages and response handling
+## 🚀 Next Steps:
+- Test the changes with the GGUF model
+- Verify timeout is sufficient for Phi-3 model
+- Test fallback mechanisms
+- Add progress logging for generation
+## ⚙️ Configuration:
+- Default timeout: 300s (Spaces) / 120s (local)
+- Environment variable: `GGUF_GENERATION_TIMEOUT`
+- Fallback: Template-based summary when GGUF fails

TODO.md ADDED Viewed

	@@ -0,0 +1,12 @@

+# GGUF Timeout Fix Plan
+## Steps to Complete:
+1. [ ] Increase GGUF timeout from 120s to 300s in model_loader_gguf.py
+2. [ ] Make timeout configurable via environment variables
+3. [ ] Add better error handling and fallback mechanisms
+4. [ ] Optimize GGUF model parameters for Hugging Face Spaces
+5. [ ] Add progress logging for generation
+6. [ ] Test the changes
+## Current Status: Starting implementation

TODO_PROGRESS.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# GGUF Timeout Fix Progress
+## Steps Completed:
+1. ✅ Increased GGUF timeout from 120s to 300s for Hugging Face Spaces
+2. ✅ Made timeout configurable via GGUF_GENERATION_TIMEOUT environment variable
+3. ✅ Updated both `_generate_with_timeout` and `generate` methods to use configurable timeout
+## Next Steps:
+4. Add better error handling and fallback mechanisms in routes.py
+5. Optimize GGUF model parameters for better performance on Spaces
+6. Add progress logging for generation
+7. Test the changes
+## Environment Configuration:
+- Default timeout: 300s for Spaces, 120s for local development
+- Configurable via: GGUF_GENERATION_TIMEOUT environment variable

ai_med_extract/api/routes.py CHANGED Viewed

@@ -1122,12 +1122,57 @@ def register_routes(app, agents):
                             "delta": delta_text
                         }), 200
                     except TimeoutError as e:
-                        return jsonify({"error": f"GGUF model generation timed out: {str(e)}"}), 408
                     except Exception as e:
                         return jsonify({"error": f"GGUF model generation failed: {str(e)}"}), 500
                 except Exception as e:
-                    return jsonify({"error": f"Failed to load GGUF model: {str(e)}"}), 500
             elif model_type in {"text-generation", "causal-openvino"}:
                 # Try to use an existing loader if available
                 loader = agents.get("medical_data_extractor")

                             "delta": delta_text
                         }), 200
                     except TimeoutError as e:
+                        logger.error(f"GGUF model generation timed out: {e}")
+                        # Try to use a simpler fallback model
+                        try:
+                            from .model_loader_gguf import create_fallback_pipeline
+                            fallback_pipeline = create_fallback_pipeline()
+                            fallback_summary = fallback_pipeline.generate_full_summary(prompt)
+                            markdown_summary = summary_to_markdown(fallback_summary)
+                            with state_lock:
+                                patient_state["visits"] = all_visits
+                                patient_state["last_summary"] = markdown_summary
+                            validation_report = validate_and_compare_summaries(old_summary, markdown_summary, "Update (Fallback)")
+                            return jsonify({
+                                "summary": markdown_summary,
+                                "validation": validation_report,
+                                "baseline": baseline,
+                                "delta": delta_text,
+                                "warning": "GGUF model timed out, using fallback summary"
+                            }), 200
+                        except Exception as fallback_error:
+                            return jsonify({
+                                "error": f"GGUF model generation timed out and fallback failed: {str(e)}",
+                                "original_error": str(e)
+                            }), 408
                     except Exception as e:
+                        logger.error(f"GGUF model generation failed: {e}")
                         return jsonify({"error": f"GGUF model generation failed: {str(e)}"}), 500
                 except Exception as e:
+                    logger.error(f"Failed to load GGUF model: {e}")
+                    # Try to use fallback pipeline
+                    try:
+                        from .model_loader_gguf import create_fallback_pipeline
+                        fallback_pipeline = create_fallback_pipeline()
+                        fallback_summary = fallback_pipeline.generate_full_summary(prompt)
+                        markdown_summary = summary_to_markdown(fallback_summary)
+                        with state_lock:
+                            patient_state["visits"] = all_visits
+                            patient_state["last_summary"] = markdown_summary
+                        validation_report = validate_and_compare_summaries(old_summary, markdown_summary, "Update (Fallback)")
+                        return jsonify({
+                            "summary": markdown_summary,
+                            "validation": validation_report,
+                            "baseline": baseline,
+                            "delta": delta_text,
+                            "warning": "GGUF model failed to load, using fallback summary"
+                        }), 200
+                    except Exception as fallback_error:
+                        return jsonify({
+                            "error": f"Failed to load GGUF model and fallback failed: {str(e)}",
+                            "original_error": str(e)
+                        }), 500
             elif model_type in {"text-generation", "causal-openvino"}:
                 # Try to use an existing loader if available
                 loader = agents.get("medical_data_extractor")

ai_med_extract/utils/model_loader_gguf.py CHANGED Viewed

@@ -85,11 +85,21 @@ class GGUFModelPipeline:
                 use_mlock=False,
                 seed=0,
                 verbose=False,  # Reduce logging
-                # Additional memory optimizations
                 rope_freq_base=10000,
                 rope_freq_scale=1.0,
                 mul_mat_q=True,  # Enable quantized matrix multiplication
                 f16_kv=True,     # Use half-precision for key/value cache
             )
         except Exception as e:
             logger.error(f"Failed to initialize GGUF model: {e}")
@@ -107,8 +117,13 @@ class GGUFModelPipeline:
             text = re.sub(p, "", text, flags=re.IGNORECASE)
         return text.strip()
-    def _generate_with_timeout(self, prompt, max_tokens=512, temperature=0.5, top_p=0.95, timeout=120):
         """Generate text with timeout using threading"""
         def _generate():
             try:
                 output = self.model(
@@ -134,7 +149,8 @@ class GGUFModelPipeline:
     def generate(self, prompt, max_tokens=512, temperature=0.5, top_p=0.95):
         t0 = time.time()
         try:
-            output = self._generate_with_timeout(prompt, max_tokens, temperature, top_p, timeout=120)
             dt = time.time() - t0
             text = output["choices"][0]["text"].strip()
             text = self._strip_special_tokens(text)
@@ -168,24 +184,41 @@ class GGUFModelPipeline:
         total_start = time.time()
         try:
             for loop_idx in range(max_loops):
                 loop_start = time.time()
                 output = self.generate(current_prompt, max_tokens=max_tokens)
                 # Remove prompt from output if repeated
                 if output.startswith(prompt):
                     output = output[len(prompt):].strip()
                 full_output += output
                 loop_time = time.time() - loop_start
                 logger.info(f"[GGUF] loop {loop_idx+1}/{max_loops}: {loop_time:.2f}s, cumulative {time.time()-total_start:.2f}s, length={len(full_output)} chars")
-                # Only continue if required sections are missing
                 required_present = all(s in full_output for s in ['Clinical Assessment','Key Trends & Changes','Plan & Suggested Actions','Direct Guidance for Physician'])
                 if required_present:
                     break
                 # Prepare the next prompt to continue
                 current_prompt = prompt + "\n" + full_output + "\nContinue the summary in markdown format:"
             total_time = time.time() - total_start
-            logger.info(f"[GGUF] generate_full_summary total: {total_time:.2f}s")
             return full_output.strip()
         except Exception as e:
             logger.error(f"Full summary generation failed: {e}")

                 use_mlock=False,
                 seed=0,
                 verbose=False,  # Reduce logging
+                # Additional memory optimizations for Spaces
                 rope_freq_base=10000,
                 rope_freq_scale=1.0,
                 mul_mat_q=True,  # Enable quantized matrix multiplication
                 f16_kv=True,     # Use half-precision for key/value cache
+                # Performance optimizations for CPU
+                use_mlock=False,  # Don't lock memory (better for containers)
+                vocab_only=False,
+                # Threading optimizations
+                n_threads_batch=n_threads,  # Use same threads for batch processing
+                # Memory mapping optimizations
+                mmap=True,
+                # Cache optimizations
+                cache_type_k=0,  # Default cache type
+                cache_type_v=0,  # Default cache type
             )
         except Exception as e:
             logger.error(f"Failed to initialize GGUF model: {e}")
             text = re.sub(p, "", text, flags=re.IGNORECASE)
         return text.strip()
+    def _generate_with_timeout(self, prompt, max_tokens=512, temperature=0.5, top_p=0.95, timeout=None):
         """Generate text with timeout using threading"""
+        # Use environment variable or default timeout (300s for Spaces, 120s otherwise)
+        if timeout is None:
+            is_hf_space = os.environ.get('SPACE_ID') is not None
+            timeout = int(os.environ.get('GGUF_GENERATION_TIMEOUT', '300' if is_hf_space else '120'))
         def _generate():
             try:
                 output = self.model(
     def generate(self, prompt, max_tokens=512, temperature=0.5, top_p=0.95):
         t0 = time.time()
         try:
+            # Use configurable timeout
+            output = self._generate_with_timeout(prompt, max_tokens, temperature, top_p)
             dt = time.time() - t0
             text = output["choices"][0]["text"].strip()
             text = self._strip_special_tokens(text)
         total_start = time.time()
         try:
+            logger.info(f"[GGUF] Starting full summary generation with max_loops={max_loops}")
             for loop_idx in range(max_loops):
                 loop_start = time.time()
+                logger.info(f"[GGUF] Starting loop {loop_idx+1}/{max_loops}")
                 output = self.generate(current_prompt, max_tokens=max_tokens)
                 # Remove prompt from output if repeated
                 if output.startswith(prompt):
                     output = output[len(prompt):].strip()
                 full_output += output
                 loop_time = time.time() - loop_start
                 logger.info(f"[GGUF] loop {loop_idx+1}/{max_loops}: {loop_time:.2f}s, cumulative {time.time()-total_start:.2f}s, length={len(full_output)} chars")
+                # Check if we have all required sections
                 required_present = all(s in full_output for s in ['Clinical Assessment','Key Trends & Changes','Plan & Suggested Actions','Direct Guidance for Physician'])
                 if required_present:
+                    logger.info(f"[GGUF] All required sections found after loop {loop_idx+1}")
                     break
                 # Prepare the next prompt to continue
                 current_prompt = prompt + "\n" + full_output + "\nContinue the summary in markdown format:"
+                logger.info(f"[GGUF] Preparing next prompt for loop {loop_idx+2}")
             total_time = time.time() - total_start
+            logger.info(f"[GGUF] generate_full_summary completed in {total_time:.2f}s")
+            # Final validation check
+            if not is_complete(full_output):
+                logger.warning("[GGUF] Generated summary may be incomplete - missing sections or incomplete sentences")
             return full_output.strip()
         except Exception as e:
             logger.error(f"Full summary generation failed: {e}")