# DeepSeek-OCR on Google Colab

This notebook sets up and runs the DeepSeek-OCR model for optical character recognition.

**Requirements:**
- GPU Runtime (T4 or better recommended)
- ~15-20 minutes setup time

**Based on:** https://github.com/deepseek-ai/DeepSeek-OCR

## 1. Environment Setup and GPU Check

In [6]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

Tue Oct 21 13:08:54 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   41C    P8             16W /   72W |       3MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 2. Clone Repository

In [2]:
# Clone the DeepSeek-OCR repository
!git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
%cd DeepSeek-OCR

Cloning into 'DeepSeek-OCR'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 34 (delta 0), reused 3 (delta 0), pack-reused 30 (from 1)[K
Receiving objects: 100% (34/34), 7.78 MiB | 17.63 MiB/s, done.
Resolving deltas: 100% (1/1), done.
/content/DeepSeek-OCR


## 3. Install Dependencies

Installing PyTorch, transformers, and other required packages.

In [3]:
# Install PyTorch with CUDA support (Colab typically has CUDA 11.8 or 12.1)
# Note: Colab may already have PyTorch installed, but we ensure compatible version
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118


In [4]:
# Install requirements from the repository
!pip install -r requirements.txt

Collecting transformers==4.46.3 (from -r requirements.txt (line 1))
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers==0.20.3 (from -r requirements.txt (line 2))
  Downloading tokenizers-0.20.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting PyMuPDF (from -r requirements.txt (line 3))
  Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting img2pdf (from -r requirements.txt (line 4))
  Downloading img2pdf-0.6.1.tar.gz (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting addict (from -r requi

In [9]:
# Install flash-attention (this may take 5-10 minutes to compile)
!pip install flash-attn==2.7.3 --no-build-isolation

Collecting flash-attn==2.7.3
  Downloading flash_attn-2.7.3.tar.gz (3.2 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m3.1/3.2 MB[0m [31m96.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash-attn: filename=flash_attn-2.7.3-cp312-cp312-linux_x86_64.whl size=414494788 sha256=567bddcae6f7c133fd964bed9988926fe7aabaddb58bf62a744b2f782a7d4269
  Stored in directory: /root/.cache/pip/wheels/f6/ba/3a/e5622e4a21e0735b65d5f7a0aca41c83467aaf2122031d214e
Successfully built flash-attn
Installing collected packages: flash-attn
Successfully installed flash-attn-2.7

## 4. Upload Test Image

Upload your Capture.PNG file here.

In [None]:
from google.colab import files
from IPython.display import Image, display
import os

# Upload the image
print("Please upload your Capture.PNG file:")
uploaded = files.upload()

# Get the uploaded filename
image_path = list(uploaded.keys())[0]
print(f"\nUploaded file: {image_path}")

# Display the uploaded image
print("\nPreview of uploaded image:")
display(Image(filename=image_path))

In [12]:
# Reinstall flash-attention with specific CUDA version
# Check your CUDA version with !nvidia-smi and adjust cu121 if necessary
!pip install flash-attn==2.7.3 --no-build-isolation --index-url https://download.pytorch.org/whl/cu121

Looking in indexes: https://download.pytorch.org/whl/cu121


## 5. Load DeepSeek-OCR Model

This will download the model from HuggingFace (may take a few minutes).

In [13]:
from transformers import AutoModel, AutoTokenizer
import torch
import os

print("Loading DeepSeek-OCR model...")
print("This may take several minutes on first run...\n")

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Removing attn_implementation='flash_attention_2' as a troubleshooting step
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

print("Model loaded successfully!")
print(f"Model device: {next(model.parameters()).device}")
print(f"Model dtype: {next(model.parameters()).dtype}")

Loading DeepSeek-OCR model...
This may take several minutes on first run...



You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.
Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded successfully!
Model device: cuda:0
Model dtype: torch.bfloat16


In [16]:
from transformers import AutoModel, AutoTokenizer
import torch
import os

print("Loading DeepSeek-OCR model...")
print("This may take several minutes on first run...\n")

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Removing attn_implementation='flash_attention_2' as a troubleshooting step
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

print("Model loaded successfully!")
print(f"Model device: {next(model.parameters()).device}")
print(f"Model dtype: {next(model.parameters()).dtype}")

Loading DeepSeek-OCR model...
This may take several minutes on first run...



You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.
Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded successfully!
Model device: cuda:0
Model dtype: torch.bfloat16


## 6. Run OCR Inference

In [19]:
from PIL import Image
import time
import os
import torch

# Load the image (already loaded in a previous cell, but keeping this for clarity)
# img = Image.open(image_path)
# print(f"Image size: {img.size}")
# print(f"Image mode: {img.mode}\n")

# Set CUDA device (already set in model loading, but keeping for clarity)
# os.environ["CUDA_VISIBLE_DEVICES"] = '0'

print("Running OCR inference using model.infer...\n")
start_time = time.time()

# Define prompt and output path
# prompt = "<image>\nFree OCR. "
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
output_path = '/content/ocr_output' # Define an output directory

# Create output directory if it doesn't exist
if not os.path.exists(output_path):
    os.makedirs(output_path)

# Run inference using the infer method
with torch.no_grad():
    # infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):

    # Tiny: base_size = 512, image_size = 512, crop_mode = False
    # Small: base_size = 640, image_size = 640, crop_mode = False
    # Base: base_size = 1024, image_size = 1024, crop_mode = False
    # Large: base_size = 1280, image_size = 1280, crop_mode = False

    # Gundam: base_size = 1024, image_size = 640, crop_mode = True

    res = model.infer(tokenizer,
                      prompt=prompt,
                      image_file=image_path, # Use the uploaded image path
                      output_path=output_path,
                      base_size=1024,
                      image_size=640,
                      crop_mode=True,
                      save_results=True,
                      test_compress=True)

end_time = time.time()

print(f"Inference completed in {end_time - start_time:.2f} seconds\n")
print("=" * 80)
print("OCR RESULT:")
print("=" * 80)
# The infer method might return different formats,
# we will assume it returns the text directly or in a structure we can access.
# You might need to adjust this based on the actual output format of model.infer
print(res)
print("=" * 80)

# Note: The infer method with save_results=True should save the output to output_path
# You might need to adjust the saving and downloading logic in the next cell
# depending on how model.infer saves the results.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Running OCR inference using model.infer...

BASE:  torch.Size([1, 256, 1280])
PATCHES:  torch.Size([4, 100, 1280])
<|ref|>text<|/ref|><|det|>[[62, 31, 483, 171]]<|/det|>
·We assess a wide range of state-of-the-art LLMs for the first time and empirically show that they exhibit significant patterns of bias related to non-binary gender representations, leaving room for future improvement.  

<|ref|>sub_title<|/ref|><|det|>[[62, 198, 235, 225]]<|/det|>
## 2 Related Work  

<|ref|>sub_title<|/ref|><|det|>[[62, 246, 373, 272]]<|/det|>
### 2.1 Binary Gender Bias in LLMs  

<|ref|>text<|/ref|><|det|>[[62, 283, 485, 992]]<|/det|>
Research on gender bias in artificial intelligence, especially in large language models (LLMs), has predominantly centered on binary gender categories, often reinforcing conventional stereotypes while overlooking the complexities of gender diversity (Blodgett et al., 2020; Nadeem et al., 2021; Schramowski et al., 2022; Stanovsky et al., 2019). Studies such as Bolukbasi

image: 0it [00:00, ?it/s]
other: 100%|██████████| 7/7 [00:00<00:00, 41352.29it/s]

Inference completed in 44.24 seconds

OCR RESULT:
None





## 8. Batch Processing (Optional)

Process multiple images at once.

In [27]:
from PIL import Image
import time
import os
import torch

# Upload multiple images
print("Upload multiple images for batch processing:")
uploaded_files = files.upload()

results = {}
output_path = '/content/batch_ocr_output' # Define a directory for batch output

# Create output directory if it doesn't exist
if not os.path.exists(output_path):
    os.makedirs(output_path)

for filename in uploaded_files.keys():
    print(f"\nProcessing {filename}...")

    try:
        # Construct the full image path in the current working directory
        image_path = os.path.join(os.getcwd(), filename)

        # Define prompt (adjust based on DeepSeek-OCR's expected format)
        prompt = "<image>\n<|grounding|>Convert the document to markdown. "

        with torch.no_grad():
            # Use the infer method for batch processing
            res = model.infer(tokenizer,
                              prompt=prompt,
                              image_file=image_path, # Use the uploaded image path
                              output_path=output_path,
                              base_size=1024,
                              image_size=640,
                              crop_mode=True,
                              save_results=True,
                              test_compress=True)

            # The infer method with save_results=True saves the output to output_path
            # You might need to adjust how to retrieve or confirm the saved result
            # For this example, we'll just note that it was processed.
            results[filename] = f"Processed. Output saved to {output_path}"
            print(f"✓ {filename} processed successfully. Output saved to {output_path}")

    except Exception as e:
        print(f"✗ Error processing {filename}: {str(e)}")
        results[filename] = f"Error: {str(e)}"

# Display all results (or confirmation of processing)
print("\n" + "=" * 80)
print("BATCH PROCESSING SUMMARY")
print("=" * 80)

for filename, result in results.items():
    print(f"\n--- {filename} ---")
    print(result)
    print()

print(f"\nDetailed results are saved in the directory: {output_path}")

# Note: Downloading the batch results as a single file might require
# zipping the output directory or iterating through saved files.
# This part is commented out as model.infer handles saving.
# with open('batch_results.txt', 'w', encoding='utf-8') as f:
#     for filename, result in results.items():
#         f.write(f"{'='*80}\n")
#         f.write(f"File: {filename}\n")
#         f.write(f"{'='*80}\n")
#         f.write(result)
#         f.write(f"\n\n")
#
# files.download('batch_results.txt')

Upload multiple images for batch processing:


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Saving Capture.jpg to Capture (5).jpg
Saving Capture1.jpg to Capture1 (1).jpg

Processing Capture (5).jpg...
BASE:  torch.Size([1, 256, 1280])
PATCHES:  torch.Size([4, 100, 1280])
<|ref|>text<|/ref|><|det|>[[62, 31, 483, 171]]<|/det|>
·We assess a wide range of state-of-the-art LLMs for the first time and empirically show that they exhibit significant patterns of bias related to non-binary gender representations, leaving room for future improvement.  

<|ref|>sub_title<|/ref|><|det|>[[62, 198, 235, 225]]<|/det|>
## 2 Related Work  

<|ref|>sub_title<|/ref|><|det|>[[62, 246, 373, 272]]<|/det|>
### 2.1 Binary Gender Bias in LLMs  

<|ref|>text<|/ref|><|det|>[[62, 283, 485, 992]]<|/det|>
Research on gender bias in artificial intelligence, especially in large language models (LLMs), has predominantly centered on binary gender categories, often reinforcing conventional stereotypes while overlooking the complexities of gender diversity (Blodgett et al., 2020; Nadeem et al., 2021; Schramowski

image: 0it [00:00, ?it/s]
other: 100%|██████████| 7/7 [00:00<00:00, 68279.37it/s]

✓ Capture (5).jpg processed successfully. Output saved to /content/batch_ocr_output

Processing Capture1 (1).jpg...



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


BASE:  torch.Size([1, 256, 1280])
PATCHES:  torch.Size([4, 100, 1280])
<|ref|>text<|/ref|><|det|>[[20, 0, 475, 338]]<|/det|>
Retrieval Augmented Generation system for LLM agents. SCMRAG introduces a novel paradigm that moves beyond the static retrieval methods of traditional RAG systems by integrating a dynamic, LLM- assisted knowledge graph for information retrieval. This knowledge graph evolves with the system, updating and refining itself based on the SCMRAG's agent driven interactions and query- answer pair generations. Crucially, SCMRAG also includes a self- corrective mechanism, enabling it to identify when information is missing or inadequate and autonomously retrieves it from external sources (e.g. web, enterprise information sources, or any other available information resources) by generating a new retrieval query without relying on predefined algorithms. This self- corrective step ensures that up- to- date and accurate information is always accessible.  

<|ref|>text<|/ref|><

image: 0it [00:00, ?it/s]
other: 100%|██████████| 11/11 [00:00<00:00, 72657.23it/s]

✓ Capture1 (1).jpg processed successfully. Output saved to /content/batch_ocr_output

BATCH PROCESSING SUMMARY

--- Capture (5).jpg ---
Processed. Output saved to /content/batch_ocr_output


--- Capture1 (1).jpg ---
Processed. Output saved to /content/batch_ocr_output


Detailed results are saved in the directory: /content/batch_ocr_output





In [23]:
from transformers import AutoModel, AutoTokenizer
import torch
import os

print("Loading DeepSeek-OCR model for batch processing...")

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Removing attn_implementation='flash_attention_2' as a troubleshooting step
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

print("Model loaded successfully for batch processing!")

Loading DeepSeek-OCR model for batch processing...


You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.
Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded successfully for batch processing!


## Troubleshooting

### Common Issues:

1. **Out of Memory (OOM):**
   - Use a higher-tier GPU (A100, V100)
   - Reduce image resolution before processing
   - Enable gradient checkpointing

2. **Flash Attention Installation Fails:**
   - Try removing `attn_implementation='flash_attention_2'` parameter
   - Fallback to standard attention mechanism

3. **Model Download Slow:**
   - This is normal for large models (may take 10-15 minutes)
   - Model is cached after first download

4. **Image Format Issues:**
   - Ensure image is in RGB format
   - Convert: `img = img.convert('RGB')`

### Performance Tips:

- Use images close to native resolutions: 512×512, 640×640, 1024×1024, 1280×1280
- For faster inference, use `torch.float16` (already enabled)
- Batch processing is more efficient for multiple images

## Cleanup (Optional)

Free up GPU memory when done.

In [21]:
# Clear GPU memory
import gc

del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

print("GPU memory cleared")

GPU memory cleared
