Home
Blog
Portfolio
Resume

Building a Local LLM-Powered Hybrid OCR Engine

How I combined Surya's lightning-fast detection with the semantic power of Local LLMs to create pixel-perfect, searchable PDFs from handwriting and complex forms.

OCRLLMLocal IntegrationAI EngineeringPythonPDF
Ahnaf An Nafee
Ahnaf An Nafee
14 Dec 2025

We live in a digital world, yet the most valuable data is often trapped in the analog prison of scanned PDFs. Invoices, handwritten notes, and medical forms are essentially black boxes to our computers; pixels without meaning.

I wanted to solve this. But I didn't want to pay API fees to Cloud Vision providers, and I didn't want my private documents leaving my machine.

This article details how I engineered a Local, Hybrid OCR Engine that combines the layout precision of Surya OCR with the reasoning capabilities of OlmOCR (running locally via LM Studio) to create searchable, indexable PDFs.

Project UI

The Problem: The "OCR Triangle"

When building an OCR tool today, you typically pick two:

  1. Layout Accuracy (Knowing where the text is).
  2. Semantic Accuracy (Knowing what the text says).
  3. Speed/Cost (Running it on a laptop).
ToolLayoutSemanticsSpeed
Tesseract/EasyOCR✅ Good❌ Fails on Handwriting⚡ Fast
Surya Recognition✅ Excellent⚠️ Moderate on Handwriting🐢 Slow (~20s/page)
Surya Detection Only✅ Excellent❌ No text output⚡ Fast (~1s/page)
OlmOCR (Local LLM)❌ No coordinates✅ Reads Cursive Perfectly🐢 Depends on GPU

My goal was to break this triangle. I needed the semantic brains of the LLM to read the messy handwriting, but the structural eyes of Surya's detection to place that text in the correct spot on the PDFs; without the slowness of full recognition.

The Architecture: "Detection + LLM"

Loading diagram...

The core innovation is separating where (detection) from what (LLM). Instead of using slow OCR recognition, we use:

  1. Surya DetectionPredictor - Just finds text bounding boxes (no recognition)
  2. Local LLM (OlmOCR) - Reads the actual content with human-like understanding
  3. Position-Based Aligner - Distributes LLM text across detected boxes

Why Detection-Only is 10-21x Faster

The key insight is that Surya's recognition step is expensive; it runs a large transformer model on every detected text region. But if the LLM is already giving us perfect text, we don't need recognition at all!

Document TypeWith RecognitionDetection-OnlySpeedup
Digital PDF20.38s0.97s21x
Handwritten9.66s1.00s10x
Hybrid Form~11s~1s11x
Loading diagram...

The Position-Based Alignment Algorithm

Since we no longer have OCR text to match against, we use a simpler but effective strategy: distribute tokens by box width.

PYTHON
# hybrid_aligner.py

def align_text(self, structured_data, llm_text):
    """Position-based alignment: distribute LLM tokens across boxes."""
    llm_tokens = llm_text.split()
    boxes = [item[0] for item in structured_data]

    # Calculate width ratios
    total_width = sum(box[2] - box[0] for box in boxes)

    final_output = []
    token_idx = 0

    for i, box in enumerate(boxes):
        # Proportional token allocation
        ratio = (box[2] - box[0]) / total_width
        count = int(round(len(llm_tokens) * ratio))

        chunk = llm_tokens[token_idx : token_idx + count]
        token_idx += count

        final_output.append((box, " ".join(chunk)))

    return final_output

This works because:

  1. Boxes are sorted by reading order (top-to-bottom, left-to-right)
  2. Wider boxes naturally contain more text
  3. The LLM output follows the same reading order as the visual layout

Batch Processing for Multi-Page PDFs

For documents with many pages, we batch all images into a single Surya detection call:

PYTHON
# Process ALL pages at once instead of one-by-one
all_image_bytes = [decode(img) for img in images_dict.values()]
all_boxes = hybrid_aligner.get_detected_boxes_batch(all_image_bytes)

This reduces per-page overhead and leverages GPU parallelism.

The "Sandwich" PDF

Once we have aligned the text, we need to embed it. I used PyMuPDF to create a "Sandwich PDF".

  1. Render the Page as an Image: This strips all existing selectable text (cleaning the slate).
  2. Insert the Image: The background is now just pixels.
  3. Embed Invisible Text: We take our aligned text and inject it using render_mode=3 (invisible text).

The font sizing is carefully calculated to fill each bounding box:

PYTHON
# Calculate font size to fill box width, constrained by height
ref_width = font.text_length(text, fontsize=12.0)
width_based_size = (box_width * 0.98) / ref_width * 12.0
height_based_size = box_height * 0.85

# Use the smaller to ensure it fits
target_fontsize = min(width_based_size, height_based_size)

This technique means the user sees the original visual document (preserving all fonts and layout), but their cursor interacts with a hidden layer of perfect, searchable text.

CLI & Real-time Web UI

A tool is only as good as its DX. I implemented a dual-interface approach:

Project CLI
  1. CLI: Powered by rich, providing beautiful progress bars, spinners, and live status updates.
  2. Web UI: A FastAPI backend allows users to drag-and-drop files. I used WebSockets to push the exact same rich progress updates to the browser in real-time.
PYTHON
# main.py - Two-phase processing
with progress:
    # Phase 1: Batch layout detection
    task_layout = progress.add_task("[cyan]Detecting layouts (batch)...", total=1)
    all_boxes = hybrid_aligner.get_detected_boxes_batch(all_image_bytes)

    # Phase 2: LLM OCR per page
    task_ocr = progress.add_task(f"[cyan]LLM OCR Processing...", total=total_pages)
    for page_num in page_nums:
        llm_lines = ocr_processor.perform_ocr(image_base64)
        aligned_data = hybrid_aligner.align_text(boxes, llm_lines)
        progress.advance(task_ocr)

Conclusion

This project started as a way to digitize my handwritten notes, but it evolved into a robust case study on Hybrid AI Systems.

By separating detection (fast, deterministic) from recognition (slow, semantic), we get the best of both worlds:

  • Surya DetectionPredictor: 10-21x faster than full recognition
  • Local LLM (OlmOCR): Human-like reading comprehension
  • Position-Based Alignment: Simple but effective text distribution

The result is a tool that feels "native" but possesses the reading comprehension of a state-of-the-art AI; running 100% locally on your machine.

The full source code is available on GitHub.

zread

Edit on GitHub

ME

HomeBlogPortfolioResume

SOCIALS

connect with me:EmailLinkedInGitHubGoogle ScholarORCIDItch.ioArtStationBehance

© 2025 - All Rights Reserved

graph TD A["Input PDF"] --> B["PDF to Image Conversion"] B --> C["Batch Processing"] subgraph "Phase 1: Layout Detection (Surya)" C --> D["Surya DetectionPredictor"] D --> E["Bounding Boxes"] E --> F["Sorted by Reading Order"] end subgraph "Phase 2: Text Extraction (Local LLM)" C --> G["OlmOCR Vision Model"] G --> H["Pure Text Content"] end F --> I["Position-Based Aligner"] H --> I I -->|"Distribute by Box Width"| J["Aligned Text Blocks"] J --> K["Sandwich PDF Generator"] K --> L["Searchable PDF Output"]
sequenceDiagram participant UI as User Interface participant WS as WebSocket participant S as Server participant P as Processing Pipeline UI->>WS: Connect with client_id UI->>S: Upload PDF + client_id S->>P: Start processing loop Each Processing Step P->>S: Progress update S->>WS: Send progress message WS->>UI: Update UI progress end S->>UI: Return processed file