Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders

Introduction

Extracting structured data from business documents—such as purchase orders, invoices, or delivery receipts—is a common yet challenging task in B2B workflows. Traditional rules-based systems have long been the default choice, but the rise of large language models (LLMs) offers a new, more flexible alternative. This article presents a practical comparison between a rule-based PDF extractor built with pytesseract and an LLM-based solution powered by Ollama and LLaMA 3. Both were applied to the same realistic B2B order scenario to evaluate their strengths and weaknesses.

Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders
Source: towardsdatascience.com

The B2B Order Scenario

The test dataset consisted of scanned PDF purchase orders containing fields such as order number, vendor name, line items (quantities, part numbers, descriptions), pricing, and totals. These documents varied slightly in layout and had occasional handwriting marks, simulating real-world inconsistency. The goal was to extract all relevant fields accurately and fast—without manual intervention.

Rule-Based Extraction with Pytesseract

Implementation

For the rule-based approach, I used pytesseract, a Python wrapper for Google's Tesseract OCR engine. The workflow was:

  1. Preprocess the PDF pages (convert to grayscale, apply thresholding, and deskew).
  2. Run OCR to extract raw text and bounding boxes.
  3. Apply handcrafted regular expressions and layout heuristics to locate and parse fields (e.g., "Order Number:" followed by alphanumeric characters).

Strengths

Weaknesses

LLM-Based Extraction with Ollama and LLaMA 3

Implementation

For the LLM approach, I used Ollama to serve the locally hosted LLaMA 3 model (8B parameters). The pipeline was:

  1. Convert PDF pages to images (as before).
  2. Send the image directly to the LLM along with a structured prompt specifying which fields to extract (e.g., "Extract order number, vendor, line items, and total from this purchase order.").
  3. The model returned a JSON object containing the extracted data.

Strengths

Weaknesses

Head-to-Head Comparison

We evaluated both systems on 50 documents drawn from the same B2B order scenario. Key metrics were:

Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders
Source: towardsdatascience.com
MetricRule-Based (pytesseract)LLM (Ollama + LLaMA 3)
Accuracy (field-level F1)0.850.93
Average processing time per page0.4 seconds9.2 seconds
Set-up effort3 days of rule tweaking30 minutes of prompt engineering
Robustness to layout changeLow (broke on 20% of docs)High (handled all variations)

When to use each approach

Conclusion

Building the same B2B document extractor twice revealed clear trade-offs. The rule-based system with pytesseract offered speed and determinism but required constant maintenance. The LLM approach with Ollama and LLaMA 3 provided superior flexibility and accuracy at the cost of latency and hardware requirements. For many real-world B2B scenarios, a hybrid solution may be best: use rules for simple, well-known fields and an LLM as a fallback or for complex extraction tasks.

This article is based on practical experiments and was first published on Towards Data Science.

Tags:

Recommended

Discover More

Farm-Led 400MW Battery Clears Federal Environmental Hurdle in Record 30 DaysSony Xperia 1 VIII Colorways Leaked: 5 Key Insights Before the LaunchGrowing Distrust: How Screen Time Fears Are Reshaping Education Technology OversightHow Scientists Discovered Warm Ocean Water Approaching Antarctica's Ice ShelvesSmart Cache-Busting for JSON and Static Assets Using PHP’s filemtime()