Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 12 Issue: 10 | Oct 2025

p-ISSN: 2395-0072

www.irjet.net

Enhanced Document Image Preprocessing Using Histogram Equalization and Sobel Edge Detection for Improved OCR Accuracy Mohd Shahzad1, Laraib Ahmad Siddiqui2, Nazish Baliyan3, Mohd Amaan Khan4 1AWS and DevOps Engineer, Deloitte, India

2Program Control Services Analyst, Accenture, India 3Software Engineer, Nike, India

4Cloud and DevOps Engineer, Intellect Design Arena, India

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - The accuracy of Optical Character Recognition

Global and adaptive binarization. Otsu’s global thresholding maximizes interclass variance and is a strong baseline for well-lit pages [1]. However, global thresholds falter under strong shading or background texture. Adaptive/local methods, Niblack (mean-std neighbourhood), Sauvola (mean and local deviation), and Wolf–Jolion variants, compute a threshold per pixel from local statistics and are robust to uneven lighting [2–4]. Periodic “Document Image Binarization Contests” (e.g., DIBCO) have further advanced adaptive schemes and evaluation protocols.

(OCR) depends strongly on document image quality. Low contrast, uneven illumination, scanning artifacts, and sensor noise reduce character separability and degrade OCR. We present a two‑stage preprocessing pipeline that is both simple and explainable: global Histogram Equalization (HE) for contrast enhancement, followed by Sobel edge detection to strengthen character boundaries prior to OCR. Implemented in Python with OpenCV, the method is evaluated on the UW‑III English Document Image Database and self‑scanned grayscale documents. Across 30 images, OCR accuracy improved from 74.3% (raw) to 82.7% (HE) and 88.9% with HE + Sobel, demonstrating substantial gains in challenging real‑world scans while keeping computational complexity low. These findings support the use of lightweight, interpretable preprocessing before OCR in practical digitization workflows.

Contrast enhancement. Standard Histogram Equalization (HE) increases global contrast by redistributing intensities; OpenCV exposes it via equalizes for 8-bit single-channel images. Contrast Limited Adaptive Histogram Equalization (CLAHE) performs HE locally and clips histogram peaks to avoid noise amplification, with well-documented utility in medical and low-contrast images [5–7]. We adopt global HE for simplicity and speed, while noting CLAHE as an effective alternative where local contrast varies strongly.

Key Words: OCR, Document Preprocessing, Histogram Equalization, Sobel Edge Detection, Image Enhancement, Text Recognition, Grayscale Image Processing

1.INTRODUCTION

Edge enhancement. Edge detection can strengthen stroke boundaries prior to binarization or OCR. The Sobel operator approximates image gradients via separable derivative kernels; it is fast and robust for text edges. Canny’s detector offers strong theoretical guarantees but adds smoothing, non-maximum suppression, and hysteresis thresholds that may require tuning [8–9]. Our choice of Sobel reflects a bias toward speed, simplicity, and easy parameterization in production pipelines.

Optical Character Recognition (OCR) systems are increasingly used to digitize printed and handwritten material. However, OCR performance deteriorates when documents exhibit low contrast, nonuniform lighting, blur, or scanning noise. Many advanced enhancement strategies exist, but production deployments often need methods that are fast, interpretable, and easy to maintain. We study a minimal, two stage preprocessing chain, Histogram Equalization (HE) followed by Sobel edge detection, that aims to (i) redistribute intensity values to improve global contrast and (ii) emphasize character edges to improve text–background separability. The sequential combination is straightforward to implement, transparent to debug, and compatible with most OCR engines, including Tesseract. We provide a compact mathematical description, an implementation plan, and empirical performance on a mixed dataset including UW III and self-scanned pages.

Denoising. Classical filters (median, Gaussian) are common in OCR pipelines, but more advanced methods, On-Local Means (NLM) and BM3D, often preserve thin strokes better, at higher computational cost [10–12]. These can be valuable when the primary degradation is noise rather than contrast. OCR engines and datasets. Tesseract remains a widely used OCR engine; Smith’s overview (ICDAR 2007) documents its adaptive classifier and line-finding strategy [13]. For document image benchmarking, UW-III contains roughly 1,600 English document images with detailed ground truth (zones, text lines, words), and it is frequently used for layout/OCR research [14–15].

2. LITERATURE REVIEW Document image preprocessing typically targets binarization, denoising, contrast correction, and geometry correction before OCR.

Impact Factor value: 8.315

ISO 9001:2008 Certified Journal

Page 569