How to Use Tesseract Library for ORC in Google Colab Notebook
Photo from Unsplash
Originally Posted On: https://bhadreshpsavani.medium.com/how-to-use-tesseract-library-for-ocr-in-google-colab-notebook-5da5470e4fe0
OCR from Image using PyTesseract in Python on Colab Notebook?
Optical Character Recognition(OCR) has been a popular task in Computer Vision. The popularity is because of its wide range of applications. It can be used for Data Entry for Business, Number Plate Recognition, Automated Passport Recognition, Quick Document Verification, IoT Application, Task Automation, and many more. Basically, any application which has a need to extract text from an image.
Tesseract is the most open-source software available for OCR. It was initially developed by HP as a tool in C++. Since 2006 it is developed by Google. The original software is available as a command-line tool for windows. We are living in a python world. Because of its popularity. The tool is also available in python developed and maintained as an opensource project.
Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.
Knowing detail about the tool is information but learning to use it is knowledge. We are a knowledge seeker. Let’s learn how to use it.
Here are the steps to extract text from the image in Google Colab Notebook for OCR using Pytesseract:
Step1. Install Pytesseract and tesseract-OCR in Google Colab.
!sudo apt install tesseract-ocr!pip install pytesseractStep2. import libraries
import pytesseractimport shutilimport osimport randomtry: from PIL import Imageexcept ImportError: import ImageStep3. Upload Image to the Colab
We can manually upload the image by clicking on file- upload but we can also use the following code for uploading the image to Colab.
from google.colab import filesuploaded = files.upload()Step4. Text Extraction
The image_to_string function will take an image as an argument and returns an extracted text from the image. We can either directly print it or store this string in one variable.
image_path_in_colab='image.jpg'extractedInformation = pytesseract.image_to_string( Image.open(image_path_in_colab))print(extractedInformation)Step5. Detect Langauge other than English:
Specifying language in the above function by lang argument we can change the language text to be detected.
# French text image to stringextractedInformation = pytesseract.image_to_string( Image.open('test-european.jpg'), lang='fra')print(extractedInformationStep6. Get Bounding Boxes for Text
To get bounding box coordinates for the text we use the image_to_boxes function will the same image path argument as the earlier function.
# Get bounding box estimatesprint(pytesseract.image_to_boxes(Image.open(image_path_in_colab)))Feel free to check this Colab Notebook Implementation of the above method.
bhadreshpsavani/OCR_using_TesseatactLib_Project
You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…
github.com
Pros:
- Easy to use
- Fast Detection
- Most Popular
- Most efficient
- Support 100+ Language
- Oldest OCR Library
- Command-line support
Cons:
- Only works on CPU
- Doesn’t perform well on Blur, Noisy and colorful image
- Performance decrease for lower font size in low-resolution images
- Doesn’t work well on complex Forms
If you want to have text detection and recognition using a single function, check out this AI-based Opensource Easy OCR. It supports 70+ languages and faster GPU Inference.
Note: For Blur, Noisy and colorful image we need to follow some image-processing steps like making image black and white, remove salt and pepper noise using lowpass filters such as averaging filters or Gaussian Filter, We can also make blur image sharpen by using Highpass filter such as Sobel filters. This Image Processing operation can also be implemented by the OpenCV library in python.
IronOCR for .NET
For those working in .NET rather than Python, IronOCR offers a similar experience with built-in image preprocessing.
It handles blur, noise, and color correction automatically, so you don’t need to manually apply filters before OCR. It supports 127+ languages and works directly with NuGet installation. If your project stack is C# or VB.NET, it’s worth exploring as an alternative approach.
Don’t Forget to Clap if you found this article helpful,
Follow my telegram channel to get awesome blogs, projects, and learning opportunities for Python, Machine Learning, and Data Science Stuff.
Stay Pythonic!!
References:
tesseract-ocr/tesseract
This package contains an OCR engine – libtesseract and a command line program – tesseract. Tesseract 4 adds a new…
pytesseract
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the…
Information contained on this page is provided by an independent third-party content provider. Frankly and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact [email protected]
