Skip to main content

Command Palette

Search for a command to run...

AI Tip: How to do OCR using Gemni Vision Model

If you’re an academic who regularly works with printed sources, you know how crucial OCR (Optical Character Recognition) is. Whether you’re trying to make a scanned document searchable or feeding it into a more advanced RAG (Retrieval-Augmented Generation) system, the quality of the OCR makes or breaks the workflow.

Unfortunately, traditional tools like Adobe Acrobat or ABBYY FineReader often fall short—especially when high precision is needed. The good news? There are now two much more effective approaches that leave those legacy options behind.

In this post, we’ll explore the first of those: using Gemini’s vision model to achieve high-accuracy OCR that’s ready for modern research workflows.

To see the code depository, visit my github at https://github.com/jzou19957/Automatic_OCR_Through_Gemni_Vision

Below is an instruction about how to use it:

📘 How to Use: Automatic_OCR_Through_Gemini_Vision

If you're a non-programmer looking for a convenient way to perform high-quality OCR on PDFs using Gemini’s Vision model, this tool is designed for you. Here's how to get started:

Step 1: Get a Gemini API Key

  1. Visit the official Gemini API page:
    👉 https://aistudio.google.com/app/apikey

  2. Request an API key by following the instructions on the page.

⚠️ Important Note on Billing:
You can use the API for free without linking it to a billing account. However, in that case, requests will be much slower due to rate limits.
If you link the API key to a billing account, you'll get f
aster performance, but usage will incur automatic charges.

  1. To use the tool, simply replace the placeholder API key in the script with your own key, which you can obtain from:

    👉 https://aistudio.google.com/app/apikey

    ⚠️ Tip: If you're not linking to a billing account, expect slower performance due to free-tier rate limits. For faster results, linking a billing account will automatically enable priority access—but this will incur usage-based charges.

  1. To get started, place the PDF files you want to convert in the same folder as the Python script. For example, if you’re working with a book called example_book.pdf, your project directory should look something like this:

     bashCopyEdit/your-project-folder
     ├── example_book.pdf
     ├── Automatic_OCR_Through_Gemini_Vision.py
    

    Once everything is in place, here’s what the code does under the hood:

    1. PDF to Image Conversion:
      The script uses the Pillow (PIL) library to split your PDF into individual page images. Each page is converted into a high-resolution .png or .jpg image—this prepares it for accurate OCR processing.

    2. OCR via Gemini Vision API:
      Each image is then automatically sent to Google’s Gemini Vision model via API. The script processes the pages sequentially, ensuring that each page receives dedicated attention for optimal OCR accuracy.

    3. Markdown Output:
      The text content extracted from each image is saved as individual Markdown (.md) files—one per page. A combined Markdown file is also generated for easier full-text querying.

▶️ Running the Code in Visual Studio Code

  1. To run the tool, simply open the Python script in Visual Studio Code and execute it. The script is designed to be user-friendly and self-contained:

    • It automatically installs all required dependencies on first run (no manual setup needed).

    • It converts the input PDF into OCR-ready content using high-accuracy image-to-text processing via the Gemini Vision API.

    • Each page of the PDF is:

      • Converted to a high-resolution image,

      • Passed through the Gemini OCR model,

      • And saved as a Markdown (.md) file.

The output includes:

  • One Markdown file per page, and

  • A combined Markdown file containing the full content for convenience.

This process effectively creates a digitized, query-ready version of the book that’s ideal for:

  • Full-text search,

  • Personal knowledge management systems,

  • Academic research, and

  • RAG (Retrieval-Augmented Generation) pipelines.