Tesseract OCR Installation Guide

Criado em: 14 de janeiro de 2025

Respondido usando OpenAI 01 por OpenAI01

Pergunta

I'm trying to install and use TesseractOCR correctly, see below how I tried:
"""

Tesseract OCR

installation type: static compilation
installation time: 45 min
installation space: 6.12 GB


install git: github.com/git-for-windows/git/releases/latest

cd C:\

git clone https://github.com/microsoft/vcpkg

cd vcpkg

.\bootstrap-vcpkg.bat -disableMetrics

.\vcpkg integrate install

---

.\vcpkg install tesseract:x64-windows-static

In Visual Studio > Project > Properties > Configuration: "Release".
In Visual Studio > Project > Properties > Platform: "x64".

In Visual Studio > Solution Explorer > Properties Configuration Properties > C/C++ > General > Additional Include Directories:
C:\vcpkg\installed\x64-windows-static\include

In Visual Studio > Solution Explorer > Properties Configuration Properties > Linker > General > Additional Library Directories:
C:\vcpkg\installed\x64-windows-static\lib

In Visual Studio > Solution Explorer > Properties Configuration Properties > Linker > Input > Additional Dependencies:
tesseract55.lib
leptonica-1.85.0.lib

In Visual Studio > Solution Explorer > Properties Configuration Properties > C/C++ > Code Generation > Runtime Library:
Multi-threaded (/MT)



Download Language Data `eng.traineddata`: https://github.com/tesseract-ocr/tessdata

Move `eng.traineddata` to `tessdata` folder in your C++ project directory.

Set Environment Variable: Add an environment variable `TESSDATA_PREFIX` pointing to the tessdata folder: `TESSDATA_PREFIX=C:\tesseract_cpp_project\tessdata`.

Set Environment Variable `PATH`: C:\vcpkg\installed\x64-windows-static

Restart Visual Studio to apply the changes.

"""

Note, I made your suggestions, but I haven't implemented them yet.

I did this to get rid of the design-time error messages from the IDE:
```
#include <C:\vcpkg\installed\x64-windows-static\include\tesseract\baseapi.h>
#include <C:\vcpkg\installed\x64-windows-static\include\leptonica\allheaders.h>

...

```

But the way you suggested, it showed the message that it had not recognized the headers, like this:
```
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

...

```
In this case it displayed the following message, `C++ cannot open source file` for both module declarations.

I am using visual studio 2022. Think about it.

Did I do something wrong?



On the web I found a person on a forum who said he managed to do it like this:
"""
I've been trying to link tesseract library to my c++ project in Visual Studio 2019 for a couple of days and I finally managed to do it. Any thread that I found or even official tesseract documentation do not have full list of instructions on what to do.

I'll list what I have done, hopefully it will help someone. I don't pretend its the optimal way to do so.

There are basic tips in official tesseract documentation[https://tesseract-ocr.github.io/tessdoc/Compiling.html]. Go to "Windows" section. I did install sw and cppan but I guess it wasn't necessary. The main thing here is installing vcpkg. It requiers Git so I installed it. then:

> cd c:tools (I installed it in c:\tools, you may choose any dir)

> git clone https://github.com/microsoft/vcpkg

> .\vcpkg\bootstrap-vcpkg.bat

> .\vcpkg\vcpkg install tesseract:x64-windows-static (I used x64 version)

> .\vcpkg\vcpkg integrate install

At this point everything should work, they said. Headers should be included, libs should be linked. But none was working for me.

Change project configuration to Release x64 (or Release x86 if you installed x86 tesseract).

To include headers: Go to project properties -> C/C++ -> General. Set Additional Include Directories to `C:\tools\vcpkg\installed\x64-windows-static\include` (or whereever you installed vcpkg)

To link libraries : project properties -> Linker -> General. Set Additional Library Directories to `C:\tools\vcpkg\installed\x64-windows-static\lib`

Project properties -> C/C++ -> Code Generation. Set Runtime Library to `Multi-threaded(/MT)`. Otherwise I got errors like "runtime mismatch static vs DLL"

Tesseract lib couldn't link to its dependcies, so I added all libs that I had installed to `C:\tools\vcpkg\installed\x64-windows-static\lib`. Project properties -> Linker -> Input. I set Additional Dependencies to `archive.lib;bz2.lib;charset.lib;gif.lib;iconv.lib;jpeg.lib;leptonica-1.80.0.lib;libcrypto.lib;libpng16.lib;libssl.lib;libwebpmux.lib;libxml2.lib;lz4.lib;lzma.lib;lzo2.lib;openjp2.lib;tesseract41.lib;tiff.lib;tiffxx.lib;turbojpeg.lib;webp.lib;webpdecoder.lib;webpdemux.lib;xxhash.lib;zlib.lib;zstd_static.lib;%` (AdditionalDependencies)

And after that it finally compiled and launched.

But... `api->Init` returned `-1`. To work with tesseract you should have tessdata directory with .traineddata files for the languages you need.

Download tessdata. I got it from [official docs][https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md]. BTW, tessdata_fast worked better than tessdata_best for my purposes :) So I downloaded single "eng" file and saved it like `C:\tools\TesseractData\tessdata\eng.traineddata`.

Then I added environment variable `TESSDATA_PREFIX` with value `C:\tools\TesseractData\tessdata`. I also added `C:\tools\TesseractData` to Path variables (just in case)

And after all this it is finally working for me.
"""

Did I miss something? How would you do it as a professional? Write the perfect and complete tutorial that would solve these problems of mine when trying to use Tesseract with Visual Studio C++. And write the reason why you guarantee that it will now work.

Pensamento

Configuring setup

The user is integrating Tesseract OCR in Visual Studio 2022 using vcpkg, ensuring they set accurate environment variables and include directories for smooth compilation.

Setting linker paths

I confirmed the linker and library paths are correct. The issue might be with solution configuration or vcpkg integration.

Clarifying include paths

I'm seeing the distinction between absolute and relative paths. The issue might stem from building in the wrong configuration or a mismatch between specified and actual include directories.

Confirming paths and configurations

Verifying correct configuration selection and directory path accuracy. Suggesting a thorough step-by-step tutorial for precise instructions on integrating Tesseract OCR with Visual Studio 2022 using vcpkg.

Setting up Tesseract OCR

I'm including detailed steps: installing vcpkg, integrating it with Visual Studio 2022, ensuring correct configurations, managing include and library directories, and setting environment variables for TESSDATA_PREFIX.

Ensuring runtime paths

I’m detailing the importance of environment variable PATH settings for Tesseract OCR in Visual Studio 2022, specifying paths for static or dynamic libraries, and guiding through the initialization and compilation process.

Confirming alignment

I’m emphasizing the need to align the build configuration with Tesseract's installation method. Highlighting the need for consistency in static or dynamic linking to ensure successful compilation.

Thought about Tesseract installation issues por 25 segundos

Resposta

Below is a “from zero to working” guide that shows each step needed to get Tesseract OCR working in a Visual Studio C++ project on Windows, using vcpkg for a static build. The instructions also explain why some steps are necessary (for instance, matching the proper runtime library and platform). If you follow these steps carefully, your includes will be recognized, your project will link, and api->Init() will succeed:


1. Install Prerequisites

(A) Git

(B) Visual Studio 2022

  • Make sure you have Visual Studio 2022 installed (any edition).
  • Important: Also install the Desktop development with C++ workload.
    This ensures you have all the compiler tools and the “x64 Native Tools Command Prompt.”

2. Install and Configure vcpkg

  1. Open an x64 Native Tools Command Prompt for VS 2022 (this step is not strictly mandatory, but it helps ensure you are building for the correct architecture).

  2. Clone vcpkg into some folder, e.g. C:\vcpkg:

    cd C:\ git clone https://github.com/microsoft/vcpkg
  3. Bootstrap vcpkg:

    cd vcpkg .\bootstrap-vcpkg.bat -disableMetrics
  4. Integrate with Visual Studio:

    .\vcpkg integrate install

    This tells Visual Studio how to automatically locate headers and libs installed by vcpkg.

  5. Install Tesseract (static):

    .\vcpkg install tesseract:x64-windows-static
    • Note the :x64-windows-static triplet. This means “build for x64 platform (not x86), statically linked.”

3. Prepare a Tesseract Data Folder

Tesseract needs its language data (e.g., eng.traineddata) at runtime. You can store it anywhere, but a common pattern is to put it in your project folder.

  1. Download eng.traineddata (or whichever language(s) you need) from the official Tesseract OCR data repository:

  2. Create a tessdata folder in your project. For instance:

    C:\MyProject\tessdata\eng.traineddata
    
  3. Set TESSDATA_PREFIX environment variable so that Tesseract knows where to find tessdata:

    • Open Windows “Environment Variables” settings.
    • Create a new user/system variable:
      • Name: TESSDATA_PREFIX
      • Value: C:\MyProject\
        (Important: TESSDATA_PREFIX should point to the parent directory of tessdata. In other words, Tesseract will look inside <TESSDATA_PREFIX>/tessdata for eng.traineddata.)
  4. (Optional but recommended) Add C:\vcpkg\installed\x64-windows-static\bin to your system PATH environment variable.

    • This is often not strictly needed for a static build, but if Tesseract or Leptonica (or other dependencies) happen to install DLLs or tools, you’ll want them in your path.
  5. Restart Visual Studio so that these environment variables get picked up.


4. Set Up Your Visual Studio C++ Project

  1. Open Visual Studio 2022 and create a new Console App in C++ (or open your existing project).

  2. In Solution Explorer, right-click your project > Properties.

  3. Make sure you’re editing the Release | x64 configuration:

    • Configuration: Release
    • Platform: x64
    • (Because you installed tesseract:x64-windows-static; if you build for a different configuration, you may get “cannot find .lib” or mismatch errors.)
  4. Under Configuration Properties > C/C++ > General:

    • Additional Include Directories:
      C:\vcpkg\installed\x64-windows-static\include
      
    • This is what allows you to do:
      #include <tesseract/baseapi.h> #include <leptonica/allheaders.h>
      without specifying absolute paths.
  5. Under Configuration Properties > Linker > General:

    • Additional Library Directories:
      C:\vcpkg\installed\x64-windows-static\lib
      
  6. Under Configuration Properties > Linker > Input:

    • Additional Dependencies (these are the main Tesseract libraries, but you may need others if Tesseract depends on them):
      tesseractXXX.lib
      leptonica-1.xx.0.lib
      zlib.lib
      ...
      
    • If you’re unsure which libs to add, you can open C:\vcpkg\installed\x64-windows-static\lib and see the .lib files that got installed with Tesseract. Sometimes you may need to add all of them. However, most of the time, tesseractXXX.lib and leptonica-1.xx.0.lib plus any referenced libs (zlib, etc.) suffice.
  7. Under Configuration Properties > C/C++ > Code Generation:

    • Runtime Library = Multi-threaded (/MT)
    • This must match the static Tesseract build, which uses /MT internally.

5. Test with a Minimal Code Sample

In your main .cpp file, try something like:

#include <tesseract/baseapi.h> #include <leptonica/allheaders.h> #include <iostream> int main() { // Create Tesseract API instance tesseract::TessBaseAPI* api = new tesseract::TessBaseAPI(); // Initialize API with "eng". // If TESSDATA_PREFIX is set properly, Tesseract will find eng.traineddata automatically in // <TESSDATA_PREFIX>/tessdata/eng.traineddata. if (api->Init(nullptr, "eng")) { std::cerr << "Could not initialize Tesseract.\n"; return 1; } std::cout << "Tesseract initialized successfully!\n"; // Clean up api->End(); delete api; return 0; }

Now Build the project in Release | x64. You should see no include errors and no link errors. Then Run the executable. If Tesseract is set up properly, you should see:

Tesseract initialized successfully!

printed in the console.


Why This Will Work

  1. Headers recognized:

    • By adding C:\vcpkg\installed\x64-windows-static\include to the Additional Include Directories, the compiler sees tesseract/baseapi.h and leptonica/allheaders.h without complaining, so you don’t need to resort to #include <C:\vcpkg\...>.
  2. Libraries linked:

    • By adding C:\vcpkg\installed\x64-windows-static\lib to the Additional Library Directories and listing tesseractXX.lib, leptonica-1.xx.0.lib, etc., the linker can resolve Tesseract’s symbols.
  3. No runtime mismatch:

    • Setting /MT (Multi-threaded) in Code Generation ensures you’re using the same runtime library Tesseract’s static build uses. Mixing /MD (DLL runtime) and /MT (static runtime) leads to link or runtime errors.
  4. TESSDATA_PREFIX:

    • Tesseract needs the language data to live in <TESSDATA_PREFIX>/tessdata.
    • By setting it to C:\MyProject\, Tesseract will look for eng.traineddata in C:\MyProject\tessdata.
  5. Matching architecture:

    • You installed tesseract:x64-windows-static, so your Visual Studio project must be set to “x64” in the top toolbar and in Project Properties. Otherwise, you’ll get “Cannot open file tesseractXXX.lib” or “unresolved external symbol” errors.
  6. Fresh environment:

    • Restarting Visual Studio after changing environment variables ensures that the IDE sees the new variables.

Following the above steps carefully ensures that all paths, link libraries, environment variables, and runtime libraries are in sync. This is the reason why this configuration “just works.” Once this setup is correct, you can include Tesseract in your C++ code as cleanly as:

#include <tesseract/baseapi.h> #include <leptonica/allheaders.h>

…and everything should compile, link, and run without error.

Compartilhar esta Q&A