tesseract | FSU Research Computing Center - The first knowledge sharing application in Vietnam

Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats. Tesseract is highly customizable and can operate using most languages, including multilingual documents and vertical text.

Mục lục bài viết

Running tesseract on RCC Resources

To run tesseract on HPC, you can directly run the command from the terminal as it does not require a modulefile. In the example below, simply replace imagename and outputbase with your filenames. The options and configfile content are all listed out here: https://tesseract-ocr.github.io/tessdoc/

tesseract imagename outputbase [options...] [configfile...]