Scan text and save the result to your computer in image format (.bmp). Then use the ORC recognition system for conversion, and finally use WORD for modification and editing. Let’s teach you how to use ORC:
OCR is the abbreviation of English Optical Character Recognition. Translated into Chinese, it means recognizing text through optical technology. It is an important aspect in the field of automatic recognition technology research and application. . It is a software technology that can automatically recognize and input text into the computer. It is the main software matched with the scanner. It belongs to the category of non-keyboard input and requires the cooperation of the image input device, mainly the scanner. Now OCR mainly refers to text recognition software. Before Tsinghua Unigroup began to use Chinese recognition software in 1996, scanners and OCR software on the market had been sold separately. Professional OCR software was recommended. CR software is also constantly being upgraded. Scanner manufacturers now sell professional OCR software with their own scanners. The rapid development of OCR technology is inseparable from the widespread use of scanners. In the past two years, with the gradual popularization of scanners and the improvement of OCR technology, OCR has become a powerful assistant for most scanner users.
1. The development history of OCR technology
Since the first generation of OCR products appeared in the early 1960s, after more than 30 years of continuous development and improvement, various OCR technologies including handwriting The research has achieved remarkable results. People's functional requirements for OCR products have also evolved from the original focus on recognition rate to the recognition speed of the entire OCR system, the friendliness of the user interface, the simplicity of operation, and the stability of the product. It puts forward higher requirements in terms of performance, adaptability, reliability and easy upgradeability, as well as pre-sales and after-sales service quality.
IBM was the first to develop an OCR product. In 1965, IBM's OCR product, IBM1287, was exhibited at the New York World's Fair. At that time, this product could only recognize printed numbers, English letters and some symbols, and it had to be in the specified font. In the late 1960s, Hitachi and Fujitsu also developed their own OCR products. The world's first automatic letter sorting system that realized handwritten postal code recognition was developed by Toshiba Corporation of Japan. Two years later, NEC Corporation also launched the same system. By 1974, the automatic sorting rate of letters reached about 92%, and it was widely used in the postal system and played a good role. In 1983, Toshiba Corporation of Japan released its OCR system OCRV595 for recognizing printed Japanese Chinese characters. Its recognition speed is 70 to 100 Chinese characters per second, and the recognition rate is 99.5%. Later, Toshiba Corporation began research work on handwritten Japanese and Chinese character recognition.
China’s research work on OCR technology started relatively late. It only began to research the recognition technology of numbers, English letters and symbols in the 1970s, and began to carry out Chinese character recognition in the late 1970s. Research. In 1986, the National 863 Program in the Information Field organized three units, Tsinghua University, Beijing Institute of Information Engineering, and Shenyang Institute of Automation, to jointly develop Chinese OCR software. By 1989, Tsinghua University took the lead in launching the first set of Chinese OCR software in China - Tsinghua Wentong TH-OCR version 1.0. At this point, Chinese OCR officially moved from the laboratory to the market. Tsinghua OCR printed Chinese character recognition software later launched TH-OCR 92, a high-performance practical simplified/traditional, multi-font, multi-functional printed Chinese character recognition system, making significant progress in printed Chinese character recognition technology. The TH-OCR 94 high-performance Chinese-English mixed printed text recognition system launched in 1994 was identified by experts as "the first Chinese-English mixed printed text recognition system launched at home and abroad, and is generally at the leading international level."
In the mid-to-late 1990s, the Department of Electronic Engineering of Tsinghua University proposed and conducted comprehensive research on Chinese character recognition, making Chinese character recognition technology applicable in fields such as printed text, online handwritten Chinese character recognition, offline handwritten Chinese character recognition, and offline handwritten digit symbol recognition. Important results were achieved across the board. The representative achievement is the TH-OCR 97 comprehensive integrated Chinese character recognition system, which can complete the recognition input of multi-language (Chinese, English, Japanese) printed text, online handwritten Chinese characters, offline handwritten Chinese characters and handwritten numbers. In the past few years, in addition to Tsinghua Wentong TH-OCR, other OCR software with different styles such as Shangshu SH-OCR have also come out. The Chinese OCR market has steadily expanded, with users all over the world.
It can be said that the current recognition technology of printed OCR has reached a high level. OCR products have evolved from early models that could only recognize specified printed numbers, English letters, and some symbols to powerful computers that can automatically perform layout analysis, table recognition, and realize recognition of mixed text, multiple fonts, multiple font sizes, and mixed horizontal and vertical layouts. Quick information entry tool. The recognition rate of printed Chinese characters reaches over 98%, and even for characters with poor printing quality, the recognition rate reaches over 95%. It can recognize simplified and traditional Chinese fonts such as Song font, Hei font, Kai font, and imitation Song font. It can also recognize mixed typesetting of multiple fonts and different font sizes. The recognition rate of handwritten Chinese characters reaches more than 70%. In particular, after more than ten years of hard work, my country's Chinese character OCR technology has overcome difficulties such as its late start and extremely large Chinese character set. The recognition speed of single characters (referring to the number of words completed from feature extraction to recognition result output per unit time) Can reach more than 70 words/second. Since printed OCR Chinese character recognition technology is relatively mature, OCR products are widely used in various industries such as journalism, printing, publishing, libraries, and office automation.
Professional OCR products are mostly oriented to specific industries, that is, they are suitable for departments that need to process a large amount of form information entry every day, such as postal services, taxation, customs, statistics, etc. This professional OCR system for specific industries has a relatively fixed format and a relatively small set of recognized characters. It is often used in conjunction with special input devices, so it has the characteristics of fast speed and high efficiency, such as automatic mail sorting systems.
Handwritten document recognition products did not begin to appear until 1996 and 1997, and were provided as an additional function of printed document recognition products. Since people's writing habits vary widely, it is quite difficult to realize free handwriting recognition. Therefore, the application field of handwriting OCR technology is online handwriting recognition, that is, people write while the computer recognizes it, which is a real-time recognition method.
2. The basic principle of OCR
To put it simply, the basic principle of OCR is to input the image of a document into the computer through a scanner, and then the computer takes out the image of each text. image and convert it into Chinese character encoding. The specific working process is that the scanner converts the optical signal of the Chinese character manuscript into an electrical signal through the charge-coupled device CCD, and then converts it into a digital signal through an analog/digital converter and transmits it to the computer. The computer receives a digital image of the document. The Chinese characters on the image may be printed Chinese characters or handwritten Chinese characters, and then recognizes the Chinese characters in these images. For printed characters, optical methods are first used to convert the document data into original black and white dot matrix image files, and then the text in the image is converted into text format through recognition software for further processing by word processing software. Among them, text recognition is an important technology of OCR.
1. Two ways of OCR recognition
Like other information data, all graphic and text information captured by the scanner in the computer is recorded and identified using the two numbers 0 and 1. All information They are just a string of points or sample points saved as 0 and 1. The OCR recognition program recognizes character information on the page, mainly through unit pattern matching method and feature extraction method for character recognition.
Unit pattern matching identification method (Pattern Matching) is a loose comparison of each character with a file that saves a standard font and font size bitmap.
If the application has a large database of saved characters, the application will pick up the appropriate characters for a correct match. The software must use some processing technique to find the most similar match, usually by constantly trying different versions of the same character to compare. Some software can scan a page of text and identify each character that defines a new font. Some software uses its own recognition technology to do its best to identify the characters on the page, and then manually selects or directly enters unrecognizable characters.
Feature Extraction is to decompose each character into many different character features, including diagonal lines, horizontal lines, curves, etc. These features are then matched with understood (recognized) characters. To give a simple example, if the application recognizes two horizontal lines, it will "think" that the character may be "two". The advantage of the feature extraction method is that it can recognize a variety of fonts. For example, Chinese calligraphy fonts use feature extraction method to realize character recognition.
Most OCR application software has added an intelligent grammar checking function, which further improves the recognition rate. It mainly implements spelling and grammar correction through context checking. During text recognition, the OCR application will do multiple context coherence checks. Based on the phrases and fixed word order that already exist in the program, the corresponding check string of words. More advanced application software will automatically replace incorrect words with words it "thinks" are correct and correct the meaning of the sentence.
2. Several steps of text recognition
Text recognition includes the following steps: image and text input, pre-processing, single-word recognition and post-processing, etc.
(1) Image and text input
refers to inputting documents into the computer through input devices, which is to realize the digitization of originals. A more commonly used device nowadays is the scanner. The scanning quality of document images is a prerequisite for correct recognition by OCR software. Appropriate selection of scanning resolution and related parameters is the key to ensuring that text is clear and features are not lost. In addition, the document should be placed as straightly as possible to ensure that the tilt angle detected by preprocessing is small. After tilt correction, the deformation of the text image will be small. These simple operations will improve the system's recognition accuracy. On the other hand, due to improper scanning settings, too many broken pens in the text may result in an image of half the text being detected. Broken strokes and adhesion of strokes will cause some features to be lost. When comparing the features with the feature database, the feature distance will be increased and the recognition error rate will increase.
(2) Preprocessing
Scan an image of a simple printed document, separate each text image and hand it over to the recognition module for recognition. This process is called image preprocessing. deal with. Preprocessing refers to some preparation work before text recognition, including image purification processing to remove obvious noise (interference) in the original image. The main tasks are to measure the inclination angle of the document placement, conduct layout analysis on the document, confirm the typesetting of the selected text fields, segment the horizontal and vertical text lines, separate the text images of each line, and identify punctuation marks. wait. The work at this stage is very important, and the effect of processing directly affects the accuracy of text recognition.
Layout analysis is an overall analysis of text images. It separates all text blocks in the document and distinguishes text paragraphs and layout order, as well as image and table areas. The domain boundary of each text block (the coordinates of the starting point and end point of the domain in the image), the attributes within the domain (horizontal and vertical layout), and the connection relationship of each text block are used as a data structure and provided to the recognition module for automatic recognition. The text area is directly recognized and processed, the table area is subjected to dedicated table analysis and recognition processing, and the image area is compressed or simply stored. Line segmentation is the process of cutting a large image into lines and then separating individual characters from the image lines.
(3) Single character recognition
Single character recognition is the core technology that embodies OCR text recognition. The text images detected from the scanned text are converted by the computer into standard codes of text. This is the key to allowing the computer to "recognize characters", which is the so-called recognition technology.
Just like the human brain recognizes text because various characteristics of the text have been preserved in the human brain, such as the structure of the text, the strokes of the text, etc. If you want a computer to recognize text, you also need to store information such as the characteristics of the text into the computer. However, what kind of information should be stored and how to obtain this information is a very complicated process, and a very high recognition rate must be achieved. to meet the requirements. The commonly used approach is to analyze based on the strokes of the text, feature points, projection information, regional distribution of points, etc.
There are thousands of commonly used Chinese characters. The recognition technology is feature comparison technology. By comparing with the recognition feature library, the character with the most similar features is found, and the standard code of the character is extracted, which is the recognition result. Comparison is a basic way for people to understand things. Chinese character recognition also uses comparison to find out the similarities, similarities, and differences between Chinese characters, and to grasp the relationship between quantity and quality, as well as the relationship between time and space. For Chinese characters in large character sets, multi-level classification, multi-feature, and all-round dynamic matching are generally used to find similar sets to ensure high classification rate, strong adaptability, and good stability; the focus of subdivided classification is to find difference matching and weighted processing of similar sets , structural identification, quantitative and qualitative analysis, and the relationship between the preceding and following connectives, and finally the identification. Chinese character recognition is essentially the application of comparative science or cognitive science in artificial intelligence, and its key technology is the recognition feature library. Only with such a feature library can the computer complete the function of character recognition.
In the layout of image documents, in addition to text and pictures, there are sometimes tables. In order to digitize the recognized tables, special processing of the table fields is required during the layout analysis process. It includes extracting the structural information of the table lines, sorting the text fields in the table, completing the identification of the table lines and the text fields, and generating different file formats based on the digitization of the table lines. Because the tables in the document are arbitrary and have various formats, including closed and open, especially the slashes in the tables, it creates certain difficulties in table analysis.
(4) Post-processing
Post-processing refers to matching the recognized text or multiple recognition results using phrases, that is, segmenting the results of single-word recognition into words and matching them with words. Compare phrases in the database to improve the system’s recognition rate and reduce the misrecognition rate.
Chinese character recognition is the most difficult problem in the field of text recognition, which involves pattern recognition, image processing, digital signal processing, natural language understanding, artificial intelligence, fuzzy mathematics, information theory, computers, Chinese information processing and other disciplines , is a comprehensive technology. In recent years, the single-character recognition accuracy rate of printed Chinese character recognition systems has exceeded 95%. In order to further improve the overall recognition rate of the system, technologies such as scanned images, image pre-processing and recognition post-processing have also been studied in depth. And has made great progress, effectively improving the overall performance of the printed Chinese character recognition system. Tsinghua University has made outstanding research achievements in this area and has become one of the most authoritative institutions in the world. Currently, Tsinghua Unigroup's full range of scanners are equipped with Tsinghua OCR Millennium Edition software, which has reached a high level in terms of recognition rate, form recognition and even standard handwriting recognition.
3. OCR text recognition skills
In recent years, OCR recognition technology has developed rapidly with the popularity of scanners, and the performance of scanning and recognition software has continued to become stronger and stronger. Continuously upgrade and develop toward intelligence. However, if you want to quickly obtain correct scanning results and obtain efficient text entry, you must carefully study the relevant knowledge, combine it with practical experience, and find your own complete set of solutions. Sometimes when we do text recognition work, the recognition rate is very low and cannot reach above 95% as stated by the software. Please do not blame the hardware or software. In fact, this is the reason why we have not mastered scanning and OCR recognition skills.
The following are some methods and techniques commonly used in text recognition operations.
1. Resolution setting is an important prerequisite for text recognition. Generally speaking, scanners provide more image information, and recognition software can more easily obtain recognition results. But it does not mean that the higher the scanning resolution is, the higher the recognition accuracy will be.
Choose 300dpi or 400dpi resolution, suitable for most document scanning. Pay attention to the scanning and recognition of text originals. When setting the scanning resolution, never exceed the optical resolution of the scanner, otherwise the gain will outweigh the loss. Below are some typical settings for reference only.
(1) For article paragraphs with size 1, 2, and 3 fonts, it is recommended to use 200dpi.
(2) 4. For article paragraphs with small 4 and 5 size fonts, it is recommended to use 300dpl
(3) For article paragraphs with small 5 and 6 size fonts, it is recommended to use 400dpl p>
(4) For article paragraphs with size 7 and 8 fonts, it is recommended to use 600dpi.
2. When scanning, adjust the brightness and contrast values ??appropriately to make the scanned documents clear in black and white. This has the most critical impact on the recognition rate. The scanning brightness and contrast values ??are set based on the principle of observing that the strokes of Chinese characters in the scanned image are thin but not open. Before recognition, first check the quality of the text in the scanned image. If there are black spots or spots in the image or the text lines are thick and dark and the strokes cannot be distinguished, it means that the brightness value is too small and the brightness value should be increased. Try it again; if the text lines are uneven, broken, or even the outline of the Chinese characters in the image is severely damaged, it means that the brightness value is too high, and you should reduce the brightness and try again.
3. Choose scanning software. Choosing a good OCR software that suits you is the basis for good text recognition. Generally, do not use the OEM software that comes with the scanner. OEM OCR software has fewer functions and poor effects, and some even do not have Chinese recognition. After comparison , I think the recognition capabilities and usage functions of Tsinghua Unisoc OCR2003 Professional Edition and Shangshu OCR6.0 automatic text recognition input system are more outstanding. Choose another image software. Doesn’t the OCR software have a scanning interface? Why are you still looking for imaging software? First, OCR software cannot recognize all scanners; second, and most importantly, images scanned using the scanning interface of image software are easy to process; PHOTOSHOP is generally used.
4. If the text to be processed is formatted, such as bold, italics, first line indentation, etc., some OCR software cannot recognize it, and the format will be lost or garbled characters will appear. If you must scan formatted text, make sure in advance whether the recognition software you use supports scanning text formats. You can also turn off the style recognition system, allowing the software to focus on finding the correct characters instead of fonts and font formatting.