Monday, June 2, 2014

First of all, we had to find dictionary files of all three Belarusian currents. Naturally, the only


China became the largest buyers of industrial robots 02.06.2014 - 12:03
Apple bought the company Beats Electronics 29.05.2014 - 13:15
Flash: kill not pardon
FineReader eyes Belarusian
Write this article inspired me with blog LiveJournal. Girl asked where you can find a package to the Belarusian-known program FineReader eighth Although since nearly four years and has already published several new versions of the program, but the problem with OCR in the Belarusian language and not solved until now.
It must be hard to find someone who actively uses computers and do not know about the program FineReader. The program is a very powerful tool to convert paper documents, images and PDF (and DjVu) files to plain text. But this is not all that it can do. The latest versions of the opportunity to process documents of poor quality and recognize mathrubhumi texts with complex structure. Eg containing tables, figures, and not very complicated formula. It is clear that without human intervention until necessary, but it is minimal and more or less complex text can be edited in a few minutes (assuming that you see in the program is not the first time). To date, the program supports 189 languages, 36 of them are basic, ie have dictionary support. Unfortunately, the Belarusian language is not the primary, and the quality of text recognition without the support of the dictionaries is very mediocre. Moreover, the company's specialists are not aware that in the Belarusian language, there are three current spelling: spelling mathrubhumi Latin, modern spelling (narkomovka), classical spelling (tarashkevitsa). And for example, if you have a book in PDF / DjVu, written Belarusian Latin, probably you will have difficulties with the transfer to plain text. I have several times appealed to the developers of the program with a proposal to identify at least three Belarusian-language spelling without dictionary support, but I always mathrubhumi refused, spasylavshysya that the commercial interest of the Belarusian-language population and the state of the program FineReader not, enter it now and not advisable. I understand that I am not the first who addressed this proposal. At Google, I found an interesting article that was published in the newspaper "Zvezda", which refers not only about why there is no corresponding support of the Belarusian language in FineReader, but generally on the state of the Belarusian language in the computer world.
And after I was firmly convinced that our state is not interested in supporting mathrubhumi the national language, mathrubhumi and experts from ABBYY's made it clear that they will only change something if it is the largest commercial order for their product, I started looking for a way out of the situation yourself because I honestly did not like that quality OCR, which offers a program by default. Time recognition program is often wrong and demanded mathrubhumi my participation. One thing was clear that without dictionary support can not do ... And I decided to do it yourself. Moreover, I set a goal to artificially introduce all three Belarusian spelling.
First of all, we had to find dictionary files of all three Belarusian currents. Naturally, the only condition - is the file size with the dictionary. Than it is, the better. With a little searching, I found on my hard drive appropriate dictionary mathrubhumi files. FineReader program I had installed (I use FineReader 11), and therefore could only build dictionaries in the program itself. Incidentally, in version 9 I could not get to do this: only a part of the program has imported words from the dictionary. 10 and 11 versions of these limitations was not.
In the end, you can import mathrubhumi the dictionary Belarusian (Modern Spelling), all dictionaries can be downloaded from one file. Choose User dictionary> Edit ...> Import and wait ... I'm not very powerful computer, so Import the dictionary somewhere took 5-6 minutes. mathrubhumi Each time it does not have to be imported, but because you can wait ...
Everything else is done by an algorithm where we created a dictionary support for the modern spelling of the Belarusian language. As a result, we were able to recognize the text using the full dictionary in classic spelling.
The fact that you can not only write Cyrillic letters in Belarusian, but also letters of the Latin alphabet for sure not everyone knows Belarusian (be honest, not every Belarusian even at the elementary level has a modern spelling, not like Latin). And this thread has been very popular in the XIX century. Incidentally, it was printed in the Latin alphabet first Belarusian mathrubhumi newspaper "Peasant's true." Perform mathrubhumi all the steps that we performed when integrating with the modern mathrubhumi spelling dictionary. Language name instead of "Copy of Belarusian" write "Belarusian (Latin)" Source language: mathrubhumi Belarusian Alphabet - here it is necessary to replace it all. I specially prepared for the Latin alphabet. You only need to copy and paste a string in Alphabet n

No comments:

Post a Comment