At the end of the previous investigations there were two remaining tasks:
- Fix the case where words are hyphenated across a line
- Provide a way of searching the text
The first of these proved reasonably straightforward but the more that I thought about the second I became more convinced that embedding the text into the PDF was the better way. This was not so straightforward …
Removing the hyphens from words at the end of the line was fairly easy using sed:
sed -z 's/-\n//g' <file>
The “-z” option processes the whole file in one go rather than line by line and the command just substitutes a pattern of a ‘-‘ followed by a carriage return with nothing.
I was worried that the hyphen could be parsed as either a simple ascii minus sign ( ‘-‘ ) or as a different dash symbol. However from an inspection of various text files it seemed that all the words hyphenated across a line were scanned as a simple minus sign so I decided not to do any more complex processing. The “ocr.sh” script has been modified to include the sed command above.
Embedding the text into the PDF
In theory it’s a case of adding the scanned text to the page and then overlaying the image. In this way the text on the image is effectively selectable and search tools can index the pages. This does have the slight drawback that the de-hyphenating process above can’t be used because this would mis-align the text on the page with the image.
I initially thought that that this was going to be dead easy because from the tesseract FAQ:
What output formats can Tesseract produce?
pdf with text layer only
With the configfile ‘pdf’ tesseract will produce searchable PDF containing pages images with a hidden, searchable text layer.
However tesseract will embed the JPG image that was used for scanning in the PDF which unfortunately is not what’s needed because the image has been processed to make the OCR process more accurate. To make this approach work it would be necessary to embed a different image from the one that was used for the scanning. This is not possible at present, although there is a GitHub feature request that discusses this exact thing.
Rather than use tesseract to generate the PDF I tried using a different set of open source tools called hocr-tools which should do the embedding. These tools need the text in a different input format called “hocr” but this is available from tesseract ( see above ).
However when I tried it the embedded text ended up as a huge font, see left for a screenshot of a single page. In this case there was no image overlaid as I wanted to check the text layout. It seems that this is a known problem but it’s unclear to me whether the hocr output from tesseract or the hocr-tools parsing is at fault. ( I found a HTML visualiser for the hocr output and it looked OK to me )
I also found some OCR test examples on-line and these showed the same problem so I don’t think that there’s a scanning issue. I prepared a bug report and sent it to the developers of the software but their response was a rather disappointing “Thank you for your email. Of course we can offer you commercial support for your inquiry to take a detailed look on the problem you reported.”
I had some further discussions with them but without a commercial contract they wouldn’t even comment on whether it was a bug with their software or a problem with my scanning so I decided to move on.
Fortunately I managed to find a different set of tools which would achieve the same thing. These are slightly easier to use as you specify a directory of hocr files and their equivalent JPG images and the tools construct the PDF for you. This tool worked fine with none of the font problems above. A few things that I noted were:
- The installation as suggested by the GitHub page didn’t work for me but I found a possible workaround in a Stackoverflow post. The software installed fine using the command – “
sudo python -m pip install hocr-tools“
- Previously I had just upscaled the processed images to allow tesseract to work correctly. However this tool needs the original JPG upscaled as well so as to align the text correctly. One point to note is that Imagemagick seems to lose the DPI settings when rescaling and these need to be reinstated with the “-density” option.
All of the necessary steps have been packaged into a shell script called ocr_embed.sh and this is available in my GitHub repository. The text in the PDF document can then be selected and copied to a different application if required ( see left ).
The standard desktop search would probably work OK but I had previously had some success with a standalone tool called Recoll. This has some more sophisticated search capabilities and presents the results in a more user-friendly way. The user manual is pretty comprehensive but I do like the auto-suggestions as you type in the query box and the ability to see some more context about the search results without opening the document:
It’s been quite a lengthy process to get to this point and has involved a lot of tools and scripts. However I think that I’ve now got the sort of result that National Geographic should have supplied in the first place rather than their rather cumbersome Adobe Air app.
I’m now in the process of converting the last 50 years worth of magazines to PDFs. It takes about 15 minutes per issue so converting 600 of them is going to take a while 🙂