sips convert pdf to tiff to use in tesseract ocr
I'm using sips to convert a pdf file to a tiff file. I need to generate a tiff file without any alpha channels. I'm trying to run the resulting tiff through tesseract ocr. If I do this:
sips -s format tiff infile.pdf --out outfile.tif
It gives me a tiff file, but when I run it through tesseract, I get this:
command: tesseract outfile.tif ocr_data.txt
results: check legal_imagesize:Error:Only 1,2,3,5,6,8 bpp are supported:16
As a work around, I use sips to convert the pdf to a jpeg first, then convert the jpeg to a tiff. That generates a tiff file that works with tesseract ocr. However, I don't want to do two image conversions to get a working tiff file. Are there any settings that would allow me to convert the pdf directly to a tiff file that works in tesseract?
Another unrelated question... Does anyone know how to convert a multi-page pdf to a multi-page tiff file? When I feed a multi-page pdf into sips, the resulting tiff is only the first page.
sips -s format tiff infile.pdf --out outfile.tif
It gives me a tiff file, but when I run it through tesseract, I get this:
command: tesseract outfile.tif ocr_data.txt
results: check legal_imagesize:Error:Only 1,2,3,5,6,8 bpp are supported:16
As a work around, I use sips to convert the pdf to a jpeg first, then convert the jpeg to a tiff. That generates a tiff file that works with tesseract ocr. However, I don't want to do two image conversions to get a working tiff file. Are there any settings that would allow me to convert the pdf directly to a tiff file that works in tesseract?
Another unrelated question... Does anyone know how to convert a multi-page pdf to a multi-page tiff file? When I feed a multi-page pdf into sips, the resulting tiff is only the first page.
Mac OS X (10.5)