Apple Intelligence is now available on iPhone, iPad, and Mac!

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

OCR reader technicalities (DeepL)?

any of the big brains on here know anything about character recognition software? someone turned my on to an app called DeepL which looks fantastic.


the problem is that i have one PDF of a book (recently uploaded to the web by a library) which i can COPY text from and paste it into the DeepL browser and have the text translated. of course i could COPY and then paste this into a word document or whatever i wanted.


however. i also have a document that is a xerox of a book that was sent to me from a mathematics department. this is ALSO a PDF but it is not as clean as the other recently scanned text. and with this document i cannot copy any of the text. the pages on the PDF seem unselectable and only a big black box goes over a page when i try to copy text from it.


does anyone know what if any technical issues are involved here?


i need to know what i would need to ask someone to do in order to get an OCR readable (if that is the right term) PDF scan of this document.


thanks for any help or any leads or any good search terms to study up on this.


Mac Pro, macOS 10.13

Posted on Feb 6, 2020 3:31 PM

Reply
Question marked as Top-ranking reply

Posted on Feb 7, 2020 11:16 AM

THANKS. i think i get it. it was late but also sort of a complicated topic.


so i guess the two follows up i have (if you don't mind) - are

  1. would some scanning of documents result in unselectable IMAGE FILES and some scanning of documents result in selectable VECTOR/TEXT files?
  2. i mean if i have something now that is unselectable could i RE-SCAN the document here in some way that might get me selectable text that i would not have to send to an OCR translation tool (which will require some cleaning up time i suppose).
  3. but also if i have someone at a university that would be willing to RE-SCAN the document on their end with a modern scanning machine would i have to tell them to do something specific? like give me PDF directly from the new scan i suppose?

Similar questions

10 replies
Question marked as Top-ranking reply

Feb 7, 2020 11:16 AM in response to BDAqua

THANKS. i think i get it. it was late but also sort of a complicated topic.


so i guess the two follows up i have (if you don't mind) - are

  1. would some scanning of documents result in unselectable IMAGE FILES and some scanning of documents result in selectable VECTOR/TEXT files?
  2. i mean if i have something now that is unselectable could i RE-SCAN the document here in some way that might get me selectable text that i would not have to send to an OCR translation tool (which will require some cleaning up time i suppose).
  3. but also if i have someone at a university that would be willing to RE-SCAN the document on their end with a modern scanning machine would i have to tell them to do something specific? like give me PDF directly from the new scan i suppose?

Feb 7, 2020 11:40 AM in response to hotwheels22

  1. I don't see how Scanning could result in real Text without some sort of OCR, Scanning is converting everything to a pic.
  2. No.
  3. A scan is going to be pictures no matter what form they send it in... PDF, jpeg.Tiff, etc.


PS. if scanning or rescanning I've been told that GIF preserves Text readability better than JPG or other compressed formats.


Feb 6, 2020 5:30 PM in response to hotwheels22

Yes on your first paragraph.


A.. Graphic Converter?Picture>Simple Brightness/Contrast.


B.. I sent them to this site, did you miss the link? https://www.onlineocr.net/


C. You can still use OCR on a pic or pdf at that site.


D.. Generally if you can't select the text it's academic if it is Text or a Graphic, but if you enlarge it enough a Graphic will show rougher edges, whereas Text will be smoother.


Cleaned up 2nd page below...

Original below...

Feb 6, 2020 4:09 PM in response to hotwheels22

Well, 2 things on the not copying text...


They can block copying test.


It may not be text at all, but a picture/graphic/photo of text.


For OCR to work you'd need to clean up the second pic... which does look scanned.


2nd pic cleaned up...


Zur llas»ifigulion 41, 11, lig n dritter °rdnung.

u ',ggr. V.g.u.slice. in Hauen ing

1/ie Eintheilging der 1"litchen dritter Ordnung hinsiehtlieb der liea litiit ihrer °eraden und Ller singuliiren Punkte »ggr., er. schöpfengl gegeben von Herrn EI i seiner grossen Arludt (bi the 1/istribution •if Surfaees gd the third erder inlo Species, i: •l to the absenee or pre,enre 0U...uinpike Points. Pialos. t. I.nig• dun p. 2t, Gelegentlich meiner Promotion gah (.1e.elg nur die l'iltersuelning des Pentaeilers für die einzelnen Se 11 I fli 'schen Arten. welche Auf-gabe in meiner 11issertatigin: „1/.1.A Peidaegler der II dritter °rd-ran, beim Auftreten von Singularitiiten'• behandelt wird. Hierbei rdellte eä sieh hera, dass eine Iteibe speeieller 1:Liehen, insbesondere the,enigen mit eoniseben Knoten, :web ein eigenfilelies Pen.eder besitzen, während andere, wie die mit biplanaren 1,igiten, im Allgemeinen l'enta-eder »tit aneintlieb benachbarten Ebenen mit 4ia »ihren. Ab, nieht „nr alle denkbaren Ville der vereinigten 1,, ttirgl num inder Zugrungle• ,legung einer bestimmten Singulantat gelulgrt. Diese Thatsache suran-i lass. mich, die umgekehrte, und wohl sweekaniismgere, bragestellung nach denjenigen 1'1:ichen. welche einem ge,benen l'enlaegler hören, zu wählen. Die lichanillung der !dglluis entspringend. Auf-, - gebe 12t 1111,80111 Aufsatze glitreh,d»bri, Von der grössten ,Vichtigkeit glie folggaid. 1:ntersuginungen ist der von Herrnlilitin seiner Abhandlung „l'eber Flächen dritter Ordnung., Math. Annalen Bd. VI, p. 551 tf., gegebene ,:outt, dass elle . Flachen ohne Singularitiiten tun] derselben Sch i 'schen Art durell . continuirliche Aeuckrung der Umndauten in einander übergeführt wer. den können, ohne dass hierbei ein Knoten auftrete. Diese Derivntiwie • ist niindich fast ausnalmndos aurk dann noch zagglich, wenn .1i. unverändert bestehen bleibt.


https://www.onlineocr.net/


Not cleaned up...


Zur elemidication der Fläriben dritter Ordnung.

von

Com Ilmireenuo in Plauee in Vowlsede.

Die Eintheilung der $12,1... dritter Ordnung hinsichtlieh der Itee. Keit ihrer Geraden und der möglichen singulären Mulde werde er. schöpfend gegeben son Ilerrn Scheini in seiner grimm Arbeit: 11„the Diatribution of Surfacea uf the thinl urdor nto Specin, in referenee to the abeence ur presence of :Minier Philow heftet. dun 1863, g 207 HI Gelegene. meiner Promotion gab I« h mir die lIntemeloing Pentambers Air die einzeln. Schlaf I enchen Arten, welche Auf-gabe in meiner Oidertation: „Dar Pentneder der Fliehen dritter Ord. meng beim Auftreten von Singularinit.. behandelt wind. Hierbei stellte et nick her., dass eine heitre speciellor Flächen, insbesondere diejenigen mit cm lachen Knoten, noch ein eigentliches rentweder bnilaen, während Adere, wie die mit biplanaren Knoten, im Allgemeinen Pentn. eitler mit unendlich lenachbarten Ebenen mit sich flIbren. Aber nicht auf alle denkbaren Kille der vereinigt. Lege wird 1.11 unter Zugrunde-Jegeng einer beetimmten Singnlaried u.N.. Diese Tliatnche miete. die umgekehrte, und wohl sweekruidigne, Frageatallung ..eh denjenigen Flieh.. welche dnern gegebenen Penneder hören, au wilden. Die Itehmollung der hieraus entspringenden Auf-, 'gebe ist in ihnen Aufeatze durchgerd.. . . Von der grössten Wichtigkeit fnr die folgenden Untenuchimen IM der von Herrn KI ein in seiner Abhandlung „tleber Flächen dritter ()rdnone, Math. Annalen Bd. VI, 551 H., gegebene Se. dem olle Fliehen ohne Singiden.. und denelben Sel,l1fl dachen Art durch , . conlinuirliche neudunng der eonetarden in einrunder Mergelnhrl wer. den Idinnen, ohne dass hierbei ein Knoten auftrete. Dien Derivation int ninilich fad manahmelos auch da. noch möglieh, wenn MA hl., Mar .verandert bestehen



Do you have an OCR App?


Feb 6, 2020 4:31 PM in response to BDAqua

hey man.

help me here a little more please?

1st pic is a “clean scan”. up on the web as a PDF. 2nd pic is a “dirty scan” with a lot of “graphic noise”?

your first set is of the second test “cleaned up” and your second set is as it stands?


lastly i was going to look for an OCR app and i thought i had one in DeepL. but i guess that is just a translation app.


so i need an app that gets the a rural text out of the PDF? can it take it from an image scan?! or specifically not an image scan?


THANKS

A. how did you clean it up?

B. how did you get text from the either of them? when i open the second PDF on my desktop i can not select anything. unlike the first where i can select a lot at a time (granted this first text selects text i don’t want to select so i have to go line by line and be careful to get it in order).

C. i DO have a “contemporary” pdf written by a guy who has made it unselectable. it is also in german. he is kind of a pain so i have to guess he deliberately printed to PDF as unselectable text so it cannot be translated or something?!

D. how do i determine whether i have an “image PDF” which would be unselectable versus a vector (?) PDF which has selectable text?


Feb 6, 2020 6:01 PM in response to BDAqua

thanks man.

i missed the link maneuvering on my phone i guess. appreciate it.

so just so i am clear. i think i /may/ have misposted with that second image. when i wanted to post an image from a PDF that had /unselectable/ text.

i mean - i am realizing i took a SCREENSHOT of a PDF and posted two examples and you were able to get the text from both.

and you cleaned up the second IMAGE file and for a better OCR reading?

so if i ALSO have a pdf on my desktop that has unselectable text in the PDF i could export it all to image files (or go back to the original individual scans which are image files) and see if i can get readable text out of that?

Feb 6, 2020 6:19 PM in response to hotwheels22

You can upload the PDF or the Scans to that site for conversion, no matter, but likely higher resolution is better, doesn't matter if it's selectable or not.


Cleaning up the second one was a quick attempt to make it easier for the OCR, I don't know which it did better, but was mostly just to illustrate an example.

Feb 7, 2020 6:34 AM in response to BDAqua

THANKS man. i find it a tricky topic.

i think what i am realizing is i had fully readable text in a pdf in the first image and not readable at all text in a second image (meaning it was completely unselectable).

so i can take the unreadable text into an OCR reader and make it readable/selectable?

does the fact i can select text in a pdf imply it is vector data which i can input into a translator?

and does the fact i have unselectable text in a pdf imply it is raster data which i need to clean up and put into an ocr reader?

i guess i am trying to understand that part as well...

THANK YOU

Feb 7, 2020 11:03 AM in response to hotwheels22

You pretty much have it understood... Picture or Raster data would not have selectable text as there is no text really, & running it through OCR will interpret the pic of the text to real text. whether cleanup is needed depends on how clear the pic is.


If yo're scanning them yourself, I find VueScan has darn good built-in OCR...


https://www.hamrick.com/support/how-to-guides/how-to-scan-ocr-text-files.html

OCR reader technicalities (DeepL)?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.