Newsroom Update

New features come to Apple services this fall. Learn more >

Numbers to search for text from a collection of columns from imperfect Extract Text from Screenshot of PDF?

I need help to find a column from Row Header which is from PDF; I cannot get a Table from Paid APP and so used iPhone Shortcut Action Extract Text from Screenshot of PDF.

Then, I need Apple Numbers to find within Row Header a column which has inexact text due to imperfect OCR.

Is fuzzy match possible?


EXAMPLE

I got a pdf. I read out the text. now i want to look for values in that text by using regex, i.e. “(invoice\s+number)[:\s]+(?inv-[a-zA-Z0-9]+)\s+(order\s+number)[:\s]+(?[a-zA-Z0-9_.-]+)”.

yet sometime in the pdf it’s not the exact word “invoice number” or “order number”. it’s something like “ivoice number”, “invoice numbr”, etc. - so let’s say typos. how can i still find matches in that case?


A website suggested Levenshtein distance may help.

iPhone 11

Posted on May 20, 2024 2:16 AM

Reply
2 replies

May 20, 2024 7:15 AM in response to SunnyInToronto

There are no ways I know of to match "inexact" text. You are able to match exactly what you specify and that's all. If the best you can do is an imperfect OCR of a document, one thing you can do is make strings of all the usual ways the words might be misspelled and reference those in your regex expression. Instead of hardcoding the word "invoice" into the regex expression, reference a cell that has a string such as

(invoice|invce|ivoice|nvoice)

Include as many misspellings as you think are necessary, but you won't be able to include them all because there are so many. The "|" is the "or" operator.

Do the same with "number" and "order"

Then the expression will be something like this where B2 is the "invoice" string and B3 is the "number" string:

"("&B2&"\s+"&B3&"(....etc..."


A better solution would be to find something that can import the PDF correctly or turn it into a text file (hopefully in a tabular format). You might also try a simple copy/paste to a Numbers table. With PDFs it is often difficult to select just what you want and it all be in the order it appears in the PDF but it is worth a shot trying it.

May 22, 2024 3:11 AM in response to Badunit

I am more fluent in Excel than Numbers but prefer not to stay with Numbers; I need guidance if Numbers have equivalent of The Fuzzy Lookup Add-In for Excel was developed by Microsoft Research and performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data. For instance, it might detect that the rows “Mr. Andrew Hill”, “Hill, Andrew R.” and “Andy Hill” all refer to the same underlying entity, returning a similarity score along with each match. While the default configuration works well for a wide variety of textual data, such as product names or customer addresses, the matching may also be customized for specific domains or languages.

Numbers to search for text from a collection of columns from imperfect Extract Text from Screenshot of PDF?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.