Rename PDF with content in zone

Question

Level 1

5 points

Rename PDF with content in zone

Hello, sorry for my question, I am a beginner. I have thousand of PDF files to rename with 2 numbers written in rectangles on the right corner. These files are the result of scans, so the positions of these numbers are a sometime a little bit different on the different pages.

I tried with some commercial softwares but it is not working (it's ok with pure digital PDF but not with scanned pages)

The recognition of these numbers is very good in Apercu.

I saw here some post explaining automator possibilities for OCR.

Do you know if it is possible in my case to rename with theses numbers "MODELE"-"ORIGINE" : in the examples it should be :

71 045-100 502.PDF

71 046-100 503.PDF

71 047-100 458.PDF

Or if you know a commercial software that can do it on macOS

Thank you !!

Posted on Jan 23, 2023 2:37 PM

Reply

Answer 1

Top-ranking reply

VikingOSX

Level 10

126,346 points

Jan 28, 2023 5:09 AM in response to jeangab75

I have added tweaks to the pattern (regular expression) that allows for OCR artifact interference and now can match all three numbers in top-down order from 100% of the sample images that you provided. This should provide a more reliable means to form your three-part filename from those numbers.

The first paragraph is no guarantee that it will work on all images because of unplanned OCR behavior. The logic to construct your missing value (xxxxx) from the results might be some unpleasant if-statements too.

Here is the renaming performed on your sample images. The four renamed images are duplicates of already renamed images.

and the replacement Run Shell Script code:

SRCFILE="${1}"
OCR="${2}"

# pattern for numbers beginning with these numbers.
# some numbers may be preceded with a '-' character
# \D any non-numeric character \d any number ? means one or none
pat="^(\D?71\D{1,2}|^\D?100\D|^\D?\d{2}\D{1,2})\d{3}"

# extract first three matching number strings based on $pat
# and remove all punctuation and spaces from those strings placing
# integers in the array
values=(  ${(O@f)"$(egrep -om3 "(${~pat})" <<<"${OCR}" | tr -d '[[:punct:] ]')"} )

# print -l $values

# concatenate three numbers with "-" separator character
if [[ ${#values} -ge 2 && ${#values} -le 3 ]]; then
   newimgname="${(j:-:)values}.${SRCFILE:e}"
fi

# print ${newimgname}
# clear the array for next iteration
unset values

# comment the print statement and uncomment the /bin/mv
# statement to rename the files in current location
# print "${SRCFILE:a} => ${SRCFILE:a:h:r}/${newimgname}"

# ignore original images whose extracted duplicate codes
# would result in overwriting an already renamed image.
/bin/mv -n "${SRCFILE:a}" "${SRCFILE:a:h:r}/${newimgname}"

Reply

Answer 2

VikingOSX

Level 10

126,346 points

Jan 24, 2023 6:57 PM in response to jeangab75

I managed to create a shortcut that passes the image file path and extracted OCR text into the Run Shell Script and does the expected work to create a new filename for renaming. This one users a single file chooser selection.

Apparently, Shortcuts also supports a folder selection, filtering the file types in that folder by extension, and then performing a repeat block that does the work of the current script on each file in the folder. It is not working as I expected and will require some patience…

This is working now…

The Run Shell Script code:

INFILE="${1}"
OCR="${2}"

# pattern for numbers beginning with these two values
pat="^71 .*|^100 .*"

# extract first two matching number strings based on $pat
values=( ${(@f)"$(egrep -owm2 "(${~pat})" <<<"${OCR}")"} )
print -l $values

# replace element space with underscore and concatenate two numbers with "-" character
newimgname="${(j:-:)values// /_}.${INFILE:e}"

# comment the print statement and uncomment the /bin/mv
# statement to rename the files in current location
print "${INFILE} => ${INFILE:h:r}/${newimgname}"
#/bin/mv -v "${INFILE}" "${INFILE:h:r}/${newimgname}"

And the result from using les couleurs 9.jpeg:

71 047
100 458
/Users/viking/Downloads/JPEG/les couleurs 9.jpeg => /Users/viking/Downloads/JPEG/71_047-100_458.jpeg

Reply

Answer 3

VikingOSX

Level 10

126,346 points

Feb 3, 2023 9:54 AM in response to jeangab75

Qu'est-ce que c'est? J'ai besoin de tout le texte extrait qui ne contient pas d'informations relatives à la confidentialité.

The shortcut that I provided would only be valid for the form types present in your first set of images that I tested, because a custom pattern (regular expression) must be built to correctly parse the text and capture the requested numeric data.

The text you have posted Feb 3 @ 11:12 am is not all of the extracted image content is it? What I see only offers one matched number sequence and the Run Shell Script test for the number of elements in the values array is not 2 or 3 items but only 1 and thus, it simply exits from any further processing of the current image file. It should not blow up the Run Shell Script.

Reply

Answer 4

VikingOSX

Level 10

126,346 points

Jan 24, 2023 5:11 AM in response to jeangab75

The content of these scanned PDF is image. There is no Automator action to extract text from an image, but on macOS Monterey and later, there is a Shortcut action named Extract Text from Image that uses optical character recognition (OCR) to scan every line of the PDF containing text or numbers and output to text. Numbers will appear on their own output line.

Although it is possible with a secondary Run Shell Script to take that text output and filter it to only yield numbers, how one can consistently isolate the Modele and N ͦ Origine numeric strings for the rename purpose becomes more involved and you would need to inspect that alphanumeric text result to determine if there is a consistent pattern between the PDFs that one can use to form a valid rename.

A new Shortcut using these two actions will allow you to select a scanned PDF:

amd view the OCR result from the Shortcut.

I can add a third action, Run Shell Script that extracts only digits and their associated space as array elements:

and this from my trivial table and text example. Your output would likely be more complicated.

Reply

Answer 5

VikingOSX

Level 10

126,346 points

Jan 24, 2023 11:17 AM in response to jeangab75

Ok. I have tested a revised script on all of your example PDFs and JPEGs and get the desired result with each. Clearly working with the JPEG images is far quicker than using OCR on the PDF.

Try this replacement content for the Run Shell Script:

# pattern for numbers beginning with these two values
pat="^71 .*|^100 .*"

# extract first two matching number strings based on $pat
# and with spaces replaced by "_"
values=( ${(@f)"$(egrep -owm2 "(${~pat})" | sed -e 's/ /_/g')"} )
print -l $values

# concatenate two numbers with "-" character
newpdfname="${(j:-:)values}.pdf"
print "${newpdfname}"

Reply

Answer 6

VikingOSX

Level 10

126,346 points

Jan 31, 2023 10:55 AM in response to jeangab75

The Run Shell script code would only detect 100 521 in that content and that would fail the element test in the values array that needs to be between 2 and 3 items. I can branch that test and skip the this file because attempting to handle every eventuality of what OCR detects would be endless code…

Replace this:

# concatenate three numbers with "-" character
if [[ ${#values} -ge 2 && ${#values} -le 3 ]]; then
  newimgname="${(j:-:)values}.${SRCFILE:e}"
fi

with:

# concatenate two or three array numbers with the "-" character
if [[ ${#values} -ge 2 && ${#values} -le 3 ]]; then
  newimgname="${(j:-:)values}.${SRCFILE:e}"
else
  exit 0  # something is wrong with OCR so skip this item
fi

By using exit 0, I do not send a script failure to Shortcuts and it simply ignores the current file and proceeds to process the next file in the list of files in the selected folder.

Reply

Answer 7

VikingOSX

Level 10

126,346 points

Feb 3, 2023 4:21 AM in response to jeangab75

The Apple hosting team will edit anything that remotely resembles anything privacy related. That means anything you post here that may expose personal names, addresses, account numbers, URLs, or email addresses in either linked images, or posted OCR text.

There may be one avenue left for you to share the extracted OCR text from that image that is breaking the last posted version of the Shortcut. Create a simple Shortcut:

Select File

Extract text from image

Run it, select the problematic image, and the extracted text output will appear as plain text in a scrollable window. Copy all of that output to the clipboard. Open TextEdit and set Format menu > Make Plain Text. Then paste into TextEdit. Remove lines in that OCR text that fall into the privacy categories in the first paragraph.

When you are done with that, use the Additional Text tool (adjacent to the Image Insertion too) on this editor's toolbar. Name it OCR text, and then copy/paste the edited content from TextEdit into the additional text area. Post here and see if it survives Apple host scrutiny. I can then test that OCR text locally to see why the current Shortcut is failing on that content.

Reply

Answer 8

VikingOSX

Level 10

126,346 points

Jan 24, 2023 8:01 AM in response to jeangab75

This is what makes this solution so difficult as I do not have your PDF OCR text content to test. I interactively added your previously shown numbers to a values array in my Zsh shell, and then manually ran the same commands in the Terminal and it worked to generate:

It is unclear if you are getting consistent numeric text results across multiple tested scanned PDFs, and if nothing matched a 71 or 100 number string in the values array, then the nums array would be empty and so would the newpdfname variable. Does this fail for several PDFs or just one?

Reply

Answer 9

VikingOSX

Level 10

126,346 points

Jan 24, 2023 11:50 AM in response to VikingOSX

And this gets rid of the external sed command. This example expects Pass Input: as arguments instead of stdin.

# pattern for numbers beginning with these two values
pat="^71 .*|^100 .*"

printf '%s\n' ${2}

# extract first two matching number strings based on $pat
# and with spaces replaced by "_"
values=( ${(@f)"$(egrep -owm2 "(${~pat})" <<<"${1}")"} )
print -l $values

# concatenate two numbers with "-" character
newpdfname="${(j:-:)values// /_}.pdf"
print "${newpdfname}"

71 045

100 502

71_045-100_502.pdf

Of course, if an image file is the source data, then the ".pdf" extension would be inappropriate.

Reply

Answer 10

VikingOSX

Level 10

126,346 points

Jan 25, 2023 8:57 AM in response to jeangab75

I now have a working Shortcut that allows you to select a folder and processes each image (e.g. jpg, jpeg) in that folder, and use the extracted OCR codes to rename each original image as expected. In those circumstances where a jpeg image would result in duplicate code extraction, and risk overwriting a previously renamed file, it will leave the original image file untouched in the selected folder.

I used your downloaded JPEG folder contents and the Shortcut has been tested on Ventura 13.2.

Reply

Answer 11

VikingOSX

Level 10

126,346 points

Jan 27, 2023 6:38 AM in response to jeangab75

Sorry it took so long to get back to you. I have done more testing and refinements based on observing OCR returns. The following Run Shell Script contents should replace what you have now. It will generate renamed files as nnnnn-nnnnn.jpeg. Again, if the image file has identical pair values to an already renamed image, then the original image filename is not changed.

SRCFILE="${1}"
OCR="${2}"

# pattern for numbers beginning with these numbers.
# some numbers may be preceded with a '-' character
pat="^([-]?71|[-]?100)[. 0-9]+"

# extract first two matching number strings based on $pat
# removing occurrences of .,- and space and sorting in
# ascending order
values=( ${(O@f)"$(egrep -om2 "(${~pat})" <<<"${OCR}" | sed -E 's/[., -]//g')"} )

# print -l $values

# concatenate two numbers with "-" character
if [[ ${#values[@]} -eq 1 ]]; then
   newimgname="${values}.${SRCFILE:e}"
elif [[ ${#values[@]} -eq 2 ]]; then
   newimgname="${(j:-:)values}.${SRCFILE:e}"
fi

# print ${newimgname}
# clear the array for next iteration
unset values

# comment the print statement and uncomment the /bin/mv
# statement to rename the files in current location
# print "${SRCFILE:a} => ${SRCFILE:a:h:r}/${newimgname}"

# ignore original images whose extracted duplicate codes
# would result in overwriting an already renamed image.
/bin/mv -n "${SRCFILE:a}" "${SRCFILE:a:h:r}/${newimgname}"

Reply

Answer 12

VikingOSX

Level 10

126,346 points

Jan 27, 2023 12:27 PM in response to VikingOSX

Although I can extract the first four number sequences from a given image, I cannot control the order in which they appear in the OCR output from different images, and sometimes the Date d'Emission number is between the Modele and N Origine values, and other times the No Repetitions value is between the two. Thus, there never is a consistent pattern from which to build your triple sequence filename as it would require specific array indexing of values always changing their location.

Giving up as my time is limited.

Reply

Answer 13

VikingOSX

Level 10

126,346 points

Feb 1, 2023 5:33 AM in response to jeangab75

When I run the current Run Shell Script contents on the nine image files you originally provided, it works without error, so I have no means to reproduce the error here.

The first argument passed to that Run Shell Script is the image file that produced the OCR content. This is the $1 variable assigned to SRCFILE. If the first argument is not the filtered image file, then there is no means to extract its filename extension ${SRCFILE:e} in the creation of the newimgname variable, or construct the destination path via ${SRCFILE:a:h:r}. If the values array has 0 or 1 elements, then the current iteration of the Run Shell Script exits cleanly, and we never get to the file renaming line of code:

/bin/mv -n "${SRCFILE:a}" "${SRCFILE:a:h:r}/${newimgname}"

where the error dialog that you last received was suggesting an error occurred, or nearby.

I took one of your sample images, captured the OCR text into an editor, and removed all of the numbers the pattern was to be applied too. When run interactively in the Terminal, this generated an empty values array and the if else fi test block worked as expecte with the Terminal exiting..

Just to error on the side of caution, change the following:

exit 0  # something…

to this with no trailing space:

exit 0

and try the file that caused the error again.

The legibility of the text in the image file, and how the OCR process reacts to aberrations in that content may continue to produce inconsistent results. Due to that likely variability, I don't believe that a single regular expression pattern can work for all OCR results.

Reply

Answer 14

jeangab75 Author

Level 1

5 points

Jan 24, 2023 6:05 AM in response to VikingOSX

Thank you a lot ! I don't know how is working your script but it works very well !!

If I can ask you 2 other questions please :

1-Is it possible to filter and extract only the 2 first numbers that begin by "71" et by "100" (there are the two only numbers that interest me)

2- Is it possible with Shortcuts to rename all the files in a folder like "number1-number2.ext"

Thank you a lot !

Reply

Answer 15

VikingOSX

Level 10

126,346 points

Jan 24, 2023 7:19 AM in response to jeangab75

1-Is it possible to filter and extract only the 2 first numbers that begin by "71" et by "100" (there are the two only numbers that interest me)

2- Is it possible with Shortcuts to rename all the files in a folder like "number1-number2.ext"

Yes.
Maybe, haven't got this far yet

Copy and paste the following text so that it replaces the Run Shell Script content

# pattern for numbers beginning with these two values
pat="^71.*|^100.*"

# extract number strings into an array
values=( ${(@F)"$(egrep -ow "([[:digit:] ]+)")"} )

for n in $values;
do
   # match only array values beginning with 71 or 100
   # and replace space with underscore in those values
   # into the num array - avoiding filenames with spaces.
   [[ "${n}" =~ "(${~pat})" ]] && nums+=( $(sed -e 's/ /_/g' <<<${match}) )
done
print -l $nums
# join the two number strings with a dash
newpdfname="${(j:-:)nums}.pdf"
print "${newpdfname}"

Reply

Rename PDF with content in zone

Similar questions