Separate one sheet feed (scan) of student exam papers and rename according to student number.

Question

Level 1

4 points

Separate one sheet feed (scan) of student exam papers and rename according to student number.

The PDF is a scanned image PDF with lots of typed text, images, numbers and handwritten answers. At the top of every page however is the student id (eg "STUDENT NUMBER: 37313344") in font size 11.

I want to be able to separate the file into individual student files (again PDF) so I can distribute these back to the respective students and potentially import into Filemaker to mark onscreen at a later time.

I am hoping some of the ideas shared by VikingOSX may be adaptable and made to work.

I have tried using Chronoscan on a PC which is complicated and obviously not Mac.

Posted on Sep 20, 2024 9:11 PM

Reply

Answer 1

Oct 11, 2024 7:19 AM in response to VikingOSX

For a scanned PDF that is not OCR'd in the process of scanning, one will need to use a PDF Editor to OCR that document content. I suggest the paid Code Industry Master PDF Editor (tested with v5.9.85 on 2024-10-11). One can download a free trial which allows one to OCR a scanned PDF in place without watermarking it or having to pay for the application.

Steps

Make a duplicate copy of the PDF in Finder
Know in advance the predominant font used in this PDF.
Open the duplicate PDF in Master PDF Editor.
1. Document menu > OCR
2. Install languages (select the language used in the PDF (e.g. English)
3. Font Family (from font family above)
4. Editable Text is selected
5. Advanced
  1. Select Deskew
  2. Minimal confidence level: increase to 80.00
  3. Leave unchecked manual text editing if confidence level not achieved
6. Click OK
7. A progress bar will appear as the PDF is OCR'd
8. Click cmd+S or File menu > Save
Run the Automator application by double-clicking on the Desktop
1. Select the just OCR'd and saved PDF
2. Press the ⌘-key and then select the output folder for the student number PDFs
3. Both items 1 and 2 should be on your Desktop
Observe the PDFs in the output folder bearing the student number and PDF pages associated with that student number.

Reply

Answer 2

Oct 25, 2024 12:30 PM in response to VikingOSX

I have had a guilty conscience recommending Master PDF Editor for placing an OCR layer over scanned text. It is free and a crude approach that simply does not leave a satisfying result in the PDF.

For those that purchased VueScan Professional scanning application, it has an output option to automatically OCR the resulting PDF that is saved. Otherwise, the following is a free OCR solution…

With additional investigation, I discovered that the free Ghostscript v10.04 has PDF OCR drivers built into it and supports multi-lingual v4.1 Tesseract trained-data files that automatically OCR PDF text in the discovered language. I just used this approach to OCR a scanned German language text and it worked beautifully.

Fortunately, one does not need to install the Tesseract application — just its trained data files and then have a shell environment variable pointing to that trained data folder. Ghostscript then does the rest.

Installation steps:

Install Ghostscript v10.04 or later from this Univerity of Oregon site. Not the extras package. The installer will place it into /usr/local/bin folder location.
Install the Tesseract v4.1 trained data files. These are in a zip container named tessdata-main.zip. Visit this location, click the green [ <> Code ] button and select Download ZIP.
1. The download should go into your /Users/username/Downloads folder.
2. Double-click the zip file to expand it to /Users/username/Downloads/tessdata-main folder
In the Terminal application, you will want the following EXPORT statement at the end of your dot startup file:
1. Bash /Users/username/.bash_profile
  1. export TESSDATA_PREFIX="${HOME}/Downloads/tessdata-main"
2. Zsh /Users/username/.zshrc
  1. same export statement as above
Still in the Terminal, enter the following depending on which shell you are using.
1. source ~/.bash_profile
2. source ~/.zshrc

Now, suppose you scanned one or more pages of text to a PDF on your Desktop. Let's say its arbitrary name is foo_wo_ocr.pdf. You now can place a searchable, OCR text layer on an output PDF using Ghostscript. In the Terminal application, you type the following where -r300 stands for 300 DPI but can be 600…

cd ~/Desktop
/usr/local/bin/gs -dNOPAUSE -dBATCH -sDEVICE=pdfocr24 -r300 -sOutputFile=foo_ocr.pdf foo_wo_ocr.pdf

You can now double-click the foo_ocr.pdf written to your Desktop in the Finder and open it in Apple's Preview where you can now select or search the OCR text layer.

Reply

Answer 3

Allan Jones

Level 9

72,224 points

Sep 21, 2024 8:01 AM in response to Quambone

Welcome!

You found the forum section for asking how to use the forum software. It gets very few views.

If you tell us what Apple device you are using, we can ask the moderators to move your question to a more active and appropriate forum section.

If a software question, there has not been a specific section for FileMaker support in these forums for a very long time. I last time I used FileMaker was before I retired nearly 20 years ago and I've forgotten most of the cool stuff I used to know (age gets to us all!). However, there is a general forum section called "Older Software" that may still have a few FM users still hanging out:

Older Software - Apple Community

TIP: If you start a new thread there, be sure to include "FileMaker" in the title so any old FM gurus can quickly spot it.

Best wishes! FM was a really great tool.

Reply

Answer 4

Sep 22, 2024 2:17 PM in response to Quambone

I have AppleScript/Objective-C code that has three phases:

Get the unique, ordered student ID numbers in the PDF
Get all sequential pages in the PDF for each student ID
Write a PDF to a designated folder location that contains all PDF pages associated with that student ID. The PDF name will be the student ID number.

I have tested this in Apple's Script Editor and as an Automator action. Both work. My test document is a 7-page pdf with three student IDs (37313344 - 4 pages, 37313345 - 2 pages, 3711346 - 1 page) and the resulting split PDF documents reflect the same content.

The original PDF remains unchanged.

Tested on macOS Sequoia v15.0 only.

I am using two Automator actions:

Files & Folders : Ask for Finder items
1. I am choosing to ask for Files and Folders with allow multiple selections (using ⌘ key).
  1. Select the input PDF first, and then the Destination folder for the student's PDF. Order is important.
Utilities : Run AppleScript

Remove the default code in the Run AppleScript action and replace it with the following:

use framework "Foundation"
use framework "PDFKit"
use AppleScript version "2.4"
use scripting additions

property ca : current application

on run {input, parameters}
	
	set pdfURL to ca's NSURL's fileURLWithPath:(POSIX path of (item 1 of (input as list)))
	set outFolder to POSIX path of (item 2 of (input as list))
	set pdf to ca's PDFDocument's alloc()'s initWithURL:pdfURL
	set pdfText to ca's NSString's stringWithString:(pdf's |string|())
	
	set pattern to "(?<=student\\snumber:)\\s*([[:digit:]]+)"
	-- list of all unique student numbers in original PDF
	set student_numbers to ca's NSArray's arrayWithArray:(my students(pdfText, pattern))
	
	repeat with astudent in student_numbers
		-- a list of all pages associated with that student
		set student_pages to my students_pages(pdf, astudent) as list
		-- split out students pages to a separate student number PDF in outFolder
		my students_pdf(pdf, astudent, student_pages, outFolder)
	end repeat
	return
	return input
end run

on students(ptext, regexPat)
	-- return unique list of ordered student numbers in PDF
	
	set sID to ca's NSMutableOrderedSet's new()
	set srange to ca's NSMakeRange(0, ptext's |length|())
	set regex to ca's NSRegularExpression's regularExpressionWithPattern:regexPat options:(ca's NSRegularExpressionCaseInsensitive) |error|:0
	set matches to regex's matchesInString:ptext options:0 range:srange
	repeat with match in matches
		(sID's addObject:(ptext's substringWithRange:(match's rangeAtIndex:1)))
	end repeat
	return (sID's allObjects())
end students

on students_pages(apdf, snumber)
	-- for a student number locate all of their pages in the PDF
	set pageno to ca's NSMutableArray's array()
	set found to apdf's findString:snumber withOptions:(ca's NSLiteralSearch)
	if not ((count of found) = 0) = true then
		repeat with sel in found
			repeat with n in sel's pages()
				(pageno's addObject:(n's label() as text))
			end repeat
		end repeat
		return pageno
	else
		return ["None"]
	end if
end students_pages

on students_pdf(apdf, snumber, pageList, outFolder)
	-- write out PDF bearing student number and their pages split from original
	set outPDF to ((ca's NSString's stringWithString:outFolder)'s stringByAppendingPathComponent:snumber)'s stringByAppendingPathExtension:"pdf"
	set pdfout to ca's PDFDocument's alloc()'s init()
	-- log (outPDF) as text
	
	repeat with n in pageList
		(pdfout's insertPage:(apdf's pageAtIndex:(n - 1)) atIndex:(pdfout's pageCount()))
	end repeat
	pdfout's writeToFile:outPDF
	return
end students_pdf

Reply

Answer 5

varjak paw

Level 10

177,418 points

Sep 21, 2024 8:59 AM in response to Quambone

Anwering the first issue, if you search the web for "export pdf to individual pages" you'll find a number of solutions offered. Which will work best for you will depend on your exact situation, including whether each student exam paper is a single page or consists of multiple pages, and of course budget. Adobe's Acrobat can export a PDF to single pages, but of course it's not free.

I haven't used FileMaker in a long time so its current ability to import PDFs I'm not sure about. It won't be able to recognize the student numbers or other text, though, unless you use an OCR package to convert those PDFs into reable text. Claris, maker of FileMaker, has their own community here where you can get help with FileMaker questions:

https://community.claris.com/en/s/

Regards.

Reply

Answer 6

Quambone Author

Level 1

4 points

Sep 21, 2024 8:01 PM in response to Allan Jones

Thank you Allan,

I was not clear. This is more about renaming the split files with the student ID number.

Essentially I want to return the students handwritten scanned exams to them via a mail merge.

The entire scanned PDF could be 1500 pages long, comprising 100 student exams each of which are 15 pages long and all of which have the relevant student ID number at the top. I have a separate spreadsheet with the student email addresses and their student ID that I can use to easily send them their scanned exam if I can just split the big combined PDF and rename the new 15 page PDFs that it can be split into.

I saw a post from VikingOSX using Automator that seemed to do something similar for someone that had one page invoices but Im not sure if they were scanned PDF images or had searchable text.

Reply

Answer 7

Quambone Author

Level 1

4 points

Sep 21, 2024 8:05 PM in response to varjak paw

Thanks Varjak, my question was not clear enough. This is not really a Filemaker problem but rather how to rename split PDF (scanned image) files using a number that is part of the image.

Perhaps I have posted to the wrong Forum, apologies but I am new to the Apple Community and community forums in general.

Reply

Answer 8

Sep 22, 2024 2:52 PM in response to VikingOSX

I wasn't using a scanned PDF and unless you have OCR'd it to reveal the text layer, that PDF will be a PDF wrapper around the scanned image(s). I don't believe the solution that I posted will work with ordinarily scanned PDFs.

Reply

Answer 9

Quambone Author

Level 1

4 points

Sep 21, 2024 7:09 PM in response to varjak paw

I was hoping this could be solved with Automator

Reply

Answer 10

Quambone Author

Level 1

4 points

Sep 27, 2024 7:24 PM in response to VikingOSX

Thank you so much. I’m keen to head home and try this out as soon as I can.

Reply