Search footnotes over multiple docx files

Question

Level 1

4 points

Search footnotes over multiple docx files

I would like to search the footnotes over multiple docx files, but spotlight does not index the footnotes. Unix tools such as grep don't work because docx files are compressed. Zipgrep doesn't work because it expects the files to have a zip extension. I tried using textutil to batch convert the files to another format, but textutil leaves out the footnotes. Thanks for any help!

iMac, OS X Mountain Lion (10.8.3)

Posted on Oct 28, 2016 12:05 PM

Reply

Answer 1

rccharles

Level 6

13,037 points

Oct 28, 2016 4:36 PM in response to bill_alves

Surprise for me. Didn't know unix utilities required an extension.

You could use the link command to give the file an extension.

Perhaps a hard link would be best.

ln mydocument.docx mydocument.zip

mac $ mkdir mess
mac $ cd mess
/Users/mac/Documents/mess
mac $ unzip ../footnotetest.docx 
Archive:  ../footnotetest.docx
  inflating: [Content_Types].xml     
  inflating: _rels/.rels             
  inflating: word/_rels/document.xml.rels  
  inflating: word/document.xml       
  inflating: word/footnotes.xml      
  inflating: word/endnotes.xml       
  inflating: word/theme/theme1.xml   
 extracting: docProps/thumbnail.jpeg  
  inflating: word/settings.xml       
  inflating: word/webSettings.xml    
  inflating: word/stylesWithEffects.xml  
  inflating: word/styles.xml         
  inflating: docProps/core.xml       
  inflating: word/fontTable.xml      
  inflating: docProps/app.xml        
mac $ ls
[Content_Types].xml  _rels/               docProps/            word/
mac $ pwd
/Users/mac/Documents/mess
mac $ find . | grep "foot"
./word/footnotes.xml
mac $ find . | grep "first"
mac $ find . | grep "second"
mac $

a little dense:

textwrangle reveals this.

R

Reply

Answer 2

rccharles

Level 6

13,037 points

Oct 28, 2016 4:48 PM in response to rccharles

don't know in what file the data is in. notice grep failed. Could be in a different character set. likely 16 bit unicode.

My first footnotes since my stay at a U.

found the footnotes with TextWrangler. So it's Unicode utf-8. [ isn't this ascii? Why not grep no workie? ]

http://stackoverflow.com/questions/27311428/how-to-retrieve-all-the-footnotes-fr om-a-docx-document

Reply

Answer 3

VikingOSX

Level 10

123,338 points

Oct 29, 2016 11:30 AM in response to bill_alves

The following Python script is meant to be run in the Terminal with Word (.docx only) documents provided on the command-line. It will print the footnote index, and the associated footnote text for each document. If the footnote text exceeds a predetermined line length, it will then aesthetically wrap, and preserve the output formatting.

This is a read-only script, and does not change the Word document. It focuses all of its energy on finding footnotes, and nothing else. There is no warranty expressed or implied.

Copy and paste the following Python code into a programmer's editor, or TextEdit in plain text mode. It is syntax clean as posted. Keep it that way by not pasting into a word processor. 😉 Save the file out as fnote.py, or whatever name you please.

Code:

#!/usr/bin/python

# coding: utf-8

"""

fnote.py

Usage: fnote.py doc1.docx doc2.docx ~/Desktop/Files/*.docx ... docn.docx

For each command-line, provided Word (.docx only) document, print out

the footnotes found in the document. Long footnote text will wrap aligned.

Sample output:

[doc1.docx]

1: This is the first and only footnote text

[doc2.docx]

No footnotes found in document

Tested: OS X 10.11.6/Python 2.7.10, 2.7.12, OS X 10.12/Python 2.7.10

VikingOSX, Oct. 29, 2016, Apple Support Communities

"""

import zipfile

import re

import os

import sys

from itertools import izip

import textwrap

FNXML = 'word/footnotes.xml'

findex = []

fnotes = []

space4 = ' ' * 4

line_length = 70

def get_footnotes(ifile):

global findex, fnotes

work = []

try:

with zipfile.ZipFile(ifile, 'r') as docx:

xmldata = docx.read(FNXML)

except KeyError:

print('[{}]'.format(os.path.basename(ifile)))

print('{}No footnotes found in document'.format(space4))

return False

work = re.findall(r'(?<=<w:footnote w:id=\")(\d+)(?=\">)|(?=\" w:type=\"cont)',

xmldata, re.M)

# Microsoft uses a '0' footnote index with no text associated with it

# This results in a blank list entry that we can remove here,

# but the footnote indices are artificially numbered +1 higher than reality

findex = filter(None, work)

fnotes = re.findall(r'(?<=<w:t>)(.*?)(?=</w:t>)', xmldata, (re.M | re.U))

return True

def main():

if len(sys.argv) == 1:

sys.exit('Usage: {} file1.docx, file2.docx ... filen.docx'.format(sys.argv[0]))

for adocx in sys.argv[1:]:

fname = os.path.basename(adocx)

if not adocx.endswith('docx'):

continue

result = get_footnotes(os.path.expanduser(adocx))

if not result:

continue

if findex and fnotes:

# mash the indices and footnote lists into an unordered dictionary

adict = dict(izip(findex, fnotes))

print("[{}]".format(fname))

for keys, values in sorted(adict.items()):

if len(values) > line_length:

# shift footnote numbers downward to match content

prefix = space4 + str(int(keys) - 1) + ': '

# wrap and align really long footnote text

wrapper = textwrap.TextWrapper(initial_indent=prefix, width=line_length,

subsequent_indent=' ' * len(prefix))

print("{}".format(wrapper.fill(values)))

else:

# make the footnote indices match their footnote text

print("{}{}:".format(space4, int(keys) - 1)),

print("{}".format(values))

if __name__ == '__main__':

sys.exit(main())

Reply

Answer 4

bill_alves Author

Level 1

4 points

Oct 30, 2016 2:45 PM in response to bill_alves

Wow, thanks for the extensive help, VikingOSX. Just as you posted this reply, I found a simpler possibility that has worked for my situation. I should clarify that I was looking for any occurrence of the string and didn't mean to limit the search to the footnotes, though if one wanted to, it's nice to know this solution. Also, I was fine just seeing the names of the matching files, which I could then open in word, but if one wanted to, your solution is clearly more powerful. Anyway, with some help elsewhere, and the knowledge that docx files are essentially zipped xml, I came up with the following script. The first argument is the search string, and it searches only the current directory:

for file in ./*.docx; do

if ( unzip -c "$file" | grep -qi "$1"); then

echo "$file"

fi

done

I hope this is also helpful to anyone else with this question, and thanks again!

Reply

Answer 5

bill_alves Author

Level 1

4 points

Oct 30, 2016 2:47 PM in response to rccharles

Thanks, this clarifies some mysteries of the format!

Reply

Answer 6

VikingOSX

Level 10

123,338 points

Oct 30, 2016 4:28 PM in response to bill_alves

It is good that you have a solution that works for you.

The Python application restricts itself to the single internal Word footnotes.xml file (if it exists), and prints the ordered footnotes. It is not scanning the entire Word document for a search string — as done in your the shell script. But, one line of additional code could scan the footnotes for a specific search string.

Reply

Answer 7

rccharles

Level 6

13,037 points

Oct 30, 2016 4:30 PM in response to VikingOSX

I don't understand why in my case grep "first" didn't work. Am I missing something obvious.

R

Reply

Answer 8

VikingOSX

Level 10

123,338 points

Oct 30, 2016 6:58 PM in response to bill_alves

For rccharles because the site won't let me see his post when I am signed in.😠

I typed your text into a Word document with both footnotes, and saved it to my Desktop. I then extracted the docx content into a folder (e.g. unzip -d RC rc.docx), and changed directory into the RC folder.

The following will match (and list) files that contain the "first" and "second" strings:

find . -type f -exec grep -l "first|second" {} \;

grep -Rlw "first|second" *

If you have pandoc installed (brew pandoc), you don't need to disintegrate the Word document to perform a search, and the following will find every occurrence of the words "first" and "second" in the body text, and footnotes. Pandoc is simply converting the docx content into plain text, which is passed on to grep.

pandoc -f docx -t plain -o - rc.docx | grep -i "first|second"

Reply