Split an RTF/Word Document At Arbitrary Expression

Okay, I have a question for the discussion-forum hive-mind. I have a Word document (a .doc, also saved as an .rtf file, if one is necessary), which is structured according to an outline form. What I want to do is split this file into a set of smaller files, with each of the smaller files containing text that adheres to the pattern </r></r><a_single_line_of_text></r></r><then_this_block_of_text></r></r>. Is this possible to do with Automator, or perhaps Automator or AppleScript in conjunction with Perl? (Bear in mind I only know of Perl; I don't actually know much about using it.) I really do need a solution to this problem. The outline is over 200 pages, with many small sections, and to do this by hand would take FOREVER. Any help is vastly appreciated!

iMac, Mac OS X (10.7.2), Running iLife, iWork

Posted on Dec 21, 2011 2:25 PM

Reply
4 replies

Dec 21, 2011 2:33 PM in response to William Hainline

Is the formatting of relevance? or do you just care about the text.


The reason I ask is that both .doc and .rtf documents contain additional formatting data, so your text isn't going to be in the actual format you describe - there will be a slew of additional data mixed in between the actual text lines.


If it's just plain text, though, then your chances are significantly enhanced.


What's unclear, though is how you intend to diffentiate between </r></r> before the first line, between the two lines, and after the last line. How do you know that the </r></r> between the lines is actually between the lines and not the marker for the beginning of the next line, or the marker for the end of the previous line?


Additionally, how do you intend/expect to save the chunked data? Given that you're going to end up with some 200 files you need to explain how/where you want to save these files.

Dec 21, 2011 5:14 PM in response to Camelot

Okay, here's what I want to happen:


1. Search through the .doc or .rtf file, "filename," looking for the text that matches the regular expression, "/n/n.+/n.+/n/n" (or instead of /n, use whatever special codes RTFs and DOCs use for indicating carriage returns)


2. Take whatever was found, create a new file, and stick that chunk of text into it, and then save it under the name "original_filename_xxx.doc/rtf", where xxx is the number of passes.


3. Save the file.


4. Look for another occurrence, repeat until there are no more.


It's pretty simple. Of course, if I were just using plain text files, that would be SO much easier. Unfortunately, I want to retain formatting information. (i.e., special codes that RTF and DOC use to indicate formatting). I could use .docx instead, since thats supposedly based on XML. But unfortunately, I'm kind of stuck with what I've got.

Dec 21, 2011 9:17 PM in response to William Hainline

Wanting to keep the formatting is going to radically limit your options because of the way the document formatting (e.g. font choices, size, style, etc.) is stored - it's stored as part of the document header itself, as well as kind of mixed in with the text.


In order to create multiple sub-documents that retain the original styling you're going to need to replicate both the document header as well as the styled paragraphs within the document itself.

What I'm getting at is that this pretty much precludes any kind of direct-access to the document (unless you want to write an RTF parser in AppleScript or Perl - a task that I just shudder at).


Instead you're going to need to use some other application to do the heavy-lifting, but that's OK - it takes the burden off you doing the style data, as long as you can find a RTF-aware word processor with robust-enough AppleScript support. Unfortunately, off-hand, none come to mind.

Dec 21, 2011 10:00 PM in response to Camelot

I'll point out that Microsoft Word has a reasonably effective and scriptable find command, so you could use applescript to do a search like this using Word itself. However, scripting Word is a PITA and trying to script a regexp find in word as an absolute Fu-PITA. Whoever designed Word's scriptabilty had no idea what they were doing. But it is possible if you want to put the effort into it.

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Split an RTF/Word Document At Arbitrary Expression

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.