Deleting paragraphs in TextEdit

Question

Level 1

4 points

Deleting paragraphs in TextEdit

Hi there,

I'm having some issues deleting certain paragraphs in a document. I want to delete every paragraph which doesn't start with '{'. I have done this so far:

tell application "TextEdit"

activate

open "/Users/John/Desktop/map.rtf"

set this_text to the text of document 1

end tell

set the paragraph_list to every paragraph of this_text

repeat with i from 1 to the count of paragraphs of this_text

set this_paragraph to itemi of the paragraph_list

try

if first character of this_paragraph is not "{" then

deleteitemi of paragraph_list

end if

on error --If there isn't any character

deleteitemi of paragraph_list

end try

end repeat

I don't know what I am doing wrong. Thanks in advance.

Posted on Jan 6, 2013 2:22 AM

Reply

Answer 1

Hiroto

Level 5

7,461 points

Jan 8, 2013 12:28 PM in response to xaruqui

Hello

Both posted scripts of mine do not care whether spaces are between tokens. So there's something else we're missing.

• The first thing to note is that both scripts I posted should filter out such paragraph as -

"Wood":"45","Water":"0","Stone":"1220","Fire":"9","Hammer":"355"

because it does not start with character {.

If you meant to say it doesn't filter out -

{"Wood":"45","Water":"0","Stone":"1220","Fire":"9","Hammer":"355"}

then the first script won't but the second will.

To make it clear, given the text -

"Wood":"45","Water":"0","Stone":"1220","Fire":"9","Hammer":"355"
{"Wood":"45","Water":"0","Stone":"1220","Fire":"9","Hammer":"355"}
"Wood":"45","Water":"1","Stone":"1220","Fire":"9","Hammer":"355"
{"Wood":"45","Water":"1","Stone":"1220","Fire":"9","Hammer":"355"}

result of the first script is -

{"Wood":"45","Water":"0","Stone":"1220","Fire":"9","Hammer":"355"}

result of the second script is -

{"Wood":"45","Water":"1","Stone":"1220","Fire":"9","Hammer":"355"}

Is the latter what you expect?

• Another thing to note is text encoding.

Is it really in UTF-8? If it is in UTF-16LE, the second script can still filter out paragraphs not starting with { but cannot filter out those with "Water":"0" because the pattern won't match in its current state.

To make clear that the input file is in UTF-8, you may resave it via TextEdit.app with its preferences set to save in UTF-8.

• Lastly, line ending.

It is assumed that LF (U+000A: LINE FEED) terminates paragraph. Not CR (U+000D: CARRIAGE RETURN).

If the scripts can filter out every paragraph not starting with {, then the current line ending should be fine.

That's all for now.

Good luck,

H

Reply

Answer 2

Pierre L.

Level 5

4,635 points

Jan 8, 2013 1:34 PM in response to Pierre L.

Though not as fast as Hiroto's script, the following one should probably be the cleanest and fastest possible using TextEdit:

set theNewText to ""

tell application "TextEdit"

set theText to text of front document

repeat with k from 1 to (count paragraphs of theText)

set thisParagraph to paragraphk of theText

if (thisParagraph begins with "{") ¬

and (thisParagraph does not contain "\"Water\":\"0\"") then

set theNewText to theNewText & thisParagraph & return

end if

end repeat

makenewdocumentat front with properties {text:theNewText}

activate

end tell

Message was edited by: Pierre L.

Reply

Answer 3

xaruqui Author

Level 1

4 points

Jan 9, 2013 12:52 PM in response to xaruqui

Well, I figured out why I couldn't filter the water thing. I inserted a 'display dialog "Current paragraph: " & thisparagraph' and I realized the hole document was there. I mean, seems like what I see as a paragraph isn't actually one. I saved the document in TextEdit as Hiroto say, but nothing.

I have to add a paragraphs triming routine to filter it properly.

There is one thing I can't understand, Why I can't filter by water but I can delete paragraphs which doesn't begin with "{"?

I didn't think It was going to be so difficult!!

Reply

Answer 4

Pierre L.

Level 5

4,635 points

Jan 9, 2013 2:17 PM in response to Pierre L.

I was completely wrong (once more) in my last post. The following version of the script, although still using TextEdit, seems to be extremely fast, even with huge files.

set list2 to {}

set list2Ref to a reference to list2

tell application "TextEdit"

set list1 to paragraphs of text of front document

set list1Ref to a reference to list1

repeat with thisParagraph in list1Ref

if (thisParagraph begins with "{") ¬

and (thisParagraph does not contain "\"Water\":\"0\"") then

copy (thisParagraph as text) to the end of list2Ref

end if

end repeat

set theNewText to list2 as text

makenewdocumentat front with properties {text:theNewText}

activate

end tell

(Search for “bigList” in the AppleScript Language Guide.)

Reply

Answer 5

Hiroto

Level 5

7,461 points

Jan 11, 2013 5:44 PM in response to xaruqui

Hmm, that's strange.

Would you run the following applescript and show the result in your message?

The script will convert the given rtf to text and get hexdump of the first 10 lines and put the result in clipboard.

Please change the file path as is needed.

set f to "/Users/John/Desktop/map.rtf"
set sh to "textutil -convert txt -stdout " & f's quoted form & " | head | hexdump -C"
set r to do shell script sh
set the clipboard to r

This way we can examine the text encoding and line ending.

I'm guessing text encoding is the problem.

Regards,

H

Reply

Answer 6

xaruqui Author

Level 1

4 points

Jan 12, 2013 1:24 AM in response to xaruqui

The document has, lets say, 10 paragraphs. One of them contains everything I need and the other 9 paragraphs are the ones to delete (not starting with "{"). I managed to trim the main paragraph to insert a 'return' at the end of each paragraph, so I can choose wether keeping it or dropping it.

Here's the HexDump (without reaching the main paragraph):

00000000 0a 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b |.HTTP/1.1 200 OK|

00000010 0a 0a 44 61 74 65 3a 20 53 75 6e 2c 20 30 36 20 |..Date: Sun, 06 |

00000020 4a 61 6e 20 32 30 31 33 20 31 32 3a 33 36 3a 34 |Jan 2013 12:36:4|

00000030 32 20 47 4d 54 0a 0a 53 65 72 76 65 72 3a 20 41 |2 GMT..Server: A|

00000040 70 61 63 68 65 2f 32 2e 32 2e 33 20 28 43 65 6e |pache/2.2.3 (Cen|

00000050 74 4f 53 29 0a 0a 58 2d 50 6f 77 65 72 65 64 2d |tOS)..X-Powered-|

00000060 42 79 3a 20 50 48 50 2f 35 2e 33 2e 38 0a 0a 43 |By: PHP/5.3.8..C|

00000070 6f 6e 6e 65 63 74 69 6f 6e 3a 20 63 6c 6f 73 65 |onnection: close|

00000080 0a |.|

00000081

Reply

Answer 7

Hiroto

Level 5

7,461 points

Jan 12, 2013 8:48 PM in response to xaruqui

Ah. I'm beginning to get the picture. Regrettably the vital part is missing in the hexdump...

If this is one time process and you have already done it manually, then there's no need for you to bother with it further.

But if this is repetitive chore, you would try something like the following script and post the result once again.

Currently it will hexdump the first 500 bytes (head -c500). Please adjust the number after "head -c" so that the result contains at least two "records' in the "main paragraph" which starts with {.

set f to "/Users/John/Desktop/map.rtf"
set sh to "textutil -convert txt -stdout " & f's quoted form & " | head -c500 | hexdump -C"
set r to do shell script sh
set the clipboard to r

I guess "records" are separated by some INFORMATION SEPARATORS in U+001C..001F or such.

Once the separator is identified, it should be easy to arrange the previous Perl script for your need.

That's all for now.

Good luck,

H

Reply

Answer 8

xaruqui Author

Level 1

4 points

Jan 13, 2013 2:53 AM in response to xaruqui

Oups! I clicked "This solved my question" by mistake. Well, I rather not paste the Hexdump in here, is there any other way I can send it to you?

The script I wrote to trim the main paragraph is working, but too slow. Would it be possible to trim it with Perl as well with the clipboard text as a source with a given offset "key"? I was amazed of the speed of your script.

Thank you very much.

Reply

Answer 9

Hiroto

Level 5

7,461 points

Jan 13, 2013 6:10 AM in response to xaruqui

All right. You don't need to post the hexdump at all and I'd rather not expose my email address in public board. 😉

So please just check the hexdump by yourself to identify the record separator.

There must be some (invisible) character between two consecutive records in the said main paragraph. All we need to know is what the charater is.

E.g., in the posted hexdump of the first 10 lines, you can see 0a as line separator. 0x0a is linefeed character.

What do you find between two records in the main paragraph?

Please note that record separator is not necessarily a single byte in hexdump. If it is U+2028 LINE SEPARATOR, it will be dumped as 3 bytes <e2 80 a8> in UTF-8; U+2029 RECORD SEPARATOR will be <e2 80 a9> and so on.

Regards,

H

Reply

Answer 10

xaruqui Author

Level 1

4 points

Jan 20, 2013 4:13 AM in response to xaruqui

Sorry for my very delayed answer, I've been up to my eyes in work.

Well, I haven't found any line separator; sorry for that. I mean, each record does end with ":" (3a), but there are ":" all along the record as well. The funny thing, though, is that when I make TextEdit window wider, it separates correctly the paragraphs as I want them to be.

Each paragraph begins with a "key" E.g. "\"Water\":\"0\"". Could It be used as an offset key?

Thanks.

Reply

Answer 11

adayzdone

Level 2

150 points

Jan 20, 2013 10:40 AM in response to xaruqui

Why not?

set myFile to "/Users/John/Desktop/hex.txt"
set myData to do shell script "cat " & quoted form of myFile & " | grep { | grep -v Water"

Reply

Answer 12

Hiroto

Level 5

7,461 points

Jan 21, 2013 11:37 AM in response to xaruqui

Hmm. That's very strange.

How can TextEdit know where to break lines without any substantial indicator?

By the way, I've just learned there's a thing named \line in RTF, which is a "soft line break".

Seemingly it can explain most of the said behaviours of your document. E.g., TextEdit breaks line where it finds \line, while TextEdit's AppleScript interface does not consider \line as paragraph separator and thus "lines" separated by \line are treated as one paragraph.

On the other hand, it cannot explain why you don't see anything between records in hexdump because I confirmed that "textutil -convert txt" converts \line in RTF to U+2028 LINE SEPARATOR in plain text, which appears as <e2 80 a8> in UTF-8. Also when I convert rich text to plain text via TextEdit's menu, TextEdit converts \line to U+2028 as well.

Here's the source of sample.rtf I tested.

-----------------------

{\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf540
{\fonttbl\f0\fnil\fcharset0 LucidaGrande;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww19600\viewh14000\viewkind0
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural

\f0\fs24 \cf0 a\
b\
c\
d\
\{\line 1\line 2\line 3\line 4\line 5\line \}\
e\
f\
g\
h\
}

-----------------------

Now the rest is my guess work based upon an assumption that your rtf document uses \line in the "main" paragraph which, however, contradicts with the fact you don't see <e2 80 a8> in hexdump.

Anyway you may try the applescript code below that has been in mind for the case you find U+2028: <e2 80 a8> in hexdump.

Please change the file paths as you see fit.

-----------------------

-- applescript
set infile to "/Users/John/Desktop/map.rtf"
set outfile to "/Users/John/Desktop/map_processed.txt"

set sh to "textutil -convert txt -stdout " & infile's quoted form & " \\
| perl -CSD -e 'local $\\ = qq(\\n);
while (<>) {
    next unless /^{/o;
    pos($_) = 0;
    while ( / \\G (.*?) (?:\\x{2028}|$) /ogx ) {
        print $1 unless $1 =~ /\"Water\":\"0\"/o;
    }
}' > " & outfile's quoted form

do shell script sh
-- end of applescript

--------------------------

Of course it would be possible to use other criteria to break lines as long as the criteria are well-defined and consistent through the document.

But since TextEdit is able to break lines without using such ad hoc knowledege about the document we should be as well.

Good luck,

H

Reply