NEED A FASTER TEXT PARSING SOLUTION

Question

Level 1

14 points

NEED A FASTER TEXT PARSING SOLUTION

can be a shell script...

This example script takes entirely too long to process larger text files.
What I need it to do is:
read a text file,
find a string,
then return the first word of its paragraph line.

But my files are 2000 lines long!! in this example I just have a short snippet of the text file working.

Any ideas?

[code]
set findme to "8701"
set findme to SaR(findme, " ", "")
set the_file to (((path to desktop) as string) & "etch_rules.txt")
set rulez to my read from_file(thefile) --- this file is normally 2500 lines

set rulez to "715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 727, 728, 729
736, 744, 745, 749, 742, 99, 864, 865, 661, 664, 660, 663, 662, 665, 659, 782
4484, 4479, 4476
770, 771, 772, 774, 776, 778, 779, 780, 781, 783, 785, 786, 787, 773, 282, 7702
2070, 2875, 2079, 2082, 2084, 2075
1884, 1870, 1875, 1879, 1882, 1884
8700, 8701
791, 795, 792, 796" -----substitute for larger text file

set rulez to SaR(rulez, " ", "")
set cnt to count paragraphs of rulez

repeat with x from 1 to cnt
set this_par to paragraph x of rulez
set cnt2 to count text items of this_par
set isit to the offset of findme in this_par
if isit ≠ 0 then
repeat with y from 1 to cnt2
set this_word to text item y of this_par
if this_word = findme then set findme to text item 1 of this_par
end repeat
end if
end repeat

display dialog findme

on read from_file(thefile)
set the_data to "FILE APPARENTLY DOES NOT EXISTS"
try
set the_file to the_file as file specification
set the_data to read the_file
end try
return the_data
end read fromfile

on SaR(sourceText, findText, replaceText)
set {atid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, findText}
set tempText to text items of sourceText
set AppleScript's text item delimiters to replaceText
set sourceText to tempText as string
set AppleScript's text item delimiters to atid
return sourceText
end SaR

imac, Mac OS X (10.5.8), my computer is fast

Posted on Nov 26, 2010 10:41 AM

Reply

Answer 1

Pierre L.

Level 5

4,641 points

Nov 26, 2010 1:39 PM in response to handellphp

The following script seems to do what you are asking for:

*set theTarget to "8701"*
*set theFile to choose file of type "txt"*
*set F to open for access theFile*
*set theText to read F*
*close access F*
*set theList to {}*
*repeat with k from 1 to (count paragraphs of theText)*
* if paragraph k of theText contains theTarget then*
* copy (word 1 of paragraph k of theText) to the end of theList*
* end if*
*end repeat*
theList

Did I miss something?

Message was edited by: Pierre L.

Reply

Answer 2

Tony T1

Level 6

10,247 points

Nov 26, 2010 1:50 PM in response to handellphp

This has to be easier with grep.
Try the following from Terminal (can be run in Applescript with "do shell script"):


grep --word-regexp '772' Untitled.txt | grep --only-matching ^[[:alnum:]]*

This is the data in the file Untitled.txt


715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 727, 728, 729 
736, 744, 745, 749, 742, 99, 864, 865, 661, 664, 660, 663, 662, 665, 659, 782
4484, 4479, 4476
770, 771, 772, 774, 776, 778, 779, 780, 781, 783, 785, 786, 787, 773, 282, 7702
2070, 2875, 2079, 2082, 2084, 2075
1884, 1870, 1875, 1879, 1882, 1884
8700, 8701
791, 795, 792, 796

This is the output

This will be very fast on very large files

Reply

Answer 3

Camelot

Level 9

59,229 points

Nov 26, 2010 1:43 PM in response to handellphp

There are several optimizations that come to mind, but before I go there I question whether your code works at all.

Specifically, the code:

set this_word to text item y of this_par
if this_word = findme then set findme to text item 1 of this_par

As written your text item delimiters are empty, so 'text item y...' is going to iterate through each character in the paragraph. Given that your findme variable is a word (in this case '8701' there is no way that any 'character' is going to match '8701' so your check never matches and you never change 'findme' to be the first word.
Indeed, even if it did match, the fact that you set findme to text item 1 of this_par means that findme would end up being the first character of the paragraph, not the first word, as requested.

Now, the above might work if you set the text item delimiters to a comma, but you haven't done that.

Beyond that, how slow is slow? I created a dummy 2800-line file with one occurence of the string '8701' on the last line and it took approximately 0.01 seconds to execute. Now I consider myself pretty adept at AppleScript, but I don't think I can make much of an improvement against that, even with optimized code 🙂

So how much time does this take to run? Can you provide a real sample file?
Even though there are some obvious optimizations you could make it doesn't seem worth it at this point.

Reply

Answer 4

Tony T1

Level 6

10,247 points

Nov 26, 2010 1:49 PM in response to Camelot

His problem is the reason why grep was written. Applescript is not the best solution for this problem.

Reply

Answer 5

handellphp Author

Level 1

14 points

Nov 26, 2010 4:02 PM in response to Pierre L.

No luck here, this script took just as long. I need 2 seconds or less.

similar example text file is at:
http://tmgraphics.biz/VIKINGS/untitled.txt

(MORE INFO)
This script was designed to locate a Adobe Illustrator templates. Instead of saving a template for each number, we have a chart that tells us what templates are the same. The number at the beginning of the each line is the actual template number. We have at least 10,000 templates.

maybe im approaching this all wrong... any suggestions

Reply

Answer 6

handellphp Author

Level 1

14 points

Nov 26, 2010 4:15 PM in response to Camelot

I thought I had the text delimiter set to comma, i thinks its might be the default. Anyways the script works but its slow.

SLOw is more than 1-2 seconds for me or i would just do it manually.

similar example text file is at:
http://tmgraphics.biz/VIKINGS/untitled.txt

Reply

Answer 7

handellphp Author

Level 1

14 points

Nov 26, 2010 4:19 PM in response to Tony T1

forgive me when it comes to shell scripts but it does not recognize the file on the desktop

similar example text file is at:
http://tmgraphics.biz/VIKINGS/untitled.txt

can it pull infor from online?

{code}
do shell script "grep --word-regexp '8701' untitled.txt | grep --only-matching ^[[:alnum:]]*"

Reply

Answer 8

handellphp Author

Level 1

14 points

Nov 26, 2010 5:10 PM in response to Tony T1

Thanks Tony T1, I got it, this works perfect in milliseconds. I cant thank you enough.

[code]

set template to do shell script "grep --word-regexp '8701' /Users/macuser/Desktop/untitled.txt | grep --only-matching ^[[:alnum:]]*"

display dialog template

Reply

Answer 9

Pierre L.

Level 5

4,641 points

Nov 26, 2010 5:24 PM in response to handellphp

As Camelot said: “how slow is slow?”

I added a few lines to my script (in red below) and tested it on the text file you refered to (2016 paragraphs). The result shows that, on my computer, the script takes about one third of a second to return a result.

*set theTarget to "8701"*
*set theFile to choose file of type "txt"*
*set t0 to (current date)*
*repeat 100 times*
*set F to open for access theFile*
*set theText to read F*
*close access F*
*set theList to {}*
*repeat with k from 1 to (count paragraphs of theText)*
* if paragraph k of theText contains theTarget then*
* copy (word 1 of paragraph k of theText) to the end of theList*
* end if*
*end repeat*
*end repeat*
*((current date) - t0) & theList* --> {34, "22319"}

Nevertheless, I can understand that you are looking for something faster.
Thanks for your feedback.

Reply

Answer 10

Tony T1

Level 6

10,247 points

Nov 26, 2010 5:33 PM in response to handellphp

handellphp wrote:
forgive me when it comes to shell scripts but it does not recognize the file on the desktop

Try:
<pre style="
font-family: Monaco, 'Courier New', Courier, monospace;
font-size: 10px;
font-weight: normal;
margin: 0px;
padding: 5px;
border: 1px solid #000000;
width: 720px;
color: #000000;
background-color: #E6E6EE;
overflow: auto;"
title="this text can be pasted into the AppleScript Editor">
set f to choose file
set s to text returned of (display dialog "Please enter search:" default answer "")
try
set a to do shell script "/usr/bin/grep --word-regexp --max-count=1 " & quoted form of s & space & quoted form of POSIX path of f & " | /usr/bin/grep --only-matching ^[[:alnum:]]* "
display dialog "Found " & a
on error
display dialog s & space & "Not Found"
end try</pre>

Your example file has multiple 'hits'. How do want to handle this? The above script stop after the first 'hit' with "--max-count=1", remove to display all

Reply

Answer 11

Tony T1

Level 6

10,247 points

Nov 26, 2010 5:42 PM in response to handellphp

handellphp wrote:
Thanks Tony T1, I got it, this works perfect in milliseconds. I cant thank you enough.

You're welcome (you found the file recognition problem as I was posting a reply)
BTW, to post code in these forums, enclose in :

..

Reply

Answer 12

handellphp Author

Level 1

14 points

Nov 26, 2010 6:04 PM in response to Tony T1

thanks again. i used this on 36000 lines & didn't see a pause at all. i may be able to use this in other applications as well. Is there an explode type command using grep? (similar to PHP explode)

Reply

Answer 13

Tony T1

Level 6

10,247 points

Nov 26, 2010 7:02 PM in response to handellphp

handellphp wrote:
Is there an explode type command using grep? (similar to PHP explode)

I'm not familiar with PHP explode, but if all you want to do is replace spaces and any non-alpha numeric charater with a newline, you can do it with tr


Untitled.txt: This is a 123 test
tr -cs "[:alnum:]" " " < Untitled.txt
output:
This
is
a
123
test

You can also pipe the output of grep through tr:


grep --word-regexp '772' Untitled.txt | grep --only-matching ^[[:alnum:]]* | tr -cs "[:alnum:]" " "

Take a look at (in Terminal) man grep (or http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/ man1/grep.1.html )
also for a regular expression tutorial see: http://www.grymoire.com/Unix/Regular.html
or just google grep tutorial

Reply

Answer 14

handellphp Author

Level 1

14 points

Nov 26, 2010 8:02 PM in response to Tony T1

Thanks for the info, this looks like pretty powerful stuff, i hope i can learn some of it.

php explode breaks data into chunks (arrays) based on the delimiters
makes it easy to pick apart web page source content

Reply

Answer 15

Tony T1

Level 6

10,247 points

Nov 26, 2010 8:35 PM in response to handellphp

handellphp wrote:
php explode breaks data into chunks (arrays) based on the delimiters
makes it easy to pick apart web page source content

Looks like AWK would be useful for this if tr is not good enough.

For AWK Take a look at http://www.grymoire.com/Unix/Awk.html

For greg and sed: http://www.osxfaq.com/Tutorials/LearningCenter/UnixTutorials/GrepSedRegexp/index .ws

Also, stop by the Apple UNIX forum: http://discussions.apple.com/forum.jspa?forumID=735

Reply