NEED A FASTER TEXT PARSING SOLUTION

can be a shell script...

This example script takes entirely too long to process larger text files.
What I need it to do is:
read a text file,
find a string,
then return the first word of its paragraph line.

But my files are 2000 lines long!! in this example I just have a short snippet of the text file working.


Any ideas?




[code]
set findme to "8701"
set findme to SaR(findme, " ", "")
set the_file to (((path to desktop) as string) & "etch_rules.txt")
set rulez to my read from_file(thefile) --- this file is normally 2500 lines

set rulez to "715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 727, 728, 729
736, 744, 745, 749, 742, 99, 864, 865, 661, 664, 660, 663, 662, 665, 659, 782
4484, 4479, 4476
770, 771, 772, 774, 776, 778, 779, 780, 781, 783, 785, 786, 787, 773, 282, 7702
2070, 2875, 2079, 2082, 2084, 2075
1884, 1870, 1875, 1879, 1882, 1884
8700, 8701
791, 795, 792, 796" -----substitute for larger text file

set rulez to SaR(rulez, " ", "")
set cnt to count paragraphs of rulez

repeat with x from 1 to cnt
set this_par to paragraph x of rulez
set cnt2 to count text items of this_par
set isit to the offset of findme in this_par
if isit ≠ 0 then
repeat with y from 1 to cnt2
set this_word to text item y of this_par
if this_word = findme then set findme to text item 1 of this_par
end repeat
end if
end repeat

display dialog findme


on read from_file(thefile)
set the_data to "FILE APPARENTLY DOES NOT EXISTS"
try
set the_file to the_file as file specification
set the_data to read the_file
end try
return the_data
end read fromfile



on SaR(sourceText, findText, replaceText)
set {atid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, findText}
set tempText to text items of sourceText
set AppleScript's text item delimiters to replaceText
set sourceText to tempText as string
set AppleScript's text item delimiters to atid
return sourceText
end SaR

imac, Mac OS X (10.5.8), my computer is fast

Posted on Nov 26, 2010 10:41 AM

Reply
20 replies

Nov 26, 2010 1:39 PM in response to handellphp

The following script seems to do what you are asking for:

*set theTarget to "8701"*
*set theFile to choose file of type "txt"*
*set F to open for access theFile*
*set theText to read F*
*close access F*
*set theList to {}*
*repeat with k from 1 to (count paragraphs of theText)*
* if paragraph k of theText contains theTarget then*
* copy (word 1 of paragraph k of theText) to the end of theList*
* end if*
*end repeat*
theList


Did I miss something?

Message was edited by: Pierre L.

Nov 26, 2010 1:50 PM in response to handellphp

This has to be easier with grep.
Try the following from Terminal (can be run in Applescript with "do shell script"):


grep --word-regexp '772' Untitled.txt | grep --only-matching ^[[:alnum:]]*


This is the data in the file Untitled.txt

715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 727, 728, 729
736, 744, 745, 749, 742, 99, 864, 865, 661, 664, 660, 663, 662, 665, 659, 782
4484, 4479, 4476
770, 771, 772, 774, 776, 778, 779, 780, 781, 783, 785, 786, 787, 773, 282, 7702
2070, 2875, 2079, 2082, 2084, 2075
1884, 1870, 1875, 1879, 1882, 1884
8700, 8701
791, 795, 792, 796


This is the output

770


This will be very fast on very large files

Nov 26, 2010 1:43 PM in response to handellphp

There are several optimizations that come to mind, but before I go there I question whether your code works at all.

Specifically, the code:

set this_word to text item y of this_par
if this_word = findme then set findme to text item 1 of this_par


As written your text item delimiters are empty, so 'text item y...' is going to iterate through each character in the paragraph. Given that your findme variable is a word (in this case '8701' there is no way that any 'character' is going to match '8701' so your check never matches and you never change 'findme' to be the first word.
Indeed, even if it did match, the fact that you set findme to text item 1 of this_par means that findme would end up being the first character of the paragraph, not the first word, as requested.

Now, the above might work if you set the text item delimiters to a comma, but you haven't done that.

Beyond that, how slow is slow? I created a dummy 2800-line file with one occurence of the string '8701' on the last line and it took approximately 0.01 seconds to execute. Now I consider myself pretty adept at AppleScript, but I don't think I can make much of an improvement against that, even with optimized code 🙂

So how much time does this take to run? Can you provide a real sample file?
Even though there are some obvious optimizations you could make it doesn't seem worth it at this point.

Nov 26, 2010 4:02 PM in response to Pierre L.

No luck here, this script took just as long. I need 2 seconds or less.

similar example text file is at:
http://tmgraphics.biz/VIKINGS/untitled.txt


(MORE INFO)
This script was designed to locate a Adobe Illustrator templates. Instead of saving a template for each number, we have a chart that tells us what templates are the same. The number at the beginning of the each line is the actual template number. We have at least 10,000 templates.

maybe im approaching this all wrong... any suggestions

Nov 26, 2010 5:24 PM in response to handellphp

As Camelot said: “how slow is slow?”

I added a few lines to my script (in red below) and tested it on the text file you refered to (2016 paragraphs). The result shows that, on my computer, the script takes about one third of a second to return a result.

*set theTarget to "8701"*
*set theFile to choose file of type "txt"*
*set t0 to (current date)*
*repeat 100 times*

*set F to open for access theFile*
*set theText to read F*
*close access F*
*set theList to {}*
*repeat with k from 1 to (count paragraphs of theText)*
* if paragraph k of theText contains theTarget then*
* copy (word 1 of paragraph k of theText) to the end of theList*
* end if*
*end repeat*
*end repeat*
*((current date) - t0) &
theList* --> {34, "22319"}



Nevertheless, I can understand that you are looking for something faster.
Thanks for your feedback.

Nov 26, 2010 5:33 PM in response to handellphp

handellphp wrote:
forgive me when it comes to shell scripts but it does not recognize the file on the desktop


Try:
<pre style="
font-family: Monaco, 'Courier New', Courier, monospace;
font-size: 10px;
font-weight: normal;
margin: 0px;
padding: 5px;
border: 1px solid #000000;
width: 720px;
color: #000000;
background-color: #E6E6EE;
overflow: auto;"
title="this text can be pasted into the AppleScript Editor">
set f to choose file
set s to text returned of (display dialog "Please enter search:" default answer "")
try
set a to do shell script "/usr/bin/grep --word-regexp --max-count=1 " & quoted form of s & space & quoted form of POSIX path of f & " | /usr/bin/grep --only-matching ^[[:alnum:]]* "
display dialog "Found " & a
on error
display dialog s & space & "Not Found"
end try</pre>

Your example file has multiple 'hits'. How do want to handle this? The above script stop after the first 'hit' with "--max-count=1", remove to display all

Nov 26, 2010 7:02 PM in response to handellphp

handellphp wrote:
Is there an explode type command using grep? (similar to PHP explode)


I'm not familiar with PHP explode, but if all you want to do is replace spaces and any non-alpha numeric charater with a newline, you can do it with tr

Untitled.txt: This is a 123 test
tr -cs "[:alnum:]" " " < Untitled.txt
output:
This
is
a
123
test


You can also pipe the output of grep through tr:

grep --word-regexp '772' Untitled.txt | grep --only-matching ^[[:alnum:]]* | tr -cs "[:alnum:]" " "


Take a look at (in Terminal) man grep (or http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/ man1/grep.1.html )
also for a regular expression tutorial see: http://www.grymoire.com/Unix/Regular.html
or just google grep tutorial

Nov 26, 2010 8:35 PM in response to handellphp

handellphp wrote:
php explode breaks data into chunks (arrays) based on the delimiters
makes it easy to pick apart web page source content


Looks like AWK would be useful for this if tr is not good enough.

For AWK Take a look at http://www.grymoire.com/Unix/Awk.html

For greg and sed: http://www.osxfaq.com/Tutorials/LearningCenter/UnixTutorials/GrepSedRegexp/index .ws

Also, stop by the Apple UNIX forum: http://discussions.apple.com/forum.jspa?forumID=735

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

NEED A FASTER TEXT PARSING SOLUTION

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.