How can I use Automator to extract substring of text based on pattern?

Question

Level 1

30 points

How can I use Automator to extract substring of text based on pattern?

I have inbound text in a workflow and I want to extract a substring from the text.

(inbound text)

我 wǒ 代 ① （指一人）（用作主语） I （用作宾语） me （表所属关系） my 告诉我 tell me 我为人人，人人为我 one for all and all for one 我爸/妈 my father/mother 我的祖国 my homeland 我现在没空。 I am busy at the moment. 我认为我行！ I think I can manage it. ② （指两人或以上）（用作主语） we （用作宾语） us （表所属关系） our 我厂/国/校/军 our factory/country/school/army 敌军被我全歼。 The enemy was annihilated by us. → 我方, 敌我矛盾 ③ （表泛指） [used together with 你 in parallel structures] anyone 大家你一言，我一语，献计献策。 They had a brainstorming session with anyone and everyone joining in. 市场里你来我往非常热闹。 The market is bustling with people coming and going. → 尔虞我诈, 你死我活 ④ （指自我） self → 忘我, 自我

I basically only want the non-double byte characters between after the first character and the first occurrence of ① : (see sample)

我 wǒ 代 ①

Using regex101.com I have been able to determine that this regex pattern should produce the required results but I need help getting these results into automator:

/ ([a-z].) /

MacBook Air (13-inch Mid 2012), Mac OS X (10.7.5), Love the mac (OSX 10.9+)

Posted on Sep 9, 2014 2:59 PM

Reply

Answer 1

⚠️

Top-ranking reply

SGIII

Level 8

36,662 points

Sep 9, 2014 7:24 PM in response to mingsai

If your pinyin substring never has spaces in it, then you can use AppleScript to extract it like this:

This is the script (with the sample input). Copy/paste into AppleScript Editor and click the green triangle 'Run' button:

set input to "我 wǒ 代 ① （指一人）（用作主语） I （用作宾语） me （表所属关系） my 告诉我 tell me 我为人人，人人为我 one for all and all for one 我爸/妈 my father/mother 我的祖国 my homeland 我现在没空。 I am busy at the moment. 我认为我行！ I think I can manage it. ② （指两人或以上）（用作主语） we （用作宾语） us （表所属关系） our 我厂/国/校/军 our factory/country/school/army 敌军被我全歼。 The enemy was annihilated by us. → 我方, 敌我矛盾 ③ （表泛指） [used together with 你 in parallel structures] anyone 大家你一言，我一语，献计献策。 They had a brainstorming session with anyone and everyone joining in. 市场里你来我往非常热闹。 The market is bustling with people coming and going. → 尔虞我诈, 你死我活 ④ （指自我） self → 忘我, 自我"

set pyCharSet to {"a", "ā", "á", "ǎ", "à", "b", "c", "d", "e", "ē", "é", "ě", "è", "f", "g", "h", "i", "ī", "í", "ǐ", "ì", "j", "k", "l", "m", "n", "o", "ō", "ó", "ǒ", "ò", "p", "q", "r", "s", "t", "u", "ū", "ú", "ǔ", "ù", "ǖ", "ǚ", "ǜ", "w", "x", "y", "z"}

set {oTID, AppleScript'stext item delimiters} to {AppleScript'stext item delimiters, "①"}

set cc to (input as string)'s text items's item 1's characters 2 thru -1

set py to ""

repeat with c in cc

if c is in pyCharSet then set py to py & c

end repeat

set AppleScript'stext item delimiters to oTID

return py

The above view is AppleScript Editor, not Automator. What do you plan to do with the results in Automator?

SG

Reply

Answer 2

mingsai Author

Level 1

30 points

Sep 9, 2014 9:30 PM in response to SGIII

I'd like to take an input string such as 我是美国人 and have a system-wide service that generates the pinyin (while retaining the original text). I've seen this as the Phonetic Guide Text in Pages on a Context menu and I just want to make it available as a system wide service for use with other programs. Hopefully you'll tell me there is a simpler way than rolling my own script.

😊

Reply

Answer 3

mingsai Author

Level 1

30 points

Sep 9, 2014 11:26 PM in response to SGIII

Hi SGIII,

I just wanted to say thanks for kick starting me. I am well on my way to accomplishing this task with some modifications to the Applescript source you provided. I figured out that I would need to loop through the input and lookup each individual character in the dictionary (so I created a separate service to handle simple dictionary lookup). I parse the definition using your suggested code wrapped into a function, and then append the output. Right now it's close but not perfect: showing this output for the characters above:

"wǒshìAměiAguórén" (the A is wrong and the reason is that I need to change the delimiter to factor that some definitions will have a capital A in them). I will share the finished script on my pastebin.

----on run {input, parameters}

set my_string to "我是美国人" --input

set output to ""

repeat with counter_variable_name from 1 to count of my_string

set current_character to itemcounter_variable_name of my_string

set current_definition to do shell script "automator -i " & current_character & " ~/Library/Services/GetChineseDefinition.workflow"

set output to output & parseForPinyin(current_definition)

end repeat

--end run

on parseForPinyin(character_input)

set pyCharSet to {"a", "ā", "á", "ǎ", "à", "b", "c", "d", "e", "ē", "é", "ě", "è", "f", "g", "h", "i", "ī", "í", "ǐ", "ì", "j", "k", "l", "m", "n", "o", "ō", "ó", "ǒ", "ò", "p", "q", "r", "s", "t", "u", "ū", "ú", "ǔ", "ù", "ǖ", "ǚ", "ǜ", "w", "x", "y", "z"}

set {oTID, AppleScript'stext item delimiters} to {AppleScript'stext item delimiters, "①"}

set cc to (character_input as string)'s text items's item 1's characters 2 thru -1

set py to ""

repeat with c in cc

if c is in pyCharSet then set py to py & c

end repeat

set AppleScript'stext item delimiters to oTID

return py

end parseForPinyin

The GetChineseDictionary Service couldn't be simpler:

Reply

Answer 4

VikingOSX

Level 10

119,957 points

Sep 10, 2014 8:01 AM in response to mingsai

Here is an abbreviated way to capture wŏ from your input stream. Note that AppleScript wants to escape certain characters.

set inbound to "我 wǒ 代 ① （指一人）（用作主语） I （用作宾语） me （表所属关系） my 告诉我 tell me 我为人人，人人为我 one for all and all for one 我爸/妈 my father/mother 我的祖国 my homeland 我现在没空。 I am busy at the moment. 我认为我行！ I think I can manage it. ② （指两人或以上）（用作主语） we （用作宾语） us （表所属关系） our 我厂/国/校/军 our factory/country/school/army 敌军被我全歼。 The enemy was annihilated by us. → 我方, 敌我矛盾 ③ （表泛指） [used together with 你 in parallel structures] anyone 大家你一言，我一语，献计献策。 They had a brainstorming session with anyone and everyone joining in. 市场里你来我往非常热闹。 The market is bustling with people coming and going. → 尔虞我诈, 你死我活 ④ （指自我） self → 忘我, 自我"

set outbound to ""

set cmd to "perl -we \"print <> =~ /([\\w+]..)/\" <<< "

set outbound to (do shell scriptcmd & inbound'squoted form)

display dialogoutbound

Reply

Answer 5

SGIII

Level 8

36,662 points

Sep 10, 2014 2:13 PM in response to mingsai

Thanks for the inspiration for a service, and for showing how to put a reference to another workflow in an AppleScript loop via 'do shell script'. I don't have your dictionary on my Mac (it doesn't seem to be on the US Mac App Store and I couldn't figure out where else to get it) but I do have freeware Xiao Cidian, which I've seldom used because it has lots of single characters but few combinations. That makes it quite well suited for this purpose because it's easier to get the pinyin out; just extract the first line of the definition (and filter out any "Not found" for when the script encounters punctuation or non-Chinese in the input.)

So this is now working here:

With this (placed in Workflows folder instead of Services, so it doesn't clutter up the menu; this also makes it available to the Run Workflow action in Automator in case you want that in the future):

So the result of selecting this: 第二十三回横海郡柴进留宾　景阳冈武松打虎, then choosing Get Pinyin from the Services menu, and pasting in a text editor, is this:

Not bad, though the pinyin for some characters is not "context-aware."

The the script that works with xiaocidian is below. I have the output going to the clipboard but of course it could go elsewhere as well.

SG

on run {input}

set pinyin to ""

set inbound to input as text

repeat with c in inbound's characters

set def to (do shell script "automator -i " & quoted form of c & " ~/Library/Workflows/ch-to-py.workflow")

set def to def'scharacters 1 thru -2 as string-- strip "" and trailing return

set py to def'sparagraph 1 as string--get first line of xiaocidian.com definition -- the pinyin

try

if (py's characters 1 thru 2 as string) is "No" then

set pinyin to pinyin & c-- punctuation, etc. - "Not found"

else

set pinyin to pinyin & py

end if

on error

set pinyin to pinyin & py-- one-letter pinyin

end try

end repeat

set outbound to inbound & return & pinyin

set the clipboard tooutbound

-- return outbound

end run

Reply

Answer 6

mingsai Author

Level 1

30 points

Sep 10, 2014 2:15 PM in response to VikingOSX

I see a couple of weird characters when using the perl command.

"wǒsh√1 mgu√ré"

I believe that when I had this tied to an Automator script there was a need to set some terminal variables to account for the double byte characters in parsing Chinese text...

PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin/: 
export PATH

LC_CTYPE=UTF-8

Would some adjustment like this be possible in the perl command... By the way - Thanks for that regex, I was up all night trying to get that pattern...

Reply

Answer 7

mingsai Author

Level 1

30 points

Sep 10, 2014 2:27 PM in response to SGIII

You can find all of the open source dictionaries that I use on my Github page. I think it's probably worth the effort to analyze the other dictionaries and complete a fully flexible tool no matter which dictionary is selected. That leads me to consider however, whether there is a way (I am sure there must be) to analyze available dictionaries and possibly even allow users to select the dictionary desired (at least during the initial run)... Just thinking out loud on that one...

I will also rearrange the internal workflow. Good catch.

Reply

Answer 8

VikingOSX

Level 10

119,957 points

Sep 10, 2014 3:30 PM in response to mingsai

I used the identical multi-lingual text that you originally posted for my testing. Not certain how you got the additional output from my AppleScript code. 😉

You can tell Perl to return the captured item as UTF-8, but when that change is made, wŏ becomes 我. For the same reason, it is why I did not set the outbound variable as text (or «class utf8»). Here is the Perl change that returns 我.

-- perl -we "print <> =~ /([\w+]..)/u" <<<

set cmd to "perl -we \"print <> =~ /([\\w+]..)/u\" <<< "

If you display the contents of /etc/paths, you will see that these are default paths already set up for you, and the processes that you launch.

Here is how you would incorporate the locale change. If you omitted the trailing 'u' on the Perl Regex, this would return wŏ.

set cmd to "export LC_CTYPE=UTF-8;perl -we \"print <> =~ /([\\w+]..)/u\" <<< "

Reply

Answer 9

SGIII

Level 8

36,662 points

Sep 11, 2014 4:48 AM in response to mingsai

The regex works here on the one sample multi-lingual text. Haven't figured out how to install dictionaries on Github page (use DictUnifier.app somehow?) so only have your one sample definition to work from. But it turns out that this returns the pinyin from your sample definition.

set py to input's paragraph 1's word 2

A rare case of AppleScript being more concise than a shell script! 🙂 (Can't speak for performance, though).

set input to "我 wǒ
①代说话人称自己。
我认识你 | 他是我的老师
②代称自己的一方，相当于“我们”。
我国 | 我军 | 我校 | 我厂 | 敌我双方
③代用于“你” “我”对举，泛指许多人。
你来我往 | 你一言，我一语。
④代自己。
自我介绍 | 忘我工作"

set py to input's paragraph 1's word 2

Instead of calling another workflow to look up a word it seems it *may* be possible to use open location "dict://...". May or may not be more efficient, and not sure whether there is a way to specify the dictionary.

SG

Reply

Answer 10

VikingOSX

Level 10

119,957 points

Sep 11, 2014 5:59 AM in response to mingsai

The Apple Dictionary app has a Simplified Chinese option in its preferences. One can click, and then drag this selection to the very top of that window so that it becomes the default dictionary. Previously, I had written an application in Python that uses Apple's Dictionary Services. Provide a native word on the command-line, and it returns the local dictionary entry. When I supplied 我 to the application, I got the following in return:

odin: ~$ pydict.py 我

Because Apple's Dictionary Services will search all selected dictionaries in Dictionary app's preferences for the provided word, preferences order isn't important. The following curl dictionary lookup gets no match.

curl -s dict://dict.org/d: 我

Reply

Answer 11

mingsai Author

Level 1

30 points

Sep 11, 2014 8:28 AM in response to SGIII

Re: How to install the dictionaries.

Just download the entire repo as a zip file then copy the dictionaries manually to /Library/Dictionaries or ~/Library/Dictionaries. Open dictionary and include the new libraries from within the Dictionary preferences. I will setup a package installer for this when I have time.

Reply

Answer 12

mingsai Author

Level 1

30 points

Sep 11, 2014 8:40 AM in response to VikingOSX

I did the very same task using a command line program made with Xcode (rdef). What I've noticed is that a number of entries return different patterns of text. What is needed is a library that handles these variations to always produce valid pinyin. It seems to me that one way to approach this is to gather the first word that has accented text in it (this is why I was excited to see SGIII's initial answer containing the accented characters). Such a regex would handle 90% of the pinyin, however there are some pinyin which are not accented at all. In most instances this would be the first non-chinese word (also non-numeric) in a dictionary following the character.

BTW - Your previous regex worked just fine on the sample provided. It's when I applied to the extended use-cases that I ran into trouble again. Also when crossing the application domain barrier between the shell and the workflow/script/service domains, I've seen some unusual behaviors when dealing with Chinese text.

You might want to download all of the dictionaries that I mentioned in an earlier post to use if you want to pursue this idea further.

P.S. - The repo needs to have a packager, which I will work on but the folders are named for the location where each file belongs.

Reply

Answer 13

mingsai Author

Level 1

30 points

Sep 11, 2014 8:50 AM in response to mingsai

--

Reply

Answer 14

mingsai Author

Level 1

30 points

Sep 11, 2014 8:55 AM in response to SGIII

SGIII,

Is there a better way to use the script to look for the first word that contains an accented character from the array you presented in your initial response? That would handle ~90% of the use-cases. The others would likely be covered by the first non-chinese word found in the inbound text. Any thoughts on how to achieve these regex patterns using Applescript? Keep in mind there are one or two edge cases where the pinyin word will be a single character as in 啊 (ā).

Reply

Answer 15

mingsai Author

Level 1

30 points

Sep 11, 2014 8:52 AM in response to SGIII

This is very good but should probably incorporate a space between each pinyin word.

Reply

Shop

Quick Links

Shop Special Stores

Explore Mac

Shop Mac

More from Mac

Explore iPad

Shop iPad

More from iPad

Explore iPhone

Shop iPhone

More from iPhone

Explore Watch

Shop Watch

More from Watch

Explore Vision

Shop Vision

More from Vision

Explore AirPods

Shop AirPods

More from AirPods

Explore TV & Home

Shop TV & Home

More from TV & Home

Explore Entertainment

Support

Shop Accessories

Explore Accessories

Explore Support

Get Help

Helpful Topics

Quick Links

How can I use Automator to extract substring of text based on pattern?