How to use automator to extract specific text from json txt file

I'm trying to set up an Automator folder action to extract certain data from json files. I'm pulling metadata from YouTube videos, and I want to extract the Title of the video, the URL for the video, and the date uploaded.


Sample json data excerpts:

"upload_date": "20130319"

"title": "[title of varying length]"

"webpage_url": "https://www.youtube.com/watch?v=[video id]"


Based on this thread, seems I should be able to have Automator (or any means of using a shell script) find data and extract it into a .txt file, which I can then open as a space delimited file in Excel or Numbers. That answer assumes a static number of digits for the text to be extracted, though. Is there a way Automator can search through the json file and extract the text - however long - after "title" and "webpage_url"?


json files are all in the same folder, and all end in .info.json.


Any help greatly appreciated!

Posted on Apr 29, 2015 9:32 AM

Reply
13 replies

Apr 29, 2015 12:45 PM in response to sjacks26

There are so many ways of doing this, either via AppleScript or a shell script.


The simplest may be a shell script, which can be as simple as:


/usr/bin/awk -F\" '{/title/ {print $4}' /path/to/your.json > /path/to/your/output.txt


This will read the file /path/to/your/json and run it through awk. awk searches for lines that match 'title', and prints out the fourth field (based on a field delimiter of " ). The output (i.e. the track title) is then written to /path/to/your/output.txt


If you want to wrap this into Automator you need a way of identifying the .json file and inserting it in the appropriate place in that command, or just eschew Automator completely and run the command in Terminal.app. If you want all .info.json files in a given directory that's as easy as:


cd /path/to/dir; /usr/bin/awk -F\" '{/title/ {print $4}' *.info.json >> /path/to/your/output.txt


which will process all files in the /path/to/dir directory in one shot.

Apr 29, 2015 11:25 AM in response to Camelot

Thanks, Camelot.


Using the first example, I get the following error:

/usr/bin/awk -F\" '{/title/ {print $4}' /path/to/my.json > /path/to/my/output.txt

/usr/bin/awk: syntax error at source line 1

context is

{/title/ >>> { <<<

/usr/bin/awk: illegal statement at source line 1

/usr/bin/awk: illegal statement at source line 1

missing }

Not sure if this is important, but the "title" text is:

"title": "[title]"

Does that change the ' and "? Also, the .txt file contains multiple hits for "title," including things like "subtitles" and "playlist_title." I'm guessing I should replace title with "title" to find only instances of title surrounded by quotes, right?


I'm a bit mystified as to what "fourth field" refers to here. I can't seem to figure out what field is the fourth in reference to "title."

Apr 29, 2015 12:51 PM in response to sjacks26

I found a very inelegant way of doing this.


/usr/bin/awk -F\" '{print $8", "$106", "$136", "FILENAME}' /path/to/metadata.json > /path/to/output/Report.csv

After some trial and error, those fields correspond with the "upload_date", "title", and "webpage_url" fields. I added a comma in between so I can open the output as a .csv and have each field print in its own cell. The numbers don't correspond to the proper fields in every .info.json, so I added a print filename so I can go back and manually correct the handful that don't give me the right results automatically. This still leaves me with one problem: if there are commas in the filename field, the filename is split into multiple cells in the .csv. Annoying but simple fix for this is to concatenate the fields using & formula in Excel.


So the version for multiple .info.json files in one directory is:

cd /path/to/directory; /usr/bin/awk -F\" '{print $8", "$106", "$136", " FILENAME}' *.info.json > /path/to/output/Report.csv


I'm definitely still interested in having awk print the field immediately following a search term, if that's feasible and not overly complicated.

Apr 29, 2015 2:07 PM in response to sjacks26

The following brief Python program reads a valid JSON file from the command-line, extracts and prints your specified capture fields as space delimited fields in a text file. Python expects tab (4 space) indents. This was tested on Yosemite 10.10.3 and Python 2.7.6. I did not take the time to add the exception handling for when it encounters invalid JSON data.


JSON Data (test.info.json)


{

"upload_date" : "20130319",

"title" : "[title of varying length]",

"webpage_url" : "https://www.youtube.com/watch?v=[video id]"

}


Python 2.7.6 (Yosemite 10.10.3)

User uploaded file


Usage


test_json.py ~/test.info.json >> test.csv


test.csv


20130319 [title of varying length] https://www.youtube.com/watch?v=[video id]

Apr 29, 2015 2:40 PM in response to sjacks26

My bad, there was a typo in the command I posted earlier (an extra { ):


awk -F\" '/\"title\"/ {print $4}' my.json


Note that this example also takes care of the 'subtitle', 'playlist_title' and other labels by searching for the surrounding quotes,


Does that get you closer?


As for titles that include commas you should be able to avoid that problem by quoting the output string:


awk -F\" '/\"title\"/ {print "\""$4"\""}' my.json

Apr 30, 2015 7:58 AM in response to sjacks26

Well having the actual file made all the difference - the file consists of a single stream of text with no line breaks.


My script assumed that the json was formatted with each key/value pair on a separate line. That's why my script gives the result it does... it's looking for any line in the file that contains "title" and prints out the fourth quote-delimited field. Since there's only one line, there's only one match and the fourth field isn't what you expect.


The simple solution here is to tell awk that records are split on commas, not newlines. This is done by assigning the RS variable:


awk -F\" '/\"title\"/ {print "\""$4"\""}' RS=, *.info.json

Apr 30, 2015 4:09 AM in response to sjacks26

Python scripts are run from the Terminal command line. You make them executable like any other script (chmod +x test_json.py). I happen to have my PATH set up to see my home directory, and a ~/bin location where I place my executable scripts. So I can run this script as in the provided usage example. If you don't have this PATH configuration, you would run the above script as the following from your home directory, and substitute the appropriate relative path to your json data file:


./test_json.py ~/test.info.json >> test.csv


If the Python script was in your home directory and you CD into the directory with your data file:


~/test_json.py test.info.json >> test.csv

Apr 30, 2015 4:20 AM in response to sjacks26

Hello


You might try the following perl script, which will process every *.json file in current directory and yield out.csv.


* CSV currently uses space for field separator as you requested. Note that Numbers.app cannot import such CSV file correctly.



#!/bin/bash /usr/bin/perl -CSDA -w <<'EOF' - *.json > out.csv use strict; use JSON::Syck; $JSON::Syck::ImplicitUnicode = 1; # json node paths to extract my @paths = ('/upload_date', '/title', '/webpage_url'); for (@ARGV) { my $json; open(IN, "<", $_) or die "$!"; { local $/; $json = <IN>; } close IN; my $data = JSON::Syck::Load($json) or next; my @values = map { &json_node_at_path($data, $_) } @paths; { # output CSV spec # - field separator = SPACE # - record separator = LF # - every field is quoted local $, = qq( ); local $\ = qq(\n); print map { s/"/""/og; q(").$_.q("); } @values; } } sub json_node_at_path ($$) { # $ : (reference) json object # $ : (string) node path # # E.g. Given node path = '/abc/0/def', it returns either # $obj->{'abc'}->[0]->{'def'} if $obj->{'abc'} is ARRAY; or # $obj->{'abc'}->{'0'}->{'def'} if $obj->{'abc'} is HASH. my ($obj, $path) = @_; my $r = $obj; for ( map { /(^.+$)/ } split /\//, $path ) { if ( /^[0-9]+$/ && ref($r) eq 'ARRAY' ) { $r = $r->[$_]; } else { $r = $r->{$_}; } } return $r; } EOF




For Automator workflow, you may use Run Shell Script action as follows, which will receive json files and yield out_YYYY-MM-DD_HHMMSS.csv on desktop.


Run Shell Script action


- Shell = /bin/bash

- Pass input = as arguments

- Code = as follows



#!/bin/bash /usr/bin/perl -CSDA -w <<'EOF' - "$@" > ~/Desktop/out_"$(date '+%F_%H%M%S')".csv use strict; use JSON::Syck; $JSON::Syck::ImplicitUnicode = 1; # json node paths to extract my @paths = ('/upload_date', '/title', '/webpage_url'); for (@ARGV) { my $json; open(IN, "<", $_) or die "$!"; { local $/; $json = <IN>; } close IN; my $data = JSON::Syck::Load($json) or next; my @values = map { &json_node_at_path($data, $_) } @paths; { # output CSV spec # - field separator = SPACE # - record separator = LF # - every field is quoted local $, = qq( ); local $\ = qq(\n); print map { s/"/""/og; q(").$_.q("); } @values; } } sub json_node_at_path ($$) { # $ : (reference) json object # $ : (string) node path # # E.g. Given node path = '/abc/0/def', it returns either # $obj->{'abc'}->[0]->{'def'} if $obj->{'abc'} is ARRAY; or # $obj->{'abc'}->{'0'}->{'def'} if $obj->{'abc'} is HASH. my ($obj, $path) = @_; my $r = $obj; for ( map { /(^.+$)/ } split /\//, $path ) { if ( /^[0-9]+$/ && ref($r) eq 'ARRAY' ) { $r = $r->[$_]; } else { $r = $r->{$_}; } } return $r; } EOF



Tested under OS X 10.6.8.


Hope this may help,

H

Apr 30, 2015 7:58 AM in response to Camelot

One day I'll learn to give sample data with questions...


This definitely fixes the problems I was having earlier. The only remaining issue is getting the results to print in a way that I can easily convert into a useable database. In the end, I run the following to pull the fields corresponding to my three search terms:

/usr/bin/awk -F\" '/\"title\"|\"upload_date\"|\"webpage_url\"/ {print "\""$4"\""}' RS=, *.info.json > /path/to/my/output.txt

This prints the right field for each search term, with each surrounded by quotes. But it prints each field on a separate line. Is it possible to print the fields corresponding to each search term on the same line, separated by commas?

Apr 30, 2015 8:01 AM in response to Hiroto

Thanks, Hiroto. The first perl script runs perfectly, once I switch the output to .txt and create a .csv out of it. Printing straight to .csv prints the fields in the same cell.


I wasn't able to get the Automator script to work. It prints an empty output every time. I also tried copying the first script into Automator, but that gives me a syntax error. That said, the first script is simple enough and works fast enough that I'm happy running it each time I get a new batch of .json files.

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

How to use automator to extract specific text from json txt file

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.