Question Re: Merging Spreadsheets

Question

Question Re: Merging Spreadsheets

I have a huge spreadsheet with scientific names of animals, along with some other information. Below is an an example of some data from this table:

N | TAXON | PARENT | RANK

1 | Animalia | (NULL) | 5

2 | Chordata | Animalia | 15

3 | Vertebrata | Chordata | 17

4 | Mammalia | Vertebrata | 25

Many, but not all, animals have common names. So I've created another spreadsheet that lists those common names. It looks something like this:

N | TAXON | NAME_COMMON

1 | Animalia | animal
2 | Chordata | chordate

3 | Vertebrata | vertebrate

4 | Mammalia | mammal

The two spreadsheets share the field Taxon in common. If there are 60,000 rows in the first spreadsheet there are perhaps half that many in the second.

I would like to merge these two spreadsheets into a single spreadsheet, but it's going to take a helluva lot of cutting and pasting. Is there a way to automate the procedure?

Thanks.

MacBook Pro

Posted on Aug 27, 2015 5:35 PM

Reply

Answer 1

Aug 27, 2015 8:25 PM in response to David Blomstrom

Are you wanting to change the Taxon name to the NAME_COMMON in you new table?

Reply

Answer 2

Aug 27, 2015 9:25 PM in response to charles.christian14

No, Taxon is the scientific name, Name_Common the common name. All I want to do is add the column Name_Column to the first spreadsheet. So when I'm finished, the first spreadsheet will look exactly as it did before, except there will be new column that features common names for some of the taxons.

Reply

Answer 3

Aug 27, 2015 9:36 PM in response to David Blomstrom

Yes, then we can solve that problem easily. In Numbers a great tool to use is the Index formula with a Match formula embedded. The first thing is that you need to make the column on your first sheet, where ever you would like it to be. In the first cell of this column, put the formula

=INDEX($NAME_COMMON,MATCH($B2,Table 2::$Taxon,0))

These are the tables that correspond to the formula in my example.

N	Taxon	Parent	Rank	Common Name
1	Taxon 1	Parent 1	Rank 1	common name 1
2	Taxon 2	Parent 2	Rank 2	common name 2
3	Taxon 3	Parent 3	Rank 3	common name 3
4	Taxon 4	Parent 4	Rank 4	common name 4

N	Taxon	NAME_COMMON
1	Taxon 1	common name 1
2	Taxon 2	common name 2
3	Taxon 3	common name 3
4	Taxon 4	common name 4

Once you have it working properly, incase the formula in an IFerror formula to get rid of all the error messages. It will look something similar to this

IFERROR(INDEX($NAME_COMMON,MATCH($B2,Table 2::$Taxon,0)), "")

Reply

Answer 4

Aug 28, 2015 5:21 AM in response to charles.christian14

Thanks, I'll give that a try when I get home from work. One caveat - the two spreadsheets share the column TAXON, not Name_Common. So I assume I should change your formula to this:

=INDEX($Taxon,MATCH($B2,Table 2::$Taxon,0))

Another question: I have a dozen spreadsheets that have columns named Taxon, some of which also have a column named Name_Common. How does the formula know which two spreadsheets to perform the operation on? Do I simply have to open the two spreadsheets, and it automatically selects them?

Reply

Answer 5

SGIII

Level 8

35,252 points

Aug 28, 2015 9:01 AM in response to David Blomstrom

David Blomstrom wrote:

Another question: I have a dozen spreadsheets that have columns named Taxon, some of which also have a column named Name_Common. How does the formula know which two spreadsheets to perform the operation on?

Hi David,

When you say "a dozen spreadsheets" to you mean "a dozen tables" in one document?

In Numbers (unlike Excel) a cell is always in a table. It is never directly on a sheet.

Numbers can tell which column you mean by the table name. In practice when entering formulas must people probably select the specific table column while in the Formula Editor and Numbers will insert the appropriate reference. That's easier than typing it out.

So if you want to match on Taxon you would select column B of the bottom table in Charles's example when constructing the formula in the Formula Editor. As you can see in his formula, Numbers refers to this column as Table 2::Taxon (assuming you have 'Use header names as labels' checked in Numbers > Preferences; I usually leave that unchecked, in which case the reverence will be Table 2::$B).

SG

P.S. Numbers may become sluggish with 60,000 rows. Have you considered using a database app for this... or Excel?

Reply

Answer 6

Aug 28, 2015 6:24 PM in response to SGIII

In Finder, if I navigate to a folder (e.g. MySites > Database > GeoZoo, I can find icons for all my spreadsheets there. In this case, the two spreadsheets I want to merge are named gz_life_mammals and gz_names_mammals.

I should have clarified that my DATABASE TABLE has 60,000 rows or so, representing all vertebrates. But I have a separate spreadsheet for each vertebrate class - mammals, birds, etc.

I use spreadsheets to manipulate and store data, which I then save as a CSV file so I can import it into my database table.

Reply

Answer 7

SGIII

Level 8

35,252 points

Aug 28, 2015 6:59 PM in response to David Blomstrom

David,

So by "spreadsheets" I gather you mean Numbers "documents."

What do you mean by DATABASE TABLE? Are you using a database app?

SG

Reply

Answer 8

Aug 28, 2015 7:23 PM in response to SGIII

Yes, "spreadsheets" = Numbers documents.

I'm not using a database app at the moment. I'm just trying to merge two of my Numbers documents (e.g. "spreadsheets"). After I've accomplished that, I'm going to save the merged document as a CSV file and import it into a database table.

Reply

Answer 9

SGIII

Level 8

35,252 points

Aug 28, 2015 9:43 PM in response to David Blomstrom

So you want to merge information from tables in two separate Numbers documents.... That's possible but an AppleScript will probably be necessary.

What is the name of each document? What is the name of the sheet on which each table is located? What is the name of each table? Is the structure of each table? Is it exactly as in your first post?

SG

Reply

Answer 10

Aug 29, 2015 6:23 PM in response to SGIII

The documents are named Mammals and Names_Mammals. They're in different folders, but I can put them in the same folder. Both are on "Sheet1," and when I click on "Sheet1," both say "Table 1."

Mammals has five columns - a numerical key, Taxon, Parent, Rank and Key.

Mammals_Names has a numerical key, Taxon, Slug, NameCommon, Plural, Rank.

Thanks.

Reply

Answer 11

Hiroto

Level 5

7,461 points

Aug 30, 2015 2:27 PM in response to David Blomstrom

Hello

It would be much simpler and faster to process csv text files than spreadsheet tables.

You may export your master table to master.csv and names table to names.csv and try the following ruby script to consolidate master.csv and names.csv and yield out.csv in desktop. Currently, output csv's field separator is comma and record separator is linefeed.

E.g.,

master.csv:

N,TAXON,PARENT,RANK 1,Animalia,(NULL),5 2,Chordata,Animalia,15 3,Vertebrata,Chordata,17 10,Mammalia,Vertebrata,25

names.csv:

N,TAXON,NAME_COMMON 1,Animalia,animal 2,Chordata,chordate 4,Mammalia,mammal

out.csv:

N,TAXON,PARENT,RANK,NAME_COMMON 1,Animalia,(NULL),5,animal 2,Chordata,Animalia,15,chordate 3,Vertebrata,Chordata,17, 10,Mammalia,Vertebrata,25,mammal

#!/bin/bash cd ~/desktop || exit /usr/bin/ruby -w <<'EOF' - master.csv names.csv > out.csv # # ARGV[0] = master csv # ARGV[1] = names csv # # * input and output csv fields e.g. # master csv fields = N, TAXON, ... # names csv fields = N, TAXON, NAME_COMMON # output csv fields = N, TAXON, ..., NAME_COMMON # * field TAXON is the consolidation key. # * Output csv records are sorted by field N in numerically ascending order. # * internal hash table is created such that # { pp[i][k] => [ pp[i] << qq[j][n] : qq[j][k] == pp[i][k] or '' ] : for i } # where # pp = 2d-array of master csv # qq = 2d-array of names csv # k = consolidation key column index (0-based) of master and names csv (k = 1 : TAXON) # n = name column index (0-based) of names csv (n = 2 : NAME_COMMON) # i = body row index (0-based) of master csv # j = body row index (0-based) of names csv # a[x][y] denotes data at coorditates [x][y] of array a # def csv2array(t, opts = {}) # string t : csv text # hash opts : {:fs => fs, :rs => rs} # string fs : field separator # string rs : record separator # * fs and rs may be string or compiled pattern # return array : 2d-array representing data in csv # quasi constants _QQ = %["] # U+0022 QUOTATION MARK _FS = %[\x1c] # U+001C INFORMATION SEPARATOR FOUR (FS) _RS = %[\x1e] # U+001E INFORMATION SEPARATOR TWO (RS) _US = %[\x1f] # U+001F INFORMATION SEPARATOR ONE (US) # create unique escape sequences _FS_, _RS_, _US_ = _FS, _RS, _US _FS_ += _FS while t.include?(_FS_) _RS_ += _RS while t.include?(_RS_) _US_ += _US while t.include?(_US_) # accept fs, rs fs, rs = {:fs => %[,], :rs => %[\n]}.merge(opts).values_at(:fs, :rs) # split text by _QQ and get arrays for odd & even elements - aa & bb (bb is quoted elements) p = 0 aa, bb = t.split(_QQ).partition {|x| ( p += 1) % 2 == 1} # change null string in aa except at beginning and end to _QQ, which corresponds to changing "" to " in quoted text (1 .. aa.length - 2).each {|i| aa[i] = _QQ if aa[i] == ''} # escape fs by _FS_ and rs by _RS_ in aa aa = aa.join(_US_).gsub(fs, _FS_).gsub(rs, _RS_).split(_US_) # rebuild escaped text and convert it to 2d array by using _FS_ and _RS_ s = aa.zip(bb).join('') return s.split(_RS_).map {|x| x.split(_FS_)} end def array2csv(aa, opts = {}) # array aa : 2d array # hash opts : {:fs => fs, :rs => rs} # string fs : field separator # string rs : record separator # return string : csv text representing data in 2d array # * field is quoted if it contains fs, rs or quote character fs, rs = {:fs => %[,], :rs => %[\n]}.merge(opts).values_at(:fs, :rs) q = %["] aa.map do |a| a.map do |s| s =~ /#{fs}|#{rs}|#{q}/o ? (q + s.gsub(/#{q}/o, q + q) + q) : s end.join(fs) end.join(rs) + rs end key_index = 1 # consolidation key column index (0-based) of master and names csv files name_index = 2 # name column index (0-based) of names csv file sort_index = 0 # sort column index (0-based) of output csv file csv_fs = %[,] # field separator in input and output csv's csv_rs = %[\n] # record separator in output csv hh = {} # internal hash table # create 2d-arrays from input csv's aa = [] ARGV[0,2].each do |f| File.open(f) { |f| aa << csv2array(f.read, :fs => csv_fs, :rs => /\015\012|\015|\012/)} end pp, qq = aa # process header pp1 = pp.shift qq1 = qq.shift rr1 = pp1 << qq1[name_index] # process body pp.each { |e| hh[e[key_index]] = e << '' } qq.each { |e| hh[e[key_index]][-1] = e[name_index] } rr = [] hh.keys.sort_by { |k| hh[k][sort_index].to_i }.each { |k| rr << hh[k] } # build and print output csv print array2csv(rr.unshift(rr1), :fs => csv_fs, :rs => csv_rs) EOF

If you're not familiar with shell script in Terminal.app, you may run it in Run Shell Script action in Automator workflow as follows.

1) Run AppleScript action

- Code = as follows

set in1 to (choose file of type {"csv", "txt"} with prompt "Choose master csv file")'s POSIX path set in2 to (choose file of type {"csv", "txt"} with prompt "Choose names csv file")'s POSIX path set out to (choose file name default name "out.csv" with prompt "Specify output csv file name and location")'s POSIX path return {in1, in2, out}

2) Run Shell Script action

- Shell = /bin/bash

- Pass Input = as arguments

- Code = as follows

#!/bin/bash /usr/bin/ruby -w <<'EOF' - "$1" "$2" > "$3" # # ARGV[0] = master csv # ARGV[1] = names csv # # * input and output csv fields e.g. # master csv fields = N, TAXON, ... # names csv fields = N, TAXON, NAME_COMMON # output csv fields = N, TAXON, ..., NAME_COMMON # * field TAXON is the consolidation key. # * Output csv records are sorted by field N in numerically ascending order. # * internal hash table is created such that # { pp[i][k] => [ pp[i] << qq[j][n] : qq[j][k] == pp[i][k] or '' ] : for i } # where # pp = 2d-array of master csv # qq = 2d-array of names csv # k = consolidation key column index (0-based) of master and names csv (k = 1 : TAXON) # n = name column index (0-based) of names csv (n = 2 : NAME_COMMON) # i = body row index (0-based) of master csv # j = body row index (0-based) of names csv # a[x][y] denotes data at coorditates [x][y] of array a # def csv2array(t, opts = {}) # string t : csv text # hash opts : {:fs => fs, :rs => rs} # string fs : field separator # string rs : record separator # * fs and rs may be string or compiled pattern # return array : 2d-array representing data in csv # quasi constants _QQ = %["] # U+0022 QUOTATION MARK _FS = %[\x1c] # U+001C INFORMATION SEPARATOR FOUR (FS) _RS = %[\x1e] # U+001E INFORMATION SEPARATOR TWO (RS) _US = %[\x1f] # U+001F INFORMATION SEPARATOR ONE (US) # create unique escape sequences _FS_, _RS_, _US_ = _FS, _RS, _US _FS_ += _FS while t.include?(_FS_) _RS_ += _RS while t.include?(_RS_) _US_ += _US while t.include?(_US_) # accept fs, rs fs, rs = {:fs => %[,], :rs => %[\n]}.merge(opts).values_at(:fs, :rs) # split text by _QQ and get arrays for odd & even elements - aa & bb (bb is quoted elements) p = 0 aa, bb = t.split(_QQ).partition {|x| ( p += 1) % 2 == 1} # change null string in aa except at beginning and end to _QQ, which corresponds to changing "" to " in quoted text (1 .. aa.length - 2).each {|i| aa[i] = _QQ if aa[i] == ''} # escape fs by _FS_ and rs by _RS_ in aa aa = aa.join(_US_).gsub(fs, _FS_).gsub(rs, _RS_).split(_US_) # rebuild escaped text and convert it to 2d array by using _FS_ and _RS_ s = aa.zip(bb).join('') return s.split(_RS_).map {|x| x.split(_FS_)} end def array2csv(aa, opts = {}) # array aa : 2d array # hash opts : {:fs => fs, :rs => rs} # string fs : field separator # string rs : record separator # return string : csv text representing data in 2d array # * field is quoted if it contains fs, rs or quote character fs, rs = {:fs => %[,], :rs => %[\n]}.merge(opts).values_at(:fs, :rs) q = %["] aa.map do |a| a.map do |s| s =~ /#{fs}|#{rs}|#{q}/o ? (q + s.gsub(/#{q}/o, q + q) + q) : s end.join(fs) end.join(rs) + rs end key_index = 1 # consolidation key column index (0-based) of master and names csv files name_index = 2 # name column index (0-based) of names csv file sort_index = 0 # sort column index (0-based) of output csv file csv_fs = %[,] # field separator in input and output csv's csv_rs = %[\n] # record separator in output csv hh = {} # internal hash table # create 2d-arrays from input csv's aa = [] ARGV[0,2].each do |f| File.open(f) { |f| aa << csv2array(f.read, :fs => csv_fs, :rs => /\015\012|\015|\012/)} end pp, qq = aa # process header pp1 = pp.shift qq1 = qq.shift rr1 = pp1 << qq1[name_index] # process body pp.each { |e| hh[e[key_index]] = e << '' } qq.each { |e| hh[e[key_index]][-1] = e[name_index] } rr = [] hh.keys.sort_by { |k| hh[k][sort_index].to_i }.each { |k| rr << hh[k] } # build and print output csv print array2csv(rr.unshift(rr1), :fs => csv_fs, :rs => csv_rs) EOF

Resulting workflow will look something like this:

When run, it will let you choose master.csv, names.csv and output csv's name and location and then yield the output as specified.

Briefly tested under OS X 10.6.8.

Good luck,

H

Reply

Answer 12

Aug 30, 2015 11:42 PM in response to Hiroto

Wow, thanks for going to all that trouble!

Reply