Text Filtering in Automator

Question

Level 1

0 points

Text Filtering in Automator

Hi there,

I wonder if you could help me. I'm trying to filter a very large text file for lines containing any one of four letters. The parts of the file I want to filter are lines that are 100 characters long, with no spaces, and contain one of four letters (C, T, A and/or G). I would ideally like to extract these 100-character lines of text, so that I can save them into a new document.

I would be so grateful if someone might guide me on how I could get automator to do this for me.

Thank you so much in advance!

Sharklel48

15" Unibody MBP, iMac, iBook, iPhone 4, Mac OS X (10.6.5)

Posted on Dec 7, 2010 5:16 PM

Reply

Answer 1

Barney-15E

Level 10

122,300 points

Dec 7, 2010 7:47 PM in response to Sharklel48

It sounds like something for grep and regular expressions. While I might be able to figure it out, eventually, you might find better response on the [Unix forum|http://discussions.apple.com/forum.jspa?forumID=735] or the [Applescript forum|http://discussions.apple.com/forum.jspa?forumID=724].

Reply

Answer 2

Sharklel48 Author

Level 1

0 points

Dec 8, 2010 4:51 PM in response to Barney-15E

Wow, thank you so much! I didn't expect anyone to reply - thank you. And thanks for the links - I will be scouring those forums for some answers. Thank you!

Reply

Answer 3

mns579

Level 4

1,170 points

Dec 10, 2010 1:04 PM in response to Sharklel48

TextWrangler ought to work--if I'm correctly understanding the document format.

Open the source document in TextWrangler, then

Text > Process Lines Containing . . .
--Find lines containing C|T|A|G
--check "Use grep" and "Copy to new document"
--click Process

The new document contains only those lines containing any of the characters C, T, A, or G.

If you need to restrict the selection further, solely to lines of precisely 100 characters--that is, not lines of 0-99 or 101+ characters--that's going to take more filtering and a clearer description of the document format.

http://www.barebones.com/products/textwrangler/

Reply

Answer 4

Sharklel48 Author

Level 1

0 points

Dec 10, 2010 1:44 PM in response to mns579

Thank you so much, mns579. I've never used TextWrangler before, but it looks like it might definitely help me get the solution to my problem.

I'll try and explain more clearly what it is that I'm trying to do. Basically, I'm a biologist, and I have some DNA presented in 100-base-pair-long (100-character-long) reads. The problem is, that I've got buckets of this data, and it's presented in a way that each piece of DNA is organised into separate paragraphs, with loads of annoying metadata tagged to the front of it. Below is an example of the format:

@SRR062634.321 HWI-EAS110_103327062:6:1:1446:951/2
TGATCATTTGATTAATACTGACATGTAGACAAGAAGAAAAGTATGTTTCATGCTATTTTGAGTAACTTCCATTTAGAAGC CTACTCCTGAGCACAACATT
+
B5=BD5DAD?:CBDD-DDDDDCDDB -B:;?A?CCE?;D3A?B?DB??;DDDEEABD>DAC?A-CD-=D?C5A@::AC-?AB?=:>CA@##########
@SRR062634.488 HWI-EAS110_103327062:6:1:1503:935/2
AATGTTATTAAAAATGGACACCTTTTTCTCACACATTCAGTTTCATTGTCTCGCACCCCATCGTTTTACTTTTCTTCCTT CAGAAAATGATAAATGTGGG
+
AAAA?5D?BD==ADBD:DBDDDDD5D=;@>AD-CD?D=C5=@4<7CCAA5?=?>5@BC? <:=>>:D:B5?B?5?'3::5?5<:;97:<A#########
@SRR062634.849 HWI-EAS110_103327062:6:1:1587:921/2
CAGATCAGAATAATTTTTGTGTTATGTACGTGTAAGAAAACATAGCTATTATGATATGGAAACTAGGAGTGAAATATGAG GAATTTGTGACTTTTCTGAA

Here there are 3 chunks of DNA, which is all I'm interested for the time being. I don't want the jargon around it. So I'm basically looking for a way to extract that 100-character-long string of Cs Ts As and Gs at the end of each paragraph, and have it all processed into a separate text document (ideally with a line's gap between each 100 characters).

I'm sorry if my explanation is a bit confusing, but I can't thank you enough for your help. I'll look more into TextWrangler and see if I can decipher a way to do it. But if you have any obvious suggestions that spring to mind, I'd be very grateful for your help!

Thanks a million, in advance!

Sharklel48

Reply

Answer 5

mns579

Level 4

1,170 points

Dec 10, 2010 4:49 PM in response to Sharklel48

Yes, the question did look like a bio problem! It might be easier to delete everything that's not a DNA sequence.

In TextWrangler, it would be three find/replace commands:

1. To delete lines that end in #
2. To delete lines that end in /2
3. To delete the remaining + signs that are on lines of their own

In TW, it's Search > Find, then in the dialogue box (with both Grep and Wrap around checked):

1. Find: .*#$
Replace:
2. Find: .*/2$
Replace:
3. Find: \+
Replace:

In each case, the replace field is left blank.

A test on the above data results in:

TGATCATTTGATTAATACTGACATGTAGACAAGAAGAAAAGTATGTTTCATGCTATTTTGAGTAACTTCCATTTAGAAGC CTACTCCTGAGCACAACATT

AATGTTATTAAAAATGGACACCTTTTTCTCACACATTCAGTTTCATTGTCTCGCACCCCATCGTTTTACTTTTCTTCCTT CAGAAAATGATAAATGTGGG

CAGATCAGAATAATTTTTGTGTTATGTACGTGTAAGAAAACATAGCTATTATGATATGGAAACTAGGAGTGAAATATGAG GAATTTGTGACTTTTCTGAA

If that's too much space between the remaining strings, you can use TW's find/replace to change three returns to one with

Find: \r\r\r\
Replace: \r

Reply

Answer 6

red_menace

Level 6

17,066 points

Dec 10, 2010 5:09 PM in response to Sharklel48

Are your examples really as you posted them, or are they each one paragraph (separated by a return)? If the metadata and base pair string is all on one line, you could also just use a script action to strip everything but the last 100 characters of a line.

Reply

Answer 7

mns579

Level 4

1,170 points

Dec 11, 2010 11:57 AM in response to mns579

A bit of Googling suggests we're looking at a fragment of a version (?) of this 77M dataset:

http://galaxy.rgenetics.org/datasets/1562201ac260fe38/display?to_ext=fastqsanger

It looks a fairly simple matter to massage it into shape to retain the relevant sequences in every fourth line starting with 2; delete the rest; and put a blank line between base-pair sequences.

1. Find: EAS
Replace:
{Delete a specific nonrelevant, document-wide string with an A in it}

2. Find: [^ACTG\+] {That should be left squarebracket caret ACTG\+ right squarebracket, but the board is garbling it}
Replace:
{Delete every remaining character except A C T G +}

3. Find: \+
Replace: \r\r
{Replace the + delimiter with two returns}

Step 2 might take awhile on over a million lines. You can speed it up by checking "Case sensitive" (along with "Grep") in the find/replace dialogue box.

Grep is your friend.

Reply

Answer 8

Sharklel48 Author

Level 1

0 points

Dec 13, 2010 2:59 AM in response to mns579

Thank you so much, mns579.

That looks great - I'm running these as I speak. I will report back once they're finished. It looks like it should work, so thank you for your help!

And you probably are picking up on the same dataset as me, although I have it from a different source. Either way, the dataset is enormous, and formatted in the same way, so the method you designed should work on both, hopefully.

Thank you, once again.

Reply

Answer 9

Sharklel48 Author

Level 1

0 points

Dec 13, 2010 3:31 AM in response to mns579

Hi again, mnas579,

I tried this, but I don't think it's worked quite the way we expected. If I could find a way to select the 100 characters that precede every '+' in the document, that would do the trick.

Do you have any ideas? I'm a novice when it comes to programming, but I'm trying to learn Perl at the moment, in hope that I can write a quick program that would do this for me.

Thanks again for your help.

And @red_menace - that's exactly how it looks. I'm not sure how to write a script that enables me to do this, but if you have any ideas, I'd be most grateful to hear them.

Thanks, guys!

Reply

Answer 10

red_menace

Level 6

17,066 points

Dec 13, 2010 6:53 AM in response to Sharklel48

Paragraphs are separated by returns, so if each one of those was a paragraph you could just get the last 100 characters. There are several lines in each entry though, so another approach is needed. I would avoid using Automator for files this large, since it does a lot of caching and swapping around behind the scenes that really slow things down.

Give the following AppleScript a try. It uses a shell script to get lines that contain "C", "T", "G", and "\+" characters (the metadata lines can contain an "A", so I just skipped it), and then replaces the "\+" with a return. The resulting text is written to a file named "results.txt" on your desktop. I tested it with the file referenced by mns579, and it seems to work OK (the file was a bit large to count the results).

<pre style="
font-family: Monaco, 'Courier New', Courier, monospace;
font-size: 10px;
font-weight: normal;
margin: 0px;
padding: 5px;
border: 1px solid #000000;
width: 720px;
color: #000000;
background-color: #DAFFB6;
overflow: auto;"
title="this text can be pasted into the AppleScript Editor">
set input to POSIX path of (choose file)
set output to POSIX path of (((path to desktop folder) as text) & "results.txt")

do shell script "grep '[CTG+]' " & quoted form of input & " | sed 's/+/" & return & "/g' > " & quoted form of output
</pre>

Reply

Answer 11

mns579

Level 4

1,170 points

Dec 13, 2010 7:57 AM in response to red_menace

As usual, red_menace offers an elegant solution. And it looks right to me: 1,235,385 lines go in, with the final line a blank; 308,846 ((1,235,385-1)/4) come out. Numerically at least, the script picks up one-fourth of the lines, which is what the pattern suggests should be the desired result.

Reply

Answer 12

red_menace

Level 6

17,066 points

Dec 13, 2010 10:03 PM in response to red_menace

Another way that also works with the posted text is to exclude all of the lines that have various metadata/mask characters, for example:

do shell script "grep -v '[!@#$%&?/.:]' " & quoted form of input & " | sed 's/+/" & return & "/g' > " & quoted form of output

My greppage isn't all that great, so I am sure there are other methods.

Reply

Answer 13

Sharklel48 Author

Level 1

0 points

Dec 14, 2010 2:47 AM in response to red_menace

Thank you so much, red_menace. Could I possibly ask a huge favour? Is there any way you could spell that out a bit for me? I'm a bit new to Terminal, and it's freaking me out a little bit, so I'm still unsure how to execute the script in your answer. Sorry for looking daft, but that would really be appreciated! And thank you so much for helping me out and running it yourself to check - very kind of you!

Thanks!

Reply

Answer 14

Sharklel48 Author

Level 1

0 points

Dec 14, 2010 3:24 AM in response to red_menace

OK, not sure if I've done this right, but I've just copied your script as below, and put the data file, and a 'Results' folder on the desktop. I did 'cd Desktop' command first, then just ran 'osascript Test.applescript'. I don't get a results.txt file, though:

set input to POSIX path of ("data.filt.fastq")
set output to POSIX path of ((("Results") as text) & "results.txt")

do shell script "grep '[CTG+]' " & quoted form of input & " | sed 's/+/" & return & "/g' > " & quoted form of output

I'm guessing I've done something horrifically wrong? Sorry for such a silly question, and thanks so much for all your help!

Reply

Answer 15

red_menace

Level 6

17,066 points

Dec 14, 2010 8:04 AM in response to Sharklel48

Using the POSIX path of your file name alters it by adding a leading slash, so your path isn't being set correctly (the original use is to convert from a Finder path). Scripts run from the Terminal can't have user interaction (e.g. the choose file dialog), so to use it there you can cheat a bit by asking another application to do that part:

tell application "System Events" to set input to POSIX path of (choose file)

You can also use the commands directly, dragging the various files to the Terminal window (this pastes in the path to the file), or use a here-document.

I didn't think you were familiar with the Terminal though, which is why I used an AppleScript. You can just paste the following script into the *AppleScript Editor* application and run it from there, or save it as an application. The script can also be used in an Automator workflow (for example as a Service), although as I mentioned earlier Automator chokes pretty bad on large files if you were wanting to use the results in another action. Post back with more details about your workflow to to see about wrapping it around the shell script.

<pre style="
font-family: Monaco, 'Courier New', Courier, monospace;
font-size: 10px;
font-weight: normal;
margin: 0px;
padding: 5px;
border: 1px solid #000000;
width: 720px;
color: #000000;
background-color: #DAFFB6;
overflow: auto;"
title="this text can be pasted into the AppleScript Editor">
set input to POSIX path of (choose file)
set output to POSIX path of (((path to desktop folder) as text) & "results.txt")

do shell script "grep -v '[!@#$%&?/.:]' " & quoted form of input & " | sed 's/+/" & return & "/g' > " & quoted form of output

-- do shell script "grep -E '(([ACTG]{6,})|(^\+$))' " & quoted form of input & " | sed 's/+/" & return & "/g' > " & quoted form of output
</pre>

In the above script I included a couple of different do shell scripts in case something slips through (just use one of them at a time, though). The first one uses an option that reverses the grep matching (it selects lines that do not match) and looks for lines that contain various characters that are only in the metadata lines. It then pipes the results to sed, which converts the lines that are "+" (this character was not included in the previous search) into an extra return.

The second do shell script (commented) is another one that I was playing with. it differs in that grep looks for lines that contain 6 or more of any of the base pair characters, and lines that consist of only a single "+" character.

I've tested both of these on the example text you posted earlier, and the large file that mns579 found - both methods appear to work OK (I didn't go through the entire results for the large document).

Reply