Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

How to use Microsoft Word's Find and Replace with HTML tags?

Hello to all!

I'm trying to figure out how to use the find and replace function in Word to replace html tags. I'd like to be able to change something like this:

<span class="B01-K-ITAL">random text</span>

To something like this:

<em>random text</em>

I want to replace the open and close tags without changing or interfering with the text between the tags. I'm pretty sure I should use wildcards, but I can't figure out how to use them properly.

Anyone able to lend a hand?

Automator-OTHER, OS X Mountain Lion (10.8.4)

Posted on Jul 17, 2013 8:21 AM

Reply
Question marked as Best reply

Posted on Jul 17, 2013 9:51 AM

Are we talking about a .htm or html file?

Word isn't the proper tool for editing html. Html is plan text. Word works with formated text.


use textwrangler. supports regular expressions.


http://www.barebones.com/products/


textwrangler is free


bbedit is paid with more features etc.


Robert

10 replies

Jul 17, 2013 10:08 AM in response to rccharles

Yep; use a decent text editor.


Or a little more advanced, use sed or awk, or maybe a Python or Perl script for somewhat more complex conversion work, or a Ragel-generated compiler or such, depending on the complexity of the input HTML code. Not Microsoft Word, though.


Another potential approach is on the data acquisition and data input end of the processing. If this HTML text is (as various of these discussions seem to involve) scraped HTML text, there may be an API or a different mechanism for obtaining the data. Or using a better scraper.


If this HTML is from Word and you're post-processing it, there are better converters around, and there are tools to clean up Word-generated HTML code. (The quality of the HTML export from Word — at least the versions I've looked at — has been poor.)

Jul 17, 2013 12:14 PM in response to srz92

I think you will get more milage out of calling:

automator > applescript > textwrangler


You could get even more MPG with calling sed from automator.


Disclaimer: sed works on one line at a time. Therefor, your tag cannot span lines. Could be a problem. May be an advanced version of set that allows line spanning.


Could use perl which does allow line spanning. You don't need to know perl. Search for perl command line replace.


What you really do once you figure out how to call sed/perl, you post another question and ask for advice on how to write the regular expression.


Look for

"In the Automator application group in the library (in Leopard it is the Utilities category), there should be an action called "Run Shell Script"."

in https://discussions.apple.com/thread/1220818?start=0&tstart=0


for the find & replace:

http://www.brunolinux.com/02-The_Terminal/Find_and%20Replace_with_Sed.html

http://www.ibm.com/developerworks/linux/library/l-sed2/index.html

Jul 17, 2013 12:37 PM in response to srz92

I'm with rccharles here....


Word (and Automator) are tools for editing Word-format documents. Word documents are complex format documents, and Word itself isn't good at reading and writing text files.


Use a shell script, or a Python, Perl, php or Ruby program, or sed or awk script, or whatever. Invoke that procedure from Automator if you need, but I wouldn't get Microsoft Word mixed in here.


Even with Word over on Windows and automating these tasks (though not on a text fiile there, either), this task would likely involve VBA and a macro or two.

Jul 18, 2013 3:09 PM in response to srz92

Making some progress with Perl:


Macintosh-HD -> Applications -> Utilities -> Terminal

# press return to run the command.


perl -0660pe 's^<span class="B01-K-ITAL">(.*?)</span>^<em>$1</em>^g' i.html >|o.html


Note: span tag cannot span lines :-( more work is needed.


span must be in all lower case


input for test:



I'm trying to figure out how to use the find and replace function in Word to replace html tags. I'd like to be able to change something like this:


<span class="B01-K-ITAL">random text</span>


To something like this:


<em>random text</em>


I want to replace the open and close tags without changing or interfering with the text between the tags. I'm pretty sure I should use wildcards, but I can't figure out how to use them properly.

I'd like to be able to change something like this:


<span class="B01-K-ITAL">random 2 text</span>


To something like this:


<em>random text</em>


I want to replace the open and close tags without changing or interfering with the text between the tags.

I'd like to be able to change something like this:


<span class="B01-K-ITAL">random 3 
text
</span>


To something like this:


<em>random text</em>


I want to replace the open and close tags without changing or interfering with the text between the tags.



output



I'm trying to figure out how to use the find and replace function in Word to replace html tags. I'd like to be able to change something like this:



<em>random text</em>



To something like this:



<em>random text</em>



I want to replace the open and close tags without changing or interfering with the text between the tags. I'm pretty sure I should use wildcards, but I can't figure out how to use them properly.


I'd like to be able to change something like this:



<em>random 2 text</em>



To something like this:



<em>random text</em>



I want to replace the open and close tags without changing or interfering with the text between the tags.


I'd like to be able to change something like this:



<span class="B01-K-ITAL">random 3

text

</span>



To something like this:



<em>random text</em>



I want to replace the open and close tags without changing or interfering with the text between the tags.

Jul 18, 2013 3:47 PM in response to srz92

I'm all for Perl in most cases, but this seems pretty difficult, even with Perl. Perl can certainly do it, but you have to construct just the right regex to do it. There may be many, many variants in your HTML.


I have another approach you might want to try. Take one of your HTML documents, open it in TextEdit, and re-save it as HTML. Eventually, you will want to automate it with the textutil command. TextEdit seems pretty limitied with format conversions so you may need to use textutil right away. It may not do the conversions that you want, but it will simplify the formatting and make it consistent.

Jul 20, 2013 9:28 AM in response to rccharles

Here is my latest regular expression with Perl. I think it matches the spirit of the request in the original post.


Note, this isn't as easy as you think. You need to code up the complete set of html rules in you implementation. You need to allow for a certain amount of mal-formed html.


perl -0660pe 's^<[sS][pP][aA][nN]\s+class="B01-K-ITAL"\s*>(.*?)</[sS][pP][aA][nN]>^<em>$1</em>^gs' i.html >|o.html


input text



<html>

<head>...</head>

<body>

I'd like to be able to change something like this:


<span class="B01-K-ITAL">#1 one line</span>


I want to replace the open and close tags without changing or interfering with the text between the tags. I'm pretty sure I should use wildcards, but I can't figure out how to use them properly.


<p>note, this isn't as easy as you think. You need to code up the complete set of html rules in you implementation. You need to allow for a certain amount of mal-formed html.</p>


<span class="B01-K-ITAL">#2 don't be greedy</span>


<span class="B01-K-ITAL">$3

multiline text

</span>


<span

class="B01-K-ITAL">#4

multiline tag. I believe html allow a carriage return in white space of tags

</span>


<span

class="B01-K-ITAL"

>#5

split after the class tag. optional white space

</span>


<sPan class="B01-K-ITAL">#6 mixed case tag</Span>


<p>no text #7</p>

<span class="B01-K-ITAL"></span>


<!-- Apparently, this is valid

http://www.positioniseverything.net/articles/cc-plus.html -->


<!--[if IE]>

<div id="IEroot">

<![endif]-->

<p id="IE">This browser is IE.</p>

<p id="notIE">This browser is not IE.</p>

<!--[if IE]>

</div>

<![endif]-->

</body> </html>


output text




<html>

<head>...</head>

<body>

I'd like to be able to change something like this:


<em>#1 one line</em>


I want to replace the open and close tags without changing or interfering with the text between the tags. I'm pretty sure I should use wildcards, but I can't figure out how to use them properly.


<p>note, this isn't as easy as you think. You need to code up the complete set of html rules in you implementation. You need to allow for a certain amount of mal-formed html.</p>


<em>#2 don't be greedy</em>


<em>$3

multiline text

</em>


<em>#4

multiline tag. I believe html allow a carriage return in white space of tags

</em>


<em>#5

split after the class tag. optional white space

</em>


<em>#6 mixed case tag</em>


<p>no text #7</p>

<em></em>


<!-- Apparently, this is valid

http://www.positioniseverything.net/articles/cc-plus.html -->


<!--[if IE]>

<div id="IEroot">

<![endif]-->

<p id="IE">This browser is IE.</p>

<p id="notIE">This browser is not IE.</p>

<!--[if IE]>

</div>

<![endif]-->

</body> </html>

Jul 21, 2013 5:12 AM in response to srz92

Hello


Not that I know how to do what you're trying to do in Word...


I just thought it might help to point out that defining style sheet for that specific span element as follows will have the same effect as replacing that span element with em element.


<style type='text/css'>
    span.B01-K-ITAL { font-style : italic !important; }
</style>


All you need to do is place the above style element in head element of target html.


Regards,

H

How to use Microsoft Word's Find and Replace with HTML tags?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.