How to remove HTML tags from data with SQL

The purpose of this article is to provide a way of cleaning up of HTML tags within the data. When we use various styles or tabular format data in UI using Rich Text Editor/ Rad Grid etc, it will save data in database with HTML tags.

When you save in database this kind of field you have:

An HTML element starts with a start tag (<p>) and ends with end tag (<p/>) and everything between Start tag and End tag is HTML element. e.g.

<b>Following are the popular databases: <br /> </b>1. SQL Server <br /> 2. Oracle <br /> 3.
Teradata <br /> 4. Sybase

We are using SQL Server CHARINDEX function to get the positions of ‘<’ and ‘>’ and once found replacing the string between <….> with blank using STUFF function. We are using WHILE Loop that will run till presence of ‘<’ & ‘>’ in the string. Below is the UDF script that performs HTML Tags clean up from data.

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE FUNCTION [dbo].[usp_ClearHTMLTags]  
    (@String NVARCHAR(MAX))  
    RETURNS NVARCHAR(MAX)  

AS  
    BEGIN  
        DECLARE @Start INT,  
                @End INT,  
                @Length INT  

        WHILE CHARINDEX('<', @String) > 0 AND CHARINDEX('>', @String, CHARINDEX('<', @String)) > 0  
        BEGIN  
            SELECT  @Start  = CHARINDEX('<', @String),   
                    @End    = CHARINDEX('>', @String, CHARINDEX('<', @String))  
            SELECT @Length = (@End - @Start) + 1  

            IF @Length > 0  
            BEGIN  
                SELECT @String = STUFF(@String, @Start, @Length, '')  
             END  
         END  
        RETURN @String  
    END  
GO

By using above UDF, We can clean the HTML tags from the data.

How to strip all HTML tags and entities and get clear text?

I was encouraged to write this Tip/Trick because of so many questions received for this issue.
Suppose you're having a bunch of HTML strings, but you just want to remove all the HTML tags and want a plain text.

You can use Regex to come to the rescue.

The Regex I had developed before was more cumbersome, then Chris made a suggestion, so I will now go further with the regex suggested by Chris that is a "\<[^\>]*\>".

I have tested it for many cases. It detects all types of HTML tags, but there may be loopholes inside so if you find any tags which are not passing through this Regex, then kindly inform me about the same.

Regex Definition

  • Regex :\<[^\>]*\>
    • Literal >
    • Any character that NOT in this class:[\>], any number of repetations
    • Literal >

Visual Basic

''' 
''' Remove HTML from string with Regex
''' 
Function StripTags(ByVal html As String) As String
    ' Remove HTML tags.
    Return Regex.Replace(html, "<.*?>", "")
End Function

C#

/// 
/// Remove HTML from string with Regex
/// 
public static string StripTags(string source)
{
    return Regex.Replace(source, "<.*?>", string.Empty);
}

Happy coding!

Skype tag

<meta name="format-detection" content="telephone=no">

<meta name="SKYPE_TOOLBAR" content="SKYPE_TOOLBAR_PARSER_COMPATIBLE">

Twitter tag

<meta name="twitter:site" content="@erossinits">

<meta name="twitter:creator" content="@erossinits">
<meta name="twitter:card" content="summary">

<meta name="twitter:domain" content="http://www.puresourcecode.com/">

<meta name="twitter:title" content="PureSourceCode">

<meta name="twitter:description" content="Only pure source code for you">

Advertsing

125X125_06

Planet Xamarin

Planet Xamarin

Calendar

<<  September 2017  >>
MonTueWedThuFriSatSun
28293031123
45678910
11121314151617
18192021222324
2526272829301
2345678

View posts in large calendar

Month List