By Administrator Updated: Monday, 20 June 2011 13:55

Extract urls using java regular expressions

Friday, 17 June 2011 18:20

Extract urls using Java regular expressions

In this sample we are using Java regular expressions to extract urls names.

Java method to extract urls

Let's define the regular expression pattern :

Pattern			Description	Reference
(			Start of a group #1
(			Start of a group #3
https?			look for http or https	Litteral
	\|
ftp			ftp protocol	l Litteral
	\|
gopher			gopher protocol	Litteral
	\|
telnet			telnet protocol	Litteral
	\|
file				Litteral
)			End of a group #3
:			Semicolon separator	Litteral
(			Start of a group #4
	(		Start of a group #5
		//	Double slash	Litteral
	)		End of a group #5
	\|
	(		Start of a group #5
		\\\\	Double backslash
	)		End of a group #5
)+			End of a group #4 one or more times
[			Start of a simple character class	Character class
	\\w		A word character	Predefined character classes
	\\d		Any digit	Predefined character classes
	:		Colon character	Litteral
	#@%/;$ ()~_?\\+-=		Number sign or at symbol or percent sign or slash or semicolo or dollar sign or a parenthesis or tilde or underscore or question mark or �plus sign or minus sign or equal sign	Litteral
	\\\\\\		triple back slash
	.&		a dot or an ampersand	Litteral
]*			End of a simple character class	Character class
)

Java regex extract multiple urls


private List<String> extractUrls(String value){
    if (value == null) throw new NullArgumentException("urls to extract");
    List<String> result = new ArrayList<String>();
   String urlPattern = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(value);
    while (m.find()) {
        result.add(value.substring(m.start(0),m.end(0)));
    }
    return result;
}

Extracting the urls using our Pattern

If you execute our method using the following content :

http://www.ubiteck.com/test/mypage.jsf?param1=ok file://simpleFileUrl.txt file:\\\\backslashUrl.txt

Using the following sample code to execute our method :


String content = "http://www.ubiteck.com/test/mypage.jsf?param1=ok file://simpleFileUrl.txt file:\\\\backslashUrl.txt";
List<String> result = extractUrls(content);
for (String domain : result) {
    Sstem.out.println("url :" + domain);
}

regex urls extraction result

url :http://www.ubiteck.com/test/mypage.jsf?param1=ok
url :file://simpleFileUrl.txt
url :file:\\backslashUrl.txt

Tags: java , http , class , file , urls , regular , extract , character , group , litteral , start , sign

Comments

0 #1 Manikandan 2012-01-28 13:28

Excellent. The regular expression almost covers all the thing.

Quote

Refresh comments list
RSS feed for comments to this post

Add comment

JComments