Thursday Aug 30

AddThis

AddThis Social Bookmark Button

Extract urls using java regular expressions

PDF Print E-mail
Friday, 17 June 2011 18:20
AddThis Social Bookmark Button

Extract urls using Java regular expressions

In this sample we are using Java regular expressions to extract urls names.

Java method to extract urls

Let's define the regular expression pattern :

((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)

Pattern Description Reference
(

Start of a group #1

( Start of a group #3
https? look for http or https Litteral
|
ftp ftp protocol l Litteral
|
gopher gopher protocol Litteral
|
telnet telnet protocol Litteral
|
file Litteral
) End of a group #3
: Semicolon separator Litteral

(

Start of a group #4
(

Start of a group #5

//

Double slash Litteral

)

End of a group #5

|

(

Start of a group #5

\\\\

Double backslash

)

End of a group #5

)+

End of a group #4

one or more times

[

Start of a simple character class

Character class

\\w

A word character

Predefined character classes

\\d

Any digit

Predefined character classes

: Colon character Litteral

#@%/;$ ()~_?\\+-=

Number sign or at symbol or percent sign or slash or semicolo or dollar sign or a parenthesis or tilde or underscore or question mark or  plus sign or minus sign or equal sign Litteral

\\\\\\

triple back slash

.&

a dot or an ampersand Litteral

]*

End of a simple character class Character class
)

Java regex extract multiple urls

private List<String> extractUrls(String value){
    if (value == null) throw new NullArgumentException("urls to extract");
    List<String> result = new ArrayList<String>();
   String urlPattern = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(value);
    while (m.find()) {
        result.add(value.substring(m.start(0),m.end(0)));
    }
    return result;
}

Extracting the urls using our Pattern

If you execute our method using the following content :

http://www.ubiteck.com/test/mypage.jsf?param1=ok file://simpleFileUrl.txt file:\\\\backslashUrl.txt

Using the following sample code to execute our method :

String content = "http://www.ubiteck.com/test/mypage.jsf?param1=ok file://simpleFileUrl.txt file:\\\\backslashUrl.txt";
List<String> result = extractUrls(content);
for (String domain : result) {
    Sstem.out.println("url :" + domain);
}

regex urls extraction result

url :http://www.ubiteck.com/test/mypage.jsf?param1=ok
url :file://simpleFileUrl.txt
url :file:\\backslashUrl.txt
Tags: java , http , class , file , urls , regular , extract , character , group , litteral , start , sign

Comments

0 #1 Manikandan 2012-01-28 13:28
Excellent. The regular expression almost covers all the thing.
Quote

Add comment


Security code
Refresh

Java Tutorial on Facebook