Extract urls using java regular expressions

Friday, 17 June 2011 18:20

Extract urls using Java regular expressions

In this sample we are using Java regular expressions to extract urls names.

Java method to extract urls

Let's define the regular expression pattern :

((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)

Pattern Description Reference
(

Start of a group #1

( Start of a group #3
https? look for http or https Litteral
|
ftp ftp protocol l Litteral
|
gopher gopher protocol Litteral
|
telnet telnet protocol Litteral
|
file Litteral
) End of a group #3
: Semicolon separator Litteral

(

Start of a group #4
(

Start of a group #5

//

Double slash Litteral

)

End of a group #5

|

(

Start of a group #5

\\\\

Double backslash

)

End of a group #5

)+

End of a group #4

one or more times

[

Start of a simple character class

Character class

\\w

A word character

Predefined character classes

\\d

Any digit

Predefined character classes

: Colon character Litteral

#@%/;$ ()~_?\\+-=

Number sign or at symbol or percent sign or slash or semicolo or dollar sign or a parenthesis or tilde or underscore or question mark or �plus sign or minus sign or equal sign Litteral

\\\\\\

triple back slash

.&

a dot or an ampersand Litteral

]*

End of a simple character class Character class
)

Java regex extract multiple urls


private List<String> extractUrls(String value){
    if (value == null) throw new NullArgumentException("urls to extract");
    List<String> result = new ArrayList<String>();
   String urlPattern = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(value);
    while (m.find()) {
        result.add(value.substring(m.start(0),m.end(0)));
    }
    return result;
}

Extracting the urls using our Pattern

If you execute our method using the following content :

http://www.ubiteck.com/test/mypage.jsf?param1=ok file://simpleFileUrl.txt file:\\\\backslashUrl.txt

Using the following sample code to execute our method :


String content = "http://www.ubiteck.com/test/mypage.jsf?param1=ok file://simpleFileUrl.txt file:\\\\backslashUrl.txt";
List<String> result = extractUrls(content);
for (String domain : result) {
    Sstem.out.println("url :" + domain);
}

regex urls extraction result

url :http://www.ubiteck.com/test/mypage.jsf?param1=ok
url :file://simpleFileUrl.txt
url :file:\\backslashUrl.txt
Tags: java , http , class , file , urls , regular , extract , character , group , litteral , start , sign

Comments

0 #1 Manikandan 2012-01-28 13:28
Excellent. The regular expression almost covers all the thing.
Quote

Add comment


Security code
Refresh