Thursday Aug 30


Extract urls using java regular expressions

Friday, 17 June 2011 18:20
Extract urls using Java regular expressions

In this sample we are using Java regular expressions to extract urls names.

Java method to extract urls

Let's define the regular expression pattern :


Pattern Description Reference

Start of a group #1

( Start of a group #3
https? look for http or https Litteral
ftp ftp protocol l Litteral
gopher gopher protocol Litteral
telnet telnet protocol Litteral
file Litteral
) End of a group #3
: Semicolon separator Litteral


Start of a group #4

Start of a group #5


Double slash Litteral


End of a group #5



Start of a group #5


Double backslash


End of a group #5


End of a group #4

one or more times


Start of a simple character class

Character class


A word character

Predefined character classes


Any digit

Predefined character classes

: Colon character Litteral

#@%/;$ ()~_?\\+-=

Number sign or at symbol or percent sign or slash or semicolo or dollar sign or a parenthesis or tilde or underscore or question mark or  plus sign or minus sign or equal sign Litteral


triple back slash


a dot or an ampersand Litteral


End of a simple character class Character class

Java regex extract multiple urls

private List<String> extractUrls(String value){
    if (value == null) throw new NullArgumentException("urls to extract");
    List<String> result = new ArrayList<String>();
   String urlPattern = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(value);
    while (m.find()) {
    return result;

Extracting the urls using our Pattern

If you execute our method using the following content : file://simpleFileUrl.txt file:\\\\backslashUrl.txt

Using the following sample code to execute our method :

String content = " file://simpleFileUrl.txt file:\\\\backslashUrl.txt";
List<String> result = extractUrls(content);
for (String domain : result) {
    Sstem.out.println("url :" + domain);

regex urls extraction result

url :
url :file://simpleFileUrl.txt
url :file:\\backslashUrl.txt
0 #1 Manikandan 2012-01-28 13:28
Excellent. The regular expression almost covers all the thing.

