webmagic网页解析包阅读

I have been reading webmagic, a web crawler in github. I try to read the source code, and some design pattern can be extracted to share.

1. Overview & Prerequisites

I have read the com.codecraft.webmagic.selector package, which aims to parse html, and extract urls in html. Jsoup is an open source project on parsing html, and Xsoup is a subproject based on Jsoup.

1.1 Jsoup intro

Jsoup has two ways of extracting html: DOM navigatoin and Selector syntax. The demo below is easy understanding, which may give you a common sense about what jsoup can do. Plus HtmlCleaner is also a html parser, but the performance and time efficiency is not as good as Jsoup.

import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest {
    public static void main(String[] args) {
        m1();
        m2();
        m4();
    }
    public static void m1() {
        File input = new File("/Users/menuz/Desktop/baidu.html");
        Document doc =  null;
        try {
            doc = Jsoup.parse(input, "UTF-8", "http://www.baidu.com/");
        } catch (IOException e) {
            e.printStackTrace();
        }
        Element content = doc.getElementById("body");
        Elements links = content.getElementsByTag("a");
        for (Element link : links) {
          String linkHref = link.attr("href");
          String linkText = link.text();
        }
    }
    public static void m2() {
        File input = new File("/Users/menuz/Desktop/baidu.html");
        Document doc =  null;
        try {
            doc = Jsoup.parse(input, "UTF-8", "http://www.baidu.com/");
        } catch (IOException e) {
            e.printStackTrace();
        }
        Elements links = doc.select("a[href]"); // a with href
        for(Element link : links) {
            String href =link.absUrl("href");
            System.out.println(href);
        }
        System.out.println("links size = " + links.size());
    }
    public static void m3() {
        String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
        Document doc = Jsoup.parse(html);
        Element link = doc.select("a").first();
        String text = doc.body().text(); // "An example link"
        String linkHref = link.attr("href"); // "http://example.com/"
        String linkText = link.text(); // "example""
        String linkOuterH = link.outerHtml(); 
            // "<a href="http://example.com"><b>example</b></a>"
        String linkInnerH = link.html(); // "<b>example</b>"
    }
    public static void m4() {
        Document doc = null;
        try {
            doc = Jsoup.connect("http://www.baidu.com").get();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        Elements links = doc.select("a[href]"); // a with href
        for(Element link : links) {
            String href =link.absUrl("href");
            System.out.println(href);
        }
        System.out.println("links size = " + links.size());
        Element link = doc.select("a").first();
        String relHref = link.attr("href"); // == "http://www.baidu.com/gaoji/preferences.html"
        String absHref = link.attr("abs:href"); // "http://www.baidu.com/gaoji/preferences.html"
    }
}

2. SimpleFactory Pattern

2.1 Selector & ElementSelector are two interfaces. Classes implement Selector are the real classes to parse html.

2.2 Selectors actually is a Selector Factory creating different Selectors, such as RegexSelector, XpathSelector, CssSelector and XsoupSelector.RegexSelector will handle with urls extracted by Jsoup.XpathSelector will use HtmlCleaner to handle with whole html. CssSelector handles with an Element, such as a Element, extracted by Jsoup. XsoupSelector is same with CssSelector, what is more, XsoupSelector also use Xsoup, which extends from Jsoup. So Jsoup is the most important thing.

2.3 Selectable is the main class known public to developer, providing lots of methods. Regex pattern, regext(".ahref.“), Css Pattern, $(”a[href]“), Xpath pattern, xpath(”//@href").Class PlainText, Html implements Selectable. Html has document properties for storing whole html info, Html use Selectors to create specific Selector to handle document inside html and generate what we want.

3.Conclusion

3.1 interface program oriented, try to use interface as much as possible.

3.2 class has many properties to set, method return Class self make it work easier.

3.3 try simplefactory.

人生就是认识自我的过程