webmagic网页解析包阅读

I have been reading webmagic, a web crawler in github. I try to read the source code, and some design pattern can be extracted to share.

1. Overview & Prerequisites

I have read the com.codecraft.webmagic.selector package, which aims to parse html, and extract urls in html. Jsoup is an open source project on parsing html, and Xsoup is a subproject based on Jsoup.

1.1 Jsoup intro

Jsoup has two ways of extracting html: DOM navigatoin and Selector syntax. The demo below is easy understanding, which may give you a common sense about what jsoup can do. Plus HtmlCleaner is also a html parser, but the performance and time efficiency is not as good as Jsoup.

import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest {
    public static void main(String[] args) {
        m1();
        m2();
        m4();
    }
    public static void m1() {
        File input = new File("/Users/menuz/Desktop/baidu.html");
        Document doc =  null;
        try {
            doc = Jsoup.parse(input, "UTF-8", "http://www.baidu.com/");
        } catch (IOException e) {
            e.printStackTrace();
        }
        Element content = doc.getElementById("body");
        Elements links = content.getElementsByTag("a");
        for (Element link : links) {
          String linkHref = link.attr("href");
          String linkText = link.text();
        }
    }
    public static void m2() {
        File input = new File("/Users/menuz/Desktop/baidu.html");
        Document doc =  null;
        try {
            doc = Jsoup.parse(input, "UTF-8", "http://www.baidu.com/");
        } catch (IOException e) {
            e.printStackTrace();
        }
        Elements links = doc.select("a[href]"); // a with href
        for(Element link : links) {
            String href =link.absUrl("href");
            System.out.println(href);
        }
        System.out.println("links size = " + links.size());
    }
    public static void m3() {
        String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
        Document doc = Jsoup.parse(html);
        Element link = doc.select("a").first();
        String text = doc.body().text(); // "An example link"
        String linkHref = link.attr("href"); // "http://example.com/"
        String linkText = link.text(); // "example""
        String linkOuterH = link.outerHtml(); 
            // "<a href="http://example.com"><b>example</b></a>"
        String linkInnerH = link.html(); // "<b>example</b>"
    }
    public static void m4() {
        Document doc = null;
        try {
            doc = Jsoup.connect("http://www.baidu.com").get();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        Elements links = doc.select("a[href]"); // a with href
        for(Element link : links) {
            String href =link.absUrl("href");
            System.out.println(href);
        }
        System.out.println("links size = " + links.size());
        Element link = doc.select("a").first();
        String relHref = link.attr("href"); // == "http://www.baidu.com/gaoji/preferences.html"
        String absHref = link.attr("abs:href"); // "http://www.baidu.com/gaoji/preferences.html"
    }
}

2. SimpleFactory Pattern

2.1 Selector & ElementSelector are two interfaces. Classes implement Selector are the real classes to parse html.

2.2 Selectors actually is a Selector Factory creating different Selectors, such as RegexSelector, XpathSelector, CssSelector and XsoupSelector.RegexSelector will handle with urls extracted by Jsoup.XpathSelector will use HtmlCleaner to handle with whole html. CssSelector handles with an Element, such as a Element, extracted by Jsoup. XsoupSelector is same with CssSelector, what is more, XsoupSelector also use Xsoup, which extends from Jsoup. So Jsoup is the most important thing.

2.3 Selectable is the main class known public to developer, providing lots of methods. Regex pattern, regext(".ahref.“), Css Pattern, $(”a[href]“), Xpath pattern, xpath(”//@href").Class PlainText, Html implements Selectable. Html has document properties for storing whole html info, Html use Selectors to create specific Selector to handle document inside html and generate what we want.

3.Conclusion

3.1 interface program oriented, try to use interface as much as possible.

3.2 class has many properties to set, method return Class self make it work easier.

3.3 try simplefactory.

人生就是认识自我的过程

多线程导入GB数据到MySQL

Description: 1千万数据导入MySQL
Mac OSX 10.8 + MySQL5.6.3 + Eclipse4.x
MySQL采用Brew安装

首先创建表结构student(id,name)

1
2
3
4
create table student(
    id int,
    name varchar(10)
);

Java程序随机产生1千万条记录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package com.test;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;

/**
 * 
 * 此类描述的是:用来生成student,score表的模拟数据
 * @author: dmnrei@gmail.com
 * @version: 2013-10-17 下午3:52:09
 */
public class LX1017 {
    /**
     * 表记录数,默认一千万
     */
    static int recordNum = 10000000;

    public static void main(String[] args) throws IOException {
        // 生成学生表的sql文件
        generateStudent();
    }

    public static void generateStudent() throws IOException {
        FileWriter fileWriter2 = new FileWriter("student.sql", false);  
        BufferedWriter bufferedWriter = new BufferedWriter(fileWriter2);  

        StringBuffer sb = new StringBuffer(1024*1024);
        for(int i=1; i<=recordNum; i++) {
            String str = "insert into student values(" + i+ ", '" + i + "');";
            sb.append(str);
            sb.append("\n");

            if(i%10000 == 0) {
                bufferedWriter.write(sb.toString());
                bufferedWriter.flush();
                sb = new StringBuffer(1024*1024);
            }
        }
        bufferedWriter.close();
    }
}

Mysqldump导入数据库 (耗时30min)

1
2
3
4
5
now=$(date +"%T")
echo "Current time : $now"
mysql -u username -ppwd test < student.sql
now=$(date +"%T")
echo "Current time : $now"

由于耗时过长,尝试多线程导入 (耗时15min)
线程数目1k
每个线程处理10k条记录

0) 线程启动应该间隔启动,50ms, 一下子起来会死的好惨

1) 每个线程一次处理1k个sql插入,采用batch插入,每次处理也让线程休息下,中途数据库操作类不要关闭连接,直到完成手动释放连接。

2) MySQL的最大连接数1k个,/usr/local/Cellar/mysql/5.6.13/my.cnf在[mysqld]修改max_connections=2000,重启mysql即可
alias restart-mysql=“/usr/local/Cellar/mysql/5.6.13/support-files/mysql.server restart”
alias stop-mysql=“/usr/local/Cellar/mysql/5.6.13/support-files/mysql.server stop”
alias start-mysql=“/usr/local/Cellar/mysql/5.6.13/support-files/mysql.server start”
进入mysql,利用命令查看show variables like ‘max%’,可以验证是否修改成功。

3) Eclipse4.x运行程序,经常内存不足
修改Run Configurations (此方法可行)
在代码上右键,依次点击Run As -> Run Configurations,在Arguments 参数中的VM arguments:
中填入如下值即可。
-Xms512m -Xmx1024m

最后尝试了mysql导入脚本,首先mysqldump出student表结构数据,然后重新导入(耗时1min,惊呆了~~)
0) 上锁
Lock Table ‘student’ Write
insert………..
Unlock ‘student’

1) Batch插入
insert into student values(1,‘1’),(2,‘2’)……,一条insert语句含有6万条记录。

做一件事情,先做好,在做精

MySQL部署指南

Mac Mysql -> Windows Mysql

以管理员身份登录mysql

1
mysql -u root -p

选择mysql数据库

1
mysql -u root -p

创建用户并设定密码

1
create user 'xxx'@'localhost' identified by 'yyy'

使操作生效

1
flush privileges

为用户创建数据库

1
create database sms default character set utf8 collate utf8_general_ci;

为用户赋予操作数据库testdb的所有权限

1
grant all privileges on sms.* to xxx@localhost identified  by 'yyy'

允许任何地址远程连接

1
GRANT ALL PRIVILEGES ON *.* TO 'xxx'@'%' IDENTIFIED BY 'yyy' WITH GRANT OPTION;

使操作生效

1
flush privileges

用新用户登录

1
mysql -u test -p

默认字符集导出数据库

1
2
mysqldump -u xxq -p --default-character-set=utf8 sms > sms.sql
mysqldump -t dbname -u root -p > dataonly.sql

转换字符集导出数据库

1
mysqldump -u xxq -p --default-character-set=utf8 --set-charset=latin1 --skip-opt sms>sms.sql

导入数据库

1
mysql -u xxq -p --default-character-set=utf8  sms < sms.sql

mysql重启

1
2
$mysql_dir/bin/mysqladmin -u root -p shutdown
$mysql_dir/bin/safe_mysqld &

如何查看mysql占用端口,验证mysql是否启动

1
2
netstat -aon | findstr "3306"
telnet mysql 3306

mysql的字符集和校对规则有4个级别的默认设置:服务器级、数据库级、表级和字段级。

1
2
3
show variables like 'character_set_%';
show create table tb_name;
show full columns from tb_name;

如何change服务器级字符集、数据库级、表级和字段级

1
2
3
4
5
6
7
8
修改my.ini  
default-character-set = utf8
character_set_server =  utf8

alter database dbname character set utf8 collate utf8_general_ci;
alter table tb_name convert to character set charset_name;
alter table tb_name modify latin1_text_col text character set utf8;
alter table tb_name change old_col_name new_column varchar(10) character set utf8 collate utf8_general_ci;

低调做人,高调做事