有一个小工具需要对已经存在的 HTML 页面中的内容进行抓取。
以前我们使用的是正则表达式进行搜索,搜索语法比较难写。
后来我们使用了 HtmlCleaner + xpath
考察有下面的代码片段:[code]TagNode tagNode = new HtmlCleaner().clean(message.getBody());
try {
w3cDoc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
xpath = XPathFactory.newInstance().newXPath();
firstName = (String) xpath.evaluate(t_FirstNameXPathPattern, w3cDoc, XPathConstants.STRING);
email = (String) xpath.evaluate(t_EmailXPathPattern, w3cDoc, XPathConstants.STRING);
phone = (String) xpath.evaluate(t_PhoneXPathPattern, w3cDoc, XPathConstants.STRING);
mlsNumber = (String) xpath.evaluate(t_MlsNumberXPathPattern, w3cDoc, XPathConstants.STRING);
comment = (String) xpath.evaluate(t_CommentXPathPattern, w3cDoc, XPathConstants.STRING);
} catch (Exception ex) {
// TODO Auto-generated catch block
logger.error("HTML XPATH PROCESS ERROR: {}", new Object[] { ex });
}
[/code]上面方法中的 message.getBody() 就是获得需要处理的 HTML String
有关 xpath 的定义在:private String r_FirstNameXPathPattern = "/html/body/div/div/text()[3]";
private String r_LastNameXPathPattern = "/html/body/div/div/text()[4]";
private String r_MlsNumberXPathPattern = "/html/body/div/div/text()[18]";
private String r_EmailXPathPattern = "/html/body/div/div/a[1]";
private String r_PhoneXPathPattern = "/html/body/div/div/a[2]";
private String r_CommentXPathPattern = "/html/body/div/div/text()[11]";
中。
使用原生的 xpath