使用TDD方式开发:列出CSDN所有博客文章

最近,在做一个Code Kata,突然想把自己CSDN博客上面所有的文章全部列出来,而且是先写测试,在写实现(传说中的TDD)。下面把其分享出来。笔者是基于org.htmlparser.htmlparser来进行页面解析的。如果大家需要用的话,请在pom.xml文件里面加入下面的依赖。

<dependency>
  <groupId>org.htmlparser</groupId>
  <artifactId>htmlparser</artifactId>
  <version>2.1</version>
</dependency>
值得一提的是,在使用org.htmlparser.htmlparser的时候,恰当合理的Filter(过滤器)非常的重要,如果使用得当的,往往会事半功倍。下面把常用的16个Filter(过滤器)列出来一下。

16个不同的Filter,也可以分为几类。
* 判断类Filter:

  • TagNameFilter
  • HasAttributeFilter
  • HasChildFilter
  • HasParentFilter
  • HasSiblingFilter
  • IsEqualFilter

* 逻辑运算Filter:

  • AndFilter
  • NotFilter
  • OrFilter
  • XorFilter

* 其他Filter:

  • NodeClassFilter
  • StringFilter
  • LinkStringFilter
  • LinkRegexFilter
  • RegexFilter
  • CssSelectorNodeFilter

#1 TDD中测试先行,测试程序部分

package com.winneryum.csdn;

import static org.junit.Assert.*;

import java.util.List;
import org.junit.Test;


public class CSDNPageParserTest {

  @Test
  public void testListAllCategoryURLByCSDNIdURL(){
    //http://blog.csdn.net/chancein007/
    String csdnID="chancein007"; 
    CSDNPageParser csdnPageParser=new CSDNPageParser(csdnID);
    List<String> lsCategryURLs=csdnPageParser.listAllCategoryURLsByCSDNId();
    assertTrue(lsCategryURLs.size()>0);
    System.out.println(lsCategryURLs.toString());
  }
  
  @Test
  public void testListPagesByCategoryURLs(){
    CSDNPageParser csdnPageParser=new CSDNPageParser();
    String categoryURL="http://blog.csdn.net//chancein007/article/category/2331239";
    List<String> lsPages= csdnPageParser.listPagesByCategoryURL(categoryURL);
    assertTrue(lsPages.size()>0);
    System.out.println(lsPages.toString());
  }
  
  @Test
  public void testGetAllPageURLs(){
    String csdnID="chancein007"; 
    CSDNPageParser csdnPageParser=new CSDNPageParser(csdnID); 
    List<String> lsAllPages= csdnPageParser.getAllPageURLs();
    assertTrue(lsAllPages.size()>0);
    for(int i=0;i<lsAllPages.size();i++){
   System.out.println(lsAllPages.get(i));
  }
  }
}

# 2 程序实现部分

package com.winneryum.csdn;

import java.util.ArrayList;
import java.util.Hashtable;
import java.util.List;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.http.ConnectionManager;
import org.htmlparser.nodes.TagNode;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

public class CSDNPageParser {
 
  public final static String  CSDN_ROOT_URL="http://blog.csdn.net";
  private String csdnID;
  
  private String getCSDNRootBlogURL(){
    return CSDN_ROOT_URL+"/"+csdnID+"/";
  }
  public CSDNPageParser(String csdnID) {
    this.csdnID=csdnID;
  }
  public CSDNPageParser() {
  }
  public List<String> listAllCategoryURLsByCSDNId() {
    List<String> categoryURLs=new ArrayList<String>();
    String encoding = "UTF-8";  
    try {
      Parser onLineHtmlParser;
      onLineHtmlParser = new Parser();
      ConnectionManager connectionManager=Parser.getConnectionManager ();
      Hashtable hashTable=connectionManager.getRequestProperties();
      hashTable.put("User-Agent","Firefox");
      connectionManager.setRequestProperties(hashTable);
      onLineHtmlParser.setURL(getCSDNRootBlogURL());
      onLineHtmlParser.setEncoding(encoding);  
      NodeFilter filter = new HasAttributeFilter( "id","panel_Category" );  
      //NodeClassFilter nodeClassFilter=new NodeClassFilter(org.htmlparser.tags.LinkTag.class);
      //AndFilter andFilter=new AndFilter(new NodeFilter[]{filter,nodeClassFilter});
      NodeList nodes = onLineHtmlParser.extractAllNodesThatMatch(filter);
      String categorySegment= nodes.elementAt(1).toHtml();
      
      Parser categorySegementParser = new Parser(categorySegment);
      TagNameFilter tagFileter=new TagNameFilter("a");
      NodeList categoryNode=categorySegementParser.extractAllNodesThatMatch(tagFileter);
      for(int i=0;i<categoryNode.size();i++){
        TagNode linkNode=(TagNode)categoryNode.elementAt(i);
        categoryURLs.add(CSDN_ROOT_URL+linkNode.getAttribute("href"));
      }
      
    } catch (ParserException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }  
 
    
    return categoryURLs;
  }
  
  public List<String> listPagesByCategoryURL(String categoryURL) {
    List<String> pageURLs=new ArrayList<String>();
    String encoding = "UTF-8";  
    try {
      Parser onLineHtmlParser;
      onLineHtmlParser = new Parser();
      ConnectionManager connectionManager=Parser.getConnectionManager ();
      Hashtable hashTable=connectionManager.getRequestProperties();
      hashTable.put("User-Agent","Firefox");
      connectionManager.setRequestProperties(hashTable);
      onLineHtmlParser.setURL(categoryURL);
      onLineHtmlParser.setEncoding(encoding);  
      TagNameFilter h1TagFileter=new TagNameFilter("h1");
      NodeList h1Nodes = onLineHtmlParser.extractAllNodesThatMatch(h1TagFileter);
      
     String pageSegment= h1Nodes.toHtml();
      
      Parser categorySegementParser = new Parser(pageSegment);
      TagNameFilter tagFileter=new TagNameFilter("a");
      NodeList pageDetailedNode=categorySegementParser.extractAllNodesThatMatch(tagFileter);
      for(int i=0;i<pageDetailedNode.size();i++){
        TagNode linkNode=(TagNode)pageDetailedNode.elementAt(i);
        pageURLs.add(CSDN_ROOT_URL+linkNode.getAttribute("href"));
      }
    } catch (ParserException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    } 
    
    return pageURLs;
  }
  public List<String> getAllPageURLs() {
    List<String> allpageURLs=new ArrayList<String>(); 
    List<String> allCatgoryURL=listAllCategoryURLsByCSDNId();
    for(String categoryURL:allCatgoryURL){
      List<String> listPageURLs=listPagesByCategoryURL(categoryURL);
      allpageURLs.addAll(listPageURLs);
    }
    return allpageURLs;
  }

}


#3 注意事项

注意上面这段代码,

ConnectionManager connectionManager=Parser.getConnectionManager ();
Hashtable hashTable=connectionManager.getRequestProperties();
hashTable.put("User-Agent","Firefox");
connectionManager.setRequestProperties(hashTable);

如果没有这段代码,CSDN网站就会认为这是一个机器在访问CSDN网站,就会抛出下面的403 Forbidden的状态码。

org.htmlparser.util.ParserException: Exception getting input stream from http://blog.csdn.net/chancein007/ (Server returned HTTP response code: 403 for URL: http://blog.csdn.net/chancein007/).;
java.io.IOException: Server returned HTTP response code: 403 for URL: http://blog.csdn.net/chancein007/
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244)
at org.htmlparser.lexer.Page.setConnection(Page.java:576)
at org.htmlparser.lexer.Page.<init>(Page.java:133)
at org.htmlparser.lexer.Lexer.<init>(Lexer.java:185)
at org.htmlparser.Parser.setConnection(Parser.java:419)
at org.htmlparser.Parser.setURL(Parser.java:448)
at com.winneryum.csdn.CSDNPageParser.listAllCategoryURLsByCSDNId(CSDNPageParser.java:38)
at com.winneryum.csdn.CSDNPageParser.getAllPageURLs(CSDNPageParser.java:96)
at com.winneryum.csdn.CSDNPageParserTest.testGetAllPageURLs(CSDNPageParserTest.java:34)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
Caused by: java.io.IOException: Server returned HTTP response code: 403 for URL: http://blog.csdn.net/chancein007/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at org.htmlparser.http.ConnectionManager.openConnection(ConnectionManager.java:661)
at org.htmlparser.http.ConnectionManager.openConnection(ConnectionManager.java:849)
... 27 more

#4 运行结果

http://blog.csdn.net/chancein007/article/details/53731148 http://blog.csdn.net/chancein007/article/details/52675923 http://blog.csdn.net/chancein007/article/details/52305324 http://blog.csdn.net/chancein007/article/details/52109198 http://blog.csdn.net/chancein007/article/details/50646864 http://blog.csdn.net/chancein007/article/details/50507571 http://blog.csdn.net/chancein007/article/details/50507557 http://blog.csdn.net/chancein007/article/details/46508301 http://blog.csdn.net/chancein007/article/details/46444385 http://blog.csdn.net/chancein007/article/details/42132741 http://blog.csdn.net/chancein007/article/details/42033201 http://blog.csdn.net/chancein007/article/details/41652415 http://blog.csdn.net/chancein007/article/details/41628911 http://blog.csdn.net/chancein007/article/details/41276881 http://blog.csdn.net/chancein007/article/details/41158659 http://blog.csdn.net/chancein007/article/details/32181133 http://blog.csdn.net/chancein007/article/details/30587585 http://blog.csdn.net/chancein007/article/details/30475085 http://blog.csdn.net/chancein007/article/details/30365737 http://blog.csdn.net/chancein007/article/details/27123273 http://blog.csdn.net/chancein007/article/details/50984565 http://blog.csdn.net/chancein007/article/details/50909318 http://blog.csdn.net/chancein007/article/details/50909263 http://blog.csdn.net/chancein007/article/details/50909156 http://blog.csdn.net/chancein007/article/details/42131677 http://blog.csdn.net/chancein007/article/details/42120567 http://blog.csdn.net/chancein007/article/details/42120019 http://blog.csdn.net/chancein007/article/details/42119581 http://blog.csdn.net/chancein007/article/details/41950775 http://blog.csdn.net/chancein007/article/details/41280997 http://blog.csdn.net/chancein007/article/details/41157887 http://blog.csdn.net/chancein007/article/details/41156783 http://blog.csdn.net/chancein007/article/details/41156609 http://blog.csdn.net/chancein007/article/details/52939738 http://blog.csdn.net/chancein007/article/details/52939565 http://blog.csdn.net/chancein007/article/details/52939194 http://blog.csdn.net/chancein007/article/details/52684365 http://blog.csdn.net/chancein007/article/details/52676040 http://blog.csdn.net/chancein007/article/details/52624514 http://blog.csdn.net/chancein007/article/details/52609930 http://blog.csdn.net/chancein007/article/details/52601915 http://blog.csdn.net/chancein007/article/details/52551955 http://blog.csdn.net/chancein007/article/details/52551852 http://blog.csdn.net/chancein007/article/details/52551722 http://blog.csdn.net/chancein007/article/details/52551686 http://blog.csdn.net/chancein007/article/details/46552887 http://blog.csdn.net/chancein007/article/details/46539231 http://blog.csdn.net/chancein007/article/details/46490361 http://blog.csdn.net/chancein007/article/details/46476601 http://blog.csdn.net/chancein007/article/details/46470389 http://blog.csdn.net/chancein007/article/details/46470201 http://blog.csdn.net/chancein007/article/details/46447351 http://blog.csdn.net/chancein007/article/details/46318359 http://blog.csdn.net/chancein007/article/details/46317799 http://blog.csdn.net/chancein007/article/details/46293391 http://blog.csdn.net/chancein007/article/details/46293031 http://blog.csdn.net/chancein007/article/details/41157887 http://blog.csdn.net/chancein007/article/details/34514439 http://blog.csdn.net/chancein007/article/details/29642625 http://blog.csdn.net/chancein007/article/details/28016097 http://blog.csdn.net/chancein007/article/details/28001087 http://blog.csdn.net/chancein007/article/details/27384237 http://blog.csdn.net/chancein007/article/details/27239605 http://blog.csdn.net/chancein007/article/details/25926035 http://blog.csdn.net/chancein007/article/details/7318315 http://blog.csdn.net/chancein007/article/details/46310051 http://blog.csdn.net/chancein007/article/details/46301553 http://blog.csdn.net/chancein007/article/details/46242685 http://blog.csdn.net/chancein007/article/details/46241983 http://blog.csdn.net/chancein007/article/details/46241413 http://blog.csdn.net/chancein007/article/details/46238469 http://blog.csdn.net/chancein007/article/details/46238217 http://blog.csdn.net/chancein007/article/details/46137277 http://blog.csdn.net/chancein007/article/details/46136925 http://blog.csdn.net/chancein007/article/details/34537989 http://blog.csdn.net/chancein007/article/details/30340095 http://blog.csdn.net/chancein007/article/details/29653831 http://blog.csdn.net/chancein007/article/details/29645055 http://blog.csdn.net/chancein007/article/details/28142261 http://blog.csdn.net/chancein007/article/details/28104355 http://blog.csdn.net/chancein007/article/details/28083799 http://blog.csdn.net/chancein007/article/details/28023411 http://blog.csdn.net/chancein007/article/details/53983755 http://blog.csdn.net/chancein007/article/details/53889470 http://blog.csdn.net/chancein007/article/details/53792477 http://blog.csdn.net/chancein007/article/details/53731662 http://blog.csdn.net/chancein007/article/details/53731148 http://blog.csdn.net/chancein007/article/details/53730991 http://blog.csdn.net/chancein007/article/details/52109198 http://blog.csdn.net/chancein007/article/details/52109226 http://blog.csdn.net/chancein007/article/details/52108986 http://blog.csdn.net/chancein007/article/details/51813468 http://blog.csdn.net/chancein007/article/details/41950775 http://blog.csdn.net/chancein007/article/details/7316076 http://blog.csdn.net/chancein007/article/details/41157887 http://blog.csdn.net/chancein007/article/details/7315951 http://blog.csdn.net/chancein007/article/details/7315936 http://blog.csdn.net/chancein007/article/details/7315922 http://blog.csdn.net/chancein007/article/details/54296014 http://blog.csdn.net/chancein007/article/details/54295796 http://blog.csdn.net/chancein007/article/details/54260855 http://blog.csdn.net/chancein007/article/details/53120622 http://blog.csdn.net/chancein007/article/details/53120527 http://blog.csdn.net/chancein007/article/details/51813468 http://blog.csdn.net/chancein007/article/details/51813421 http://blog.csdn.net/chancein007/article/details/51813351 http://blog.csdn.net/chancein007/article/details/51813218 http://blog.csdn.net/chancein007/article/details/51813089 http://blog.csdn.net/chancein007/article/details/46293851 http://blog.csdn.net/chancein007/article/details/41280997 http://blog.csdn.net/chancein007/article/details/26297455 http://blog.csdn.net/chancein007/article/details/5154175 http://blog.csdn.net/chancein007/article/details/5154051 http://blog.csdn.net/chancein007/article/details/46136219 http://blog.csdn.net/chancein007/article/details/32992877 http://blog.csdn.net/chancein007/article/details/32986523 http://blog.csdn.net/chancein007/article/details/29822487 http://blog.csdn.net/chancein007/article/details/7316044 http://blog.csdn.net/chancein007/article/details/7315987 http://blog.csdn.net/chancein007/article/details/7307017 http://blog.csdn.net/chancein007/article/details/7306937 http://blog.csdn.net/chancein007/article/details/50645199 http://blog.csdn.net/chancein007/article/details/50645197 http://blog.csdn.net/chancein007/article/details/46059489 http://blog.csdn.net/chancein007/article/details/46059049 http://blog.csdn.net/chancein007/article/details/41178345 http://blog.csdn.net/chancein007/article/details/30813569 http://blog.csdn.net/chancein007/article/details/26297455 http://blog.csdn.net/chancein007/article/details/53189912 http://blog.csdn.net/chancein007/article/details/53002892 http://blog.csdn.net/chancein007/article/details/52940032 http://blog.csdn.net/chancein007/article/details/53014952 http://blog.csdn.net/chancein007/article/details/53014738 http://blog.csdn.net/chancein007/article/details/53002981 http://blog.csdn.net/chancein007/article/details/42277345 http://blog.csdn.net/chancein007/article/details/27116691 http://blog.csdn.net/chancein007/article/details/54016636 http://blog.csdn.net/chancein007/article/details/30494313 http://blog.csdn.net/chancein007/article/details/30467815 http://blog.csdn.net/chancein007/article/details/37722755 http://blog.csdn.net/chancein007/article/details/27242265 http://blog.csdn.net/chancein007/article/details/52170752 http://blog.csdn.net/chancein007/article/details/52069057 http://blog.csdn.net/chancein007/article/details/53933872 http://blog.csdn.net/chancein007/article/details/53959603 http://blog.csdn.net/chancein007/article/details/27122719 http://blog.csdn.net/chancein007/article/details/54343653 http://blog.csdn.net/chancein007/article/details/54238017 http://blog.csdn.net/chancein007/article/details/27243793 http://blog.csdn.net/chancein007/article/details/54344730

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


什么是设计模式一套被反复使用、多数人知晓的、经过分类编目的、代码 设计经验 的总结;使用设计模式是为了 可重用 代码、让代码 更容易 被他人理解、保证代码 可靠性;设计模式使代码编制  真正工程化;设计模式使软件工程的 基石脉络, 如同大厦的结构一样;并不直接用来完成代码的编写,而是 描述 在各种不同情况下,要怎么解决问题的一种方案;能使不稳定依赖于相对稳定、具体依赖于相对抽象,避免引
单一职责原则定义(Single Responsibility Principle,SRP)一个对象应该只包含 单一的职责,并且该职责被完整地封装在一个类中。Every  Object should have  a single responsibility, and that responsibility should be entirely encapsulated by t
动态代理和CGLib代理分不清吗,看看这篇文章,写的非常好,强烈推荐。原文截图*************************************************************************************************************************原文文本************
适配器模式将一个类的接口转换成客户期望的另一个接口,使得原本接口不兼容的类可以相互合作。
策略模式定义了一系列算法族,并封装在类中,它们之间可以互相替换,此模式让算法的变化独立于使用算法的客户。
设计模式讲的是如何编写可扩展、可维护、可读的高质量代码,它是针对软件开发中经常遇到的一些设计问题,总结出来的一套通用的解决方案。
模板方法模式在一个方法中定义一个算法的骨架,而将一些步骤延迟到子类中,使得子类可以在不改变算法结构的情况下,重新定义算法中的某些步骤。
迭代器模式提供了一种方法,用于遍历集合对象中的元素,而又不暴露其内部的细节。
外观模式又叫门面模式,它提供了一个统一的(高层)接口,用来访问子系统中的一群接口,使得子系统更容易使用。
单例模式(Singleton Design Pattern)保证一个类只能有一个实例,并提供一个全局访问点。
组合模式可以将对象组合成树形结构来表示“整体-部分”的层次结构,使得客户可以用一致的方式处理个别对象和对象组合。
装饰者模式能够更灵活的,动态的给对象添加其它功能,而不需要修改任何现有的底层代码。
观察者模式(Observer Design Pattern)定义了对象之间的一对多依赖,当对象状态改变的时候,所有依赖者都会自动收到通知。
代理模式为对象提供一个代理,来控制对该对象的访问。代理模式在不改变原始类代码的情况下,通过引入代理类来给原始类附加功能。
工厂模式(Factory Design Pattern)可细分为三种,分别是简单工厂,工厂方法和抽象工厂,它们都是为了更好的创建对象。
状态模式允许对象在内部状态改变时,改变它的行为,对象看起来好像改变了它的类。
命令模式将请求封装为对象,能够支持请求的排队执行、记录日志、撤销等功能。
备忘录模式(Memento Pattern)保存一个对象的某个状态,以便在适当的时候恢复对象。备忘录模式属于行为型模式。 基本介绍 **意图:**在不破坏封装性的前提下,捕获一个对象的内部状态,并在该
顾名思义,责任链模式(Chain of Responsibility Pattern)为请求创建了一个接收者对象的链。这种模式给予请求的类型,对请求的发送者和接收者进行解耦。这种类型的设计模式属于行为
享元模式(Flyweight Pattern)(轻量级)(共享元素)主要用于减少创建对象的数量,以减少内存占用和提高性能。这种类型的设计模式属于结构型模式,它提供了减少对象数量从而改善应用所需的对象结