怎么样才能做个文章采集功能啊
rt 问题点数:100、回复次数:7Top
1 楼rola(林)回复于 2006-07-03 16:53:12 得分 0
推荐一个已经做好的软件,CyberArticle(网文快捕),在网上自己搜一下,可以参考Top
2 楼shalen520(Love will keep us alive)回复于 2006-07-03 17:02:31 得分 0
问题有点大了
简单的说就是获取页面html并分析,找出自己需要的东西Top
3 楼whChina(江城老温)(as a thinker)回复于 2006-07-03 17:14:00 得分 0
获取html元素.分析标签.用正则提取有用信息.Top
4 楼wlovenet(喝了这杯酒大家就是兄弟)回复于 2006-07-03 17:37:01 得分 0
先抓页面,再分析页面,如上老兄的.Top
5 楼whChina(江城老温)(as a thinker)回复于 2006-07-03 18:09:16 得分 0
我手头有一份vc++写的网页抓取的代码.楼主需要的话给你发一下。
这里贴部分代码.
//--------------------------------------------------------------------------------
// FUNCTION: 去除strLine两边的空格。
// IN : strLine
// OUT : strLine
// AUTHOR : 2006-05-22 Created by navy .
// NOTE :
//--------------------------------------------------------------------------------
void Extract::funDelSideSpace(string& strLine)
{
int nPos;
if (strLine.empty()) return;
nPos=strLine.find_first_not_of(" ");
if ( nPos == -1 ) {
strLine = "";
return;
}
strLine=strLine.substr(nPos);
nPos=strLine.find_last_not_of(" ");
strLine=strLine.substr(0,nPos+1);
}
//--------------------------------------------------------------------------------
// FUNCTION: 去除以&开头以;结束的字符串。
// IN : strResult
// OUT : strResult
// AUTHOR : 2006-05-24 Created by navy .
// 2006-05-29 Modified by navy .
// NOTE : 1." "-->" " 2."<"-->"<" 3.">"-->">"
// 4."&"-->"&" 5."""-->"\" 6."©"-->"(C)"
// 7."®"-->"?" 8."™"-->"TM" 9."•"-->"·"
//--------------------------------------------------------------------------------
void Extract::funDelBegAnd(string& strResult,int nPos) //从第nPos个位置开始
{
//a & ;; 编辑
string strTmp1,strTmp2;
string strSymbol;
int nPos1,nPos2;
nPos1 = strResult.find("&",nPos);
nPos2 = strResult.find(";",nPos1);
while ((nPos1 != -1) || (nPos2 != -1))
{
strSymbol = strResult.substr(nPos1+1,nPos2-nPos1-1);
if(strSymbol == "nbsp") strSymbol = " ";
else if(strSymbol == "lt") strSymbol = "<";
else if(strSymbol == "gt") strSymbol = ">";
else if(strSymbol == "amp") strSymbol = "&";
else if(strSymbol == "quot") strSymbol = "\"";
else if(strSymbol == "copy") strSymbol = "(C)";
else if(strSymbol == "reg") strSymbol = "?";
else if(strSymbol == "trade") strSymbol = "TM";
else if(strSymbol == "#8226") strSymbol = "·";
else
{
nPos1 = strResult.find("&",nPos1+1); //找下一个&
nPos2 = strResult.find(";",nPos1);
continue;
}
strTmp1 = strResult.substr(0,nPos1) ;
strTmp2 = strResult.substr(nPos2 + 1) ;
strResult = strTmp1 + strSymbol + strTmp2;
nPos1 = strResult.find("&",strTmp1.length());
nPos2 = strResult.find(";",nPos1);
}
}
//--------------------------------------------------------------------------------
// FUNCTION: 删除无用链接(在<TR>和</TR>间除链接外,没有其他内容)
// IN : strResult
// OUT : 传址
// AUTHOR : 2006-05-27 Created by navy .
// NOTE : Eg:a<TR><TD><A>bc</A></TD></TR><TR><TD>d</TD></TR>e 变为:
// a<TR><TD>d</TD></TR>e
//--------------------------------------------------------------------------------
void Extract::funDelUselessLink(string& strResult)
{
string strTmp1,strTmp2,strTmp3;
int nPos1,nPos2;
nPos1 = strResult.find("<TR");
nPos2 = strResult.find("</TR>");
while ((nPos1 != -1) && (nPos2 != -1)) {
strTmp1 = strResult.substr(0,nPos1);
strTmp2 = strResult.substr(nPos1,nPos2 - nPos1 + 5);
strTmp3 = strResult.substr(nPos2 + 5);
funDelete2(strTmp2,"A",0); //删除<A>与</A>间的内容
funKeepListLabel(strTmp2,0);
funReplace(strTmp2,"<BR","");
funReplace(strTmp2," ","");
if (strTmp2.empty())
{
strResult = strTmp1 + strTmp3 ;
nPos1 = strResult.find("<TR",strTmp1.length());
nPos2 = strResult.find("</TR>",nPos1);
continue;
}
nPos1 = strResult.find("<TR",nPos2 + 5);
nPos2 = strResult.find("</TR>",nPos1);
}
}
Extract::~Extract()
{
}Top
6 楼mylocoy()回复于 2006-07-27 16:48:08 得分 0
火车头 c#版 免费的
www.locoy.com
Top
7 楼MaxIE(MaxIE)回复于 2006-07-27 20:37:51 得分 0
http://maxie.cnblogs.com/archive/2006/07/26/460511.html
C#实现web信息自动抓取Top




