【高手请进】--请教如何用C或者C++代码实现自动读取网页内的内容呢?
我想写一段代码,让它
1.调用google搜索引擎来搜索关键字“cppunit”;
2.把搜索的结果的那个页面中前三条结果的内容写入到一个文件中。
就这么两个步骤,现在无从下手。尤其是迈出第一步显得非常困难,这到底怎么用C/C++来做呢?
请大家指点~
问题点数:35、回复次数:17Top
1 楼idler(告别teenage)(偶是豆子。。。)(歇业休息。。。)回复于 2005-01-26 23:35:35 得分 0
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclib/html/_mfc_chttpconnection.asp
CHttpConnectionTop
2 楼tempID005(tempID005)回复于 2005-01-27 14:48:57 得分 0
哪位大侠愿意给我一个详细的代码啊??
分不够我可以再加的,小意思。
只要你们帮忙啊!!!Top
3 楼tempID005(tempID005)回复于 2005-01-27 17:30:55 得分 0
怎么没人回答阿????Top
4 楼DiabloWalkOnTheEarth(我想到个绝妙的昵称,只是地方太小,写不下)回复于 2005-01-27 17:52:04 得分 35
Socket skt;
skt.Create();
skt.Connect( "www.google.com" , 80 );
skt.Send( "GET /search?hl=zh-CN&q=cppunit&lr= HTTP/1.1\r\nHOST: www.google.com\r\n\r\n" );
char recvBuf[ ... ];
skt.Recv( recvBuf , sizeof( recvBuf ) );
然后在recvBuf里匹配: <a href=([^ ]*) onmousedown=[^>]* target=_blank> , 把匹配的前三个写到文件就可以了.
Top
5 楼WiseNeuro(春之舞)回复于 2005-01-27 23:19:43 得分 0
markTop
6 楼DiabloWalkOnTheEarth(我想到个绝妙的昵称,只是地方太小,写不下)回复于 2005-01-28 10:24:00 得分 0
#include <cryptlib/socketft.h>
#include <iostream>
#include <string>
#include <sstream>
#include <iomanip>
using namespace std;
using namespace CryptoPP;
SocketsInitializer initSocketSubSystem;
string urlesc( const char* url )
{
ostringstream os; os << hex;
for( ; *url; ++url )
{
if( isalnum( *url ) ) os << *url;
else os << '%' << setw( 2 ) << setfill( '0' )<< (unsigned)(byte)*url;
}
os << ends; return os.str();
}
void parserline( const string& line )
{
size_t pos1 = line.find( "<a href=" ) , pos2 = line.find( "target=_blank>" );
if( pos1 != string::npos && pos2 != string::npos && pos2 > pos1 )
{
cout << line.substr( pos1 + 8 , pos2 - pos1 - 8 ) << endl;
}
}
int main( int argc , char* argv[] )
{
char buf[64 * 1024]; size_t szread = 0 , timeread = 0;
string strfind = "cppunit"; long start= 0;
if( argc == 2 )
{
strfind = argv[1];
}
else if( argc == 3 )
{
strfind = argv[1];
start = atoi( argv[2] );
}
strfind = urlesc( strfind.c_str() );
ostringstream req; req << "GET /search?hl=zh-CN&q=" << strfind << "&start=" << start
<< "&lr= HTTP/1.1\r\nHOST: www.google.com\r\nConnection: Close\r\n\r\n" << ends;
cout << req.str() << endl;
Socket sktConn;
try {
sktConn.Create();
sktConn.Connect( "www.google.com" , 80 );
sktConn.Send( (byte*)req.str().c_str() , req.str().length() );
for( ;szread < sizeof( buf ) - 1 && timeread++ < 100; ++timeread )
{
timeval tmwait; tmwait.tv_sec = 10; tmwait.tv_usec = 0;
if( sktConn.ReceiveReady( &tmwait ) )
{
size_t sz = sktConn.Receive( (byte*)buf + szread , sizeof( buf ) - 1 - szread );
if( !sz ) break; szread += sz;
}
}
if( !szread ) exit( 1 );
buf[szread] = 0;
}
catch( const std::exception& e )
{
cerr << "FAILED : " << e.what() << endl;
exit(1);
}
istringstream iss( buf ); string line;
while( iss )
{
getline( iss , line );
parserline( line );
}
system( "pause" );
}Top
7 楼healer_kx(甘草(楼主揭贴吧,我们这些上班灌水的也不容易))回复于 2005-01-28 10:27:23 得分 0
建议看tomcat的源码。Top
8 楼DiabloWalkOnTheEarth(我想到个绝妙的昵称,只是地方太小,写不下)回复于 2005-01-28 10:27:33 得分 0
记得多加点分,一直用 cryptlib 下的 socket ,都习惯了,你改成你一贯使用的方式吧.Top
9 楼DiabloWalkOnTheEarth(我想到个绝妙的昵称,只是地方太小,写不下)回复于 2005-01-28 10:47:20 得分 0
好像搜索中午的时候第一个链接经常有问题,你改改吧.
运行结果:
$test1 中国
GET /search?hl=zh-CN&q=%d6%d0%b9?start=0&lr= HTTP/1.1
HOST: www.google.com
Connection: Close
"http://news.google.com/news?q=%E4%B8%AD%E5%9B%BD&hl=zh-CN&lr=&ie=UTF-8&newwindow=1&sa=N&tab=nn&oi=newsr"><font color=#C
C0033>中国</font>的相关新闻</a> - <font size=-1><a href="http://news.google.com/nwshp?hl=zh-CN&gl=&oi=newst" s
tyle="color:#7777cc;">今日焦点新闻</a></font><tr><td valign=top width=35><img src=/images/newspaper.gif width=40 height=
30><td valign=top><font size=-1><a href=/url?sa=X&oi=news_zh-CN&start=0&num=3&q=http://www.zaobao.com.sg/special/realtim
e/2005/01/280105_5.html><font color=#CC0033>中国</font>将调查八名人质自行到伊务工事件</a> - <font color=green>联合早报</
font> - 1小时前<br><a href=/url?sa=X&oi=news_zh-CN&start=1&num=3&q=http://www.grrb.com.cn/news/news_detail.asp%3Fnews_id
%3D212090%26type_id%3D13><font color=#CC0033>中国</font>反洗钱行动取得国际突破</a> - <font color=green>工人日报</font> -
1小时前<br><a href=/url?sa=X&oi=news_zh-CN&start=2&num=3&q=http://www.chinanews.com.cn/news/2005/2005-01-28/26/534215.s
html>新玫瑰今夜亮相<font color=#CC0033>中国</font>女足四国赛将首战俄罗斯</a> - <font color=green>中国新闻网</font> - 1小
时前<br></font></td></tr></table><p><p><div><p class=g><a href=http://cn.yahoo.com/
http://www.chinanews.com.cn/
http://www.china.com/
http://www.5460.net/
http://www.moe.edu.cn/
http://www.cctv.com/
http://www.china.org.cn/chinese/
http://www.moh.gov.cn/
请按任意键继续. . .
$test1 "CSDN TNND 挂了" 10
GET /search?hl=zh-CN&q=CSDN%20TNND%20%b9%d2%c1%cb&start=10&lr= HTTP/1.1
HOST: www.google.com
Connection: Close
http://blog.csdn.net/chaoyuebetter/archive/2005/01/02/237633.aspx
http://linuxsir.zahui.net/html/15/
http://www.chinapoet.net/archives/2004/10/
http://foolbear.blogchina.com/
http://foolbear.blogchina.com/blog/category.64736.html
http://blog.blogchina.com/category.64736.html
http://www.oracle.com.cn/viewthread.php?tid=33644
http://borland.mblogger.cn/95927
http://www.devmanclub.com/Search/default.aspx?SearchFor=1&SearchText=newbie
http://expert.csdn.net/Expert/TopicView2.asp?id=3085272
请按任意键继续. . .
Top
10 楼lygui(梦断天台)回复于 2005-01-28 11:50:08 得分 0
呵呵,很详细了。
你要是想了解内情,走出第一步,最直接的方法是你先用google尝试一下,然后sniffer下来,看看浏览器发送的数据包是什么样子的,然后写个程序如法炮制。等于是模拟浏览器。Top
11 楼zhengwei1984222(阿什坎迪.兄弟会之剑)回复于 2005-01-28 22:49:14 得分 0
网络编程完全不懂
不过这里先玩个鲜,下学期再学
毁灭小符你的#include <cryptlib/socketft.h>是个什么库?
Top
12 楼DiabloWalkOnTheEarth(我想到个绝妙的昵称,只是地方太小,写不下)回复于 2005-01-29 12:10:40 得分 0
cryptlib 下面的socket库, 还算好用.只是在 socket , send , recv 浅浅的包装了一层.Top
13 楼UPCC(杂食动物)回复于 2005-01-29 14:28:53 得分 0
呵呵,都是高手……Top
14 楼DiabloWalkOnTheEarth(我想到个绝妙的昵称,只是地方太小,写不下)回复于 2005-01-29 15:52:27 得分 0
还是用正则表达式比较简单.
#include <cryptlib/socketft.h>
#include <boost/regex.hpp>
#include <iostream>
#include <sstream>
#include <vector>
using namespace std;
using namespace CryptoPP;
using namespace boost;
SocketsInitializer initSocketSubSystem;
bool downloadPage( const char* host , short port , const string& req , ostream& os )
{
try
{
Socket sktConn;
sktConn.Create( SOCK_STREAM );
sktConn.Connect( host , port );
sktConn.Send( (const byte*)req.c_str() , req.length() );
while( 1 )
{
char buf[ 1024 + 1 ];
size_t sz = sktConn.Receive( (byte*)buf , sizeof( buf ) - 1 );
if( !sz ) break;
buf[sz] = 0; os << buf;
}
}
catch( const exception& e )
{
cerr << "filed : " << e.what() << endl;
return false;
}
return true;
}
int main( int argc , char* argv[] )
{
const char* const hexchar = "0123567890ABCDEF";
string key = ( argc >= 2 ? argv[1] : "CSDN" ) , findkey;
int start = ( argc >= 3 ? atoi( argv[2] ) : 0 );
for( string::iterator it = key.begin(); it != key.end(); ++it )
isalnum( *it ) ? findkey += *it : findkey = findkey + '%' + hexchar[ (*it >> 4) & 0X0F ] + hexchar[ *it & 0X0F ] ;
ostringstream os , req;
req << "GET /search?hl=zh-CN&q=" << findkey << "&start=" << start
<< "&lr= HTTP/1.1\r\nHOST: www.google.com\r\nConnection: Close\r\n\r\n"
<< ends;
if( downloadPage( "www.google.com" , 80 , req.str() , os ) )
{
vector<string> result;
RegEx regEx( "<a href=([^ ]*) target=_blank>" );
regEx.Grep( result , os.str() );
for( vector<string>::iterator it = result.begin(); it != result.end(); ++it )
{
cout << regEx.Merge( *it , "$1" , false ) << endl;
}
}
system( "pause" );
}
Top
15 楼zhengwei1984222(阿什坎迪.兄弟会之剑)回复于 2005-01-29 21:12:25 得分 0
boost库都来了
小弟连STL都没搞清楚
认了~Top
16 楼tempID005(tempID005)回复于 2005-01-31 09:34:13 得分 0
哇谢谢DiabloWalkOnTheEarth大侠,真的感谢!
Top
17 楼lovebanyi(风云)回复于 2005-03-11 10:39:58 得分 0
要是有人写个组件就更好了...Top




