2009/01/12

curl模拟浏览器抓取URL

使用php的curl模块,模拟浏览器行为,“E:/www/pachong/cookie.txt”为cookie存放的地方。curl默认输出为PHP的标准输出php://stdout,您也可以修改输出记录的磁盘文件。


<?php
set_time_limit(0);
ob_start();

$keyWords = rawurlencode("论语");

$url = 'http://www.amazon.cn/mn/searchApp?fea=layout&ix=sunray&keywords='.$keyWords.'&searchType=&showType=3&sortType=&node=0&searchKind=keyword&uid=168-5083975-3489864';

//初始化
$ch = curl_init();

curl_setopt($ch, CURLOPT_COOKIEJAR, "E:/www/pachong/cookie.txt");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; InfoPath.1; CIBA)");
curl_setopt($ch, CURLOPT_URL, $url);

//echo header
curl_setopt($ch, CURLOPT_HEADER, FALSE);
//mark this as a new cookie "session".
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
//echo body html
curl_setopt($ch, CURLOPT_NOBODY, FALSE);

//写入磁盘文件
/*
$outputFile = fopen("E:/www/pachong/contents.txt", 'ab');
curl_setopt($ch, CURLOPT_FILE, $outputFile);
*/

curl_exec($ch);

curl_close($ch);

$content = ob_get_clean();



echo '<pre>';
echo htmlspecialchars($content);
echo '</pre>';
?>

没有评论:

发表评论