My plan for 2011 is to mix in more programming related posts. I figure the best place to start is a powerful PHP function that I wrote that uses curl and has helped me many many times.
function grab($url, $postparams = '', $cookiefilepath = '', $referer = '')
{
sleep(2);
$postheaders = array();
$url_parts = parse_url($url);
$postheaders[] = 'Host: ' . $url_parts['host'];
$postheaders[] = 'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; ' .
'rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9';
$postheaders[] = 'Accept: text/xml,application/xml,application/xhtml+xml,text/' .
'html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$postheaders[] = 'Accept-Language: en-us,en;q=0.5';
$postheaders[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$postheaders[] = 'Keep-Alive: 300';
$postheaders[] = 'Connection: keep-alive';
if($referer != '')
{
$postheaders[] = "Referer: $referer";
}
if($postparams != '')
{
$postheaders[] = 'Content-Type: application/x-www-form-urlencoded';
$postheaders[] = "Content-length: " . strlen($postparams);
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTP_VERSION, 'CURL_HTTP_VERSION_1_1');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_HTTPHEADER, $postheaders); // $headers
if($postparams != '')
{
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postparams);
}
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_URL, $url);
if ($cookiefilepath != '')
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefilepath);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefilepath);
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$got_page = curl_exec($ch);
$headers = curl_getinfo($ch);
$error = curl_error($ch);
curl_close($ch);
$retval = array('result' => $got_page, 'headers' => $headers, 'error' => $error);
return $retval;
}
This function is great for writing crawlers as you can sent over POST variables, it can pass over and store cookies in a cookiejar and passes over a “User-Agent” header so site you’re connecting to can’t tell that you’re aren’t really a browser and a referrer so you can really mimic regular page navigation
.