Tuesday, 10 September 2013

How to make web site for screen scraping academic papers [on hold]

How to make web site for screen scraping academic papers [on hold]

I need a web site for screen scraping academic papers for VIP members,
actually i need ability to viewing and downloading pdf version of papers
of some academic database websites like IEEE sciencedirect and ... It
could be done by using some academic proxy setting for correct authorizing
like
a sample you could see this codes :
<html>
<body>
<label >
<div align="right" >
<h5>Enter your databe link (iEEE ,sciencedirect or ... ) HERE </h5>
</div>
</label>
<form method="post" action="">
<div align="right">
<input type="text" name="ssurl" id="textfield" />
<input name="" type="submit" value="submit" />
</div>
</form>
</body>
</html>
<?php
/**
* @author nnnnn
* @copyright 2012
*/
//removes string from the end of other
if (isset($_POST['ssurl'])) {
function removeFromEnd($string, $stringToRemove) {
$stringToRemoveLen = strlen($stringToRemove);
$stringLen = strlen($string);
$pos = $stringLen - $stringToRemoveLen;
$out = substr($string, 0, $pos);
return $out;
}
//$string = 'picture.jpg.jpg';
//$string = removeFromEnd($string, '.jpg');
$url='http://127.0.0.1/leech/';
$url='http://www.sciencedirect.com/science/article/pii/S0301421504000928';
global $ssurl;
$url=$_POST['ssurl'];
echo $url;
//$url =
'http://pdn.sciencedirect.com/science?_ob=MiamiImageURL&_cid=271097&_user=2501846&_pii=S0301421504000928&_check=y&_origin=article&_zone=toolbar&_coverDate=2005--31&view=c&originContentFamily=serial&wchp=dGLbVlB-zSkWb&md5=f51bc09e08b4d3eafb759ef5c08724c4&pid=1-s2.0-S0301421504000928-main.pdf';
$ss=substr($url,-4);
$string = removeFromEnd($url, '.pdf');
// echo $ss;
//$url = 'http://www.sciencedirect.com/science/jrnlallbooks/a/fulltext';
//$proxy = '200.93.148.72:3128';
$ext = substr($fileName, strrpos($fileName, '.') + 1);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, "203.64.181.50"); //your proxy url
curl_setopt($ch, CURLOPT_PROXYPORT, "3128"); // your proxy port number
curl_setopt($ch, CURLOPT_PROXYUSERPWD, "bjm:12345"); //username:pass
curl_setopt($ch, CURLOPT_TIMEOUT, 0);
//curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "my_cookies.txt");
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
//echo $string;
echo substr($url,-4);
//echo $url;
if ($ss=='.pdf' )
{
$fp = fopen('rss5.pdf', 'w+');
/**
* Ask cURL to write the contents to a file
*/
curl_setopt($ch, CURLOPT_FILE, $fp);
$curl_scraped_page = curl_exec($ch);
fclose($fp);
}else {
$curl_scraped_page = curl_exec($ch);
}
/*
$file = $url; // URL to the file
$contents = file_get_contents($file); // read the remote file
touch('somelocal.pdf'); // create a local EMPTY copy
file_put_contents('somelocal.pdf', $contents); // put the fetchted data
into the newly created file
*/
echo $curl_scraped_page;
}
?>
THIS or by this link: this
AND i Need ability of downloading and more completed codes I find Some ap
like : github.com/kanzure/pyscholar by python or
codeload.github.com/mikexstudios/citeulike-parser/zip/master but myself
test on my AMPPs don't work correctly.



THIS IDEA works at paper.paperdl.com .AT this site,you must sin up and
then put paper's link and get it's download link. this website use the
rapidleech36 mechanism and changed it, by the this file (
paper.paperdl.com/process.php) ( Unfortunately we can't see its code). So
i like to know how it work?!
Regards

No comments:

Post a Comment