Despite all the advancements in web APIs and interoperability, it's inevitable that, at some point in your career, you will have to "scrape" content from a website that was not built with web services in mind. And, despite its sometimes less-than-stellar reputation, web scraping is usually an entire more » legitimate activity-for example, to capture data from an old version of a website for insertion into a modern CMS. This book, written by scraping expert Matthew Turland, covers web scraping techniques and topics that range from the simple to exotic using a variety of technologies and frameworks: · Understanding HTTP requests · The PHP HTTP streams wrapper · cURL · pecl_http · PEAR:HTTP · Zend_Http_Client · Building your own scraping library · Using Tidy · Analyzing code with the DOM, SimpleXML and XMLReader extensions · CSS selector libraries · PCRE pattern matching · Tips and Tricks · Multiprocessing / parallel processing « less
A Guide to Developing Internet Agents with PHP/CURL
The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. There's no reason to let browsers limit your online experience-especially when you more » can easily automate online tasks to suit your individual needs.
Learn how to write webbots and spiders that do all this and more:
Programmatically download entire websites
Effectively parse data from web pages
Decode encrypted files
Automate form submissions
Send and receive email
Send SMS alerts to your cell phone
Unlock password-protected websites
Automatically bid in online auctions
Exchange data with FTP and NNTP servers
Sample projects using standard code libraries reinforce these new skills. You'll learn how to create your own webbots and spiders that track online prices, aggregate different data sources into a single web page, and archive the online data you just can't live without. You'll learn inside information from an experienced webbot developer on how and when to write stealthy webbots that mimic human behavior, tips for developing fault-tolerant designs, and various methods for launching and scheduling webbots. You'll also get advice on how to write webbots and spiders that respect website owner property rights, plus techniques for shielding websites from unwanted robots.
As a bonus, visit the author's website to test your webbots on sample target pages, and to download the scripts and code libraries used in the book.
Some tasks are just too tedious-or too important!- to leave to humans. Once you've automated your online life, you'll never let a browser limit the way you use the Internet again. « less