-
Notifications
You must be signed in to change notification settings - Fork 114
Setting up a dev environment
To gather data for Open Recipes, we are building spiders based on Scrapy, a web scraping framework written in Python. We are using Scrapy v0.16 at the moment. To contribute spiders for sites, you should have basic familiarity with:
- Python
- Git
- HTML and/or XML
Note: this is strongly biased towards OS X. Feel free to contribute instructions for other operating systems.
To get things going, you will need the following tools:
- Python 2.7 (including headers)
- Git
pipvirtualenv
You will probably already have the first two, although you may need to install Python headers on Linux with something like apt-get install python-dev.
If you don't have pip, follow the installation instructions in the pip docs. Then you can install virtualenv using pip.
Once you have pip and virtualenv, you can clone our repo and install requirements with the following steps:
-
Open a terminal and
cdto the directory that will contain your repo clone. For these instructions, we'll assume youcd ~/src. -
git clone https://github.com/fictivekin/openrecipes.gitto clone the repo. This will make a~/src/openrecipesdirectory that contains your local repo. -
cd ./openrecipesto move into the newly-cloned repo. -
virtualenv --no-site-packages venvto create a Python virtual environment inside~/src/openrecipes/venv. -
source venv/bin/activateto activate your new Python virtual environment. -
pip install -r requirements.txtto install the required Python libraries, including Scrapy. -
scrapy -hto confirm that thescrapycommand was installed. You should get a dump of the help docs. -
cd scrapy_proj/openrecipesto move into the Scrapy project directory -
cp settings.py.default settings.pyto set up a working settings module for the project -
scrapy crawl thepioneerwoman.feedto test the feed spider written for thepioneerwoman.com. You should get output like the following:2013-03-30 14:35:37-0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: openrecipes) 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled item pipelines: MakestringsPipeline, DuplicaterecipePipeline 2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Spider opened 2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) (referer: None) 2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) (referer: http://feeds.feedburner.com/pwcooks) ...
If you do, baby you got a stew going!