Put your GitHub data in the Elastic Stack.
- Linux/OSX
- cURL
- Logstash locally
- Elasticsearch, locally or in the cloud
- Edit
logstash.confandrun.sh. Fill in the Elasticsearch host, port, and optionally authentication details. Inrun.sh, also fillPATTERN, which describes which orgs/repos you want to store in Elasticsearch (there are examples) - Run
./download.sh bulkto download all GitHub archives of 2015-2017, and/or - Run
./download.sh YYYY-MM-DDto download all GitHub archives of a single day (you can use this for daily updates to an existing index) - At any time, you can run
./run.shto find the GitHub org/repo you're interested in, and export that to Elasticsearch
Tested in OSX and on Elastic Cloud, on Elasticsearch, Logstash and Kibana 5.1-5.6, with X-Pack Security enabled.
Run daily:
0 5 * * * cd ~/githubarchive-elastic; ./download.sh; ./run.sh;
Or, run hourly:
0 * * * * cd ~/githubarchive-elastic; ./download.sh; ./run.sh;
Note that download.php will download yesterday's data once per day, so you'll have to make sure that yesterday's data is complete, AND allow some time for possible late delivery. 05:00 UTC is usually enough for that. If you want more certainty, rig download.php to process more data than just yesterday's, or run the cron more often. Existing data will not be re-downloaded so you're free to run it as often as you like. It also sets its Elasticsearch document id explicitly, overwriting existing records if they are ingested multiple times.