Skip to content

Commit 0122992

Browse files
authored
Merge pull request #136 from VertNet/develop
Merge the active develop branch back into master
2 parents d93b367 + cf1813c commit 0122992

File tree

116 files changed

+40744
-169
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

116 files changed

+40744
-169
lines changed

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,19 @@
1+
vertnet.pem
12
pom.xml
23
*jar
34
/lib/
45
/classes/
56
.lein-deps-sum
67
.lein-plugins
78
rm-dwca-reader-clj-jars.sh
9+
creds.json
10+
s3.json
11+
aws.json
12+
target/
13+
#*.*
14+
*.*~
15+
*sublime*
16+
\#*.*\#
17+
.lein*
18+
.nrepl*
19+
.DS_Store

README.md

Lines changed: 53 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,54 @@
1-
gulo
2-
====
1+
# What is Gulo?
32

4-
Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
3+
![](http://3.bp.blogspot.com/-s1vAPdg_zZM/TZ3bnzUZgVI/AAAAAAAACKo/Mk-Tu-Nil74/s1600/animalangry.jpg)
4+
5+
Gulo is the genus for wolverine, the biggest land-dwelling species of weasel on the planet. It is a stocky and muscular carnivore, resembling a small bear. The wolverine has a reputation for endurance, ferocity, and strength out of proportion to its size, with the capacity to battle with competitors many times its size.
6+
7+
Gulo is also a VertNet project designed for harvesting Darwin Core Archives, shredding them into small pieces, and loading them into [CartoDB](http://cartodb.com). It's written in the [Clojure](http://clojure.org) programming language and rides on [Cascading](http://www.cascading.org) and [Cascalog](https://github.com/nathanmarz/cascalog) for processing "Big Data" on top of [Hadoop](http://hadoop.apache.org) using [MapReduce](http://research.google.com/archive/mapreduce.html).
8+
9+
# Developing
10+
## AWS credentials
11+
12+
Running Gulo queries with Elastic MapReduce requires adding the following to the file `credentials.json` in the project root:
13+
14+
```json
15+
{
16+
"access-id": "your_aws_access_id",
17+
"private-key":"your_aws_private_key",
18+
"key-pair-file":"~/.ssh/vertnet.pem",
19+
"key-pair":"vertnet"
20+
}
21+
```
22+
23+
Working with the `gulo.cdb` namespace requires this to be stored in `resources/aws.json`:
24+
25+
```json
26+
{
27+
"access-id": "your_aws_access_id",
28+
"secret-key": "your_aws_private_key"
29+
}
30+
```
31+
32+
## CartoDB OAuth credentials
33+
34+
Gulo depends on an authenticated connection to CartoDB. This requires adding the following file in `resources/creds.json`:
35+
36+
```json
37+
{
38+
"key": "your_cartodb_oauth_key",
39+
"secret": "your_cartodb_oauth_secret",
40+
"user": "your_cartodb_username",
41+
"password": "your_cartodb_password"
42+
}
43+
```
44+
45+
## Dependencies
46+
47+
For adding BOM bytes to UTF-8 files, so that CartoDB can detect the encoding, we use the `uconv` program which can be installed on Ubuntu like this:
48+
49+
```bash
50+
$ sudo apt-get install apt-file
51+
$ sudo apt-file update
52+
$ apt-file search bin/uconv
53+
$ sudo apt-get install libicu-dev
54+
```

dev/bootstrap.sh

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# configure EMR cluster for use with VertNet projects
2+
# put on S3 at s3://vnproject/bootstrap-actions/gulo/bootstrap.sh
3+
4+
# install some helpful utilities
5+
sudo apt-get update
6+
sudo apt-get install -y screen s3cmd zip unzip
7+
8+
# Setup for git
9+
git config --global user.name "Whizbang Systems"
10+
git config --global user.email "admin@whizbangsystems.net"
11+
12+
# generate ssh key
13+
ssh-keygen -t rsa -N "" -f /home/hadoop/.ssh/id_rsa -C "admin@whizbangsystems.net"
14+
sudo chmod 644 /home/hadoop/.ssh/id_rsa
15+
16+
# Add github to known_hosts
17+
echo "github.com,207.97.227.239 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==" >> /home/hadoop/.ssh/known_hosts
18+
19+
20+
# simple leiningen install via 'li'
21+
echo "alias li='cd /home/hadoop/bin; wget https://raw.github.com/technomancy/leiningen/stable/bin/lein; chmod u+x lein; ./lein; cd /home/hadoop;'" >> /home/hadoop/.bashrc
22+
23+
# simple uberjarring
24+
echo "alias uj='lein do deps, compile :all, uberjar'" >> /home/hadoop/.bashrc
25+
26+
# simple installs & configs
27+
echo "alias gulo='git clone git://github.com/VertNet/gulo.git'" >> /home/hadoop/.bashrc
28+
echo "alias teratorn='git clone git://github.com/MapofLife/teratorn.git'" >> /home/hadoop/.bashrc
29+
30+
echo "alias dl='wget https://gist.github.com/robinkraft/5666682/download'" >> /home/hadoop/.bashrc

dev/ec2-bootstrap.sh

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# Run this script to configure an instance for harvesting and bulkloading.
2+
3+
# install a few things
4+
5+
sudo apt-get update
6+
sudo apt-get -y install screen zip unzip git sqlite3
7+
http://s3tools.org/repo/deb-all/stable/s3cmd_1.0.0.orig.tar.gz
8+
tar -xvf s3cmd_1.0.0.orig.tar.gz
9+
cd s3cmd-1.0.0
10+
sudo python setup.py install
11+
cd
12+
13+
# Setup for git
14+
git config --global user.name "David Bloom"
15+
git config --global user.email "dbloom@vertnet.org"
16+
17+
# generate ssh key
18+
ssh-keygen -t rsa -N "" -f /home/$USER/.ssh/id_rsa -C "dbloom@vertnet.org"
19+
sudo chmod 644 /home/$USER/.ssh/id_rsa
20+
21+
# Add github to known_hosts
22+
echo "github.com,207.97.227.239 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==" >> /home/$USER/.ssh/known_hosts
23+
24+
# install Java
25+
sudo apt-get -y install openjdk-7-jre
26+
sudo apt-get -y install openjdk-7-jdk
27+
28+
# make ~/bin directory, add to PATH
29+
mkdir ~/bin
30+
echo "export PATH=/home/$USER/bin:${PATH}" >> ~/.bashrc
31+
32+
# install lein
33+
cd ~/bin
34+
wget https://raw.github.com/technomancy/leiningen/stable/bin/lein
35+
chmod u+x lein
36+
./lein
37+
cd ~/
38+
39+
# install app engine sdk
40+
cd bin
41+
wget http://googleappengine.googlecode.com/files/google_appengine_1.8.0.zip
42+
unzip google_appengine_1.8.0.zip
43+
echo "export PATH=/home/$USER/bin/google_appengine:${PATH}" >> ~/.bashrc
44+
cd
45+
46+
# simple uberjarring via uj command
47+
echo "alias uj='lein do deps, compile :all, uberjar'" >> /home/$USER/.bashrc
48+
49+
# clone projects
50+
git clone git://github.com/VertNet/gulo.git
51+
git clone git://github.com/VertNet/webapp.git
52+
53+
# configure EBS volume
54+
sudo mkfs -t ext3 /dev/xvdb
55+
sudo mkdir /mnt/beast
56+
sudo mount /dev/xvdb /mnt/beast
57+
sudo chown $USER:$USER /mnt/beast
58+
59+
# configure credentials
60+
61+
echo "Configuring CartoDB. Please have your credentials ready and press 'enter' to continue."
62+
read na
63+
echo "Oauth key:"
64+
read OAUTH_KEY
65+
echo "Oauth secret:"
66+
read OAUTH_SECRET
67+
echo "Username:"
68+
read USERNAME
69+
echo "Password:"
70+
read CDB_PASSWORD
71+
echo "API key:"
72+
read API_KEY
73+
74+
echo "{
75+
\"key\": \"$OAUTH_KEY\",
76+
\"secret\": \"$OAUTH_SECRET\",
77+
\"user\": \"$USERNAME\",
78+
\"password\": \"$CDB_PASSWORD\",
79+
\"api_key\": \"$API_KEY\"
80+
}" > ~/gulo/resources/creds.json
81+
82+
echo "Configuring AWS. Please have your credentials ready and press 'enter' to continue. Note that backslashes in your AWS credentials may cause errors."
83+
read na
84+
echo "Access key:"
85+
read ACCESS_ID
86+
echo
87+
echo "Secret key:"
88+
read SECRET_KEY
89+
echo
90+
91+
echo "{
92+
\"access-id\": \"$ACCESS_ID\",
93+
\"secret-key\": \"$SECRET_KEY\"
94+
}" > ~/gulo/resources/aws.json
95+
96+
echo "Keep those AWS credentials handy for configuring s3cmd. Press 'enter' to continue"
97+
98+
s3cmd --configure
99+
100+
# configure app engine credentials
101+
102+
echo "Please enter your App Engine email address: "
103+
read EMAIL
104+
echo "export EMAIL=$EMAIL" >> ~/.bashrc
105+
106+
echo "Please enter your App Engine password: "
107+
read GAE_PASSWORD
108+
echo "export GAE_PASSWORD=$GAE_PASSWORD" >> ~/.bashrc
109+
echo "Credentials are now set up."
110+
111+
echo "Instance configured - go have a beer to celebrate!"

dev/genthrift.sh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#!/bin/sh
2+
3+
# Generates Java code from the vn.thrift DSL. Depends on the Apache Thrift compiler.
4+
5+
rm -rf ../src/jvm/gen-java
6+
rm -rf ../src/jvm/gulo/schema/*
7+
thrift -o "../src/jvm" -r --gen java:hashcode gulo.thrift
8+
mv ../src/jvm/gen-java/gulo/schema ../src/jvm/gulo
9+
rm -rf ../src/jvm/gen-java

0 commit comments

Comments
 (0)