Blog Archives

Open Source Relational Database System

Overview of SQLite benefits including compatibility with R, Python, and ability to work with converted Microsoft Access databases.

Light, Open Source RDBMS SQLite

Tanner Jessel Spring 2014 IS 592 – Big Data Analytics School of Information Sciences College of Communication & Information

What is SQLite?

• Big Data in a Little Package – Portable, No Server – Writes directly to common media – Cross-platform (Mac, Windows, Linux)
• Portable – Small Gadgets (Windows Phone, Android, iOS) • (500 KB, 300 KB) • Embeddable – Computer Applications (Chrome, Firefox, Skype) • Single file database – Triggers – Tables – Indices – Views

http://en.wikipedia.org/wiki/SQLite

What it is not…
• Not a full database application – No forms – No reports – No saved queries

Why should you be interested?
*Free*

• No Server (no mess)

• Command Line Interface

• GUI Interface

• Works with R

• Works with Python

• Good for students to learn advanced SQL queries, command line interaction

• Good for small database projects

RSQLite
Alter, analyze, attatch, begin, commit, create (table, trigger, view, virtual table), delete, select, insert… SQLite Query Language “Mostly” SQL

Database Interface R driver for SQLite. This package embeds the SQLite database engine in R and provides an interface compliant with the DBI package. The source for the SQLite engine (version 3.7.17) is included.” –Comprehensive R Archive Network (CRAN) http://cran.r-project.org/web/packages/RSQLite/ 
Sqlite3 http://www.sqlite.org/cli.html

Command Line Shell

$ sqlite3 myexampledb SQLite version 3.8.4 2014-02-11 16:24:34 Enter ".help" for usage hints. sqlite> CREATE table tbl1(one varchar(10), two smallint); sqlite> INSERT into tbl1 values('hello!',10); sqlite> INSERT into tbl1 values('goodbye', 20); sqlite> SELECT * from tbl1; hello!|10 goodbye|20 sqlite>

*Free* • Feature Rich GUI • Cross-platform – Windows – Mac – Linux – Solaris • http://sqlitestudio.pl/ SQLite Studio
10. • Rstudio Console

> install.packages("RSQLite") > library(RSQLite) > dbDIR <- "/Users/apple/Documents/IS592-Big- Data-Analytics/SQLite-Presentation" > dbFileName <- paste(dbDIR,"classHOBOData2013.db3", sep="/") >drv <- dbDriver("SQLite”) >con <- dbConnect(drv, dbname = dbFileName)

Connect to Data
• Rstudio Console

> allDf <- dbGetQuery(con, "select * from HoboTable ") >View(allDf) Load Data to Frame

Pendant monitor data from SQLite database in data frame SQLite Table Data in R Studio 15,186 rows

>summary(allDf) obsNum serialNo recNo dateTime temperatureC Min. : 1 Length:15186 Min. : 1.0 Length:15186 Min. :14.42 1st Qu.: 3797 Class :character 1st Qu.: 422.0 Class :character 1st Qu.:23.58 Median : 7594 Mode :character Median : 844.0 Mode :character Median :23.87 Mean : 7594 Mean : 844.5 Mean :26.43 3rd Qu.:11390 3rd Qu.:1266.0 3rd Qu.:25.12 Max. :15186 Max. :1722.0 Max. :59.81 intensityLight Min. : 0.0 1st Qu.: 0.0 Median : 10.8 Mean : 2524.8 3rd Qu.: 21.5 Max. :231468.2

Pendant monitor data from SQLite database in plot

>plot(allDf$temperatureC) Temperature Data
Rstudio Console > temperatureF <- (allDf$temperatureC * (9/5.0))+32.0 Transformation C to F
Rstudio Console > temperatureF <- (allDf$temperatureC * (9/5.0))+32.0 > newDf <- cbind(allDf,temperatureF) > dbWriteTable(con, "newTable", newDf) Update Data Frame

Pendant data .db3 file now has two tables
SQLite Table Data Modified by R 15,186 rows, new TemperatureF column added

>view(hotDf) serialNo dateTime temperatureC 1 10081427 2013-06-04 12:32:30 37.824 2 10081427 2013-06-04 12:32:40 37.935 3 10081427 2013-06-04 12:32:50 37.935 4 10081427 2013-06-04 12:33:00 37.935 5 10081427 2013-06-04 12:33:10 38.046 6 10081427 2013-06-04 12:33:20 38.046 7 10081427 2013-06-04 12:33:30 38.490 8 10081427 2013-06-04 12:33:40 38.490 9 10081427 2013-06-04 12:33:50 38.602

10081427 2013-06-04 12:34:00 38.602

Quality Assurance / Control • Import SQLite3 import sqlite3 conn=sqlite3.connect(‘/Users/apple/D ocuments/IS592-Big-Data- Analytics/SQLite- Presentation/classHOBOData2013.db 3’) c=conn.cursor()

Quality Assurance / Control • Find, list “errors” • Whatever you define them to be • > 100 F in Antarctica, for example c.execute(‘SELECT obsNum,serialNo,temperatureF FROM newTable WHERE temperatureF > 100 OR temperatureF < 10’)

Quality Assurance / Control • Check for Errors • Nice printed messsage with your selected “errors”
python QAQCdemo.py

#Fetch all the query results myListOfNames=c.fetchall() # print them out print print("Range Errors for Temperature:") for myTuple in myListOfNames: print("obsNum: "+str(myTuple[0])+" SerialNo: "+str(myTuple[1])+" Temperature: "+str(myTuple[2])) conn.commit() conn.close()
'SELECT obsNum,serialNo,temperatureF FROM newTable WHERE temperatureF &gt; 100 OR temperatureF &lt; 10' &gt;100 F: SQLite/Python QAQC 1,538 records where the temperature was over 100 F

Access SQLite Convert Access to SQLite

https://code.google.com/p/mdb-sqlite/

http://convertdb.com/access/sqlite

https://code.google.com/p/access2sqlite/

http://mdbtools.sourceforge.net/

http://sites.fastspring.com/eggerapps/product/mdbaccdbviewer

Further reading http://www.sqlite.org/books.html

 

Options for Open Source, Public Access for ATBI Data

When I worked for the NBII we used Rackspace and Amazon’s cloud service, along with MySQL for the “Species of Greatest Conservation Need” database that held a similar number of species, but no spatial data.
Since your data has a spatial component, something like PostgreSQL or SQLite might be worthwhile to look at. I took a class in environmental information management last summer and was introduced to SQLite – which apparently has some advantages over databases created with Access, including native support for GIS applications like GRASS and QGIS.
SQLite works with a CMS called Django, Scott Simmerman suggested I look at that for improvements on the ATBI mapping project’s Web interface.  There some free hosting services for small or dev projects:https://wiki.python.org/moin/FreeHosts Might be worthwhile to look at that. Another “free” hosting site here: https://www.pythonanywhere.com
Finally Tom Colson mentioned something he’s working on with GSMIT – the Otter Spotter. Tom said Google Earth Engine might be worthwhile looking at for that.  Google offers grants and hosting of data for non-profits to use Earth Engine, so I’m curious how much of the ATBI database data might be translatable to KML.
When I worked for the NBII I was interested in serving up spatial data in KML files – I like them because you can open KML in robust GIS packages along with popular virtual globe tools like Google Earth or ArcGIS Explorer – things the everyday person has access to.
Even if not all the data can translate, it’s another “view” of the ATBI data that’s pretty useful for outreach, if not doing actual science.
I will explore at this issue as I have time and hopefully help as I am able.  I am a student of RDBMS and the ATBI is one of the more interesting datasets to learn with.

SDM re-projected for Google Earth, OSM with gdal2tile

I think I have stumbled upon the solution for tiling the png image.
To re-project the PNG image, we can geo-reference the images with GDAL, then warp the geo-referenced image to the correct projection, also with GDAL.
The process is described here:
First, enter “gdalinfo Abies_fraseri.png” into terminal to get the bounds of the PNG image.
This yields the following output:
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0, 1302.0)
Upper Right ( 2899.0,    0.0)
Lower Right ( 2899.0, 1302.0)
Center      ( 1449.5,  651.0)
A template and implementation for our PNG files is demonstrated here:
Template:
gdal_translate -of VRT -a_srs EPSG:4326 -gcp 0 0 ULlong ULlat -gcp UPPERRIGHTPx 0 URlong URlat -gcp LOWERRIGHTPx LOWERRIGHTPy LRlong LRlat Abies_fraseri.png Abies_fraseri.vrt
Implementation: 
gdal_translate -of VRT -a_srs EPSG:4326 -gcp 0 0 -84.000683874 35.7889383688 -gcp 2899.0 0 -83.0424855 35.7889383688 -gcp 2899.0 1302.0 -83.0424855 35.426963641 Abies_fraseri.png Abies_fraseri.vrt

For “ULlong” (Upper Left Long) and so forth I used a bounding box tool (http://boundingbox.klokantech.com) to determine the following bounds of the PNG – however, if there is a more “official” known boundary, it would be wise to use that instead.
                                35.426963641,-84.000683874 – bottom left / lower left
                                35.426963641,-83.0424855 bottom right / lower right
                                35.7889383688,-83.0424855 – top right / upper right
                                35.7889383688,-84.000683874 – top left / upper left
Next, take the .vrt file and warp it:
    gdalwarp -of VRT -t_srs EPSG:4326 Abies_fraseri.vrt Abies_fraseri_2.vrt
This creates a folder with tiles for a KML network link.
see “doc.kml” in the attached zip folder which you may open in Google Earth.
There is also an Open Map Layers HTML page in the attached zip folder.
To add trails in Google Earth
My opinion is it would be more interesting and informative to have the species occurrences added as a separate layer.  Therefore, I would prefer to create placemarkers / waypoints for species locations rather than black squares on the images.  Is it possible to generate PNG images with the likely distribution but not the black squares?  If so, would that take a long time to re-do to make these tiled overlays?
Also, although I “hacked” the PNG bounds, I’m still worried it the overlay’s georeferencing be “off” a bit and would love to use a more trustworthy set of coordinates than those derived from my best judgement and http://boundingbox.klokantech.com.
From here, I think it would be smart to document a workflow for doing a lot of PNG to VRT to KML at a time.  I haven’t tried processing a whole directory of images at once yet – just playing with the Abies_fraseri.png for now.
The doc.kml and virtual raster dataset work with Google Map API and Open Layers map.  I think they both run on Javascript.

Extraction, Transform and Load

A friend of mine in the program commented that about 90% of the time doing data science is obtaining and cleaning data.

This is where programming is incredibly useful.  In the second year of my Masters program, my programming skills are not yet at the level that I want them to be.

I recently started some work for my research assistantship concerning Twitter data for @DataONEorg.

I’m interested in the content of posts, and the relationships between the actors in the network.

In terms of content, I’d like to look at the hashtags and links.

To illustrate how difficult it is to accomplish tasks “by hand,” I recently tried to the twitter data from a free site.  My efforts are documented here: <https://notebooks.dataone.org/data-science/harvesting-dataoneorg-twitter-mentions-via-topsy/>.

I’ve read that employers should not hire a “data scientist” if the so-called “scientist” does not have programming skills.  For this reason, I’m disappointed that the School of Information Science does not offer a programming course within the School itself.  (I’ve heard Dr. Potnis will offer a course in Fall 2014, a semester after my graduation).

I enrolled in a programming course in the College of Engineering and Computer Science – Introduction to Programming for Scientists and Engineers.  The course focuses on C++ language.  This is unfortunate, as python is increasingly favored over C++.  This means more ready-made programs are available, and a user community is growing. Content management systems are even building up around python.

Python is used by a friend of mine who does genome science.  C++ is useful for taking advantage of parallelism, but that my friend who works on supercomputers uses python suggests to me that python works as well.

Programming language popularity.

http://venturebeat.com/2014/02/03/the-most-popular-coding-languages-for-2014-are/

Further reading:

Python Displacing R As The Programming Language For Data Science by @mjasay http://readwr.it/c1ew  

http://www.scipy.org/

 

 

GIS Training / ESRI Online Courses of Interest

With the upcoming winter break, I noticed UT offers some self-paced virtual training and has a course list online at <http://www.lib.utk.edu/map/gis-services/additional-support/>

Looking over the courses offered, I noticed a few I’m interested in that could be useful for my career goals as a data manager for a natural resources organization.

I’m not sure how long each module would take, but I would like to work on a few over the break if it’s possible to work on several in one month.

I saw that course codes are needed and the instructions are to contact the Geospatial Data Librarian.

ESRI Training Courses of Interest:

Basics of the Geodatabase Data Model

Getting Started with the Geodatabase

Archiving Data in a Multiuser Geodatabase

Creating and Editing Geodatabase Features with ArcGIS Desktop (for ArcEditor and ArcInfo)

Creating and Integrating Data for Natural Resource Applications

Basics of Python (for ArcGIS 10)

Python Scripting for Geoprocessing Workflows (for ArcGIS 10)

I’m also not sure how many of these are identical to modules in other ESRI resources, like the GIS workbooks or ArcGIS 10.1 tutorial CDs. However, my home computer does not have an optical drive, so it would be useful to download data directly even if the data is identical to what is found on the CDs.