Light, Open Source RDBMS SQLite
Tanner Jessel Spring 2014 IS 592 – Big Data Analytics School of Information Sciences College of Communication & Information
What is SQLite?
• Big Data in a Little Package – Portable, No Server – Writes directly to common media – Cross-platform (Mac, Windows, Linux)
• Portable – Small Gadgets (Windows Phone, Android, iOS) • (500 KB, 300 KB) • Embeddable – Computer Applications (Chrome, Firefox, Skype) • Single file database – Triggers – Tables – Indices – Views
What it is not…
• Not a full database application – No forms – No reports – No saved queries
Why should you be interested?
• No Server (no mess)
• Command Line Interface
• GUI Interface
• Works with R
• Works with Python
• Good for students to learn advanced SQL queries, command line interaction
• Good for small database projects
Alter, analyze, attatch, begin, commit, create (table, trigger, view, virtual table), delete, select, insert… SQLite Query Language “Mostly” SQL
Database Interface R driver for SQLite. This package embeds the SQLite database engine in R and provides an interface compliant with the DBI package. The source for the SQLite engine (version 3.7.17) is included.” –Comprehensive R Archive Network (CRAN) http://cran.r-project.org/web/packages/RSQLite/
$ sqlite3 myexampledb SQLite version 3.8.4 2014-02-11 16:24:34 Enter ".help" for usage hints. sqlite> CREATE table tbl1(one varchar(10), two smallint); sqlite> INSERT into tbl1 values('hello!',10); sqlite> INSERT into tbl1 values('goodbye', 20); sqlite> SELECT * from tbl1; hello!|10 goodbye|20 sqlite>
*Free* • Feature Rich GUI • Cross-platform – Windows – Mac – Linux – Solaris • http://sqlitestudio.pl/ SQLite Studio
10. • Rstudio Console
> install.packages("RSQLite") > library(RSQLite) > dbDIR <- "/Users/apple/Documents/IS592-Big- Data-Analytics/SQLite-Presentation" > dbFileName <- paste(dbDIR,"classHOBOData2013.db3", sep="/") >drv <- dbDriver("SQLite”) >con <- dbConnect(drv, dbname = dbFileName)
Connect to Data
• Rstudio Console
> allDf <- dbGetQuery(con, "select * from HoboTable ") >View(allDf) Load Data to Frame
Pendant monitor data from SQLite database in data frame SQLite Table Data in R Studio 15,186 rows
>summary(allDf) obsNum serialNo recNo dateTime temperatureC Min. : 1 Length:15186 Min. : 1.0 Length:15186 Min. :14.42 1st Qu.: 3797 Class :character 1st Qu.: 422.0 Class :character 1st Qu.:23.58 Median : 7594 Mode :character Median : 844.0 Mode :character Median :23.87 Mean : 7594 Mean : 844.5 Mean :26.43 3rd Qu.:11390 3rd Qu.:1266.0 3rd Qu.:25.12 Max. :15186 Max. :1722.0 Max. :59.81 intensityLight Min. : 0.0 1st Qu.: 0.0 Median : 10.8 Mean : 2524.8 3rd Qu.: 21.5 Max. :231468.2
Pendant monitor data from SQLite database in plot
>plot(allDf$temperatureC) Temperature Data Rstudio Console > temperatureF <- (allDf$temperatureC * (9/5.0))+32.0 Transformation C to F Rstudio Console > temperatureF <- (allDf$temperatureC * (9/5.0))+32.0 > newDf <- cbind(allDf,temperatureF) > dbWriteTable(con, "newTable", newDf) Update Data Frame
Pendant data .db3 file now has two tables
SQLite Table Data Modified by R 15,186 rows, new TemperatureF column added
>view(hotDf) serialNo dateTime temperatureC 1 10081427 2013-06-04 12:32:30 37.824 2 10081427 2013-06-04 12:32:40 37.935 3 10081427 2013-06-04 12:32:50 37.935 4 10081427 2013-06-04 12:33:00 37.935 5 10081427 2013-06-04 12:33:10 38.046 6 10081427 2013-06-04 12:33:20 38.046 7 10081427 2013-06-04 12:33:30 38.490 8 10081427 2013-06-04 12:33:40 38.490 9 10081427 2013-06-04 12:33:50 38.602
10081427 2013-06-04 12:34:00 38.602
Quality Assurance / Control • Import SQLite3 import sqlite3 conn=sqlite3.connect(‘/Users/apple/D ocuments/IS592-Big-Data- Analytics/SQLite- Presentation/classHOBOData2013.db 3’) c=conn.cursor()
Quality Assurance / Control • Find, list “errors” • Whatever you define them to be • > 100 F in Antarctica, for example c.execute(‘SELECT obsNum,serialNo,temperatureF FROM newTable WHERE temperatureF > 100 OR temperatureF < 10’)
Quality Assurance / Control • Check for Errors • Nice printed messsage with your selected “errors”
#Fetch all the query results myListOfNames=c.fetchall() # print them out print print("Range Errors for Temperature:") for myTuple in myListOfNames: print("obsNum: "+str(myTuple)+" SerialNo: "+str(myTuple)+" Temperature: "+str(myTuple)) conn.commit() conn.close()
'SELECT obsNum,serialNo,temperatureF FROM newTable WHERE temperatureF > 100 OR temperatureF < 10' >100 F: SQLite/Python QAQC 1,538 records where the temperature was over 100 F
Access SQLite Convert Access to SQLite
Further reading http://www.sqlite.org/books.html
A friend of mine in the program commented that about 90% of the time doing data science is obtaining and cleaning data.
This is where programming is incredibly useful. In the second year of my Masters program, my programming skills are not yet at the level that I want them to be.
I recently started some work for my research assistantship concerning Twitter data for @DataONEorg.
I’m interested in the content of posts, and the relationships between the actors in the network.
In terms of content, I’d like to look at the hashtags and links.
To illustrate how difficult it is to accomplish tasks “by hand,” I recently tried to the twitter data from a free site. My efforts are documented here: <https://notebooks.dataone.org/data-science/harvesting-dataoneorg-twitter-mentions-via-topsy/>.
I’ve read that employers should not hire a “data scientist” if the so-called “scientist” does not have programming skills. For this reason, I’m disappointed that the School of Information Science does not offer a programming course within the School itself. (I’ve heard Dr. Potnis will offer a course in Fall 2014, a semester after my graduation).
I enrolled in a programming course in the College of Engineering and Computer Science – Introduction to Programming for Scientists and Engineers. The course focuses on C++ language. This is unfortunate, as python is increasingly favored over C++. This means more ready-made programs are available, and a user community is growing. Content management systems are even building up around python.
Python is used by a friend of mine who does genome science. C++ is useful for taking advantage of parallelism, but that my friend who works on supercomputers uses python suggests to me that python works as well.
Python Displacing R As The Programming Language For Data Science by @mjasay http://readwr.it/c1ew
With the upcoming winter break, I noticed UT offers some self-paced virtual training and has a course list online at <http://www.lib.utk.edu/map/gis-services/additional-support/>
Looking over the courses offered, I noticed a few I’m interested in that could be useful for my career goals as a data manager for a natural resources organization.
I’m not sure how long each module would take, but I would like to work on a few over the break if it’s possible to work on several in one month.
I saw that course codes are needed and the instructions are to contact the Geospatial Data Librarian.
ESRI Training Courses of Interest:
I’m also not sure how many of these are identical to modules in other ESRI resources, like the GIS workbooks or ArcGIS 10.1 tutorial CDs. However, my home computer does not have an optical drive, so it would be useful to download data directly even if the data is identical to what is found on the CDs.