I am researching online mapping services for a class project and found this:
National Wild Fish Health Survey Database.
It runs on PostgreSQL. I don’t know a lot about PostgreSQL beyond that it is open source and competitive with MySQL and other enterprise-level database options. I’ve mentioned before it has an extension called PostGIS that allows excellent handling of spatial data.
So, seeing this application really caught my eye. I think it is a good aspirational model for what we could do with DLIA data. It appears to be all open source. Im copying out text they shared on <http://www.fws.gov/wildfishsurvey/database/page/about>
Major software components:
- MapFish, an open source web mapping development framework
- GeoServer, an open source software server written in Java that allows users to share and edit geospatial data
- MapServer, an open source development environment for building spatially-enabled internet applications
- PHP, a free, widely used, general-purpose scripting language
- PostgreSQL, an open source database management system
- PostGIS, an open source software program that adds support for geographic objects to the PostgreSQL database
- Smarty, a web template system written in PHP
I’ll probably explore this more, but wanted to share it with you now.
(3) Introduces the concepts big data and data analytics as an emerging field. To address the opportunities and challenges of big data in cademics, businesses, sciences, the Web, etc. To understand the nature of big data analytics and their various contexts. To master basic concepts and process of data analytics. To design analytics initiatives/proposals. To practice data mining techniques and skills (ETL). To explore data modeling and visualizing. Prerequisite: INSC 584 Database Management Systems or equivalent. Consent of the instructor.
The Data Intensive Summer School focuses on the skills needed to manage, process and gain insight from large amounts of data. It is targeted at researchers from the physical, biological, economic and social sciences that are beginning to drown in data. We will cover the nuts and bolts of data intensive computing, common tools and software, predictive analytics algorithms, data management and visualization. Given the short duration of the summer school, the emphasis will be on providing a solid foundation that the attendees can use as a starting point for advanced topics of particular relevance to their work.
- Experience working in a Linux environment
- Familiarity with relational data base model
- Examples and assignments will most likely use R, MATLAB and Weka. We do not require experience in these languages or tools, but you should already have an understanding of basic programming concepts (loops, conditionals, functions, arrays, variables, scoping, etc.)
- Robert Sinkovits, San Diego Supercomputer Center
- Nuts and bolts of data intensive computing
- Computer hardware, storage devices and file systems
- Cloud storage
- Data compression
- Networking and data movement
- Data ManagementIntroduction to R programming
- Digital libraries and archives
- Data management plans
- Access control, integrity and provenance
- Introduction to Weka
- Predictive analyticsDealing with missing data
- Standard algorithms: k-mean clustering, decision trees, SVM
- Over-fitting and trusting results
- ETL (Extract, transfer and load)
- The ETL life cycle
- ETL tools – from scripts to commercial solutions
- Non-relational atabases
- Brief refresher on relational mode
- Survey of non-relational models and technologies
- Presentation of data for maximum insight
- R and ggplot package
Virtual Summer School courses are delivered simultaneously at multiple locations across the country using high-definition videoconferencing technology.
On June 26 I received a follow-up e-mail with notes from the instructors:
Preparing for the virtual summer school
Several of the instructors have requested that you preinstall software on your laptop. Given the large number of participants and the compressed schedule, we ask that you comply and do this before the start of the summer school.
R Studio (statistical programming language)
Follow “download RStudio Desktop”
WEKA (data mining software)
Follow “Download” link on left hand side of home page
Please download the Stable book 3rd ed. version
Prior knowledge of R is not required, but we do assume that you have some programming experience and familiarity with basic programming concepts (variables, arrays, loops, branching, etc.). You may find it helpful to acquaint yourself with basic R syntax ahead of time.
Reading the first two chapters of the following online introduction is recommended http://cran.r-project.org/doc/manuals/R-intro.html
A basic understanding of relational databases and SQL would also be useful. If you are unfamiliar with the SQL syntax, please consider the following tutorials
I already have R studio; I have never tried Weka. This is a little bit of added work for the summer, but it looks like a great opportunity to pick up some additional skills, or at least refresh those skills I’ve already acquired.
Light, Open Source RDBMS SQLite
Tanner Jessel Spring 2014 IS 592 – Big Data Analytics School of Information Sciences College of Communication & Information
What is SQLite?
• Big Data in a Little Package – Portable, No Server – Writes directly to common media – Cross-platform (Mac, Windows, Linux)
• Portable – Small Gadgets (Windows Phone, Android, iOS) • (500 KB, 300 KB) • Embeddable – Computer Applications (Chrome, Firefox, Skype) • Single file database – Triggers – Tables – Indices – Views
What it is not…
• Not a full database application – No forms – No reports – No saved queries
Why should you be interested?
• No Server (no mess)
• Command Line Interface
• GUI Interface
• Works with R
• Works with Python
• Good for students to learn advanced SQL queries, command line interaction
• Good for small database projects
Alter, analyze, attatch, begin, commit, create (table, trigger, view, virtual table), delete, select, insert… SQLite Query Language “Mostly” SQL
Database Interface R driver for SQLite. This package embeds the SQLite database engine in R and provides an interface compliant with the DBI package. The source for the SQLite engine (version 3.7.17) is included.” –Comprehensive R Archive Network (CRAN) http://cran.r-project.org/web/packages/RSQLite/
$ sqlite3 myexampledb SQLite version 3.8.4 2014-02-11 16:24:34 Enter ".help" for usage hints. sqlite> CREATE table tbl1(one varchar(10), two smallint); sqlite> INSERT into tbl1 values('hello!',10); sqlite> INSERT into tbl1 values('goodbye', 20); sqlite> SELECT * from tbl1; hello!|10 goodbye|20 sqlite>
*Free* • Feature Rich GUI • Cross-platform – Windows – Mac – Linux – Solaris • http://sqlitestudio.pl/ SQLite Studio
10. • Rstudio Console
> install.packages("RSQLite") > library(RSQLite) > dbDIR <- "/Users/apple/Documents/IS592-Big- Data-Analytics/SQLite-Presentation" > dbFileName <- paste(dbDIR,"classHOBOData2013.db3", sep="/") >drv <- dbDriver("SQLite”) >con <- dbConnect(drv, dbname = dbFileName)
Connect to Data
• Rstudio Console
> allDf <- dbGetQuery(con, "select * from HoboTable ") >View(allDf) Load Data to Frame
Pendant monitor data from SQLite database in data frame SQLite Table Data in R Studio 15,186 rows
>summary(allDf) obsNum serialNo recNo dateTime temperatureC Min. : 1 Length:15186 Min. : 1.0 Length:15186 Min. :14.42 1st Qu.: 3797 Class :character 1st Qu.: 422.0 Class :character 1st Qu.:23.58 Median : 7594 Mode :character Median : 844.0 Mode :character Median :23.87 Mean : 7594 Mean : 844.5 Mean :26.43 3rd Qu.:11390 3rd Qu.:1266.0 3rd Qu.:25.12 Max. :15186 Max. :1722.0 Max. :59.81 intensityLight Min. : 0.0 1st Qu.: 0.0 Median : 10.8 Mean : 2524.8 3rd Qu.: 21.5 Max. :231468.2
Pendant monitor data from SQLite database in plot
>plot(allDf$temperatureC) Temperature Data Rstudio Console > temperatureF <- (allDf$temperatureC * (9/5.0))+32.0 Transformation C to F Rstudio Console > temperatureF <- (allDf$temperatureC * (9/5.0))+32.0 > newDf <- cbind(allDf,temperatureF) > dbWriteTable(con, "newTable", newDf) Update Data Frame
Pendant data .db3 file now has two tables
SQLite Table Data Modified by R 15,186 rows, new TemperatureF column added
>view(hotDf) serialNo dateTime temperatureC 1 10081427 2013-06-04 12:32:30 37.824 2 10081427 2013-06-04 12:32:40 37.935 3 10081427 2013-06-04 12:32:50 37.935 4 10081427 2013-06-04 12:33:00 37.935 5 10081427 2013-06-04 12:33:10 38.046 6 10081427 2013-06-04 12:33:20 38.046 7 10081427 2013-06-04 12:33:30 38.490 8 10081427 2013-06-04 12:33:40 38.490 9 10081427 2013-06-04 12:33:50 38.602
10081427 2013-06-04 12:34:00 38.602
Quality Assurance / Control • Import SQLite3 import sqlite3 conn=sqlite3.connect(‘/Users/apple/D ocuments/IS592-Big-Data- Analytics/SQLite- Presentation/classHOBOData2013.db 3’) c=conn.cursor()
Quality Assurance / Control • Find, list “errors” • Whatever you define them to be • > 100 F in Antarctica, for example c.execute(‘SELECT obsNum,serialNo,temperatureF FROM newTable WHERE temperatureF > 100 OR temperatureF < 10’)
Quality Assurance / Control • Check for Errors • Nice printed messsage with your selected “errors”
#Fetch all the query results myListOfNames=c.fetchall() # print them out print print("Range Errors for Temperature:") for myTuple in myListOfNames: print("obsNum: "+str(myTuple)+" SerialNo: "+str(myTuple)+" Temperature: "+str(myTuple)) conn.commit() conn.close()
'SELECT obsNum,serialNo,temperatureF FROM newTable WHERE temperatureF > 100 OR temperatureF < 10' >100 F: SQLite/Python QAQC 1,538 records where the temperature was over 100 F
Access SQLite Convert Access to SQLite
Further reading http://www.sqlite.org/books.html
Converting a database:
I found this to be the best (and easiest) tool:
This took a 440 MB file in Access down to a 80 MB file (although, it should be noted the demo version only saves half the table data; still, assuming the other half is 80 MB, we see a huge reduction in file size).
Also I like that this goes to an open access file format. With SQLite3 in the command line, or SQLite Studio across platforms, a file format that was once proprietary and accessible only to those with Windows software is now an open -access format.
I’ll update this post with examples at a later point.
Other options I evaluated:
- This one did not work for me, although I appreciated that it’s open source. I don’t think it works with sqlite 3.
- Also, it appears to work for .mdb only; you have to convert an .accdb file to a .mdb file
Lecture 3 in Big Data Analytics reviewed some of the fundamental database concepts.
One aspect of INSC 584 (Database Management Systems) that I did not like was that the course textbook provided examples from “Pine Valley Furniture Company.” However, for Big Data Analytics, I’m taking the opportunity to explore a database that I find more interesting: the All Taxa Biodiversity Inventory Database for Great Smoky Mountains National Park.
This database has 52 tables. It’s online as a cold fusion site (which is in the process of being replaced with Microsoft SQL Server). It was formerly downloadable online as an Access database; however, the downloadable file appears not to be available as of January 2014.
Also online is an entity-relationship diagram: <http://dlia.org/sites/default/files/access_relationships.pdf>.
The 52 tables are drawn up into broad categories:
- Collection Details
- Scope Dependent
This is a useful database for me to study because I find it interesting. So, I’m grateful that the ER diagram is online.
I presented this poster (I intend to supply a digital version at some later point) at the 2010 meeting of the Organization of Fish and Wildlife Information Managers. The agenda for that conference is available online.
This was part of my contract work as content manager for the now defunct National Biological Information Infrastructure’s Southern Appalachian Information Node (SAIN).
An archive of the NBII program’s web site is available from the U.S. “Fugitive Agencies”collection.
The most recent iteration of SAIN was as “SEIN” for “Southeast Information Node,” a change resulting from some of the organizational upheaval in the final months leading to the program’s ultimate termination in a Presidential line item budget request for the U.S. Geological Survey. The archive for the overall site is available online: <https://wayback.archive-it.org/2361/20120105235453/http://www.nbii.gov/portal/server.pt/community/southeast/245>; the specific application promoted by this poster is available at <https://wayback.archive-it.org/2361/20120106000735/http://www.nbii.gov/portal/server.pt/community/species_of_greatest_conservation_need/1642>.