Category Archives: Big Data Analytics

INSC 592 – Big Data Analytics

(3) Introduces the concepts big data and data analytics as an emerging field. To address the opportunities and challenges of big data in cademics, businesses, sciences, the Web, etc. To understand the nature of big data analytics and their various contexts. To master basic concepts and process of data analytics. To design analytics initiatives/proposals. To practice data mining techniques and skills (ETL). To explore data modeling and visualizing. Prerequisite: INSC 584 Database Management Systems or equivalent. Consent of the instructor.


Advertisements

Open Source Relational Database System

Overview of SQLite benefits including compatibility with R, Python, and ability to work with converted Microsoft Access databases.

Light, Open Source RDBMS SQLite

Tanner Jessel Spring 2014 IS 592 – Big Data Analytics School of Information Sciences College of Communication & Information

What is SQLite?

• Big Data in a Little Package – Portable, No Server – Writes directly to common media – Cross-platform (Mac, Windows, Linux)
• Portable – Small Gadgets (Windows Phone, Android, iOS) • (500 KB, 300 KB) • Embeddable – Computer Applications (Chrome, Firefox, Skype) • Single file database – Triggers – Tables – Indices – Views

http://en.wikipedia.org/wiki/SQLite

What it is not…
• Not a full database application – No forms – No reports – No saved queries

Why should you be interested?
*Free*

• No Server (no mess)

• Command Line Interface

• GUI Interface

• Works with R

• Works with Python

• Good for students to learn advanced SQL queries, command line interaction

• Good for small database projects

RSQLite
Alter, analyze, attatch, begin, commit, create (table, trigger, view, virtual table), delete, select, insert… SQLite Query Language “Mostly” SQL

Database Interface R driver for SQLite. This package embeds the SQLite database engine in R and provides an interface compliant with the DBI package. The source for the SQLite engine (version 3.7.17) is included.” –Comprehensive R Archive Network (CRAN) http://cran.r-project.org/web/packages/RSQLite/ 
Sqlite3 http://www.sqlite.org/cli.html

Command Line Shell

$ sqlite3 myexampledb SQLite version 3.8.4 2014-02-11 16:24:34 Enter ".help" for usage hints. sqlite> CREATE table tbl1(one varchar(10), two smallint); sqlite> INSERT into tbl1 values('hello!',10); sqlite> INSERT into tbl1 values('goodbye', 20); sqlite> SELECT * from tbl1; hello!|10 goodbye|20 sqlite>

*Free* • Feature Rich GUI • Cross-platform – Windows – Mac – Linux – Solaris • http://sqlitestudio.pl/ SQLite Studio
10. • Rstudio Console

> install.packages("RSQLite") > library(RSQLite) > dbDIR <- "/Users/apple/Documents/IS592-Big- Data-Analytics/SQLite-Presentation" > dbFileName <- paste(dbDIR,"classHOBOData2013.db3", sep="/") >drv <- dbDriver("SQLite”) >con <- dbConnect(drv, dbname = dbFileName)

Connect to Data
• Rstudio Console

> allDf <- dbGetQuery(con, "select * from HoboTable ") >View(allDf) Load Data to Frame

Pendant monitor data from SQLite database in data frame SQLite Table Data in R Studio 15,186 rows

>summary(allDf) obsNum serialNo recNo dateTime temperatureC Min. : 1 Length:15186 Min. : 1.0 Length:15186 Min. :14.42 1st Qu.: 3797 Class :character 1st Qu.: 422.0 Class :character 1st Qu.:23.58 Median : 7594 Mode :character Median : 844.0 Mode :character Median :23.87 Mean : 7594 Mean : 844.5 Mean :26.43 3rd Qu.:11390 3rd Qu.:1266.0 3rd Qu.:25.12 Max. :15186 Max. :1722.0 Max. :59.81 intensityLight Min. : 0.0 1st Qu.: 0.0 Median : 10.8 Mean : 2524.8 3rd Qu.: 21.5 Max. :231468.2

Pendant monitor data from SQLite database in plot

>plot(allDf$temperatureC) Temperature Data
Rstudio Console > temperatureF <- (allDf$temperatureC * (9/5.0))+32.0 Transformation C to F
Rstudio Console > temperatureF <- (allDf$temperatureC * (9/5.0))+32.0 > newDf <- cbind(allDf,temperatureF) > dbWriteTable(con, "newTable", newDf) Update Data Frame

Pendant data .db3 file now has two tables
SQLite Table Data Modified by R 15,186 rows, new TemperatureF column added

>view(hotDf) serialNo dateTime temperatureC 1 10081427 2013-06-04 12:32:30 37.824 2 10081427 2013-06-04 12:32:40 37.935 3 10081427 2013-06-04 12:32:50 37.935 4 10081427 2013-06-04 12:33:00 37.935 5 10081427 2013-06-04 12:33:10 38.046 6 10081427 2013-06-04 12:33:20 38.046 7 10081427 2013-06-04 12:33:30 38.490 8 10081427 2013-06-04 12:33:40 38.490 9 10081427 2013-06-04 12:33:50 38.602

10081427 2013-06-04 12:34:00 38.602

Quality Assurance / Control • Import SQLite3 import sqlite3 conn=sqlite3.connect(‘/Users/apple/D ocuments/IS592-Big-Data- Analytics/SQLite- Presentation/classHOBOData2013.db 3’) c=conn.cursor()

Quality Assurance / Control • Find, list “errors” • Whatever you define them to be • > 100 F in Antarctica, for example c.execute(‘SELECT obsNum,serialNo,temperatureF FROM newTable WHERE temperatureF > 100 OR temperatureF < 10’)

Quality Assurance / Control • Check for Errors • Nice printed messsage with your selected “errors”
python QAQCdemo.py

#Fetch all the query results myListOfNames=c.fetchall() # print them out print print("Range Errors for Temperature:") for myTuple in myListOfNames: print("obsNum: "+str(myTuple[0])+" SerialNo: "+str(myTuple[1])+" Temperature: "+str(myTuple[2])) conn.commit() conn.close()
'SELECT obsNum,serialNo,temperatureF FROM newTable WHERE temperatureF &gt; 100 OR temperatureF &lt; 10' &gt;100 F: SQLite/Python QAQC 1,538 records where the temperature was over 100 F

Access SQLite Convert Access to SQLite

https://code.google.com/p/mdb-sqlite/

http://convertdb.com/access/sqlite

https://code.google.com/p/access2sqlite/

http://mdbtools.sourceforge.net/

http://sites.fastspring.com/eggerapps/product/mdbaccdbviewer

Further reading http://www.sqlite.org/books.html

 

Convert Access Database to SQLite

Converting a database:

BioDB_EECS.accdb

to

BioDB_EECS.db

I found this to be the best (and easiest) tool:

https://eggerapps.at/mdbviewer/

http://sites.fastspring.com/eggerapps/product/mdbaccdbviewer

This took a 440 MB file in Access down to a 80 MB file (although, it should be noted the demo version only saves half the table data; still, assuming the other half is 80 MB, we see a huge reduction in file size).

Also I like that this goes to an open access file format. With SQLite3 in the command line, or SQLite Studio across platforms, a file format that was once proprietary and accessible only to those with Windows software is now an open -access format.

I’ll update this post with examples at a later point.

Other options I evaluated:

https://code.google.com/p/mdb-sqlite/

  • This one did not work for me, although I appreciated that it’s open source.  I don’t think it works with sqlite 3. 
  • Also, it appears to work for .mdb only; you have to convert an .accdb file to a .mdb file

http://convertdb.com/access/sqlite

https://code.google.com/p/access2sqlite/

http://mdbtools.sourceforge.net/

Statistics for Data Science

Today’s lecture for Big Data Analytics included statistical tools for data analysis.

My Data Pro Tumble blog includes several listings and resources concerning statistics <http://mountainsol.tumblr.com/tagged/statistics>.

From the perspective of an information scientist, statistical analysis software is not just the computation done, but preservation of both the input, output, and processing.

One of the more popular statistical software packages is R, which actually does a lot more than work with statistics (as one of my recent tweets showed):

There’s a short introduction to R which explains:

R is a tool for statistics and data modeling. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is more than that, though — it also includes extremely powerful graphics capabilities. If you want to easily manipulate your data and present it in compelling ways, R is the tool for you.

http://tryr.codeschool.com/

It’s also possible to run R from the terminal in Mac OS X, but a nice interface for using R is R Studio <https://www.rstudio.com/>.

Other useful links:

http://ropensci.org/

http://cran.us.r-project.org/

http://www.statmethods.net/index.html

 

 

Extraction, Transform and Load

A friend of mine in the program commented that about 90% of the time doing data science is obtaining and cleaning data.

This is where programming is incredibly useful.  In the second year of my Masters program, my programming skills are not yet at the level that I want them to be.

I recently started some work for my research assistantship concerning Twitter data for @DataONEorg.

I’m interested in the content of posts, and the relationships between the actors in the network.

In terms of content, I’d like to look at the hashtags and links.

To illustrate how difficult it is to accomplish tasks “by hand,” I recently tried to the twitter data from a free site.  My efforts are documented here: <https://notebooks.dataone.org/data-science/harvesting-dataoneorg-twitter-mentions-via-topsy/>.

I’ve read that employers should not hire a “data scientist” if the so-called “scientist” does not have programming skills.  For this reason, I’m disappointed that the School of Information Science does not offer a programming course within the School itself.  (I’ve heard Dr. Potnis will offer a course in Fall 2014, a semester after my graduation).

I enrolled in a programming course in the College of Engineering and Computer Science – Introduction to Programming for Scientists and Engineers.  The course focuses on C++ language.  This is unfortunate, as python is increasingly favored over C++.  This means more ready-made programs are available, and a user community is growing. Content management systems are even building up around python.

Python is used by a friend of mine who does genome science.  C++ is useful for taking advantage of parallelism, but that my friend who works on supercomputers uses python suggests to me that python works as well.

Programming language popularity.

http://venturebeat.com/2014/02/03/the-most-popular-coding-languages-for-2014-are/

Further reading:

Python Displacing R As The Programming Language For Data Science by @mjasay http://readwr.it/c1ew  

http://www.scipy.org/

 

 

Big Data in Ecology: Volume, Variety, and Velocity in Environmental Data

Our first INSC 592 Big Data Analytics assignment included a discussion of what “Big Data” means to each student and their respective career goals.

Because I love wildlife and the outdoors, my career goals include environmental information management.  The earth itself is a big place, so it’s easy to understand why volume, variety, and velocity might intersect as a source of big data.

This slide deck was designed to help my fellow students understand the sources of “big” data in the environment, and ways that the information can flow from field to data file.

The presentation was well received and I think produced the intended effect in illustrating the sources of volume, variety, and velocity for environmental data.

One of the more useful outcomes of this assignment was that I produced a collection of bookmarks available for download. The bookmarks are based on the helpful (albeit cumbersome) “Software Tools Catalog” database available from DataONE.  My approach was to bookmark the links Diigo, which allows for long term curation, wider availability, customizable tags, and flexible output, such as an RSS feed like one for the keywork “repository” <https://www.diigo.com/rss/user/mountainsol/repository?type=all&sort=created>.  A collection can for a particular tag can also be linked to; for example: “visualization” https://www.diigo.com/user/mountainsol/visualization.

I should thank Ethan White for his “Most Interesting Man” image that I could not resist borrowing. See http://jabberwocky.weecology.org/2013/08/12/ignite-talk-big-data-in-ecology/ for Ethan’s take on “Big Data in Ecology. “

Big Data and Environmental Information

Environmental information gathered by modern computational methods has all the hallmarks of “Big Data” including volume, variety, and velocity. This includes information collected by remote sensing from earth observing satellites, information collected by terrestrial, freshwater, and marine sensor networks, and data collected with portable devices such as radio telemetry or global positioning satellite transponders. Even small animals can be paired with “passive integrated transponders” that transform an organism into a data point – and tissue samples and specimens can be taken back to the lab, herbarium, or other natural history collection, with further analysis and data generation and curation done at the molecular and genetic level.

The volume of data available to be collected from the environment approaches infinity, with complex edges like a fractal.  In fact, the National Ecological Observation Network states the defining characteristic of ecological data is complexity.  The complex interactions between elements of the environment, such as distributions of species, interactions between species, and interactions between biotic and abiotic factors, generate a wealth of data points.  A February 3, 2014 article entitled “Ecologists urged to avail themselves of big data in studies” suggests advances in data science and computational methods require a new generation of ecologists  who can work with “big datasets at large scale” to mesh together both large and small datasets to better understand the complex dynamics of the natural world.

An adage in the wildlife biology field is to “know your stats and know your critters.” However, a 2012 paper by Strasser and Hampton concerning undergraduate training in data management for ecology students suggests that knowledge of data management tools and methods is lacking in undergraduate curriculum.  At the University of Tennessee, students of the Environmental Studies Program are encouraged to take an “Environmental Information Science” course available through the College of Communication’s undergraduate minor in Information Science and Technology.  Graduate level training is available from the University of New Mexico’s Environmental Information Management Institute, and the University of Tennessee offers a course in “Environmental Informatics.”

These course offerings are needed. In addition to “knowing the stats” and “knowing the critters,” students of ecology are increasingly expected to work with large datasets.  Further, working with data that has already been collected is in fact less costly than collecting new data. Collected data becomes increasingly valuable, and personnel who can effectively manage and safeguard the investment in research effort offer a new skillset to the field of ecology beyond skills in quantitative analysis or biology.  In “The Age of Big Data,” Steve Lohr wrote that the U.S. would need shy of 200,000 workers with skill in data analysis, and 1.5 million well versed in data management.

The need for data literacy in the environmental domain is underscored by the variety of applications, what the President’s Council of Advisors for Science and Technology in 2011 called “extreme heterogeneity of the data.” Along with remote-sensed data and ongoing field collection of data, museum collections hold millions of specimens.  Many sites are in the process of digitizing their holdings for greater access.  The U.S. Geological Survey’s Biological Information Serving Our Nation Program provides federated access to over 256 museum and herbarium collections, for a total of over 100 million occurrence records.  Taken together, these records can provide new opportunities for inquiry in biodiversity science:  with increased data access comes increased opportunities to conduct data-driven investigations.

The extreme volume, variety, and velocity of data necessitates a suite of tools to collect, curate, and visualize environmental information.  Professional organizations such as Earth Science Information Providers or the Organization for Fish and Wildlife Information Managers or DataONE Users Group offer continual training in data management.  Tutorials and methods abound on the Web.

A basic problem for any data scientist working with ecological data might be to curate his or her own collection of tools.  DataONE addressed this by creating an online database of tools and software, which contains several hundred examples of computer tools that range across the data life cycle, from planning, to collecting, assuring quality, describing, preserving, discovering, integrating, and analyzing.  The database includes brief descriptions and a rudimentary controlled vocabulary.  It is searchable.  Unfortunately the end user cannot modify the database and is forced to rely on website manager to update the site in case of expired links, and the end user cannot add new tools that emerge.

Here, a solution is proposed by integrating the database into a collection of online bookmarks, and expanding on the collection with new tools and best practices.  Categories closely follow the DataONE Tools and Software Database, and include the following: Biodiversity, Databases, GIS, Metadata, Repository, Visualization. As the material is available online, feeds are presented in Table 1.

Biodiversity https://www.diigo.com/rss/user/mountainsol/biodiversity?type=all&sort=created
Databases https://www.diigo.com/rss/user/mountainsol/databases?type=all&sort=created
GIS https://www.diigo.com/rss/user/mountainsol/GIS?type=all&sort=created
Metadata https://www.diigo.com/rss/user/mountainsol/Metadata?type=all&sort=created
Repository https://www.diigo.com/rss/user/mountainsol/repository?type=all&sort=created
Visualization https://www.diigo.com/rss/user/mountainsol/visualization?type=all&sort=created

Table 1. XML Feeds of Annotated Bibliography    

Note: PDF version of Assignment 1 online:  IS-592-ASSN1-JesselT

Related presentation in “Big Data in Ecology: Volume, Variety and Velocity in Environmental Information.

Entity – Relationship Model for Biodiversity Database

Lecture 3 in Big Data Analytics reviewed some of the fundamental database concepts.

One aspect of INSC 584 (Database Management Systems) that I did not like was that the course textbook provided examples from “Pine Valley Furniture Company.”  However, for Big Data Analytics, I’m taking the opportunity to explore a database that I find more interesting: the All Taxa Biodiversity Inventory Database for Great Smoky Mountains National Park.

This database has 52 tables.  It’s online as a cold fusion site (which is in the process of being replaced with Microsoft SQL Server). It was formerly downloadable online as an Access database; however, the downloadable file appears not to be available as of January 2014.

Also online is an entity-relationship diagram: <http://dlia.org/sites/default/files/access_relationships.pdf>.

The 52 tables are drawn up into broad categories:

  1. Specimens
  2. Collection Details
  3. Citations
  4. Taxonomy
  5. Scope Dependent

This is a useful database for me to study because I find it interesting.  So, I’m grateful that the ER diagram is online.

Decision Errors in Data Science

From Big Data Analytics lecture 2, I was most impressed by the slide concerning decision errors in logic.

I imagine most data scientists are fans of Mr. Spock.  No need to be in the Captain’s Chair, but a strong need to contribute meaningful analysis to important decisions.

Any Star Trek fan can quote Mr. Spock’s sage observation, “Logic is the beginning of wisdom, not the end.”

Logic is critical to data science, and the wisdom that can arise.  However some logical errors can arise, as pointed out by Dr. Wang’s slide:

Typical Decision Errors: Logic

  • Not asking the right questions
  • Making incorrect assumptions and

    failing to test them

  • Using analytics to justify instead of learning the facts
  • Interpret data incorrectly
  • Failing to understand the alternatives 

My Geographic Information Systems – Spatial Databases and Data Management course instructor (Dr. Ralston) has a graphic on his door about “correlation and causation.”  His graphic shows a link between decreasing use of Windows Internet Explorer and a correlated decrease in murders.

The refrain is always “correlation does not imply causation.” Logic might be sound, the math might add up, but the pitfalls exist.

I often wonder if some of the data science “boot camps” and workshops can effectively impart these key lessons that are central to the process of science.

 

IS 592 Big Data Analytics

Catalog Description

Introduces the concepts big data and data analytics as an emerging field. To address the opportunities and challenges of big data in academics, businesses, sciences, the Web, etc. To understand the nature of big data analytics and their various contexts. To master basic concepts and process of data analytics. To design analytics initiatives/proposals. To practice data mining techniques and skills. To explore data modeling and visualizing.

Pre-requisite: Database Management Systems (completion of IS584 or equivalent)

Goals/Objectives

  • To survey the needs and importance of data analytics in various contexts
  • To understand the challenges of managing big data
  • To practice data extraction, transformation and load techniques (ETL)
  • To develop algorithms to analyze and model data
  • To design effective ways for communicating results to special usersMethods of Teaching/Learning

    This course is built on knowledge and skills of database management systems. The focus will be on issues challenging organizational decision-making, real world data needs that call for methods of data management, analytics, and modeling to derive new knowledge for better decision making.

    Students are expected to read broadly and to work on real data collected from the real world. This course is managed using Blackboard courseware, which is accessible using your UT NetID and Password at https://bblearn.utk.edu/. The Blackboard Collaborate, a tool hosted
    in Blackboard, will be used for synchronous virtual class sessions; you may attend classes from anywhere in the world. The course materials, assignments, and grades are accessible in Blackboard.

Course Materials Required text:

Jeffrey Stanton with Robert De Graaf (c2013) Version 3: Introduction to Data Science at http://jsresearch.net/

Optional texts:

Bill Franks (2012) Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics (Wiley and SAS Business Series) 337 pages ISBN: 1118208781

Thomas H. Davenport, Jeanne G. Harris (2010) Analytics at Work: Smarter Decisions, Better Results [Hardcover] 240 pages. Harvard Business Review Press

Douglas W. Hubbard (2010) How to Measure Anything: Finding the Value of Intangibles in Business. (2nd edition) 320 pages. Wiley. ISBN-10: 0470539399; ISBN-13: 978- 0470539392

Tasks and Evaluation Criteria
• Attendance & Participation (15%)

Prepared attendance and participation in course activities are important to success in this course. If you have to miss a class for whatever reasons, you are still responsible for the material covered. If you miss a class, you may replay the recording. Blackboard Collaborate keeps track of attendance and replay.

Class activities include presentations and discussion.

• ePortfolio or Journal (10%)

Be a reflective learner! Throughout the semester, you should maintain a learning journal or ePortfolio. Write journal entries to reflect your thoughts, analyze critical incidents, and check milestones.

If you have taken the ePortfolio course, you should continue building your ePortfolio in this course by writing Posts to reflect on your learning and achievements. At the end of the semester, you will write a reflective summary for the course as a Page in your ePortfolio.

If you have not taken the ePortfolio course, you may keep a structured journal with dated entries and write a final reflection piece. You submit the reflection along with selected journal entries in any format accessible to the instructor.

Make your learning and achievements visible through the development of a course ePortfolio. Journal entries or ePortfolio Posts document your learning and professional growth with evidence and through reflection on learning experiences. Both collecting artifacts and reflecting in journal entries are private actions but presenting outcomes and sharing reflective summary are oriented toward a product for public (or your evaluators).

What to write in journal entries (ePortfolio posts)? You do not need to report or log what you have done during the course. You need to focus on significant learning incidents, aha moments, relevant thoughts, analysis and synthesis of important concepts, and milestone checking. Reflection is a higher level of cognitive activity in which you makes sense of what and how you learned. For example, when you encountered a challenging problem, you should reflect on the strategies and the process through which you were, or were not, able to solve the problem. For ePortfolio students, you should classify your journal entries so that they can be easily accessed to facilitate a higher level of synthesis later in producing your final ePortfolio. For non ePortfolio students, you should structure your journal with meaningful headings, which will help you to develop a summary reflection of the semester as your last journal entry.

• Assignments (Check Schedule for Due dates): 1. Data Science (15%)

Understand the nature of data analytics in context. Understand the skill set of data scientists.

2. Data Preparation: Extract, Transform and Load (ETL) (30%)
Extract the relevant data from original sources (the raw data); transform raw data to appropriate format; load the transformed data to a database.

3. Data Analysis and Modeling (30%)
Explore the transformed data to derive meaningful results (statistical analysis, pattern

recognition, trend visualization)

SP-2014-INSC-592.pdf