Monthly Archives: January 2014

Entity – Relationship Model for Biodiversity Database

Lecture 3 in Big Data Analytics reviewed some of the fundamental database concepts.

One aspect of INSC 584 (Database Management Systems) that I did not like was that the course textbook provided examples from “Pine Valley Furniture Company.”  However, for Big Data Analytics, I’m taking the opportunity to explore a database that I find more interesting: the All Taxa Biodiversity Inventory Database for Great Smoky Mountains National Park.

This database has 52 tables.  It’s online as a cold fusion site (which is in the process of being replaced with Microsoft SQL Server). It was formerly downloadable online as an Access database; however, the downloadable file appears not to be available as of January 2014.

Also online is an entity-relationship diagram: <http://dlia.org/sites/default/files/access_relationships.pdf>.

The 52 tables are drawn up into broad categories:

  1. Specimens
  2. Collection Details
  3. Citations
  4. Taxonomy
  5. Scope Dependent

This is a useful database for me to study because I find it interesting.  So, I’m grateful that the ER diagram is online.

Decision Errors in Data Science

From Big Data Analytics lecture 2, I was most impressed by the slide concerning decision errors in logic.

I imagine most data scientists are fans of Mr. Spock.  No need to be in the Captain’s Chair, but a strong need to contribute meaningful analysis to important decisions.

Any Star Trek fan can quote Mr. Spock’s sage observation, “Logic is the beginning of wisdom, not the end.”

Logic is critical to data science, and the wisdom that can arise.  However some logical errors can arise, as pointed out by Dr. Wang’s slide:

Typical Decision Errors: Logic

  • Not asking the right questions
  • Making incorrect assumptions and

    failing to test them

  • Using analytics to justify instead of learning the facts
  • Interpret data incorrectly
  • Failing to understand the alternatives 

My Geographic Information Systems – Spatial Databases and Data Management course instructor (Dr. Ralston) has a graphic on his door about “correlation and causation.”  His graphic shows a link between decreasing use of Windows Internet Explorer and a correlated decrease in murders.

The refrain is always “correlation does not imply causation.” Logic might be sound, the math might add up, but the pitfalls exist.

I often wonder if some of the data science “boot camps” and workshops can effectively impart these key lessons that are central to the process of science.

 

IS 599 – Spring 2014 Practicum Experience

I am interested in a career in environmental information management,
particularly in a governmental natural resource management agency. My course
work to date includes classes in geographic information science,
environmental information management, and data visualization for
environmental science.

Practicum Location: Great Smoky Mountains National Park; National Institute
for Computational Sciences

Practicum Objectives:
I would like to develop advanced environmental information processing and
data visualization skills by working with species occurrence records and a
high performance computing environment as part of a technology transfer
project between the University of Tennessee and the National Park Service.

The following four practicum goals and associated outcomes are proposed for
this project:

(1) Develop proficiency in running the MaxEnt species distribution modelling
program in a PC environment for determining probability of species
distribution given environmental variables and demonstrate acquired
proficiency by providing training and instruction to Park Service staff in
use of the MaxEnt program on Park resources configured to run MaxEnt.
Training materials and sessions will be produced as an outcome of the
practicum.

(2) Gain skills with workflow and parallel processing in a high performance
computing environment on a single-system-image supercomputer and demonstrate
these skills by generating species distribution models as requested by
practicum supervisor. There are currently 540 species models out of ~36,000
species in the park. A collection of new models will demonstrate the outcome
of the practicum.

(3) Create documentation for running the MaxEnt model in a PC environment
using appropriate technology such as a wiki with walkthroughs, screen
captures, or video screencasts as appropriate. A URL will be provided to the
final online documentation to demonstrate the outcome of the practicum.

(4) Practice sound data curation principles in managing both model inputs and
model outputs by successfully building on the store of models available at .
An HPC data management system such as XSEDE (XSEDE.org) will be used to
manage the inputs and outputs to demonstrate the outcome of the practicum.

In a rough estimation, I expect to spend about a 1/3 of the required 150
hours learning MaxEnt on PC and HPC environments, 1/3 writing documentation,
and 1/3 creating and delivering training (to commence in March, 2014) to
enable NPS staff to implement MaxEnt modelling on both PC and HPC platforms.

IS 592 Big Data Analytics

Catalog Description

Introduces the concepts big data and data analytics as an emerging field. To address the opportunities and challenges of big data in academics, businesses, sciences, the Web, etc. To understand the nature of big data analytics and their various contexts. To master basic concepts and process of data analytics. To design analytics initiatives/proposals. To practice data mining techniques and skills. To explore data modeling and visualizing.

Pre-requisite: Database Management Systems (completion of IS584 or equivalent)

Goals/Objectives

  • To survey the needs and importance of data analytics in various contexts
  • To understand the challenges of managing big data
  • To practice data extraction, transformation and load techniques (ETL)
  • To develop algorithms to analyze and model data
  • To design effective ways for communicating results to special usersMethods of Teaching/Learning

    This course is built on knowledge and skills of database management systems. The focus will be on issues challenging organizational decision-making, real world data needs that call for methods of data management, analytics, and modeling to derive new knowledge for better decision making.

    Students are expected to read broadly and to work on real data collected from the real world. This course is managed using Blackboard courseware, which is accessible using your UT NetID and Password at https://bblearn.utk.edu/. The Blackboard Collaborate, a tool hosted
    in Blackboard, will be used for synchronous virtual class sessions; you may attend classes from anywhere in the world. The course materials, assignments, and grades are accessible in Blackboard.

Course Materials Required text:

Jeffrey Stanton with Robert De Graaf (c2013) Version 3: Introduction to Data Science at http://jsresearch.net/

Optional texts:

Bill Franks (2012) Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics (Wiley and SAS Business Series) 337 pages ISBN: 1118208781

Thomas H. Davenport, Jeanne G. Harris (2010) Analytics at Work: Smarter Decisions, Better Results [Hardcover] 240 pages. Harvard Business Review Press

Douglas W. Hubbard (2010) How to Measure Anything: Finding the Value of Intangibles in Business. (2nd edition) 320 pages. Wiley. ISBN-10: 0470539399; ISBN-13: 978- 0470539392

Tasks and Evaluation Criteria
• Attendance & Participation (15%)

Prepared attendance and participation in course activities are important to success in this course. If you have to miss a class for whatever reasons, you are still responsible for the material covered. If you miss a class, you may replay the recording. Blackboard Collaborate keeps track of attendance and replay.

Class activities include presentations and discussion.

• ePortfolio or Journal (10%)

Be a reflective learner! Throughout the semester, you should maintain a learning journal or ePortfolio. Write journal entries to reflect your thoughts, analyze critical incidents, and check milestones.

If you have taken the ePortfolio course, you should continue building your ePortfolio in this course by writing Posts to reflect on your learning and achievements. At the end of the semester, you will write a reflective summary for the course as a Page in your ePortfolio.

If you have not taken the ePortfolio course, you may keep a structured journal with dated entries and write a final reflection piece. You submit the reflection along with selected journal entries in any format accessible to the instructor.

Make your learning and achievements visible through the development of a course ePortfolio. Journal entries or ePortfolio Posts document your learning and professional growth with evidence and through reflection on learning experiences. Both collecting artifacts and reflecting in journal entries are private actions but presenting outcomes and sharing reflective summary are oriented toward a product for public (or your evaluators).

What to write in journal entries (ePortfolio posts)? You do not need to report or log what you have done during the course. You need to focus on significant learning incidents, aha moments, relevant thoughts, analysis and synthesis of important concepts, and milestone checking. Reflection is a higher level of cognitive activity in which you makes sense of what and how you learned. For example, when you encountered a challenging problem, you should reflect on the strategies and the process through which you were, or were not, able to solve the problem. For ePortfolio students, you should classify your journal entries so that they can be easily accessed to facilitate a higher level of synthesis later in producing your final ePortfolio. For non ePortfolio students, you should structure your journal with meaningful headings, which will help you to develop a summary reflection of the semester as your last journal entry.

• Assignments (Check Schedule for Due dates): 1. Data Science (15%)

Understand the nature of data analytics in context. Understand the skill set of data scientists.

2. Data Preparation: Extract, Transform and Load (ETL) (30%)
Extract the relevant data from original sources (the raw data); transform raw data to appropriate format; load the transformed data to a database.

3. Data Analysis and Modeling (30%)
Explore the transformed data to derive meaningful results (statistical analysis, pattern

recognition, trend visualization)

SP-2014-INSC-592.pdf

A Human-Centered Approach to Studying the Spatial Visualization of Non-Spatial Information

Today I attended a talk at UT College of Electrical Engineering and Computer Science about the spatial visualization of non-spatial information. Although I would tend to argue that no information lacks a spatial component, it might be better to say "non-geospatial" information. Below is the pertinent information from the talk:

Abstract: Many visual applications, such as visual analytics tools and educational games, employ spatial information presentations to support data exploration and improve understanding. However, it is not well understood how to take advantage of spatial information layouts, especially when dealing with large data sets, abstract information, and multiple display options. As a result, it is often unclear how to effectively design spatial visualizations for learning and sense-making. My research addresses this problem through controlled experimentation and observation. My work focuses on the evaluation of interface design factors for information presentations on physically-large 2D displays and in immersive 3D virtual reality systems. In this talk, I will discuss several projects that evaluate task performance and information processing strategies, with a specific example involving scientific data exploration. Overall, the results suggest that supplemental spatial information can affect mental strategies and support performance improvements for cognitive processing, but the effectiveness of spatial presentations is dependent on the nature of the task and a meaningful use of space. I will close with a discussion of how the lessons learned from user studies affect the design of visual analytics tools.

Bio: Eric D. Ragan is a visual analytics research scientist at Oak Ridge National Laboratory. Eric works within the Situation Awareness and Visual Analytics Team in the Cyberspace Sciences and Information Intelligence Research Group. His research interests include immersive virtual reality, interface evaluation, visual analytics, educational software, training systems, and human-computer interaction. Eric’s research involves human learning and information exploration with spatially distributed visualizations of non-spatial information. He is studying visualization systems that aid organizing evidence, communicate analytic provenance, and support streaming data. Eric received his PhD and MS degrees in computer science from Virginia Tech and a BS in mathematics and computer science at Gannon University. Contact him at raganed>

Big Data Analytics Coursework

Big Data Analytics is listed as IS 592 in catalog.  Because it is a higher level class building on material from IS 584 (Database Management Systems), I’m looking forward to exploring data analytics and data science in the formal classroom setting.

Data life cycle image with steps analyze plan collect assure describe preserve discover integrate analyze.

Steps in data life cycle include “Analyze”

I’m grateful to have the opportunity to study this topic at the School of Information Sciences.  It is one of a handful of courses in the overall course catalog that are closely aligned with my career goals to work as a professional data manager.  My intuition is that individuals who are trained to manage data must have an extensive knowledge of the data life cycle to effectively manage data across the spectrum of its life.  Note that “Analyze” is one of the last steps.  Even for strictly traditional views of curation, an archivist who is not familiar with flow of information in the “Analyze” step is not well positioned to receive and curate research output that is increasingly data intensive.

From the first lecture, a quote caught my attention:

We project a need for 1.5 million additional managers and analysts in the United States who can ask the right questions and consume the results of the analysis of Big Data effectively.

This is a huge and growing field.  What caught my eye is the growing number of specialty schools to turn out students with the skills to analyze data.  Even at UT-Knoxville, there are three “data science” courses – one in Computer Science, one in Statistics, and one in Information Sciences.  These are three separate colleges (Engineering, Arts & Sciences, and Communication and Information). Established information studies programs are likely seats for new data science programs and curriculum.  Berkely’s school of information just stood up a new program to support a Masters in Information and Data Science.  Other short “boot camps” are offered, although I’m not sure if those programs will produce data “scientists” or just data “analysts” – the key from the quote above is “ask the right questions.” Are our new data science programs able to impart the skill to ask questions using the scientific method?

From the first lecture, here are skills that data analysts need:

  1. Database
  2. Data mining
  3. Statistical applications
  4. Predictive analytics
  5. Business intelligence
  6. Data modeling and data visualization
  7. Meta cognitive strategies
  8. Interpersonal

Coursework for IS 592 will be collected at the following URL: https://mountainsol.wordpress.com/category/coursework/big-data-analtyics/

Geographic Information Management and Processing Course Textbooks

8064f9d4-dce6-5a50-853f-460148e0d114.jpg

Getting To Know Arcgis Modelbuilder

Author David W. Allen
ISBN: 9781589482555
Status: Required

c84d5c66-944c-5c16-b7a6-63441195332d.jpg

Modeling Our World

Author Michael Zeiler; Environmental Sy…
ISBN: 9781589482784
Status: Required