Data Intensive Summer School, June 30 – July 2, 2014

From: https://www.xsede.org/web/xup/course-calendar/-/training-user/class/263/session/384


 

The Data Intensive Summer School focuses on the skills needed to manage, process and gain insight from large amounts of data. It is targeted at researchers from the physical, biological, economic and social sciences that are beginning to drown in data. We will cover the nuts and bolts of data intensive computing, common tools and software, predictive analytics algorithms, data management and visualization. Given the short duration of the summer school, the emphasis will be on providing a solid foundation that the attendees can use as a starting point for advanced topics of particular relevance to their work.

Prerequisites

  • Experience working in a Linux environment
  • Familiarity with relational data base model
  • Examples and assignments will most likely use R, MATLAB and Weka. We do not require experience in these languages or tools, but you should already have an understanding of basic programming concepts (loops, conditionals, functions, arrays, variables, scoping, etc.)

Organizer

  • Robert Sinkovits, San Diego Supercomputer Center

Topics (tentative)

  • Nuts and bolts of data intensive computing
    • Computer hardware, storage devices and file systems
    • Cloud storage
    • Data compression
    • Networking and data movement
  • Data ManagementIntroduction to R programming
    • Digital libraries and archives
    • Data management plans
    • Access control, integrity and provenance
  • Introduction to Weka
  • Predictive analyticsDealing with missing data
    • Standard algorithms: k-mean clustering, decision trees, SVM
    • Over-fitting and trusting results
  • ETL (Extract, transfer and load)
    • The ETL life cycle
    • ETL tools – from scripts to commercial solutions
  • Non-relational atabases
    • Brief refresher on relational mode
    • Survey of non-relational models and technologies
  • Visualization
    • Presentation of data for maximum insight
    • R and ggplot package

Virtual Summer School courses are delivered simultaneously at multiple locations across the country using high-definition videoconferencing technology.


 

On June 26 I received a follow-up e-mail with notes from the instructors:


 

Preparing for the virtual summer school

Several of the instructors have requested that you preinstall software on your laptop. Given the large number of participants and the compressed schedule, we ask that you comply and do this before the start of the summer school.

R Studio (statistical programming language)

Follow “download RStudio Desktop”

http://www.rstudio.com/ide/download

WEKA (data mining software)

Follow “Download” link on left hand side of home page

http://www.cs.waikato.ac.nz/ml/weka/

Please download the Stable book 3rd ed. version

Prior knowledge of R is not required, but we do assume that you have some programming experience and familiarity with basic programming concepts (variables, arrays, loops, branching, etc.). You may find it helpful to acquaint yourself with basic R syntax ahead of time.

Reading the first two chapters of the following online introduction is recommended http://cran.r-project.org/doc/manuals/R-intro.html

A basic understanding of relational databases and SQL would also be useful. If you are unfamiliar with the SQL syntax, please consider the following tutorials

http://sqlzoo.net

http://www.w3schools.com/sql/sql_intro.asp


I already have R studio; I have never tried Weka.  This is a little bit of added work for the summer, but it looks like a great opportunity to pick up some additional skills, or at least refresh those skills I’ve already acquired.

 

Advertisements

About Tanner Jessel

I am a recent M.S. in Information Science graduate from the University of Tennessee School of Information Science. I was formerly a graduate research assistant funded by DataONE (Data Observation Network for Earth). Prior, I worked for four years as a content lead and biodiversity scientist with the U.S. Geological Survey's Biodiversity Informatics Program. Building on my work experience in biodiversity and environmental informatics, my work with DataONE focused on exploring the nature of scientific collaborations necessary for scientific inquiry. I also conducted research concerning user experience and usability, and assisted in development of member nodes with an emphasis on spatial data and infrastructure. I assisted with research designed to understand sociocultural issues within collaborative research communities. Through August 1, 2014, I was based at the Center for Information and Communication Studies at the University of Tennessee School of Information Science in Knoxville, Tennessee.

Posted on June 26, 2014, in Training and Certifications and tagged , , , , , , . Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: