Blog Archives

Data Intensive Summer School, June 30 – July 2, 2014

From: https://www.xsede.org/web/xup/course-calendar/-/training-user/class/263/session/384


 

The Data Intensive Summer School focuses on the skills needed to manage, process and gain insight from large amounts of data. It is targeted at researchers from the physical, biological, economic and social sciences that are beginning to drown in data. We will cover the nuts and bolts of data intensive computing, common tools and software, predictive analytics algorithms, data management and visualization. Given the short duration of the summer school, the emphasis will be on providing a solid foundation that the attendees can use as a starting point for advanced topics of particular relevance to their work.

Prerequisites

  • Experience working in a Linux environment
  • Familiarity with relational data base model
  • Examples and assignments will most likely use R, MATLAB and Weka. We do not require experience in these languages or tools, but you should already have an understanding of basic programming concepts (loops, conditionals, functions, arrays, variables, scoping, etc.)

Organizer

  • Robert Sinkovits, San Diego Supercomputer Center

Topics (tentative)

  • Nuts and bolts of data intensive computing
    • Computer hardware, storage devices and file systems
    • Cloud storage
    • Data compression
    • Networking and data movement
  • Data ManagementIntroduction to R programming
    • Digital libraries and archives
    • Data management plans
    • Access control, integrity and provenance
  • Introduction to Weka
  • Predictive analyticsDealing with missing data
    • Standard algorithms: k-mean clustering, decision trees, SVM
    • Over-fitting and trusting results
  • ETL (Extract, transfer and load)
    • The ETL life cycle
    • ETL tools – from scripts to commercial solutions
  • Non-relational atabases
    • Brief refresher on relational mode
    • Survey of non-relational models and technologies
  • Visualization
    • Presentation of data for maximum insight
    • R and ggplot package

Virtual Summer School courses are delivered simultaneously at multiple locations across the country using high-definition videoconferencing technology.


 

On June 26 I received a follow-up e-mail with notes from the instructors:


 

Preparing for the virtual summer school

Several of the instructors have requested that you preinstall software on your laptop. Given the large number of participants and the compressed schedule, we ask that you comply and do this before the start of the summer school.

R Studio (statistical programming language)

Follow “download RStudio Desktop”

http://www.rstudio.com/ide/download

WEKA (data mining software)

Follow “Download” link on left hand side of home page

http://www.cs.waikato.ac.nz/ml/weka/

Please download the Stable book 3rd ed. version

Prior knowledge of R is not required, but we do assume that you have some programming experience and familiarity with basic programming concepts (variables, arrays, loops, branching, etc.). You may find it helpful to acquaint yourself with basic R syntax ahead of time.

Reading the first two chapters of the following online introduction is recommended http://cran.r-project.org/doc/manuals/R-intro.html

A basic understanding of relational databases and SQL would also be useful. If you are unfamiliar with the SQL syntax, please consider the following tutorials

http://sqlzoo.net

http://www.w3schools.com/sql/sql_intro.asp


I already have R studio; I have never tried Weka.  This is a little bit of added work for the summer, but it looks like a great opportunity to pick up some additional skills, or at least refresh those skills I’ve already acquired.

 

Statistics for Data Science

Today’s lecture for Big Data Analytics included statistical tools for data analysis.

My Data Pro Tumble blog includes several listings and resources concerning statistics <http://mountainsol.tumblr.com/tagged/statistics>.

From the perspective of an information scientist, statistical analysis software is not just the computation done, but preservation of both the input, output, and processing.

One of the more popular statistical software packages is R, which actually does a lot more than work with statistics (as one of my recent tweets showed):

There’s a short introduction to R which explains:

R is a tool for statistics and data modeling. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is more than that, though — it also includes extremely powerful graphics capabilities. If you want to easily manipulate your data and present it in compelling ways, R is the tool for you.

http://tryr.codeschool.com/

It’s also possible to run R from the terminal in Mac OS X, but a nice interface for using R is R Studio <https://www.rstudio.com/>.

Other useful links:

http://ropensci.org/

http://cran.us.r-project.org/

http://www.statmethods.net/index.html

 

 

Statistics for Research I

Schmidhammer2011

Dr. James Schmidhammer, Instructor

Statistics 537

College of Business Administration

Department of Statistics, Operations & Management Science

Instructor: Dr. James L. Schmidhammer

Syllabus Online: FA2013-STAT-537

Course Website: http://www.bus.utk.edu/stat/stat537/

 

 

 

 

Required Text An Introduction to Statistical Methods and Data Analysis (Sixth Edition)

by R. Lyman Ott and Michael Longnecker

© 2010, Brooks/Cole, Cengage Learning

ISBN-13: 978-0-495-01758-5

Optional

Text

 

Student Solutions Manual for

An Introduction to Statistical Methods and Data Analysis (Sixth Edition)

by R. Lyman Ott and Michael Longnecker

© 2010, Brooks/Cole, Cengage Learning

ISBN-13: 978-0-495-10915-0

Recommended

Text

The Little SAS Book for Enterprise Guide 4.2 

by Susan J. Slaughter & Lora D. Delwiche 

© 2010, SAS Institute Inc.
SAS PUBCODE: 61861
ISBN-13: 978-1-59994-726-6

Software SAS Version 9.3 with Enterprise Guide 5.1

Available for download at http://oit.utk.edu/software,

and on DVDs obtained at the UT Computer Store

Available for use on https://apps.utk.edu/

Grading Homework (⅓) / Midterm (⅓) / Final (⅓)

Topic

Coverage

Data Collection (Population vs. Sample)Data Display (Graphics)

Data Summarization (Summary Statistics)

I

I

I

Exploratory Data Analysis

S

Use of Statistics SoftwareSAS

JMP

SPSS

S-Plus

 

T

0

0

0

Probability

S

Probability DistributionsDiscrete

Continuous

Sampling

S

I

S

Parametric Estimation (Point & Interval)Single Sample (m )

Single Sample (s2)

Two Independent Samples (m1m2)

Two Related (Paired) Samples (m1m2)

 

T

I

T

T

Nonparametric EstimationSingle Sample (Median)

Two Independent Samples

Two Related (Paired) Samples

 

S

0

0

Parametric Hypothesis Testing (t-tests)Single Sample (m =m0)

2 Independent Samples (m1=m2)

Assuming equal variances

Assuming unequal variances

2 Independent Samples (Levene’s Test s1=s2)

2 Related (Paired) Samples (m1=m2)

 

T

 

T

T

T

T

Nonparametric Hypothesis Testing

Single Sample

Sign Test

Wilcoxon Signed Ranks Test

Two Independent Samples

Wilcoxon Rank Sum Test

Two Related (Paired) Samples

Sign Test

Wilcoxon Signed Ranks Test

            Rank Transformation Tests

I

T

T

I

T

I

Assessing Normality

Normal Probability Plots

            Tests for Normality

I

T

Robustness to Violations of Assumptions

I

Categorical Data Analysis

2 ´ 2 contingency tables

            r ´ c contingency tables

I

I

Correlation

Pearson’s r

Spearman’s r

            Kendall’s t

T

I

I

Simple Linear Regression

S

One Way ANOVA

Testing equality of means

            Post Hoc procedures

I

S

   
Math Requirements Moderate
Computing Requirements Moderate

0        No Coverage

S        Slight Coverage

I          Intermediate Coverage

T        Thorough Coverage