Blog Archives
Data Intensive Summer School, June 30 – July 2, 2014
From: https://www.xsede.org/web/xup/coursecalendar//traininguser/class/263/session/384
The Data Intensive Summer School focuses on the skills needed to manage, process and gain insight from large amounts of data. It is targeted at researchers from the physical, biological, economic and social sciences that are beginning to drown in data. We will cover the nuts and bolts of data intensive computing, common tools and software, predictive analytics algorithms, data management and visualization. Given the short duration of the summer school, the emphasis will be on providing a solid foundation that the attendees can use as a starting point for advanced topics of particular relevance to their work.
Prerequisites
 Experience working in a Linux environment
 Familiarity with relational data base model
 Examples and assignments will most likely use R, MATLAB and Weka. We do not require experience in these languages or tools, but you should already have an understanding of basic programming concepts (loops, conditionals, functions, arrays, variables, scoping, etc.)
Organizer
 Robert Sinkovits, San Diego Supercomputer Center
Topics (tentative)
 Nuts and bolts of data intensive computing
 Computer hardware, storage devices and file systems
 Cloud storage
 Data compression
 Networking and data movement
 Data ManagementIntroduction to R programming
 Digital libraries and archives
 Data management plans
 Access control, integrity and provenance
 Introduction to Weka
 Predictive analyticsDealing with missing data
 Standard algorithms: kmean clustering, decision trees, SVM
 Overfitting and trusting results
 ETL (Extract, transfer and load)
 The ETL life cycle
 ETL tools – from scripts to commercial solutions
 Nonrelational atabases
 Brief refresher on relational mode
 Survey of nonrelational models and technologies
 Visualization
 Presentation of data for maximum insight
 R and ggplot package
Virtual Summer School courses are delivered simultaneously at multiple locations across the country using highdefinition videoconferencing technology.
On June 26 I received a followup email with notes from the instructors:
Preparing for the virtual summer school
Several of the instructors have requested that you preinstall software on your laptop. Given the large number of participants and the compressed schedule, we ask that you comply and do this before the start of the summer school.
R Studio (statistical programming language)
Follow “download RStudio Desktop”
http://www.rstudio.com/ide/download
WEKA (data mining software)
Follow “Download” link on left hand side of home page
http://www.cs.waikato.ac.nz/ml/weka/
Please download the Stable book 3rd ed. version
Prior knowledge of R is not required, but we do assume that you have some programming experience and familiarity with basic programming concepts (variables, arrays, loops, branching, etc.). You may find it helpful to acquaint yourself with basic R syntax ahead of time.
Reading the first two chapters of the following online introduction is recommended http://cran.rproject.org/doc/manuals/Rintro.html
A basic understanding of relational databases and SQL would also be useful. If you are unfamiliar with the SQL syntax, please consider the following tutorials
http://www.w3schools.com/sql/sql_intro.asp
I already have R studio; I have never tried Weka. This is a little bit of added work for the summer, but it looks like a great opportunity to pick up some additional skills, or at least refresh those skills I’ve already acquired.
Statistics for Data Science
Today’s lecture for Big Data Analytics included statistical tools for data analysis.
My Data Pro Tumble blog includes several listings and resources concerning statistics <http://mountainsol.tumblr.com/tagged/statistics>.
From the perspective of an information scientist, statistical analysis software is not just the computation done, but preservation of both the input, output, and processing.
One of the more popular statistical software packages is R, which actually does a lot more than work with statistics (as one of my recent tweets showed):
There’s a short introduction to R which explains:
R is a tool for statistics and data modeling. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is more than that, though — it also includes extremely powerful graphics capabilities. If you want to easily manipulate your data and present it in compelling ways, R is the tool for you.
It’s also possible to run R from the terminal in Mac OS X, but a nice interface for using R is R Studio <https://www.rstudio.com/>.
Other useful links:
http://www.statmethods.net/index.html
Statistics for Research I
Statistics 537
College of Business Administration
Department of Statistics, Operations & Management Science
Instructor: Dr. James L. Schmidhammer
Syllabus Online: FA2013STAT537
Course Website: http://www.bus.utk.edu/stat/stat537/
Required Text  An Introduction to Statistical Methods and Data Analysis (Sixth Edition)
by R. Lyman Ott and Michael Longnecker © 2010, Brooks/Cole, Cengage Learning ISBN13: 9780495017585 ⎮ 
Optional
Text

Student Solutions Manual for
An Introduction to Statistical Methods and Data Analysis (Sixth Edition) by R. Lyman Ott and Michael Longnecker © 2010, Brooks/Cole, Cengage Learning ISBN13: 9780495109150 ⎮ 
Recommended
Text 
The Little SAS Book for Enterprise Guide 4.2
by Susan J. Slaughter & Lora D. Delwiche © 2010, SAS Institute Inc. ⎮ 
Software  SAS Version 9.3 with Enterprise Guide 5.1
Available for download at http://oit.utk.edu/software, and on DVDs obtained at the UT Computer Store Available for use on https://apps.utk.edu/ ⎮ 
Grading  Homework (⅓) / Midterm (⅓) / Final (⅓) 
Topic 
Coverage 
Data Collection (Population vs. Sample)Data Display (Graphics)
Data Summarization (Summary Statistics) 
I I I 
Exploratory Data Analysis 
S 
Use of Statistics SoftwareSAS
JMP SPSS SPlus 
T 0 0 0 
Probability 
S 
Probability DistributionsDiscrete
Continuous Sampling 
S I S 
Parametric Estimation (Point & Interval)Single Sample (m )
Single Sample (s^{2}) Two Independent Samples (m_{1}–m_{2}) Two Related (Paired) Samples (m_{1}–m_{2}) 
T I T T 
Nonparametric EstimationSingle Sample (Median)
Two Independent Samples Two Related (Paired) Samples 
S 0 0 
Parametric Hypothesis Testing (ttests)Single Sample (m =m_{0})
2 Independent Samples (m_{1}=m_{2})
2 Independent Samples (Levene’s Test s_{1}=s_{2}) 2 Related (Paired) Samples (m_{1}=m_{2}) 
T
T T T T 
Nonparametric Hypothesis Testing
Single Sample Sign Test Wilcoxon Signed Ranks Test Two Independent Samples Wilcoxon Rank Sum Test Two Related (Paired) Samples Sign Test Wilcoxon Signed Ranks Test Rank Transformation Tests 
I T T I T I 
Assessing Normality
Normal Probability Plots Tests for Normality 
I T 
Robustness to Violations of Assumptions 
I 
Categorical Data Analysis
2 ´ 2 contingency tables r ´ c contingency tables 
I I 
Correlation
Pearson’s r Spearman’s r Kendall’s t 
T I I 
Simple Linear Regression 
S 
One Way ANOVA
Testing equality of means Post Hoc procedures 
I S 
Math Requirements  Moderate 
Computing Requirements  Moderate 
0 No Coverage
S Slight Coverage
I Intermediate Coverage
T Thorough Coverage