Blog Archives

Presentation: Research Data Services Industry

This presentation complements IS 553 Assignment 2  profiling the research data services industry.  Includes some examples of research data output, and touches on challenges facing organizations seeking talent to manage and analyze research data.


Data Intensive Summer School, June 30 – July 2, 2014

From: https://www.xsede.org/web/xup/course-calendar/-/training-user/class/263/session/384


 

The Data Intensive Summer School focuses on the skills needed to manage, process and gain insight from large amounts of data. It is targeted at researchers from the physical, biological, economic and social sciences that are beginning to drown in data. We will cover the nuts and bolts of data intensive computing, common tools and software, predictive analytics algorithms, data management and visualization. Given the short duration of the summer school, the emphasis will be on providing a solid foundation that the attendees can use as a starting point for advanced topics of particular relevance to their work.

Prerequisites

  • Experience working in a Linux environment
  • Familiarity with relational data base model
  • Examples and assignments will most likely use R, MATLAB and Weka. We do not require experience in these languages or tools, but you should already have an understanding of basic programming concepts (loops, conditionals, functions, arrays, variables, scoping, etc.)

Organizer

  • Robert Sinkovits, San Diego Supercomputer Center

Topics (tentative)

  • Nuts and bolts of data intensive computing
    • Computer hardware, storage devices and file systems
    • Cloud storage
    • Data compression
    • Networking and data movement
  • Data ManagementIntroduction to R programming
    • Digital libraries and archives
    • Data management plans
    • Access control, integrity and provenance
  • Introduction to Weka
  • Predictive analyticsDealing with missing data
    • Standard algorithms: k-mean clustering, decision trees, SVM
    • Over-fitting and trusting results
  • ETL (Extract, transfer and load)
    • The ETL life cycle
    • ETL tools – from scripts to commercial solutions
  • Non-relational atabases
    • Brief refresher on relational mode
    • Survey of non-relational models and technologies
  • Visualization
    • Presentation of data for maximum insight
    • R and ggplot package

Virtual Summer School courses are delivered simultaneously at multiple locations across the country using high-definition videoconferencing technology.


 

On June 26 I received a follow-up e-mail with notes from the instructors:


 

Preparing for the virtual summer school

Several of the instructors have requested that you preinstall software on your laptop. Given the large number of participants and the compressed schedule, we ask that you comply and do this before the start of the summer school.

R Studio (statistical programming language)

Follow “download RStudio Desktop”

http://www.rstudio.com/ide/download

WEKA (data mining software)

Follow “Download” link on left hand side of home page

http://www.cs.waikato.ac.nz/ml/weka/

Please download the Stable book 3rd ed. version

Prior knowledge of R is not required, but we do assume that you have some programming experience and familiarity with basic programming concepts (variables, arrays, loops, branching, etc.). You may find it helpful to acquaint yourself with basic R syntax ahead of time.

Reading the first two chapters of the following online introduction is recommended http://cran.r-project.org/doc/manuals/R-intro.html

A basic understanding of relational databases and SQL would also be useful. If you are unfamiliar with the SQL syntax, please consider the following tutorials

http://sqlzoo.net

http://www.w3schools.com/sql/sql_intro.asp


I already have R studio; I have never tried Weka.  This is a little bit of added work for the summer, but it looks like a great opportunity to pick up some additional skills, or at least refresh those skills I’ve already acquired.

 

Statistics for Research I

Schmidhammer2011

Dr. James Schmidhammer, Instructor

Statistics 537

College of Business Administration

Department of Statistics, Operations & Management Science

Instructor: Dr. James L. Schmidhammer

Syllabus Online: FA2013-STAT-537

Course Website: http://www.bus.utk.edu/stat/stat537/

 

 

 

 

Required Text An Introduction to Statistical Methods and Data Analysis (Sixth Edition)

by R. Lyman Ott and Michael Longnecker

© 2010, Brooks/Cole, Cengage Learning

ISBN-13: 978-0-495-01758-5

Optional

Text

 

Student Solutions Manual for

An Introduction to Statistical Methods and Data Analysis (Sixth Edition)

by R. Lyman Ott and Michael Longnecker

© 2010, Brooks/Cole, Cengage Learning

ISBN-13: 978-0-495-10915-0

Recommended

Text

The Little SAS Book for Enterprise Guide 4.2 

by Susan J. Slaughter & Lora D. Delwiche 

© 2010, SAS Institute Inc.
SAS PUBCODE: 61861
ISBN-13: 978-1-59994-726-6

Software SAS Version 9.3 with Enterprise Guide 5.1

Available for download at http://oit.utk.edu/software,

and on DVDs obtained at the UT Computer Store

Available for use on https://apps.utk.edu/

Grading Homework (⅓) / Midterm (⅓) / Final (⅓)

Topic

Coverage

Data Collection (Population vs. Sample)Data Display (Graphics)

Data Summarization (Summary Statistics)

I

I

I

Exploratory Data Analysis

S

Use of Statistics SoftwareSAS

JMP

SPSS

S-Plus

 

T

0

0

0

Probability

S

Probability DistributionsDiscrete

Continuous

Sampling

S

I

S

Parametric Estimation (Point & Interval)Single Sample (m )

Single Sample (s2)

Two Independent Samples (m1m2)

Two Related (Paired) Samples (m1m2)

 

T

I

T

T

Nonparametric EstimationSingle Sample (Median)

Two Independent Samples

Two Related (Paired) Samples

 

S

0

0

Parametric Hypothesis Testing (t-tests)Single Sample (m =m0)

2 Independent Samples (m1=m2)

Assuming equal variances

Assuming unequal variances

2 Independent Samples (Levene’s Test s1=s2)

2 Related (Paired) Samples (m1=m2)

 

T

 

T

T

T

T

Nonparametric Hypothesis Testing

Single Sample

Sign Test

Wilcoxon Signed Ranks Test

Two Independent Samples

Wilcoxon Rank Sum Test

Two Related (Paired) Samples

Sign Test

Wilcoxon Signed Ranks Test

            Rank Transformation Tests

I

T

T

I

T

I

Assessing Normality

Normal Probability Plots

            Tests for Normality

I

T

Robustness to Violations of Assumptions

I

Categorical Data Analysis

2 ´ 2 contingency tables

            r ´ c contingency tables

I

I

Correlation

Pearson’s r

Spearman’s r

            Kendall’s t

T

I

I

Simple Linear Regression

S

One Way ANOVA

Testing equality of means

            Post Hoc procedures

I

S

   
Math Requirements Moderate
Computing Requirements Moderate

0        No Coverage

S        Slight Coverage

I          Intermediate Coverage

T        Thorough Coverage