The Data Intensive Summer School focuses on the skills needed to manage, process and gain insight from large amounts of data. It is targeted at researchers from the physical, biological, economic and social sciences that are beginning to drown in data. We will cover the nuts and bolts of data intensive computing, common tools and software, predictive analytics algorithms, data management and visualization. Given the short duration of the summer school, the emphasis will be on providing a solid foundation that the attendees can use as a starting point for advanced topics of particular relevance to their work.
- Experience working in a Linux environment
- Familiarity with relational data base model
- Examples and assignments will most likely use R, MATLAB and Weka. We do not require experience in these languages or tools, but you should already have an understanding of basic programming concepts (loops, conditionals, functions, arrays, variables, scoping, etc.)
- Robert Sinkovits, San Diego Supercomputer Center
- Nuts and bolts of data intensive computing
- Computer hardware, storage devices and file systems
- Cloud storage
- Data compression
- Networking and data movement
- Data ManagementIntroduction to R programming
- Digital libraries and archives
- Data management plans
- Access control, integrity and provenance
- Introduction to Weka
- Predictive analyticsDealing with missing data
- Standard algorithms: k-mean clustering, decision trees, SVM
- Over-fitting and trusting results
- ETL (Extract, transfer and load)
- The ETL life cycle
- ETL tools – from scripts to commercial solutions
- Non-relational atabases
- Brief refresher on relational mode
- Survey of non-relational models and technologies
- Presentation of data for maximum insight
- R and ggplot package
Virtual Summer School courses are delivered simultaneously at multiple locations across the country using high-definition videoconferencing technology.
On June 26 I received a follow-up e-mail with notes from the instructors:
Preparing for the virtual summer school
Several of the instructors have requested that you preinstall software on your laptop. Given the large number of participants and the compressed schedule, we ask that you comply and do this before the start of the summer school.
R Studio (statistical programming language)
Follow “download RStudio Desktop”
WEKA (data mining software)
Follow “Download” link on left hand side of home page
Please download the Stable book 3rd ed. version
Prior knowledge of R is not required, but we do assume that you have some programming experience and familiarity with basic programming concepts (variables, arrays, loops, branching, etc.). You may find it helpful to acquaint yourself with basic R syntax ahead of time.
Reading the first two chapters of the following online introduction is recommended http://cran.r-project.org/doc/manuals/R-intro.html
A basic understanding of relational databases and SQL would also be useful. If you are unfamiliar with the SQL syntax, please consider the following tutorials
I already have R studio; I have never tried Weka. This is a little bit of added work for the summer, but it looks like a great opportunity to pick up some additional skills, or at least refresh those skills I’ve already acquired.
A friend of mine in the program commented that about 90% of the time doing data science is obtaining and cleaning data.
This is where programming is incredibly useful. In the second year of my Masters program, my programming skills are not yet at the level that I want them to be.
I recently started some work for my research assistantship concerning Twitter data for @DataONEorg.
I’m interested in the content of posts, and the relationships between the actors in the network.
In terms of content, I’d like to look at the hashtags and links.
To illustrate how difficult it is to accomplish tasks “by hand,” I recently tried to the twitter data from a free site. My efforts are documented here: <https://notebooks.dataone.org/data-science/harvesting-dataoneorg-twitter-mentions-via-topsy/>.
I’ve read that employers should not hire a “data scientist” if the so-called “scientist” does not have programming skills. For this reason, I’m disappointed that the School of Information Science does not offer a programming course within the School itself. (I’ve heard Dr. Potnis will offer a course in Fall 2014, a semester after my graduation).
I enrolled in a programming course in the College of Engineering and Computer Science – Introduction to Programming for Scientists and Engineers. The course focuses on C++ language. This is unfortunate, as python is increasingly favored over C++. This means more ready-made programs are available, and a user community is growing. Content management systems are even building up around python.
Python is used by a friend of mine who does genome science. C++ is useful for taking advantage of parallelism, but that my friend who works on supercomputers uses python suggests to me that python works as well.
Python Displacing R As The Programming Language For Data Science by @mjasay http://readwr.it/c1ew
From Big Data Analytics lecture 2, I was most impressed by the slide concerning decision errors in logic.
I imagine most data scientists are fans of Mr. Spock. No need to be in the Captain’s Chair, but a strong need to contribute meaningful analysis to important decisions.
Any Star Trek fan can quote Mr. Spock’s sage observation, “Logic is the beginning of wisdom, not the end.”
Logic is critical to data science, and the wisdom that can arise. However some logical errors can arise, as pointed out by Dr. Wang’s slide:
Typical Decision Errors: Logic
- Not asking the right questions
- Making incorrect assumptions and
failing to test them
- Using analytics to justify instead of learning the facts
- Interpret data incorrectly
- Failing to understand the alternatives
My Geographic Information Systems – Spatial Databases and Data Management course instructor (Dr. Ralston) has a graphic on his door about “correlation and causation.” His graphic shows a link between decreasing use of Windows Internet Explorer and a correlated decrease in murders.
The refrain is always “correlation does not imply causation.” Logic might be sound, the math might add up, but the pitfalls exist.
I often wonder if some of the data science “boot camps” and workshops can effectively impart these key lessons that are central to the process of science.
Big Data Analytics is listed as IS 592 in catalog. Because it is a higher level class building on material from IS 584 (Database Management Systems), I’m looking forward to exploring data analytics and data science in the formal classroom setting.
I’m grateful to have the opportunity to study this topic at the School of Information Sciences. It is one of a handful of courses in the overall course catalog that are closely aligned with my career goals to work as a professional data manager. My intuition is that individuals who are trained to manage data must have an extensive knowledge of the data life cycle to effectively manage data across the spectrum of its life. Note that “Analyze” is one of the last steps. Even for strictly traditional views of curation, an archivist who is not familiar with flow of information in the “Analyze” step is not well positioned to receive and curate research output that is increasingly data intensive.
From the first lecture, a quote caught my attention:
We project a need for 1.5 million additional managers and analysts in the United States who can ask the right questions and consume the results of the analysis of Big Data effectively.
This is a huge and growing field. What caught my eye is the growing number of specialty schools to turn out students with the skills to analyze data. Even at UT-Knoxville, there are three “data science” courses – one in Computer Science, one in Statistics, and one in Information Sciences. These are three separate colleges (Engineering, Arts & Sciences, and Communication and Information). Established information studies programs are likely seats for new data science programs and curriculum. Berkely’s school of information just stood up a new program to support a Masters in Information and Data Science. Other short “boot camps” are offered, although I’m not sure if those programs will produce data “scientists” or just data “analysts” – the key from the quote above is “ask the right questions.” Are our new data science programs able to impart the skill to ask questions using the scientific method?
From the first lecture, here are skills that data analysts need:
- Data mining
- Statistical applications
- Predictive analytics
- Business intelligence
- Data modeling and data visualization
- Meta cognitive strategies
Coursework for IS 592 will be collected at the following URL: https://mountainsol.wordpress.com/category/coursework/big-data-analtyics/
37.8 Million towards creating a “data science culture” at universities. I found this to be a particularly compelling reason to invest in a culture shift:
“Universities are losing much of the top data science talent they produce to industry. We need them back at the universities, working on the world’s most important science problems — not trying to make people click on ads.”
College of Business Administration
Department of Statistics, Operations & Management Science
Instructor: Dr. James L. Schmidhammer
Syllabus Online: FA2013-STAT-537
Course Website: http://www.bus.utk.edu/stat/stat537/
|Required Text||An Introduction to Statistical Methods and Data Analysis (Sixth Edition)
by R. Lyman Ott and Michael Longnecker
© 2010, Brooks/Cole, Cengage Learning
|Student Solutions Manual for
by R. Lyman Ott and Michael Longnecker
© 2010, Brooks/Cole, Cengage Learning
|The Little SAS Book for Enterprise Guide 4.2
by Susan J. Slaughter & Lora D. Delwiche
© 2010, SAS Institute Inc.
|Software||SAS Version 9.3 with Enterprise Guide 5.1
Available for download at http://oit.utk.edu/software,
and on DVDs obtained at the UT Computer Store
Available for use on https://apps.utk.edu/
|Grading||Homework (⅓) / Midterm (⅓) / Final (⅓)|
|Data Collection (Population vs. Sample)Data Display (Graphics)
Data Summarization (Summary Statistics)
|Exploratory Data Analysis||
|Use of Statistics SoftwareSAS
|Parametric Estimation (Point & Interval)Single Sample (m )
Single Sample (s2)
Two Independent Samples (m1–m2)
Two Related (Paired) Samples (m1–m2)
|Nonparametric EstimationSingle Sample (Median)
Two Independent Samples
Two Related (Paired) Samples
|Parametric Hypothesis Testing (t-tests)Single Sample (m =m0)
2 Independent Samples (m1=m2)
2 Independent Samples (Levene’s Test s1=s2)
2 Related (Paired) Samples (m1=m2)
|Nonparametric Hypothesis Testing
Wilcoxon Signed Ranks Test
Two Independent Samples
Wilcoxon Rank Sum Test
Two Related (Paired) Samples
Wilcoxon Signed Ranks Test
Rank Transformation Tests
Normal Probability Plots
Tests for Normality
|Robustness to Violations of Assumptions||
|Categorical Data Analysis
2 ´ 2 contingency tables
r ´ c contingency tables
|Simple Linear Regression||
|One Way ANOVA
Testing equality of means
Post Hoc procedures
0 No Coverage
S Slight Coverage
I Intermediate Coverage
T Thorough Coverage
Except for two, the Pratt Students again split off from the UT Science Data scholars today for an afternoon visit to the world headquarters of Thomson Reuters. Again, it was a very hot day, and we were lucky to avoid direct sunlight in their most Bond villain-esque office space.
We also joked about the “Ally McBeal” unisex bathroom – which had fully enclosed stalls that might have once been massage parlors, given the mood lighting, and rectangular faucets that “spilled” water rather than “poured” it.
A fun architectural space, but probably not compliant with the Americans with Disabilities Act. For that matter, I’m confident most of London isn’t compliant with the Americans with Disabilities Act.
I burned through the last pages of notes in my notebook at Thomson Reuters. I was writing in the margins. TR might be best known for the Reuters news service. They also deal in financial risk, legal, tax and accounting, and Intellectual property and science.
One of their big products is the “Impact Factor.” Starting with founder Eugene Garfield in 1955 in his paper in Science, “Citation indexes for science.” This is highly pertinent to my research into networks because they use a “researcher ID.” This can avoid some problems I’ve encountered in cataloging where the author is published under various names that are all the same author. E.g., “J.T Scienceguy” versus “Jeffry T. Scienceguy” or “Jeffry Tomas Scienceguy.” A database sees those as different entities, even though it is indeed the same person.
With a researcher ID, your database does not get complicated, and you can do a lot of data science.
Another area that I’m interested in and need to follow up with TR is the Map of Science – particularly the EU collaboration. They use something called “ScholarONe.”
There is some research analytics, and also they “peek” into repositories – I need some follow up information on how they prioritize repositories based on “who manages, how’s it updated, how frequently, and what’s the quality.” They have a white paper on that but I have yet to find it. It could be useful for my research with DataONE and developing/prioritizing member nodes.
The map of science I need to follow up with Patricia Brannen.
Finally they have some research out of Philadelphia, U.S.A. regarding networks, influence of research over time for individuals. They are not yet that honed in on how to do it for an ad-hoc group of researchers. Disappointing, as that is what I’m hoping to do with research into collaborations resulting from DataONE interactions.
On Wednesday, July 5 we had a visit and talk from Graham Parton, a self professed “data scientist.” He gave a talk on “Environmental Data Archival – Practices and Benefits.” He’s associated with the British Atmospheric data Centre
Centre for Environmental Data Archival
National Centre for Earth Observation
While Pratt students had not been exposed to this, UTK Science Data Students felt familiar with much of the information presented due to taking Environmental Informatics in the Spring 2013 semester. Also, UT is heavily involved with DataONE, essentially a collaboration of U.S. environmental data centers.
A second development of interest to me was my attempt to visit the University College London Centre for Advanced Spatial Analysis, http://www.casa.ucl.ac.uk. I did wind up there, but spent more time in the tube getting there than actually doing anything of value.
I’d hoped to speak with the director, Dr. Andrew Hudson-Smith. [Major aside: Unfortunately, he was busy setting up a Web cam for Jeremy Bentham, a rich old dude who gave a lot of money to UCL and made some weird requests, including to be mummified, put on display, and apparently a window on the world. Well, now he’s got it – via a webcam and computer display. I think the campus preoccupation with Jeremy Bentham is a little creepy – he’s a mummy after all. Here in the U.S. we have taxidermy mascots – but no human beings to my knowledge].
However I did get a card from Dr. Hudson-Smith, and may follow up with him. I spoke for literally a few seconds about the “if you build it they will come” scenario that Oak Ridge National Lab is facing with it’s visualization lab. What exactly is being visualized? Will the science follow on the availability of the equipment?
At UCL, apparently they are doing a lot of spatial modeling. That includes for community planning. There’s a blog on the business card I got, “digitalurban.org.” In as much as I’m interested in using Geographic Information Science (Just “GI” in Europe) to enhance transportation planning, including things like “crowdsourcing” the best biking routes, this was fascinating. They also do some agent based modelling. I’m sure LiDAR data would be useful. As has been often discussed, I’m not sure why we need to visualize data on giant screens, though. In New Mexico we joked about the University of Arizona building a visualization lab, but not budgeting in staff to work on it or maintain it. Everyone wants that “hollywood” moment where data is visualized to tell a story on-the-fly, but finding data, making sure it’s in compatible formats, and then throwing it up into a visualization takes time – and most of all, skilled information professionals – in GIS, data visualization, information sciences.
UCL spatial analysis lab has been doing work for 20 years – what keyed me into it was a facebook event celebrating that milestone that my group had just missed – a showcase of visualizations where “real and virtual worlds collide <http://www.bartlett.ucl.ac.uk/graduate/events/veiv-projection-ucl>. In fact there’s a new masters and phD program starting up in “UCL Engineering Doctorate Centre in Virtual Environments, Imaging and Visualisation (EngD VEIV)” that supports evidence-based decision making. This is one of my pet projects at <http://knox4greenways.blogspot.com> where I talk about transportation issues and community planning.
Today the class enjoyed a series of presentations at the first annual “Strand Symposium on Digital Scholarship and ePublishing” at Kings College, London.
I’m providing the schedule here, along with an uploaded version 2013-Strand-Epub-symposium:
09:30-09:35 INTRODUCTION by Anthony Watkinson (CIBER Research), organiser and chair
09:35-10:15 Introductory Presentation by Professor Carol Tenopir (UTK) – How Scholars decide to Trust Resources
10:15-11:30 FIRST SESSION Building and evaluating cultural resources
1. Professor Tula Giannini (Pratt-SILS): How Brooklyn’s Libraries and Museums Collaborate to Create a new Digital Cultural Heritage Resource: The Brooklyn Visual Heritage Website
2. Professor David Nicholas (CIBER Research): Evaluating the Usage of Europeana
11:15-12:00 Refreshment Break
12:00-13:15 SECOND SESSION Are we publishing and, if so, for whom?
1. Dr. Stuart Dunn (KCL) The distinction between exposing data and publishing: a case study from archaeology
2. Dr. Susan Whitfield (BL) The challenge of creating a resource and interface that is accessible across linguistic, disciplinary and cultural boundaries to the everyman of the Internet
13:15-14:15 Opportunity for lunch: lunch is not provided by there are many appropriate places to eat in the Waterloo surroundings.
14:15-15:30 SESSION THREE Managing online resources
1. Dr. Richard Gartner (KCL) Digital Asset Management- the pleasures and pitfalls of metadata
2. Matt Kibble (Bloomsbury) The product management role in planning and building digital resources
15:30-16:00 Refreshment break
16:00-17:15 SESSION FOUR Investment and sustainability
1. Dr. Paola Marchionni (Jisc) The end is the beginning: the challenges of digital resources post digitisation
2. Chris Cotton (Proquest) The benefits of public private partnerships in large-scale cultural and heritage digitisation
17:15-17:30 CONCLUDING REMARKS Anthony Watkinson
I enjoyed taking in a lecture on “Trust and Authority of Scholarly Resources” from Dr. Tenopir – the first time this data has been presented. It is useful for understanding not only the information behavior of scientists, but the general public as well. The question “who do you trust” will always loom large in public discourse.
Our course leader Anthony Watkinson, himself a former publisher and University College London lecturer, did a superb job putting together speakers. In spite of the full slate of speakers and 19 graduate students, admittedly there might have been more in attendance. Anthony felt he had promoted the event a bit late. In all it was not bad for a first year on a week day with no free lunch!
Anthony did a big favor for the science data students in recruiting Dr. Fiona Murphy from Wiley Publishing to speak with us for an informal lunchtime talk on science data. We also had one of the Pratt MLIS students come with us as she is interested in data science.
In preparation for the lunchtime meeting, Anthony sent a few key resources. What I found most interesting was that Dr. Murphy had delivered a talk at the “Open Access Infrastructure for Research in Europe ” OpenAIRE/LIBER workshop on “Dealing with Data. What’s the Role for the library?”
Dr. Murphy’s talk was entitled ‘Data Publication: a Publisher’s perspective’ and there are two ways of accessing it – see the presentation online <http://www.openaire.eu/en/about-openaire/publications-presentations/doc_download/555-4wileyfionamurphy> or watch a video online <https://vimeo.com/68051358>.
From the description:
Fiona from Wiley spoke about what publishing data is all about: why it is important in terms of being cited and credited. The growing pressure funder mandates also plays a role.
Some things I found particularly interesting:
Charting the growth of open access – the number of papers published between 2000 and 2012 was under 5,00 for most papers, with exception of BMC, which broke 5000 in 2005, with other publications lingering well past the first half of the first decade of the new millennium.
Dr. Murphy also commented on what exactly a data article is:
A data article describes a dataset, giving details of its collection, processing, software, file formats etc, without the requirement of novel analyses or ground breaking conclusions. It allows the reader to understand the when, how and why data was collected and what the data-product is.
Example: Geoscience Data Journal
With an example data paper:
Some other links:
“Peer REview for Publication & Accreditation of Research Data in the Earth sciences (PREPARDE)” – http://www2.le.ac.uk/projects/preparde
“The Research Data Alliance aims to accelerate and facilitate research data sharing and exchange”
It is probably worth subscribing to this mailing list <https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=DATA-PUBLICATION>
And a quote worth sharing:
Publishing an article without at the same time making the data/evidence available is scientific malpractice
Dr. Murphy is on twitter:
And is a member of the “International Organization of Scientific, Technical & Medical Publishers” research data group <http://www.stm-assoc.org/research-data-group/> focused on “exchanging information on new initiatives about the integration of research publications and research data and 2) to discuss evolving best practices in this new area.”