Course Syllabus
Management, Access, and Use of Big and Complex Data
I535, I435, B669
Syllabus and Course Roadmap
19 Aug 2016
Instructor:
Professor Beth Plale
plale@indiana.edu
https://www.linkedin.com/in/bethplale
812 855 4373
@bplale
Course Summary
|
Unit |
Activity # |
Title |
Activity type |
Assign Date |
Discussion Date |
Due Date |
|
Unit 0 |
|
Introduction |
|
|
|
|
|
|
1 |
Introductory video |
Lecture |
25-Aug-16 |
|
26-Aug-16 |
|
|
2 |
Syllabus overview |
In person/on-line (26 Aug) |
|
26-Aug-16 (f2f*) 29-Aug-16 (online*) |
|
|
|
|
|
|
|
|
|
|
Unit 1 |
|
Big Data Intro |
|
30-Aug-16 |
|
5-Sep-16 |
|
|
1 |
Big Data in Business |
Lecture |
|
|
|
|
|
2 |
Big Data in Business |
Reflection |
|
|
|
|
|
3 |
Big Data in Scientific Research |
Lecture |
|
|
|
|
|
4 |
Big Data in Scientific Research |
Reflection |
|
|
|
|
|
5 |
I535, B669: Jetstream tutorial |
In person/on-line |
|
2-Sep-16 (f2f) 5-Sep-16 (online) |
|
|
|
5 |
I435: Citrix tutorial |
In person/on-line |
|
2-Sep-16 (f2f) 5-Sep-16 (online) |
|
|
|
|
|
|
|
|
|
|
Unit 2 |
|
Data Pipelines |
|
6-Sep-16 |
|
12-Sep-16 |
|
|
1 |
Data processing pipelines in science |
Lecture |
|
|
|
|
|
2 |
Data processing pipelines in science |
Reflection |
|
|
|
|
|
3 |
Data processing pipelines in business |
Lecture |
|
|
|
|
|
4 |
Data processing pipelines in business |
Reflection |
|
|
|
|
|
5 |
Introduction to MongoDB, I |
In person/on-line |
|
9-Sep-16 (f2f) 12-Sep-16 (online) |
|
|
|
|
|
|
|
|
|
|
Unit 3 |
|
Software Systems Design |
|
13-Sep-16 |
|
19-Sep-16 |
|
|
1 |
Software systems design overview |
Lecture |
|
|
|
|
|
2 |
Software systems design overview |
Reflection |
|
|
|
|
|
3 |
Introduction to MongoDB, II |
In person/on-line |
|
16-Sep-16 (f2f) 19-Sep-16 (online) |
|
|
|
|
|
|
|
|
|
|
Unit 4 |
|
Complexity in software systems |
|
20-Sep-16 |
23-Sep-16 (f2f) 26-Sep-16 (online) |
26-Sep-16 |
|
|
1 |
Complexity in software systems |
Lecture |
|
|
|
|
|
2 |
Complexity in software systems |
Reflection |
|
|
|
|
|
|
|
|
|
|
|
|
Unit 5 |
|
NoSQL data stores |
|
27-Sep-16 |
30-Sep-16 (f2f) 3-Oct-16 (online) |
3-Oct-16 |
|
|
1 |
NoSQL data stores |
Lecture |
|
|
|
|
|
2 |
NoSQL data stores |
Reflection |
|
|
|
|
|
|
|
|
|
|
|
|
Unit 6 |
|
Comparison of data models through example |
|
4-Oct-16 |
|
10-Oct-16 |
|
|
1 |
Comparison of data models through example |
Lecture |
|
|
|
|
|
2 |
Comparison of data models through example |
Reflection |
|
14-Oct-16 (f2f) |
|
|
|
3 |
Project Part A: Twitter dataset analysis |
In person/on-line |
|
14-Oct-16 (f2f) 10-Oct-16 (online) |
31-Oct-16 |
|
|
|
|
|
|
|
|
|
Unit 7 |
|
Distributed File Systems Intro |
|
11-Oct-16 |
14-Oct-16 (f2f) 17-Oct-16 (online) |
17-Oct-16 |
|
|
1 |
Distributed File Systems Intro |
Lecture |
|
|
|
|
|
2 |
Distributed File Systems Intro |
Reflection |
|
|
|
|
|
|
|
|
|
|
|
|
Unit 8 |
|
Role of caching in distributed computing |
|
18-Oct-16 |
21-Oct-16 (f2f) 24-Oct-16 (online) |
24-Oct-16 |
|
|
1 |
Role of caching in distributed computing |
Lecture |
|
|
|
|
|
2 |
Role of caching in distributed computing |
Reflection |
|
|
|
|
|
|
|
|
|
|
|
|
Unit 9 |
|
Role of fault tolerance in distributed computing |
|
25-Oct-16 |
28-Oct-16 (f2f) 31-Oct-16 (online) |
31-Oct-16 |
|
|
1 |
Role of fault tolerance in distributed computing |
Lecture |
|
|
|
|
|
2 |
Role of fault tolerance in distributed computing |
Reflection |
|
|
|
|
|
|
|
|
|
|
|
|
Unit 10 |
|
Consistency in distributed noSQL data stores |
|
1-Nov-16 |
4-Nov-16 (f2f) 7-Nov-16 (online) |
7-Nov-16 |
|
|
1 |
Consistency in distributed noSQL data stores |
Lecture |
|
|
|
|
|
2 |
Consistency in distributed noSQL data stores |
Reflection |
|
|
|
|
|
|
|
|
|
|
|
|
Unit 11 |
|
Routing in noSQL data stores |
|
8-Nov-16 |
11-Nov-16 (f2f) 14-Nov-16 (online) |
14-Nov-16 |
|
|
1 |
Routing in noSQL data stores |
Lecture |
|
|
|
|
|
2 |
Routing in noSQL data stores |
Reflection |
|
|
|
|
|
|
|
|
|
|
|
|
Unit 12 (I535, I435) |
|
Data Cleaning and Coding (I535, I435) |
|
15-Nov-16 |
|
21-Nov-16 |
|
|
1 |
Data Cleaning |
Lecture |
|
|
|
|
|
2 |
Data Cleaning |
Reflection |
|
|
|
|
|
3 |
Data Coding : Project |
In person/on-line |
|
18-Nov-16 (f2f) 21-Nov-16 (online) |
9-Dec-16 |
|
|
|
|
|
|
|
|
|
Unit 12 (B669) |
|
Data Cleaning and Evaluation (B669) |
|
15-Nov-16 |
|
21-Nov-16 |
|
|
1 |
Data Cleaning |
Lecture |
|
|
|
|
|
2 |
Data Cleaning |
Reflection |
|
|
|
|
|
3 |
Project Part B: Twitter database evaluation |
In person/on-line |
|
18-Nov-16 (f2f) 21-Nov-16 (online) |
9-Dec-16 |
|
|
|
|
|
|
|
|
|
Unit 13 |
|
Data Provenance |
|
29-Nov-16 |
2-Dec-16 (f2f) 5-Dec-16 (online) |
5-Dec-16 |
|
|
1 |
Data Provenance |
Lecture |
|
|
|
|
|
2 |
Data Provenance |
Reflection |
|
|
|
|
|
|
|
|
|
|
|
|
Unit 14 |
|
Overcoming Social Barriers to Data Sharing |
|
6-Dec-16 |
9-Dec-16 (f2f) 12-Dec-16 (online) |
13-Dec-16 |
|
|
1 |
Overcoming Social Barriers to Data Sharing |
Lecture |
|
|
|
|
|
2 |
Overcoming Social Barriers to Data Sharing |
Reflection |
|
|
|
|
|
|
|
|
|
|
|
|
Unit 15 |
|
Concluding Discussion |
|
14-Dec-16 |
16-Dec-16 (f2f) 12-Dec-16 (online) |
16-Dec-16 |
* In the Discussion Date column, f2f refers to the sections that meet face-to-face on Fridays, i.e. 34950, 32502 and 32869. online refers to the section that meets online via Zoom, i.e. 32871.
Course Roadmap
Unit 0: Introduction and Background Competency
This course is a “flipped course” in that a student views the taped lectures then has opportunity to reinforce their learning through discussions, reflections, and hands on activity. The course is broken down into 14 units, corresponding roughly to the length of a semester. It is expected that a student will put in 6-7 hours a week every week into the course which includes time spent in readings, reflections, and engaging with instructional content. Read the course policies document. Watch the introductory video about the course. For students with weak backgrounds in databases and file systems, you are advised to work through the following nicely done tutorial videos on MySQL and the Linux File System.
- MySQL Database Tutorial - 1 - Introduction to Databases, Bucky Roberts, Jan, 2012.
- MySQL Database Tutorial - 2 - Getting a MySQL Server
- MySQL Database Tutorial - 3 - Creating a Database
- MySQL Database Tutorial - 4 - SHOW and SELECT
- MySQL Database Tutorial - 5 - Basic Rules for SQL Statements
- MySQL Database Tutorial - 6 - Getting Multiple Columns
- Learning the Linux File System, Joe Collins, Jul 2015
Unit 1: Big Data Intro
Activity 1-2: Big Data in Business
Key topic: business perspective on big data
Gives student perspective of how society thought and talked about big data as it first entered our lexicon. Activity 1 and 2 examines the topic from the perspective of business. For this activity, watch the video, read the two readings, and carry out the reflection.
Readings:
1. A special report on managing information: Data, data everywhere, The Economist, February 25, 2010
2. Data Scientist, The Sexiest Job of the 21st Century, Thomas H. Davenport and D.J. Patil, Harvard Business Review, pp. 70-76, Oct 2012
Activity 3-4: Big Data in Scientific Research
Key topic: science perspective on big data
Gives student science perspective on big data. Student will read a dozen or so short articles from Science magazine (Feb 2011). The articles are written by practitioners in a dozen or so fields including social science, medical, and scientific disciplines, each discussing their unique data issues. Taken together, the collection highlights how differently one discipline sees data challenges from another discipline.
For this activity, watch the video, read the readings, can carry out the reflection.
Readings:
“Dealing with Data”, Special Online Collection, Science, 11 February 2011. See Canvas under Unit 1 for articles.
Unit 2: Data Processing Pipelines
Learning Objectives: Student gains understanding of the concept of a data pipeline, its connection to the lifecycle of data, and its use in science and business. Students acquires basic understanding of the complexity of a software algorithm and its relationship to the number of steps in the algorithm. Student applies what they’ve learned about the data pipeline concept to an example from their own experience. Student has opportunity to research a popular form of data pipeline in business today called the Amazon Web Service (AWS) Data Pipeline.
Activity 1-2: Data Processing Pipelines in Science
Key topic: what is a data pipeline? What is data pipeline connection to Big Data? How is data pipeline used in science? Why does software complexity hurt when data gets big?
Data rarely instantly show up ready to use in whatever exploratory purpose you may have in mind. Data from creation to use undergoes numerous steps, some of which are end products in and of themselves. Watch activity 1 video, do the reading, and carry out the reflection.
Readings:
- Jim Gray on eScience: A Transformed Scientific Method, Edited by Tony Hey, Stewart Tansley, and Kirstin Tolle, in The Fourth Paradigm: Data Intensive Scientific Discovery, Tony Hey, Stewart Tansley, and Kritsin Tolle eds., Microsoft Research, 2009, pp. xix – xxxiii.
- Jim Gray’s Fourth Paradigm and the Construction of the Scientific Record, Clifford Lynch, in The Fourth Paradigm: Data Intensive Scientific Discovery, Tony Hey, Stewart Tansley, and Kritsin Tolle eds., Microsoft Research, 2009, pp. 177-183.
- Understanding execution time complexity: the Selection Sort versus the Heap Sort.
- UK Data Archive: The Research Data Lifecycle,
Activity 3-4: Data Processing Pipelines in Business
Key topic: how does business view a data pipeline? Role of cloud computing in the business view of data pipelines. Exploration with current Data Pipeline tool.
Activity 3 introduces the business perspective of data pipelines. It draws inspiration from a 2011 talk by Wernert Vogels "Data Without Limits". Vogels is CTO of Amazon, and in this nice 2011 talk discusses data pipelines in context of business computing. He argues that cloud computing is core to a business model "without limits". The pipeline he proposes is: collect | store | organize | analyze | share.
Readings: No readings for this lesson.
Resources:
1. Vogels talks about mapreduce extensively during his discussion of analysis. If you're not familiar with MapReduce, a decent primer on MapReduce (Hadoop really; MapReduce is built into the open source Hadoop tool) can be found here: http://readwrite.com/2013/05/23/hadoop-what-it-is-and-how-it-works
2. AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
3. Awslabs/data-pipeline-samples, https://github.com/awslabs/data-pipeline-samples
Unit 3: Software Systems Design
Keywords: distributed systems, emergent behavior, tradeoffs in software system design
Unit 3 is an introduction to the general concepts of software systems. These concepts are used during design of the large software systems needed to handle any of today’s large applications (social media, cloud services, shopping carts, video rentals …).
Readings: Principles of Computer System Design, an Introduction, Jerome H. Saltzer and M.Frans Kaashoek, Morgan Kaufmann, 2009. Read Chapter 1, Sections 1.1 and 1.2 only
Unit 4: Complexity in Software Systems
Keywords: complexity, layering, abstraction, modularity, hierarchy
A key problem in large-scale applications is complexity. We examine the sources of that complexity, and design principles that are used to overcome the complexity.
Readings: Principles of Computer System Design, an Introduction, Jerome H. Saltzer and M.Frans Kaashoek, Morgan Kaufmann, 2009. Read Chapter 1, Sections 1.3 - 1.5 only.
Unit 5: NoSQL data stores
Keywords: document, key-value, graph, and column stores
This unit introduces large-scale data stores. It gives an overview of noSQL data stores using a 2012 talk by Martin Fowler at the GoTo Aarhus Conference 2012. Readings additionally cover the concept of data services. As the reading states, “while data services were initially conceived to solve problems in the enterprise world, the cloud is now making data services accessible to a much broader range of consumers.”
Readings: Choosing the right NoSQL database for the job: a qualitative attribute evaluation,João Ricardo Lourenco, Bruno Cabral, Paulo Carreiro, Marco Vieira and Jorge Bernardio,Journal of Big Data 2015 2:18, Springer, 2015.
You may find this resource helpful. I suggest skipping the section “Research design and methodology”.
Unit 6: Comparison of Data Models By Example
Keywords: schema, data model
Relational databases provide structured and normalized tables for rapid and precise querying. noSQL stores support less structured data but also less rich querying. This unit walks you through a comparison of data models on different storage systems (relational, document store, key-value pair, and column store data model) using real ecological data from a science research project at Indiana University.
Resources:
Dataset: The data used in this unit is available in the forms as “dumped” from the different datastores. Compliments Dr. Peng Chen.
10 Rules for Scalable Performance in Simple Operation’ Datastores, Michael Stonebraker and Rick Cattell, Communications of ACM, 54(6), pp. 72-80. http://cacm.acm.org/magazines/2011/6/108651-10-rules-for-scalable-performance-in-simple-operation-datastores/fulltext
Optional:
Scalable SQL and NoSQL Data Stores, Rick Cattell, ACM SIGMOD Record 39(4) Dec 2010, revised 2011 version available here http://www.cattell.net/datastores/Datastores.pdf
Choosing the right NoSQL database for the job: a quality attribute evaluation, J.R. Lourenco, B. Cabral, P. Carreiro, M. Vieira, J. Bernardino, Journal of Big Data, Springer Open Journal 2(18) 2015, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-015-0025-0
Unit 7: Distributed File Systems Intro
Keywords: transparencies, session semantics, fault tolerance, naming
A major focus of this course is for the student to understand the distributed systems concepts that are underly today’s noSQL stores. The next step leading up to a study of noSQL stores themselves is to look at distributed file systems, where key concepts like transparencies, session semantics, fault tolerance, and naming all have a form that can be easily understood from the perspective of files and directories, which we all work with.
Reading:
Distributed File Systems: Concepts and Examples, E. Levy, A. Silberschatz, ACM Computing Surveys, Vol 22(4), Dec 1990, pp. 321-374 Sections 1-3 only.
Unit 8: Role of Caching in Distributed Applications
Keywords: caching, locality of reference, cache replacement strategy, cache coherency
Caching of data is a key concept to efficient performance of a large distributed application. We turn back to the Levy and Silberschatz paper to see what the authors have to say about caching.
Reading:
Distributed File Systems: Concepts and Examples, E. Levy, A. Silberschatz, ACM Computing Surveys, Vol 22(4), Dec 1990, pp. 321-374. Section 4 (skip 4.2 and 4.3)
Unit 9: Role of Fault Tolerance in Distributed Applications
Keywords: stateful and stateless servers, idempotence, transactions
When distributed systems span multiple locations or computers, the incidence of failure increases substantially. In this lesson we’ll turn back to the Levy and Silberschatz paper to learn about fault tolerance.
Readings:
Distributed File Systems: Concepts and Examples, E. Levy, A. Silberschatz, ACM Computing Surveys, Vol 22(4), Dec 1990, pp. 321-374, Section 5.1
Unit 10: Consistency in Distributed NoSQL Stores
Keywords: eventual consistency, CAP Theorem, Quorum for voting
When storage in a data store is distributed across multiple storage devices, consistency of reads and writes becomes a paramount issue.
Reading:
Eventually Consistent, W. Vogels, Communications of the ACM, 52, 1, Jan 2009
Unit 11: Routing in NoSQL Data Stores
Keywords: routing, distributed hash tables, Chord, peer-to-peer, local versus global knowledge
When data are stored across multiple computers in a single noSQL data store, and the data store can be accessed any of the multiple servers that support the noSQL data store, how does the noSQL data store ensure that a request for a data object is routed to the right location where the data are stored? This is a routing problem in noSQL data stores. This lesson discusses ways of keeping track of the information needed to route requests to the right server.
Readings: No readings for this lesson.
Unit 12: Data Cleaning and Coding (I534, I435)
Keywords: data cleaning, missing data, quantitative coding. Tagging data, categorizing data, coding data, feature extraction
Student learns data cleaning through real use cases from environmental science and social science. Student gains basic knowledge about coding data – purpose of coding and methodology for coding. Coding is at its most basic level the tagging or categorization of data on important features of the data so that themes emerge. The student will get a chance to try their hand at coding a dataset.
Readings: No readings for this lesson.
Coding Project. Student selects own project or uses this project. This project uses 278 media mentions from Pervasive Technology Institute over the year 2013-2014. The categorization that the student does will be illustrated through visualizing the results as a simple pie chart
Unit 12: Data Cleaning and Evaluation (B669)
Keywords: data cleaning, missing data, quantitative coding. Tagging data, categorizing data, coding data, feature extraction
Student learns data cleaning through real use cases from environmental science and social science. Student gains basic knowledge about coding data – purpose of coding and methodology for coding. Coding is at its most basic level the tagging or categorization of data on important features of the data so that themes emerge. The student will get a chance to try their hand at coding a dataset.
Readings: No readings for this lesson.
Evaluation Project. Student carries out Track B of the Twitter project.
Unit 13: Data Provenance
Keywords: Data provenance, causality graph, Open Provenance Model
As data sharing increasingly moves from a friendly exchange between two parties that know each other to a transaction on an open data sharing market, the need also grows for data to carry with it sufficient information that the recipient can use to establish whether or not they trust and can use the data. Data provenance lies at the heart of the descriptive data needed to discern data trustworthiness. This lesson will introduce data provenance and give you a sense of what provenance data is and how it is represented.
Reading:
A Survey of Data Provenance in e-Science, Yogesh Simmhan, Beth Plale, Dennis Gannon, Association of Computing Machinery SIGMOD Record, vol 34, no. 3, pp. 31-36, Sep 2005
Unit 14: Overcoming Social and Technical Barriers to Data Sharing
Keywords: data trustworthiness, data sharing, economies of data sharing
The student gains appreciation for the social and economic barriers to sharing data.
Readings:
1. Who Will Pay for Public Access to Research Data? Fran Berman and Vint Cerf, Science 9, August 2013, Vol. 341 no. 6146 pp. 616-617 DOI: 10.1126/science.1241625 (www.sciencemag.org). Preprint version available at: http://www.cs.rpi.edu/~bermaf/Berman%20+%20Cerf%20Public%20Access%20--%20Author%20Version.pdf
2. If Data Sharing is the Answer, What is the Question? Christine L. Borgman, http://ercim-news.ercim.eu/en100/special/if-data-sharing-is-the-answer-what-is-the-question
3. Data sharing: empty archives, Bryn Nelson, Nature 461, published online 9 Sep 2009, pp. 160-163. Doi: 10.1038/461160a http://www.nature.com/news/2009/090909/full/461160a.html
Course Summary:
| Date | Details | Due |
|---|---|---|