Course Syllabus

Header

Management, Access, and Use of Big and Complex Data
I535, I435, B669
Syllabus and Course Roadmap
19 Aug 2016

Instructor:

Professor Beth Plale
plale@indiana.edu
https://www.linkedin.com/in/bethplale
812 855 4373
@bplale

Course Summary

Unit	Activity #	Title	Activity type	Assign Date	Discussion Date	Due Date
Unit 0		Introduction
	1	Introductory video	Lecture	25-Aug-16		26-Aug-16
	2	Syllabus overview	In person/on-line (26 Aug)		26-Aug-16 (f2f) 29-Aug-16 (online)

Unit 1		Big Data Intro		30-Aug-16		5-Sep-16
	1	Big Data in Business	Lecture
	2	Big Data in Business	Reflection
	3	Big Data in Scientific Research	Lecture
	4	Big Data in Scientific Research	Reflection
	5	I535, B669: Jetstream tutorial	In person/on-line		2-Sep-16 (f2f) 5-Sep-16 (online)
	5	I435: Citrix tutorial	In person/on-line		2-Sep-16 (f2f) 5-Sep-16 (online)

Unit 2		Data Pipelines		6-Sep-16		12-Sep-16
	1	Data processing pipelines in science	Lecture
	2	Data processing pipelines in science	Reflection
	3	Data processing pipelines in business	Lecture
	4	Data processing pipelines in business	Reflection
	5	Introduction to MongoDB, I	In person/on-line		9-Sep-16 (f2f) 12-Sep-16 (online)

Unit 3		Software Systems Design		13-Sep-16		19-Sep-16
	1	Software systems design overview	Lecture
	2	Software systems design overview	Reflection
	3	Introduction to MongoDB, II	In person/on-line		16-Sep-16 (f2f) 19-Sep-16 (online)

Unit 4		Complexity in software systems		20-Sep-16	23-Sep-16 (f2f) 26-Sep-16 (online)	26-Sep-16
	1	Complexity in software systems	Lecture
	2	Complexity in software systems	Reflection

Unit 5		NoSQL data stores		27-Sep-16	30-Sep-16 (f2f) 3-Oct-16 (online)	3-Oct-16
	1	NoSQL data stores	Lecture
	2	NoSQL data stores	Reflection

Unit 6		Comparison of data models through example		4-Oct-16		10-Oct-16
	1	Comparison of data models through example	Lecture
	2	Comparison of data models through example	Reflection		14-Oct-16 (f2f) 10-Oct-16 (online)
	3	Project Part A: Twitter dataset analysis	In person/on-line		14-Oct-16 (f2f) 10-Oct-16 (online)	31-Oct-16

Unit 7		Distributed File Systems Intro		11-Oct-16	14-Oct-16 (f2f) 17-Oct-16 (online)	17-Oct-16
	1	Distributed File Systems Intro	Lecture
	2	Distributed File Systems Intro	Reflection

Unit 8		Role of caching in distributed computing		18-Oct-16	21-Oct-16 (f2f) 24-Oct-16 (online)	24-Oct-16
	1	Role of caching in distributed computing	Lecture
	2	Role of caching in distributed computing	Reflection

Unit 9		Role of fault tolerance in distributed computing		25-Oct-16	28-Oct-16 (f2f) 31-Oct-16 (online)	31-Oct-16
	1	Role of fault tolerance in distributed computing	Lecture
	2	Role of fault tolerance in distributed computing	Reflection

Unit 10		Consistency in distributed noSQL data stores		1-Nov-16	4-Nov-16 (f2f) 7-Nov-16 (online)	7-Nov-16
	1	Consistency in distributed noSQL data stores	Lecture
	2	Consistency in distributed noSQL data stores	Reflection

Unit 11		Routing in noSQL data stores		8-Nov-16	11-Nov-16 (f2f) 14-Nov-16 (online)	14-Nov-16
	1	Routing in noSQL data stores	Lecture
	2	Routing in noSQL data stores	Reflection

Unit 12 (I535, I435)		Data Cleaning and Coding (I535, I435)		15-Nov-16		21-Nov-16
	1	Data Cleaning	Lecture
	2	Data Cleaning	Reflection
	3	Data Coding : Project	In person/on-line		18-Nov-16 (f2f) 21-Nov-16 (online)	9-Dec-16

Unit 12 (B669)		Data Cleaning and Evaluation (B669)		15-Nov-16		21-Nov-16
	1	Data Cleaning	Lecture
	2	Data Cleaning	Reflection
	3	Project Part B: Twitter database evaluation	In person/on-line		18-Nov-16 (f2f) 21-Nov-16 (online)	9-Dec-16

Unit 13		Data Provenance		29-Nov-16	2-Dec-16 (f2f) 5-Dec-16 (online)	5-Dec-16
	1	Data Provenance	Lecture
	2	Data Provenance	Reflection

Unit 14		Overcoming Social Barriers to Data Sharing		6-Dec-16	9-Dec-16 (f2f) 12-Dec-16 (online)	13-Dec-16
	1	Overcoming Social Barriers to Data Sharing	Lecture
	2	Overcoming Social Barriers to Data Sharing	Reflection

Unit 15		Concluding Discussion		14-Dec-16	16-Dec-16 (f2f) 12-Dec-16 (online)	16-Dec-16

* In the Discussion Date column, f2f refers to the sections that meet face-to-face on Fridays, i.e. 34950, 32502 and 32869. online refers to the section that meets online via Zoom, i.e. 32871.

Course Roadmap

Unit 0: Introduction and Background Competency

This course is a “flipped course” in that a student views the taped lectures then has opportunity to reinforce their learning through discussions, reflections, and hands on activity. The course is broken down into 14 units, corresponding roughly to the length of a semester. It is expected that a student will put in 6-7 hours a week every week into the course which includes time spent in readings, reflections, and engaging with instructional content. Read the course policies document. Watch the introductory video about the course. For students with weak backgrounds in databases and file systems, you are advised to work through the following nicely done tutorial videos on MySQL and the Linux File System.

Unit 1: Big Data Intro

Activity 1-2: Big Data in Business

Key topic: business perspective on big data

Gives student perspective of how society thought and talked about big data as it first entered our lexicon. Activity 1 and 2 examines the topic from the perspective of business. For this activity, watch the video, read the two readings, and carry out the reflection.

Readings:

1. A special report on managing information: Data, data everywhere, The Economist, February 25, 2010

2. Data Scientist, The Sexiest Job of the 21^st Century, Thomas H. Davenport and D.J. Patil, Harvard Business Review, pp. 70-76, Oct 2012

Activity 3-4: Big Data in Scientific Research

Key topic: science perspective on big data

Gives student science perspective on big data. Student will read a dozen or so short articles from Science magazine (Feb 2011). The articles are written by practitioners in a dozen or so fields including social science, medical, and scientific disciplines, each discussing their unique data issues. Taken together, the collection highlights how differently one discipline sees data challenges from another discipline.

For this activity, watch the video, read the readings, can carry out the reflection.

Readings:

“Dealing with Data”, Special Online Collection, Science, 11 February 2011. See Canvas under Unit 1 for articles.

Unit 2: Data Processing Pipelines

Learning Objectives: Student gains understanding of the concept of a data pipeline, its connection to the lifecycle of data, and its use in science and business. Students acquires basic understanding of the complexity of a software algorithm and its relationship to the number of steps in the algorithm. Student applies what they’ve learned about the data pipeline concept to an example from their own experience. Student has opportunity to research a popular form of data pipeline in business today called the Amazon Web Service (AWS) Data Pipeline.

Activity 1-2: Data Processing Pipelines in Science

Key topic: what is a data pipeline? What is data pipeline connection to Big Data? How is data pipeline used in science? Why does software complexity hurt when data gets big?

Data rarely instantly show up ready to use in whatever exploratory purpose you may have in mind. Data from creation to use undergoes numerous steps, some of which are end products in and of themselves. Watch activity 1 video, do the reading, and carry out the reflection.

Readings:

Jim Gray on eScience: A Transformed Scientific Method, Edited by Tony Hey, Stewart Tansley, and Kirstin Tolle, in The Fourth Paradigm: Data Intensive Scientific Discovery, Tony Hey, Stewart Tansley, and Kritsin Tolle eds., Microsoft Research, 2009, pp. xix – xxxiii.
Jim Gray’s Fourth Paradigm and the Construction of the Scientific Record, Clifford Lynch, in The Fourth Paradigm: Data Intensive Scientific Discovery, Tony Hey, Stewart Tansley, and Kritsin Tolle eds., Microsoft Research, 2009, pp. 177-183.
Understanding execution time complexity: the Selection Sort versus the Heap Sort.
UK Data Archive: The Research Data Lifecycle,

Activity 3-4: Data Processing Pipelines in Business

Key topic: how does business view a data pipeline? Role of cloud computing in the business view of data pipelines. Exploration with current Data Pipeline tool.

Activity 3 introduces the business perspective of data pipelines. It draws inspiration from a 2011 talk by Wernert Vogels "Data Without Limits". Vogels is CTO of Amazon, and in this nice 2011 talk discusses data pipelines in context of business computing. He argues that cloud computing is core to a business model "without limits". The pipeline he proposes is: collect | store | organize | analyze | share.

Readings: No readings for this lesson.

Resources:

1. Vogels talks about mapreduce extensively during his discussion of analysis. If you're not familiar with MapReduce, a decent primer on MapReduce (Hadoop really; MapReduce is built into the open source Hadoop tool) can be found here: http://readwrite.com/2013/05/23/hadoop-what-it-is-and-how-it-works

2. AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html

3. Awslabs/data-pipeline-samples, https://github.com/awslabs/data-pipeline-samples

Unit 3: Software Systems Design

Keywords: distributed systems, emergent behavior, tradeoffs in software system design

Unit 3 is an introduction to the general concepts of software systems. These concepts are used during design of the large software systems needed to handle any of today’s large applications (social media, cloud services, shopping carts, video rentals …).

Readings: Principles of Computer System Design, an Introduction, Jerome H. Saltzer and M.Frans Kaashoek, Morgan Kaufmann, 2009. Read Chapter 1, Sections 1.1 and 1.2 only

Unit 4: Complexity in Software Systems

Keywords: complexity, layering, abstraction, modularity, hierarchy

A key problem in large-scale applications is complexity. We examine the sources of that complexity, and design principles that are used to overcome the complexity.

Readings: Principles of Computer System Design, an Introduction, Jerome H. Saltzer and M.Frans Kaashoek, Morgan Kaufmann, 2009. Read Chapter 1, Sections 1.3 - 1.5 only.

Unit 5: NoSQL data stores

Keywords: document, key-value, graph, and column stores

This unit introduces large-scale data stores. It gives an overview of noSQL data stores using a 2012 talk by Martin Fowler at the GoTo Aarhus Conference 2012. Readings additionally cover the concept of data services. As the reading states, “while data services were initially conceived to solve problems in the enterprise world, the cloud is now making data services accessible to a much broader range of consumers.”

Readings: Choosing the right NoSQL database for the job: a qualitative attribute evaluation,João Ricardo Lourenco, Bruno Cabral, Paulo Carreiro, Marco Vieira and Jorge Bernardio,Journal of Big Data 2015 2:18, Springer, 2015.
You may find this resource helpful. I suggest skipping the section “Research design and methodology”.

Unit 6: Comparison of Data Models By Example

Keywords: schema, data model

Relational databases provide structured and normalized tables for rapid and precise querying. noSQL stores support less structured data but also less rich querying. This unit walks you through a comparison of data models on different storage systems (relational, document store, key-value pair, and column store data model) using real ecological data from a science research project at Indiana University.

Resources:

Dataset: The data used in this unit is available in the forms as “dumped” from the different datastores. Compliments Dr. Peng Chen.

10 Rules for Scalable Performance in Simple Operation’ Datastores, Michael Stonebraker and Rick Cattell, Communications of ACM, 54(6), pp. 72-80. http://cacm.acm.org/magazines/2011/6/108651-10-rules-for-scalable-performance-in-simple-operation-datastores/fulltext

Optional:

Scalable SQL and NoSQL Data Stores, Rick Cattell, ACM SIGMOD Record 39(4) Dec 2010, revised 2011 version available here http://www.cattell.net/datastores/Datastores.pdf

Choosing the right NoSQL database for the job: a quality attribute evaluation, J.R. Lourenco, B. Cabral, P. Carreiro, M. Vieira, J. Bernardino, Journal of Big Data, Springer Open Journal 2(18) 2015, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-015-0025-0

Unit 7: Distributed File Systems Intro

Keywords: transparencies, session semantics, fault tolerance, naming

A major focus of this course is for the student to understand the distributed systems concepts that are underly today’s noSQL stores. The next step leading up to a study of noSQL stores themselves is to look at distributed file systems, where key concepts like transparencies, session semantics, fault tolerance, and naming all have a form that can be easily understood from the perspective of files and directories, which we all work with.

Reading:

Distributed File Systems: Concepts and Examples, E. Levy, A. Silberschatz, ACM Computing Surveys, Vol 22(4), Dec 1990, pp. 321-374 Sections 1-3 only.

Unit 8: Role of Caching in Distributed Applications

Keywords: caching, locality of reference, cache replacement strategy, cache coherency

Caching of data is a key concept to efficient performance of a large distributed application. We turn back to the Levy and Silberschatz paper to see what the authors have to say about caching.

Reading:

Distributed File Systems: Concepts and Examples, E. Levy, A. Silberschatz, ACM Computing Surveys, Vol 22(4), Dec 1990, pp. 321-374. Section 4 (skip 4.2 and 4.3)

Unit 9: Role of Fault Tolerance in Distributed Applications

Keywords: stateful and stateless servers, idempotence, transactions

When distributed systems span multiple locations or computers, the incidence of failure increases substantially. In this lesson we’ll turn back to the Levy and Silberschatz paper to learn about fault tolerance.

Readings:

Distributed File Systems: Concepts and Examples, E. Levy, A. Silberschatz, ACM Computing Surveys, Vol 22(4), Dec 1990, pp. 321-374, Section 5.1

Unit 10: Consistency in Distributed NoSQL Stores

Keywords: eventual consistency, CAP Theorem, Quorum for voting

When storage in a data store is distributed across multiple storage devices, consistency of reads and writes becomes a paramount issue.

Reading:

Eventually Consistent, W. Vogels, Communications of the ACM, 52, 1, Jan 2009

Unit 11: Routing in NoSQL Data Stores

Keywords: routing, distributed hash tables, Chord, peer-to-peer, local versus global knowledge

When data are stored across multiple computers in a single noSQL data store, and the data store can be accessed any of the multiple servers that support the noSQL data store, how does the noSQL data store ensure that a request for a data object is routed to the right location where the data are stored? This is a routing problem in noSQL data stores. This lesson discusses ways of keeping track of the information needed to route requests to the right server.

Readings: No readings for this lesson.

Unit 12: Data Cleaning and Coding (I534, I435)

Keywords: data cleaning, missing data, quantitative coding. Tagging data, categorizing data, coding data, feature extraction

Student learns data cleaning through real use cases from environmental science and social science. Student gains basic knowledge about coding data – purpose of coding and methodology for coding. Coding is at its most basic level the tagging or categorization of data on important features of the data so that themes emerge. The student will get a chance to try their hand at coding a dataset.

Readings: No readings for this lesson.

Coding Project. Student selects own project or uses this project. This project uses 278 media mentions from Pervasive Technology Institute over the year 2013-2014. The categorization that the student does will be illustrated through visualizing the results as a simple pie chart

Unit 12: Data Cleaning and Evaluation (B669)

Keywords: data cleaning, missing data, quantitative coding. Tagging data, categorizing data, coding data, feature extraction

Readings: No readings for this lesson.

Evaluation Project. Student carries out Track B of the Twitter project.

Unit 13: Data Provenance

Keywords: Data provenance, causality graph, Open Provenance Model

As data sharing increasingly moves from a friendly exchange between two parties that know each other to a transaction on an open data sharing market, the need also grows for data to carry with it sufficient information that the recipient can use to establish whether or not they trust and can use the data. Data provenance lies at the heart of the descriptive data needed to discern data trustworthiness. This lesson will introduce data provenance and give you a sense of what provenance data is and how it is represented.

Reading:

A Survey of Data Provenance in e-Science, Yogesh Simmhan, Beth Plale, Dennis Gannon, Association of Computing Machinery SIGMOD Record, vol 34, no. 3, pp. 31-36, Sep 2005

Unit 14: Overcoming Social and Technical Barriers to Data Sharing

Keywords: data trustworthiness, data sharing, economies of data sharing

The student gains appreciation for the social and economic barriers to sharing data.

Readings:

1. Who Will Pay for Public Access to Research Data? Fran Berman and Vint Cerf, Science 9, August 2013, Vol. 341 no. 6146 pp. 616-617 DOI: 10.1126/science.1241625 (www.sciencemag.org). Preprint version available at: http://www.cs.rpi.edu/~bermaf/Berman%20+%20Cerf%20Public%20Access%20--%20Author%20Version.pdf

2. If Data Sharing is the Answer, What is the Question? Christine L. Borgman, http://ercim-news.ercim.eu/en100/special/if-data-sharing-is-the-answer-what-is-the-question

3. Data sharing: empty archives, Bryn Nelson, Nature 461, published online 9 Sep 2009, pp. 160-163. Doi: 10.1038/461160a http://www.nature.com/news/2009/090909/full/461160a.html

Course Summary:

Course Summary
Date	Details	Due

February 2026

Calendar
Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
25 January 2026 Previous month Next month Today Click to view event details	26 January 2026 Previous month Next month Today Click to view event details	27 January 2026 Previous month Next month Today Click to view event details	28 January 2026 Previous month Next month Today Click to view event details	29 January 2026 Previous month Next month Today Click to view event details	30 January 2026 Previous month Next month Today Click to view event details	31 January 2026 Previous month Next month Today Click to view event details
1 February 2026 Previous month Next month Today Click to view event details	2 February 2026 Previous month Next month Today Click to view event details	3 February 2026 Previous month Next month Today Click to view event details	4 February 2026 Previous month Next month Today Click to view event details	5 February 2026 Previous month Next month Today Click to view event details	6 February 2026 Previous month Next month Today Click to view event details	7 February 2026 Previous month Next month Today Click to view event details
8 February 2026 Previous month Next month Today Click to view event details	9 February 2026 Previous month Next month Today Click to view event details	10 February 2026 Previous month Next month Today Click to view event details	11 February 2026 Previous month Next month Today Click to view event details	12 February 2026 Previous month Next month Today Click to view event details	13 February 2026 Previous month Next month Today Click to view event details	14 February 2026 Previous month Next month Today Click to view event details
15 February 2026 Previous month Next month Today Click to view event details	16 February 2026 Previous month Next month Today Click to view event details	17 February 2026 Previous month Next month Today Click to view event details	18 February 2026 Previous month Next month Today Click to view event details	19 February 2026 Previous month Next month Today Click to view event details	20 February 2026 Previous month Next month Today Click to view event details	21 February 2026 Previous month Next month Today Click to view event details
22 February 2026 Previous month Next month Today Click to view event details	23 February 2026 Previous month Next month Today Click to view event details	24 February 2026 Previous month Next month Today Click to view event details	25 February 2026 Previous month Next month Today Click to view event details	26 February 2026 Previous month Next month Today Click to view event details	27 February 2026 Previous month Next month Today Click to view event details	28 February 2026 Previous month Next month Today Click to view event details
1 March 2026 Previous month Next month Today Click to view event details	2 March 2026 Previous month Next month Today Click to view event details	3 March 2026 Previous month Next month Today Click to view event details	4 March 2026 Previous month Next month Today Click to view event details	5 March 2026 Previous month Next month Today Click to view event details	6 March 2026 Previous month Next month Today Click to view event details	7 March 2026 Previous month Next month Today Click to view event details