Course Details

Course Number: 95-865

Unstructured Data Analytics

Units: 6

Many organizations need to analyze large amounts of data such as text, images, audio, and video to discover useful information. For example, a company may want to monitor how the public discusses its products in social media, or a forensics team may need to discover the contents of disk drives seized by law enforcement. A recurring issue is that we often do not know what structure is present in the data initially. This course provides students with an understanding of common and emerging methods of organizing, summarizing, and analyzing large collections of this unstructured data ("unstructured data analytics"). There is a heavy emphasis on hands-on programming experience.
For students with very limited or no programming experience or who are after a more conceptual exposition, consider taking 94-775 instead.


Prerequisites:
Python programming experience. If you do not already know Python, we will expect that you pick it up fairly rapidly on your own (this should be possible if you are already very comfortable coding in a different high-level programming language such as R or MATLAB).

Note that there will be a fair amount of coding in Python and working with sufficiently large datasets. We will be making use of standard Python machine learning libraries such as scikit-learn and keras.




Learning Objectives:

By the end of the course, students are expected to have developed the following skills. Skills are assessed by the homework assignments and the final exam.

* Recall and discuss common methods of conducting exploratory and predictive analysis of unstructured data;
* Write Python code for exploratory and predictive data analysis that handles large datasets; and
* Work with the Amazon AWS cloud computing platform; and
* Apply unstructured data analysis techniques discussed in class to solve problems faced by governments and companies.


Soft Prerequisites:

Python programming experience. If you do not already know Python, we will expect that you pick it up fairly rapidly on your own (this should be possible if you are already very comfortable coding in a different high-level programming language such as R or MATLAB).

Note that there will be a fair amount of coding in Python and working with sufficiently large datasets. We will be making use of standard Python machine learning libraries such as scikit-learn and keras.

Syllabus

Faculty:
George Chen