Undergraduate Course: Text Technologies for Data Science (INFR11145)
Course Outline
School | School of Informatics |
College | College of Science and Engineering |
Credit level (Normal year taken) | SCQF Level 11 (Year 4 Undergraduate) |
Availability | Available to all students |
SCQF Credits | 20 |
ECTS Credits | 10 |
Summary | This course teaches the basic technologies required for text processing, focussing mainly on information retrieval and text classification. It gives a detailed overview of information retrieval and describes how search engines work. It also covers basic knowledge of the main steps for text classification.
This course is a highly practical course, where at least 50% of what is taught in the course will be implemented from scratch in course works and labs, and students are required to complete a final project in small groups. All lectures, labs, and two course works will take place in Semester 1. The final group project will be due early Semester 2 by week 3 or 4. |
Course description |
Syllabus:
* Introduction to IR and text processing, system components
* Zipf, Heaps, and other text laws
* Pre-processing: tokenization, normalisation, stemming, stopping.
* Indexing: inverted index, boolean and proximity search
* Evaluation methods and measures (e.g., precision, recall, MAP, significance testing).
* Query expansion
* IR toolkits and applications
* Ranked retrieval and learning to rank
* Text classification: feature extraction, baselines, evaluation
* Web search
|
Entry Requirements (not applicable to Visiting Students)
Pre-requisites |
Students MUST have passed:
|
Co-requisites | |
Prohibited Combinations | |
Other requirements | Maths requirements:
1. Linear algebra: Strong knowledge of vectors and matrices with all related mathematical operations (addition, multiplication, inverse, projections ... etc).
2. Probability theory: Discrete and continuous univariate random variables. Bayes rule. Expectation, variance. Univariate Gaussian distribution.
3. Calculus: Functions of several variables. Partial differentiation. Multivariate maxima and minima.
4. Special functions: Log, Exp, Ln.
Programming requirements:
1. Python and/or Perl, and good knowledge in regular expressions
2. Shell commands (cat, sort, grep, sed, ...)
3. Additional programming language could be useful for course project.
Team-work requirement:
Final course project would be in groups of 4-6 students. Working in a team for the project is a requirement.
|
Information for Visiting Students
Pre-requisites | Maths requirements:
1. Linear algebra: Strong knowledge of vectors and matrices with all related mathematical operations (addition, multiplication, inverse, projections ... etc).
2. Probability theory: Discrete and continuous univariate random variables. Bayes rule. Expectation, variance. Univariate Gaussian distribution.
3. Calculus: Functions of several variables. Partial differentiation. Multivariate maxima and minima.
4. Special functions: Log, Exp, Ln.
Programming requirements:
1. Python and/or Perl, and good knowledge in regular expressions
2. Shell commands (cat, sort, grep, sed, ...)
3. Additional programming language could be useful for course project.
Team-work requirement:
Final course project would be in groups of 4-6 students. Working in a team for the project is a requirement.
|
High Demand Course? |
Yes |
Course Delivery Information
|
Academic year 2022/23, Available to all students (SV1)
|
Quota: None |
Course Start |
Full Year |
Course Start Date |
19/09/2022 |
Timetable |
Timetable |
Learning and Teaching activities (Further Info) |
Total Hours:
200
(
Lecture Hours 18,
Supervised Practical/Workshop/Studio Hours 12,
Summative Assessment Hours 2,
Programme Level Learning and Teaching Hours 4,
Directed Learning and Independent Learning Hours
164 )
|
Assessment (Further Info) |
Written Exam
30 %,
Coursework
70 %,
Practical Exam
0 %
|
Additional Information (Assessment) |
Written Exam 30%
Coursework 70%
Total mark on CW will be 70%, with the following split:
CW1: 10%, individual work covers implementing basic search engine
CW2: 20%, individual work covering IR evaluation and web search
CW3: 40%, is a group project, where each group is 4-6 members.
All of the coursework is heavy on system implementation, and thus being familiar with programming and software engineering is a pre-requisite. Python is required for implementation of CW1 and CW2. For CW3, students are free to use the implementation language they prefer. |
Feedback |
Not entered |
Exam Information |
Exam Diet |
Paper Name |
Hours & Minutes |
|
Main Exam Diet S2 (April/May) | | 2:00 | |
Learning Outcomes
On completion of this course, the student will be able to:
- Build basic search engines from scratch, and use IR tools for searching massive collections of text documents
- Build feature extraction modules for text classification
- Implement evaluation scripts for IR and text classification
- Understand how web search engines (such as Google) work
- Work effectively in a team to produce working systems
|
Reading List
"Introduction to Information Retrieval", C.D. Manning, P. Raghavan and H. Schutze
"Search Engines: Information Retrieval in Practice", W. Bruce Croft, Donald Metzler, Trevor Strohman
"Machine Learning in Automated Text Categorization". F Sebastiani "The Zipf Mystery"
Additional research papers and videos to be recommended during lectures
|
Contacts
Course organiser | Dr Walid Magdy
Tel: (0131 6)51 5612
Email: |
Course secretary | Miss Lori Anderson
Tel: (0131 6)51 4164
Email: |
|
|