Undergraduate Course: Text Technologies for Data Science (INFR11145)
Course Outline
| School | School of Informatics | 
College | College of Science and Engineering | 
 
| Credit level (Normal year taken) | SCQF Level 11 (Year 4 Undergraduate) | 
Availability | Available to all students | 
 
| SCQF Credits | 20 | 
ECTS Credits | 10 | 
 
 
| Summary | This course teaches the basic technologies required for text processing, focussing mainly on information retrieval and text classification. It gives a detailed overview of information retrieval and describes how search engines work. It also covers basic knowledge of the main steps for text classification.  
 
This course is a highly practical course, where at least 50% of what is taught in the course will be implemented from scratch in course works and labs, and students are required to complete a final project in small groups. All lectures, labs, and two course works will take place in Semester 1. The final group project will be due early Semester 2 by week 3 or 4. | 
 
| Course description | 
    
    Syllabus: 
* Introduction to IR and text processing, system components 
* Zipf, Heaps, and other text laws  
* Pre-processing: tokenization, normalisation, stemming, stopping. 
* Indexing: inverted index, boolean and proximity search 
* Evaluation methods and measures (e.g., precision, recall, MAP, significance testing). 
* Query expansion 
* IR toolkits and applications 
* Ranked retrieval and learning to rank 
* Text classification: feature extraction, baselines, evaluation 
* Web search
    
    
 | 
 
 
Entry Requirements (not applicable to Visiting Students)
| Pre-requisites | 
 | 
Co-requisites |  | 
 
| Prohibited Combinations |  Students MUST NOT also be taking    
Text Technologies for Data Science (UG) (INFR11229)  
  | 
Other requirements |  MSc students must register for this course, while Undergraduate students must register for INFR11229 instead. 
 
Maths requirements: 
1. Linear algebra: Strong knowledge of vectors and matrices with all related mathematical operations (addition, multiplication, inverse, projections ... etc). 
2. Probability theory: Discrete and continuous univariate random variables. Bayes rule. Expectation, variance. Univariate Gaussian distribution. 
3. Calculus: Functions of several variables. Partial differentiation. Multivariate maxima and minima. 
4. Special functions: Log, Exp, Ln. 
 
Programming requirements: 
1. Python and/or Perl, and good knowledge in regular expressions 
2. Shell commands (cat, sort, grep, sed, ...) 
3. Additional programming language could be useful for course project. 
 
Team-work requirement: 
Final course project would be in groups of 4-6 students. Working in a team for the project is a requirement. | 
 
 
Information for Visiting Students 
| Pre-requisites | As above. No part time visiting students permitted. | 
 
		| High Demand Course? | 
		Yes | 
     
 
Course Delivery Information
 |  
| Academic year 2024/25, Available to all students (SV1) 
  
 | 
Quota:  None | 
 
| Course Start | 
Full Year | 
 
| Course Start Date | 
16/09/2024 | 
 
Timetable  | 
	
Timetable | 
| Learning and Teaching activities (Further Info) | 
 
 Total Hours:
200
(
 Lecture Hours 18,
 Supervised Practical/Workshop/Studio Hours 12,
 Summative Assessment Hours 2,
 Programme Level Learning and Teaching Hours 4,
Directed Learning and Independent Learning Hours
164 )
 | 
 
| Assessment (Further Info) | 
 
  Written Exam
30 %,
Coursework
70 %,
Practical Exam
0 %
 | 
 
 
| Additional Information (Assessment) | 
Exam 30% 
Coursework 70% 
 
Course Work 1 10%, individual work covers implementing basic search engine 
Course Work 2 20%, individual work covering IR evaluation and web search 
Course Work 3 40%, is a group project, where each group is 4-6 members 
 
All of the coursework is heavy on system implementation, and thus being familiar with programming and software engineering is a pre-requisite. Python is required for implementation of Course Work 1 and Course Work 2. For Course Work 3, students are free to use the implementation language they prefer. | 
 
| Feedback | 
Not entered | 
 
| Exam Information | 
 
    | Exam Diet | 
    Paper Name | 
    Minutes | 
    
	 | 
  
| Main Exam Diet S2 (April/May) | Text Technologies for Data Science (INFR11145) | 120 |  |  
 
Learning Outcomes 
    On completion of this course, the student will be able to:
    
        - build basic search engines from scratch, and use IR tools for searching massive collections of text documents
 - build feature extraction modules for text classification
 - implement evaluation scripts for IR and text classification
 - understand how web search engines (such as Google) work
 - work effectively in a team to produce working systems
 
     
 | 
 
 
Reading List 
"Introduction to Information Retrieval", C.D. Manning, P. Raghavan and H. Schutze 
"Search Engines: Information Retrieval in Practice", W. Bruce Croft, Donald Metzler, Trevor Strohman 
"Machine Learning in Automated Text Categorization". F Sebastiani "The Zipf Mystery" 
Additional research papers and videos to be recommended during lectures |   
 
Contacts 
| Course organiser | Dr Walid Magdy 
Tel: (0131 6)51 5612 
Email:  | 
Course secretary | Miss Yesica Marco Azorin 
Tel: (0131 6)50 5194 
Email:  | 
   
 
 |    
 
  
  
  
  
 |