Undergraduate Course: Text Technologies for Data Science (INFR11145)
Course Outline
School | School of Informatics |
College | College of Science and Engineering |
Credit level (Normal year taken) | SCQF Level 11 (Year 4 Undergraduate) |
Availability | Available to all students |
SCQF Credits | 20 |
ECTS Credits | 10 |
Summary | The course deals with retrieval technologies behind search engines, such as Google. The course will aim to strike a balance between theoretical and system-related aspects of the field. The course will cover:
1. Theoretical aspects, including properties of text, queries, relevance, major retrieval models and evaluation;
2. System-related aspects, including crawlers, text processing, index construction and retrieval algorithms. |
Course description |
Syllabus
1. Introduction: search applications, search tasks, users information need
2. Definitions: documents, queries, bag-of-words trick
3. Laws of text: Zipf, Heaps, clumpting, index size.
4. Vector space: term weighting, similarity functions.
5. Vocabulary mismatch: tokenization, stemming, synonyms.
6. Indexing: inverted lists, compression, query execution.
7. Web crawling: XML feeds, crawling, expected age.
8. Content Extraction: XML tags, DOM, Finn's method.
9. Locality Sensitive Hashing: duplicates, Simhash.
10. Evalaution: recall, precision, F1, MAP, nDCG, query logs.
11. Web search: PageRank, hubs and authorities, link spam.
12. Probabilistic model: probability ranking principle, BM25.
13. Relevance models: exchangeability, cross-language search.
14. Language models for IR: query likelihood, smoothing.
15. Machine learning in IR: PA, SVM, SMO algorithms, LeToR.
16. Social media search, nature, challenges, tasks
17. Information filtering, topic drift
18. Text classification
|
Entry Requirements (not applicable to Visiting Students)
Pre-requisites |
Students MUST have passed:
|
Co-requisites | |
Prohibited Combinations | |
Other requirements | Maths requirements:
1. Linear algebra: Strong knowledge of vectors and matrices with all related mathematical operations (addition, multiplication, inverse, projections ... etc).
2. Probability theory: Discrete and continuous univariate random variables. Bayes rule. Expectation, variance. Univariate Gaussian distribution.
3. Calculus: Functions of several variables. Partial differentiation. Multivariate maxima and minima.
4. Special functions: Log, Exp, Ln.
Programming requirements:
1. Pyhton and/or Perl, and good knowledge in regular expressions
2. Shell commands (cat, sort, grep, sed, ...)
3. Additional programming language could be useful for course project.
Team-work requirement:
Final course project would be in groups of 4-6 students. Working in a team for the project is a requirement.
|
Information for Visiting Students
Pre-requisites | Maths requirements:
1. Linear algebra: Strong knowledge of vectors and matrices with all related mathematical operations (addition, multiplication, inverse, projections ... etc).
2. Probability theory: Discrete and continuous univariate random variables. Bayes rule. Expectation, variance. Univariate Gaussian distribution.
3. Calculus: Functions of several variables. Partial differentiation. Multivariate maxima and minima.
4. Special functions: Log, Exp, Ln.
Programming requirements:
1. Pyhton and/or Perl, and good knowledge in regular expressions
2. Shell commands (cat, sort, grep, sed, ...)
3. Additional programming language could be useful for course project.
Team-work requirement:
Final course project would be in groups of 4-6 students. Working in a team for the project is a requirement.
|
High Demand Course? |
Yes |
Course Delivery Information
|
Academic year 2017/18, Available to all students (SV1)
|
Quota: None |
Course Start |
Semester 1 |
Timetable |
Timetable |
Learning and Teaching activities (Further Info) |
Total Hours:
200
(
Lecture Hours 18,
Supervised Practical/Workshop/Studio Hours 12,
Summative Assessment Hours 2,
Programme Level Learning and Teaching Hours 4,
Directed Learning and Independent Learning Hours
164 )
|
Assessment (Further Info) |
Written Exam
60 %,
Coursework
40 %,
Practical Exam
0 %
|
Additional Information (Assessment) |
Written examination will evaluate students' understanding to fundamentals of text technologies and IR = worth 60% of total course mark.
In additon, coursework will include three practical assignments to show the depth of understanding to the basics when applied to real-life problems. (Worth 40% of total course mark).
Assignments will be designed as follows:
1) Two assignments for student to work individually (10% each).
2) One course project assignment for group of students, 2-4 students per group (20%). |
Feedback |
Not entered |
Exam Information |
Exam Diet |
Paper Name |
Hours & Minutes |
|
Main Exam Diet S2 (April/May) | | 2:00 | |
|
Academic year 2017/18, Part-year visiting students only (VV1)
|
Quota: None |
Course Start |
Semester 1 |
Timetable |
Timetable |
Learning and Teaching activities (Further Info) |
Total Hours:
200
(
Lecture Hours 18,
Supervised Practical/Workshop/Studio Hours 12,
Summative Assessment Hours 2,
Programme Level Learning and Teaching Hours 4,
Directed Learning and Independent Learning Hours
164 )
|
Assessment (Further Info) |
Written Exam
60 %,
Coursework
40 %,
Practical Exam
0 %
|
Additional Information (Assessment) |
Written examination will evaluate students' understanding to fundamentals of text technologies and IR = worth 60% of total course mark.
In additon, coursework will include three practical assignments to show the depth of understanding to the basics when applied to real-life problems. (Worth 40% of total course mark).
Assignments will be designed as follows:
1) Two assignments for student to work individually (10% each).
2) One course project assignment for group of students, 2-4 students per group (20%). |
Feedback |
Not entered |
Exam Information |
Exam Diet |
Paper Name |
Hours & Minutes |
|
Main Exam Diet S1 (December) | | 2:00 | |
Learning Outcomes
On completion of this course, the student will be able to:
- Describe the main algorithms for processing, storing and retrieving text.
- Show familiarity with theoretical aspects of IR, including the major retrieval models.
- Discuss the range of issues involved in building a real search engine
- Evaluate the effectiveness of a retrieval algorithm
- Build social media applications using text processing techniques
|
Reading List
Text books:
"Introduction to Information Retrieval", C.D. Manning, P. Raghavan and H. Schutze
"Search Engines: Information Retrieval in Practice", W. Bruce Croft, Donald Metzler, Trevor Strohman
Readings:
"Machine Learning in Automated Text Categorization". F Sebastiani "The Zipf Mystery",
Youtube video:
https://www.youtube.com/watch?v=fCn8zs912OE
"Information Retrieval", C.J. van Rijsbergen
"Recommended Reading for IR Research Students", A. Moffat, J. Zobel, D. Hawking |
Contacts
Course organiser | Dr Walid Magdy
Tel: (0131 6)51 5612
Email: |
Course secretary | Mr Gregor Hall
Tel: (0131 6)50 5194
Email: |
|
|