Skip to content Skip to navigation


Theo kế hoạch giảng dạy trong học kỳ 2 năm học 2018-2019 Khoa Khoa học Máy tính phối hợp cùng Văn phòng Các chương trình đặc biệt, Phòng Đào tạo Đại học mở mới môn học do Giáo sư John F. Hurdle (University of Utah/Department of Biomedical Informatics) giảng dạy.
Thông tin về môn học của Prof. GS. Hurdle trên hệ thống đăng kí học phần như sau:

  • Tên môn học: Xử lý văn bản Y khoa
  • Mã lớp: CS339.J21.KHTN
  • Môn tự chọn – ngành Khoa học máy tính, chấp nhận cho tất cả các khóa và hệ đào tạo (từ K10).
  • Thời gian: thứ 3, Tiết 678, tại Phòng E.22
  • Đối tượng: sinh viên toàn trường (có khả năng học tập bằng tiếng Anh) đều có thể đăng ký học.
  • Lớp học sẽ có trợ giảng là GV Việt Nam (TS. Nguyễn Lưu Thùy Ngân)

Các nội dung chính của môn học:
Course Overview:

  • NLP and ML comprise the foundation of the essential online tools we use every day, such as Google’s extraordinary search engine; Netflix’s “here is a film we think you will like” suggestion technology; and Amazon’s “what to buy next” recommendation technology. Extremely popular with industry globally, students who know NLP/ML will be well positioned to move ahead in their careers. This course will introduce the essential theory behind these tools and will stress 7 applying the theory to real-world problems. These tools are very easy to use poorly, so we focus on the principled application of these tools.  

Course Schedule & Topics:

  • Week 1:  Introduce Prof. Hurdle; introduce the students and other attendees; Review syllabus and course requirements; and Survey student and other attendees’ programming background and skill sets.
  • Week 2:  NLP/ML Bootcamp. The rationale for using NLP; Text as data; Linguistic versus statistical approaches to NLP; How NLP is used in common apps (e.g., Google, Amazon, etc.); Career opportunities 
  • Week 3:  Our tools: Jupyter Notebook and NLTK. Introduction to the Jupyter Notebook system; Introduction to the Natural Language Tool Kit; Accessing corpora and related resources.
  • Week 4:  The NLP Pipeline and Preprocessing Stages/Modules. Why pipelines?; Overview of UIMA and UIMA-AS; Standard pre-processing of text as  pipeline stages. 
  • Week 5:  Basics of Information Extraction (IE) in a Pipeline. The foundation of NLP: finding discrete information it text (dictionaries, indexes, and regular expressions); Named Entity Recognition; Clinical texts: the Unified Medical Language System (UMLS); IE as a feed to ML.
  • Week 6:  Evaluation of NLP and ML systems: performance metrics. The confusion matrix; Precision, Recall, and F-Score; Sample size and bias; Problems with the accuracy measure.
  • Week 7: Midterm Exam
  • Week 8:  Basics of Information Retrieval (IR) using a a Pipeline. The foundmidtermation of IR: finding a class of documents in corpus; Indexing documents; Building on top of indexes (Google and Nutch); Bagof-words and sparse data.
  • Week 9:  Application-specific Pipelines: Clinical text use case. Part-of-Speech (POS) tagging; Stop words; Dictionary lookup; Using the UMLS in a pipeline; Clinical context is important: FastConText.
  • Week 10:  Unsupervised Methods and Text Annotation: Clinical text use case.  Approaches to annotating text to measure NLP/ML performance; Avoiding annotation: unsupervised clustering and related methods; Brief introduction to sub-language theory. // In-class: Brief exercise illustrating the pain of annotating; Using CLUTO to discover clinical sub-languages.
  • Week 11:  Machine Learning Bootcamp: focus on classification Part 1. (All lecture this session) Classification defined; Training and testing; Hyper-spaces and the algorithms to explore them; The baseline: logistic regression; Alternatives to the baseline: Naïve Bayes (NB), Classification trees and variations (CT), Support Vector Machines (SVM), and Neural Networks (NN).
  • Week 12:  Machine Learning Bootcamp: focus on classification Part 2. (All In-class this session) A team race: class breaks up into Team NB, Team CT, Team SVM, and Team NN. All teams given the same training set, each team tunes their model and measures performance on training set, then each team given the test set (no tuning allowed here)
  • Week 13:  TBA. This session will be adapted onthe-ground, tailored to topics the students want to learn more about.
  • Week 14:  Troubleshooting Final Projects. Student teams present overview of their projects and brainstorm with the class on barriers/pitfalls/workarounds