Invited Lecturers

Kevin Cohen

Kevin Cohen


Language and linguistics in NLP/NLP for biomedical language

Introduction: Cultural changes in the language processing community have led to the odd situation that in language processing research, we usually don’t think very much about language. The consequences of this lack of attention to language can be a lot of wasted time, and probably not knowing why something didn’t work the way that you expected it to. Language has properties that make some approaches to language processing much more difficult than others, and we’ll explore the aspects of language that lead to these computational and engineering problems. This will include their implications for evaluation, for testing, and for reproducibility of results.

The three lectures will be focussed on the whys of all of this. The two hands-on sessions--and three homework assignments--will be focussed on developing a gut-level understanding of the consequences of those whys for building natural language processing systems in the biomedical domain, and thence the hows of dealing with them.

Lecture 1: Language has intrinsic properties that make it different from most other engineering challenges. In the first lecture, we will explore those in some depth, exploring in particular the notions of ambiguity and variability. For example, we’ll use the nature of grammars to explore why regular expressions are fine for some language processing tasks, and terrible for others; a discussion of some of the distributional properties of language will be used to explore some of the challenges for machine learning.

Lecture 2: Just as language processing is different from most other engineering problems, biomedical language has lots of its own quirks that make it different from other language processing problems. In the second lecture, we will talk about how to approach a data set in order to understand how its properties might affect your design decisions when putting a language processing system together. We will use this to begin the discussion of how to hack together a quick solution, and then some ways that you could improve it.

Lecture 3: This lecture continues the discussion of how to put together a quick solution for a new set of linguistic data, moving towards some of the many ways to turn the quick solution into a mature solution. We will focus on two projects that (a) might be useful to you, (b) illustrate different aspects of language and of genres that can make natural language processing difficult, and (c) illustrate different uses of the lexical and ontological resources that make the biomedical domain so attractive to language processing people. Included in this lecture is a long list of every mistake I’ve ever made, so that you won’t make them.


  • An annotation project. As machine learning becomes more and more of a commodity, the differentiating factor in natural language processing becomes data. In order to understand the metaphysics and ontology of our data, you will create some. You will find a partner, and you will each spend an hour designing a project and an hour doing annotations. Then you will analyze the results.
  • An exploration of figures of merit. Because evaluation is one of the best-developed, and yet still often most misunderstood, aspects of natural language processing research, you will model the relationships between different figures of merit so that you understand both which ones are appropriate, and how to design a data set so that you’ll be able to calculate the ones that you want.
  • An analysis of biomedical linguistic data. You will have a sample of clinical or publication data, depending on your interests. Because ambiguity is a major problem for language processing, you will find and explain ambiguities in that data.

This afternoon we have 90 minutes for the hands-on session, and four exercises to do. So, don't spend more than about 20 minutes on any one of them until you've at least tried all four.

This morning, we talked about ambiguity, which is a feature of all human languages, and which humans are so good at "resolving" that they usually don't even notice it. However, it is a big problem--maybe the primary problem--for computer programs that process human language, and if you cannot recognize when and how it is creating a problem for your natural language processing programs, you are unlikely to ever know how to improve them. So: the following web page has one and a half sentences of English-language text from a medical record. Find a partner, and then read through it and find 10 things that are ambiguous. Be prepared to discuss them this afternoon.

The following web page has 3 exercises on biomedical natural language processing. They are designed to help solidify your understanding of the things that we talked about this morning--if you don't come to understand them intuitively, you will have difficulty following the rest of the lectures this week. Do the exercises with your partner, and for the first one, keep notes on the kinds of problems that you had.

You can download the homework from here

Hands-on exercises:

Empirical investigations of the implications of the nature of biomedical language for the design of experiments in natural language processing.

In our hands-on sessions, we’ll look at the effects of minor choices in system design that can make your research publishable--or not. We will then explore the implications of these effects for the currently hot topic of reproducibility in biomedical research.

Short bio:

Kevin Bretonnel Cohen is the Director of the Biomedical Text Mining Group at the University of Colorado School of Medicine, and the Chair in Natural Language Processing for the Biomedical Domain at Université Paris-Saclay. His research interests include epilepsy surgery candidacy prediction, suicide, and targeted cancer therapeutics. His methodological approaches focus on bringing together linguistics, language technology, and software engineering.