Finding Date Range in Natural Language Text

This post describes a simple way to parse date in natural language text. Programming will be done in java using Eclipse.

Now, there are quite a few opensource libraries that can help you approach this problem.

Have a look at the following: PrettyTime, Natty and SUTime.

Here, I’ll be explaining how to do this using SUTime which is part of the Stanford’s CoreNLP software package.

Let’s get started.

Prerequisites:

  1. In case you are not familiar with Eclipse or adding external jar files to a project, please refer my previous post.
  2. To use SUTime, you can download Stanford CoreNLP package from here.
  3. Just like we imported the POS tagger library to a new project in my previous post, add the .jar files you just downloaded to your project. There will be many .jar files in the download folder, but for now you can add the ones prefixed with “stanford-corenlp”.
  4. Download the Java Suite of CoreNLP tools from GitHub. Here you would be needing the file at “CoreNLP-master\src\edu\stanford\nlp\time”.
  5. Use the SUTimeSimpleParser.java and you are good to go.

 

At the time of writing this post, the above class does not clearly show how to get date range from text.

For example, if the input is

I lived in New York from January 2013 to March 2014

We should get something like

range(2013-01, 2014-03)

This does not happen at the moment.

To get range, you can replace the AnnotationPipeline method in SUTimeSimpleParser.java with the following:

private static AnnotationPipeline makeNumericPipeline() {
 AnnotationPipeline pipeline = new AnnotationPipeline();
 pipeline.addAnnotator(new PTBTokenizerAnnotator(false));
 pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
 pipeline.addAnnotator(new POSTaggerAnnotator(false));
 //pipeline.addAnnotator(new TimeAnnotator());

 Properties props = new Properties();
 props.setProperty("sutime.markTimeRanges", "true");
 props.setProperty("sutime.includeRange", "true");
 TimeAnnotator sutime = new TimeAnnotator("sutime", props);
 pipeline.addAnnotator(sutime);

 return pipeline;
}

 

You can then get range simply by:

Range range = timeExpression.getRange();

 

And you’re done!

Please post a comment for questions/feedback.

Cheers!

 

 

 

 

Stanford POS tagger in Eclipse

This post will get you started with POS tagging in java using Eclipse.

Why do it ?
Well, a Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word, such as noun, verb, adjective, etc. It is an extremely powerful and accurate tool. You can use it in any application that deals with natural language text to analyze words/tokens and classify them into categories.

For pre-requisites, follow these simple steps:

  1. Download and install java JDK and JRE on your system from here.
  2. Edit system environment variables by right clicking on My Computer -> Properties -> Advance System Settings ->Environment Variables. Copy the path to the bin directory of your JDK installation to the beginning of your environment variable PATH. For default settings, this will look like this: “C:\Program Files\Java\jdk1.7.0_09\bin;” (without the quotes ofcourse).
  3. Download Eclipse IDE from here depending upon your system configuration. You can pick “Eclipse IDE for Java and DSL Developers” if you are not sure which one to chose.
  4. Download Stanford POS tagger from here.

You’re almost ready to go. Lets setup our work:

  1. Open Eclipse and chose the location of your workspace. This is where all your projects will be stored.
  2. Make a new project and name it anything you want. I’ll go with the name “practise“.
  3. Add a new class to it. You can name it “tagText”.
  4. Go to the directory where your downloaded the Stanford POS tagger, and inside the folder “models”. Copy a .tagger file and its corresponding .props file. I will assume these are: “left3words-wsj-0-18.tagger” and “left3words-wsj-0-18.props”.  In your workspace directory, inside your project folder make a new folder and name it “taggers”. Go to this folder and paste the tagger and props files.

Alright people. Now lets start coding !

Add/write this code to the tagText.java file you created.

import java.io.IOException;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class tagText {
public static void main(String[] args) throws IOException,
ClassNotFoundException {

// Initialize the tagger
MaxentTagger tagger = new MaxentTagger("taggers/left3words-wsj-0-18.tagger");

// The sample string
String sample = "This is a sample text";

// The tagged string
String tagged = tagger.tagString(sample);

//output the tagged sample string onto your console
System.out.println("Input: " + sample);
System.out.println("Output: "+ tagged);
}
}

We are not done yet. We need to import the Stanford tagger library to eclipse. To do this:
Right click on your project “practise” -> Build Path -> Configure Build Path -> Click on Add External JARs -> Browse to the location of your download directory of the Stanford POS tagger and select the stanford-postagger.jar file -> Click OK.

Import library to Eclipse

That’s it guyz. Run your code and you should have this output:

Loading default properties from trained tagger taggers/left3words-wsj-0-18.tagger
Reading POS tagger model from taggers/left3words-wsj-0-18.tagger … done [2.1 sec].
i/FW can/MD man/VB the/DT controls/NNS of/IN this/DT machine/NN

The output you will get

The “FW”, “MD”, “VB”, etc next to each word are classes. For example, VB stands for Verb. The complete list of classes can be found here.

To play around more with this, you can have lots of English sentences stored in a file, say “input.txt” and we can run the tagger and store all tagged sentences in another file, say “output.txt”.

To accomplish this, add a new class named “tagTextToFile” to your project with the following code:


import java.io.*;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class tagTextToFile {

 public static void main(String[] args) throws IOException,
 ClassNotFoundException {

 String tagged;

 // Initialize the tagger
 MaxentTagger tagger = new MaxentTagger("taggers/left3words-wsj-0-18.tagger");

 // The sample string
 String sample = "i can man the controls of this machine";

 //The tagged string
 tagged = tagger.tagString(sample);

 //output the tagged sample string onto your console
 System.out.println(tagged);

 /* next we will pick up some sentences from a file input.txt and store the output of
 tagged sentences in another file output.txt. So make a file input.txt and write down
 some sentences separated by a new line */

 FileInputStream fstream = new FileInputStream("input.txt");
 DataInputStream in = new DataInputStream(fstream);
 BufferedReader br = new BufferedReader(new InputStreamReader(in));

 //we will now pick up sentences line by line from the file input.txt and store it in the string sample
 while((sample = br.readLine())!=null)
 {
 //tag the string
 tagged = tagger.tagString(sample);
 FileWriter q = new FileWriter("output.txt",true);
 BufferedWriter out =new BufferedWriter(q);
 //write it to the file output.txt
 out.write(tagged);
 out.newLine();
 out.close();
 }

}

}

You can throw the exceptions (if any) given by Eclipse.

You can continue to play around as much. Happy coding !

References:

  1. http://www.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/
  2. http://nlp.stanford.edu/software/tagger.shtml

Getting Started

In my previous post I gave an introduction to Natural Language Processing. Before we start programming in this area, it is important to develop an intuition as to why understanding natural language is a complex task. So take a moment and think of at least 2 reasons before you read further. . .

Okay so some of the main reasons include :

  • The vast size of natural language, consisting of infinite number of sentences.
  • Ambiguity in natural language

Let me paint you a picture [1].

“I went to a bank to meet my friends Harold and Kumar. They have been friends for a long time now.  I came to know that they had got into a fight yesterday. Harold had hit Kumar because he likes Maria. I tried to resolve things between the two, for a long time, but to no avail.”

Consider each of the lines the above para, and try to think which of these have  ambiguity in them. Again, take a moment.

Well the answer as most of you might have got, is that all sentences above have ambiguity (of some form). For those who missed a few:

  • I went to a bank to meet my friends Harold and Kumar: Which bank ? It can refer to a financial institution or a river bank. (Lexical ambiguity)
  • They have been friends for a long time now: Consider this and the last sentence together. You’ll see that “a long time” refers to different amount of time depending upon the context. (Pragmatic ambiguity)
  • I came to know that they had got into a fight yesterday: Can you tell which day ? (Pragmatic ambiguity)
  • Harold had hit Kumar because he likes Maria: Who likes Maria ? Harold or Kumar ? (Referential ambiguity)

So now you know what you’r stepping into. NLP is indeed a challenging branch of computer science, which makes it all the more interesting. In the coming posts, we’l start programming on some problems related to NLP. Two languages are generally recommended in NLP for coding: JAVA and python. I’l be using JAVA in my posts as I am more familiar with it. However, you can pick either of them as both are supported with rich libraries. To get started quickly, some really good libraries are available for NLP tasks such as sentence segmentation, POS taggin, etc in [2, 3]. We’ll discuss how to integrate and use some of them in future posts.

Cheers !

References

[1] S. Gupta, J. Kumar, M. Trivedi, Artificial Intelligence, second edition, 2008

[2] http://nlp.stanford.edu/

[3] http://wordnet.princeton.edu/

Introduction

It is probably easiest to understand a scientific notion by first considering some human analogies [3]. Lets consider one here.

A few months after a baby is born, parents, excited to hear their child speak, give her one of the first instructions of her life – “say mommy” or “say daddy”. For several attempts that go in vain, the little child merely stares back in bewilderment.  The infant’s brain, though already handling vital processing of body functions, is unable to interpret basic instructions given to her. Now, as you picture this you’ll realize that there’s nothing wrong with such behavior. One simple reason for it is lack of knowledge-base in the infant’s brain to be able to process human language. Now, as the child grows and learns, she is able to recognize her parents and understand the instruction better. She now starts to respond by smiling, and eventually by saying “mommy”. (finally !)

We can apply similar reasoning to computers. The infant is analogous to a machine devoid of a specific compiler or necessary algorithms and knowledge-base to interpret a certain language (which in this case is natural language). The interaction is analogous to what is formally called “Human Computer Interaction”. The learning curve followed by the infant is analogous to “Machine Learning”, her initial  acknowledgement of instruction by smiling to “Natural Language Understanding” and replying back to “Natural Language Generation”. The overall process that now goes on inside her brain can be thought of as “Natural Language Processing”.

We are now in a position to define NLP as :

A field of artificial intelligence and linguistics concerned with the interactions between computers and human (natural) languages [1].

Considering the state of the art in language technology, NLP deals with several problems such as:

  • Information Extraction
  • Sentiment Analysis
  • Spam Detection
  • Question Answering

…and many more [2].

Some of these will be discussed in later posts.

I find it to be a really interesting field and have been studying it for almost an year now. Taking into account its depth, I am relatively new to this field. However, I will like to share my experiences and knowledge in the subject through these posts and hope to learn more in the process. Stay tuned !

Cheers !

References:

[1] http://en.wikipedia.org/wiki/Natural_language_processing

[2] https://www.coursera.org/course/nlp

[3] Inspired by the work of J.F. Kurose and K.W. Ross in several books written on Computer Networking.

Follow

Get every new post delivered to your Inbox.