datapreparation.collocation
Class Collocation

java.lang.Object
  extended by control.Operation
      extended by datapreparation.TextOperation
          extended by datapreparation.collocation.Collocation

public class Collocation
extends TextOperation

Class representing the collocation extraction operation. This operation tries to find collocations of two nouns. In addition, proper nouns and proper noun groups are extracted. The user may also specify the extraction of verbs, adjectives, adverbs and numbers.

Author:
Ole Kristian Fivelstad

Constructor Summary
Collocation()
          Constructor for the collocation operation.
 
Method Summary
 boolean calculateCollocation(java.lang.String firstWord, java.lang.String secondWord)
          Method for performing the actual calculation too see if two words form a collocation.
 void extractTermsFromSentence(java.lang.String sentence, Text newText)
          Method for extracting terms from a sentence.
 Text extractTermsFromText(Text text)
          Method for extracting terms from a specific text
 java.lang.String findNounLemma(java.lang.String word)
          Method for looking up the lemma of a noun in WordNet.
 java.util.ArrayList findOverlap(java.util.ArrayList one, java.util.ArrayList two)
          Method for finding which documents the two terms appear together in
 java.util.ArrayList findSentences(java.lang.String text)
          Method for finding the sentences in a text.
 boolean firstCharIsDivider(java.lang.String word)
          Method for checking to see if the first character of a word is a divider.
 java.lang.String generateCollocation(java.lang.String firstWord, java.lang.String secondWord)
          Method for generating a collocation of two words.
 java.util.ArrayList getProperties()
          Method for getting the properties
 void performOperation(DataSet dataSet)
          Method for performing the operation
 java.lang.String removeChars(java.lang.String word)
          Method for removing specific characters from a word.
 void setProperties(java.util.ArrayList properties)
          Method for setting the properties
 boolean wordContainsChar(java.lang.String word)
          Method for checking if a word contains on of a list of dividers
 boolean wordEndsWithComma(java.lang.String word)
          Method for checking if the last character in a word is a comma.
 boolean wordIsStopword(java.lang.String word)
          Method for checking whether a word is a stopword.
 boolean wordIsValid(java.lang.String word)
          Method for checking whether a word is valid.
 
Methods inherited from class control.Operation
getLogResult, setLogResult
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Collocation

public Collocation()
Constructor for the collocation operation. Initializes a vector of possible sentence dividers. For example, , ) ( and so on. An english stopwords remover is also initialized.

Method Detail

performOperation

public void performOperation(DataSet dataSet)
Method for performing the operation

Specified by:
performOperation in class Operation
Parameters:
dataSet - The dataset being used

extractTermsFromText

public Text extractTermsFromText(Text text)
Method for extracting terms from a specific text

Parameters:
text - The text being processed
Returns:
The new text

findNounLemma

public java.lang.String findNounLemma(java.lang.String word)
Method for looking up the lemma of a noun in WordNet.

Parameters:
word - The word whose lemma is wanted
Returns:
The lemma of the word

extractTermsFromSentence

public void extractTermsFromSentence(java.lang.String sentence,
                                     Text newText)
Method for extracting terms from a sentence.

Parameters:
sentence - The sentence being processed
newText - The text the sentence belongs to

generateCollocation

public java.lang.String generateCollocation(java.lang.String firstWord,
                                            java.lang.String secondWord)
Method for generating a collocation of two words. This is done by putting _ between the two words. E.g. interest_rate

Parameters:
firstWord - The first word in the collocation
secondWord - The second word in the collocation
Returns:
The collocation string

wordIsValid

public boolean wordIsValid(java.lang.String word)
Method for checking whether a word is valid. This is done by both looking it up in WordNet, and checking if it is a stop word.

Parameters:
word - The word being checked
Returns:
The result

wordIsStopword

public boolean wordIsStopword(java.lang.String word)
Method for checking whether a word is a stopword.

Parameters:
word - The word being checked.
Returns:
The result

calculateCollocation

public boolean calculateCollocation(java.lang.String firstWord,
                                    java.lang.String secondWord)
Method for performing the actual calculation too see if two words form a collocation.

Parameters:
firstWord - The first word
secondWord - The second word
Returns:
The result

findOverlap

public java.util.ArrayList findOverlap(java.util.ArrayList one,
                                       java.util.ArrayList two)
Method for finding which documents the two terms appear together in

Parameters:
one - The documents of word one
two - The documents of word two
Returns:
ArrayList containing the overlap

findSentences

public java.util.ArrayList findSentences(java.lang.String text)
Method for finding the sentences in a text.

Parameters:
text - The text
Returns:
ArrayList containing the sentences

wordContainsChar

public boolean wordContainsChar(java.lang.String word)
Method for checking if a word contains on of a list of dividers

Parameters:
word - The word being checked
Returns:
The result

firstCharIsDivider

public boolean firstCharIsDivider(java.lang.String word)
Method for checking to see if the first character of a word is a divider. E.g. ( or '

Parameters:
word - The word being checked
Returns:
The result

wordEndsWithComma

public boolean wordEndsWithComma(java.lang.String word)
Method for checking if the last character in a word is a comma.

Parameters:
word - The word being checked
Returns:
The result

removeChars

public java.lang.String removeChars(java.lang.String word)
Method for removing specific characters from a word.

Parameters:
word - The word
Returns:
The new word

getProperties

public java.util.ArrayList getProperties()
Method for getting the properties

Specified by:
getProperties in class Operation
Returns:
properties

setProperties

public void setProperties(java.util.ArrayList properties)
Method for setting the properties

Specified by:
setProperties in class Operation
Parameters:
properties -