newsloader
Class Extractor

java.lang.Object
  extended by java.lang.Thread
      extended by newsloader.Extractor
All Implemented Interfaces:
java.lang.Runnable

public class Extractor
extends java.lang.Thread

A class that reads the contents of a file, and extracts news items.

Author:
Ole Kristian Fivelstad

Nested Class Summary
 
Nested classes/interfaces inherited from class java.lang.Thread
java.lang.Thread.State, java.lang.Thread.UncaughtExceptionHandler
 
Field Summary
 int items
           
 
Fields inherited from class java.lang.Thread
MAX_PRIORITY, MIN_PRIORITY, NORM_PRIORITY
 
Constructor Summary
Extractor(java.lang.String url, java.lang.String output, java.lang.String dataSetName)
          Constructor for the extractor class.
 
Method Summary
 java.lang.String anchorTextItem(java.lang.String content)
          Method for finding anchor-textx items
 java.lang.String cleanWebpage(java.lang.String content)
          Method for removing scripts from a webpage
 int getNoOfFiles()
          Method for getting the number of files
 java.lang.String prepareBbcNews(java.lang.String contents)
           
 java.lang.String prepareFinancialTimes(java.lang.String contents)
           
 void readFile(java.io.File file)
          Method for reading a file
 java.lang.String removeTags(java.lang.String string)
          Method for removing HTML tags from a string
 void run()
           
 void setCharset(java.io.File file)
           
 java.lang.String textBasedItem(java.lang.String content)
          Method for finding text-based items
 void writeXML()
          Method for writing the result to a XML-file
 
Methods inherited from class java.lang.Thread
activeCount, checkAccess, countStackFrames, currentThread, destroy, dumpStack, enumerate, getAllStackTraces, getContextClassLoader, getDefaultUncaughtExceptionHandler, getId, getName, getPriority, getStackTrace, getState, getThreadGroup, getUncaughtExceptionHandler, holdsLock, interrupt, interrupted, isAlive, isDaemon, isInterrupted, join, join, join, resume, setContextClassLoader, setDaemon, setDefaultUncaughtExceptionHandler, setName, setPriority, setUncaughtExceptionHandler, sleep, sleep, start, stop, stop, suspend, toString, yield
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

items

public int items
Constructor Detail

Extractor

public Extractor(java.lang.String url,
                 java.lang.String output,
                 java.lang.String dataSetName)
Constructor for the extractor class.

Parameters:
url - Location of the directory
output - File to store the result in
dataSetName - Name of the dataset
Method Detail

run

public void run()
Specified by:
run in interface java.lang.Runnable
Overrides:
run in class java.lang.Thread

setCharset

public void setCharset(java.io.File file)

prepareFinancialTimes

public java.lang.String prepareFinancialTimes(java.lang.String contents)

prepareBbcNews

public java.lang.String prepareBbcNews(java.lang.String contents)

readFile

public void readFile(java.io.File file)
Method for reading a file


getNoOfFiles

public int getNoOfFiles()
Method for getting the number of files

Returns:
number of files

textBasedItem

public java.lang.String textBasedItem(java.lang.String content)
Method for finding text-based items

Parameters:
content - The content of the webpage
Returns:
content after text-based items have been found

anchorTextItem

public java.lang.String anchorTextItem(java.lang.String content)
Method for finding anchor-textx items

Parameters:
content - The content of a webpage
Returns:
content after anchor-text items have been found

cleanWebpage

public java.lang.String cleanWebpage(java.lang.String content)
Method for removing scripts from a webpage

Parameters:
content - The content of a webpage
Returns:
content after scripts have been removed

removeTags

public java.lang.String removeTags(java.lang.String string)
Method for removing HTML tags from a string

Parameters:
string - The string to remove tags from
Returns:
string without HTML tags

writeXML

public void writeXML()
Method for writing the result to a XML-file