dataloader
Class XMLParseren

java.lang.Object
  extended by dataloader.XMLParseren

public class XMLParseren
extends java.lang.Object

The class reads config.xml and gets the URL or catalogue with the HTML-files one wishes to tokenize. These are then sent to HTMLStripper. When HTMLStripper has tokenized the documents, XMLParser generates the file tokenized.xml

Author:
Kjell-Inge Skogstad and Trond Řivind Eriksen

Constructor Summary
XMLParseren(java.io.File file, boolean fromUrl)
          The constructor to the class.
XMLParseren(java.lang.String fromUrl, java.lang.String toUrl)
          Constructor, reads the XML file that is to be parsed.
 
Method Summary
 java.lang.String getNumberOfTexts()
          Method that returns number of texts (news) in the collection
static void main(java.lang.String[] args)
          Main method for testing.
 void makeXmlTokenizer(java.lang.String title, java.lang.String url, java.lang.String body)
          The method creates the tokenized.xml file.
 void makeXmlTokenizer2(java.lang.String text, java.lang.String url)
          The method creates tokenized.xml, the file of news to be loaded into the system
 java.util.ArrayList parseUri(java.lang.String urls)
          The method finds all URLs that were listed in the configuration file.
 java.lang.String readFile(java.io.File filename)
          Metoden leser innholdet i fila og lagrer det i en string
 java.util.ArrayList readXML(java.io.File file)
          The method reads the content in the configuration file (the XML file)
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

XMLParseren

public XMLParseren(java.io.File file,
                   boolean fromUrl)
The constructor to the class. Reads the XML file that is to be parsed.

Parameters:
file - File the config file
fromUrl - Checks if the input files comes from file or from url.

XMLParseren

public XMLParseren(java.lang.String fromUrl,
                   java.lang.String toUrl)
Constructor, reads the XML file that is to be parsed.

Parameters:
fromUrl - The config file
toUrl - The dataset file
Method Detail

readXML

public java.util.ArrayList readXML(java.io.File file)
The method reads the content in the configuration file (the XML file)

Parameters:
file - The input file
Returns:
The doclist with the documents

parseUri

public java.util.ArrayList parseUri(java.lang.String urls)
The method finds all URLs that were listed in the configuration file.

Parameters:
urls - The string of URLs.
Returns:
The list of URIs

makeXmlTokenizer

public void makeXmlTokenizer(java.lang.String title,
                             java.lang.String url,
                             java.lang.String body)
                      throws java.io.FileNotFoundException,
                             java.lang.SecurityException
The method creates the tokenized.xml file.

Parameters:
title -
url -
body -
Throws:
java.io.FileNotFoundException
java.lang.SecurityException

makeXmlTokenizer2

public void makeXmlTokenizer2(java.lang.String text,
                              java.lang.String url)
                       throws java.io.FileNotFoundException,
                              java.lang.SecurityException
The method creates tokenized.xml, the file of news to be loaded into the system

Parameters:
text - The text
url - The URL
Throws:
java.io.FileNotFoundException
java.lang.SecurityException

readFile

public java.lang.String readFile(java.io.File filename)
                          throws java.io.IOException
Metoden leser innholdet i fila og lagrer det i en string

Parameters:
filename - File
Returns:
String
Throws:
java.io.IOException

getNumberOfTexts

public java.lang.String getNumberOfTexts()
Method that returns number of texts (news) in the collection

Returns:
numberOfTexts

main

public static void main(java.lang.String[] args)
Main method for testing.

Parameters:
args -