dataloader
Class HTMLStripper

java.lang.Object
  extended by dataloader.HTMLStripper

public class HTMLStripper
extends java.lang.Object

The class loads an HTML file or an URL to an HTML site and strips this for HTML tags.

Author:
Trond Řivind Eriksen and Kjell-Inge Skogstad

Constructor Summary
HTMLStripper(java.lang.String url, java.lang.String sBoundary, java.lang.String pBoundary, java.lang.String allsmall, XMLParseren xmlP)
          The constructor that uses an URL and parametres from the configuration file.
HTMLStripper(java.lang.String filename, XMLParseren xmlP, java.lang.String sBoundary, java.lang.String pBoundary)
          The constructor that loads an HTML site from file.
 
Method Summary
 java.lang.String findSource(java.lang.String urlForFrame)
          The method finds the source code to an URL.
 java.util.ArrayList getMeta(java.lang.String file)
          Method that collects metadata from file.
 java.util.ArrayList getNewsList()
          Method that gets the Arraylist of news
 java.lang.String getOriginalFile()
          Method that gets the original file.
 java.lang.String letterStripping(java.lang.String file)
          The method changes letters that have special characters in HTML, to regular letters.
static void main(java.lang.String[] args)
          The main method that starts everything.
 void parseFile(java.lang.String file)
          This method calls all the different methods that parse file.
 java.lang.String parseFilen(java.lang.String file)
          Another method that calls all the different methods that parse the file.
 java.lang.String readFile(java.io.File filename)
          The method reads the file and puts it in a string variable.
 java.util.ArrayList searchFrame(java.lang.String file)
          Method that finds frames in the HTML file.
 java.lang.String searchNews(java.lang.String file)
          The method finds a news in the HTML site.
 java.lang.String searchTitle(java.lang.String file)
          The method finds the title to the HTML site.
 java.lang.String strip(java.lang.String file)
          The method removes most HTML tags and spaces.
 java.lang.String stripSpecialChar(java.lang.String file)
          The method removes special characters that might occur in HTML.
 java.lang.String stripWhiteSpace(java.lang.String file)
          The method removes all whiteSpaces that are superfluous.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HTMLStripper

public HTMLStripper(java.lang.String url,
                    java.lang.String sBoundary,
                    java.lang.String pBoundary,
                    java.lang.String allsmall,
                    XMLParseren xmlP)
The constructor that uses an URL and parametres from the configuration file.

Parameters:
url - String
sBoundary - String
pBoundary - String
allsmall - String
xmlP - XMLParser

HTMLStripper

public HTMLStripper(java.lang.String filename,
                    XMLParseren xmlP,
                    java.lang.String sBoundary,
                    java.lang.String pBoundary)
The constructor that loads an HTML site from file.

Parameters:
filename -
xmlP -
sBoundary -
pBoundary -
Method Detail

getOriginalFile

public java.lang.String getOriginalFile()
Method that gets the original file.

Returns:
The original file.

getMeta

public java.util.ArrayList getMeta(java.lang.String file)
Method that collects metadata from file.

Parameters:
file - The input file.
Returns:
metadata

parseFile

public void parseFile(java.lang.String file)
This method calls all the different methods that parse file.

Parameters:
file - String

parseFilen

public java.lang.String parseFilen(java.lang.String file)
Another method that calls all the different methods that parse the file.

Parameters:
file - String

readFile

public java.lang.String readFile(java.io.File filename)
                          throws java.io.IOException
The method reads the file and puts it in a string variable.

Parameters:
filename - File
Returns:
String
Throws:
java.io.IOException

searchTitle

public java.lang.String searchTitle(java.lang.String file)
The method finds the title to the HTML site.

Parameters:
file - String
Returns:
String

searchNews

public java.lang.String searchNews(java.lang.String file)
The method finds a news in the HTML site.

Parameters:
file - String
Returns:
String

getNewsList

public java.util.ArrayList getNewsList()
Method that gets the Arraylist of news

Returns:
newsList

searchFrame

public java.util.ArrayList searchFrame(java.lang.String file)
Method that finds frames in the HTML file.

Parameters:
file - String
Returns:
ArrayList

findSource

public java.lang.String findSource(java.lang.String urlForFrame)
                            throws java.net.MalformedURLException,
                                   java.io.IOException
The method finds the source code to an URL.

Parameters:
urlForFrame - String
Returns:
String
Throws:
java.net.MalformedURLException
java.io.IOException

strip

public java.lang.String strip(java.lang.String file)
The method removes most HTML tags and spaces.

Parameters:
file - String
Returns:
file without HTML tags

stripSpecialChar

public java.lang.String stripSpecialChar(java.lang.String file)
The method removes special characters that might occur in HTML.

Parameters:
file - String
Returns:
file

stripWhiteSpace

public java.lang.String stripWhiteSpace(java.lang.String file)
The method removes all whiteSpaces that are superfluous.

Parameters:
file - String
Returns:
String

letterStripping

public java.lang.String letterStripping(java.lang.String file)
The method changes letters that have special characters in HTML, to regular letters.

Parameters:
file - String
Returns:
String

main

public static void main(java.lang.String[] args)
The main method that starts everything.

Parameters:
args - String