Lab 4 in Object Oriented Programming

Web scraping - to process data from the net.

Introduction

The Internet is an almost inexhaustible source of information and data of varying quality. Also, agencies and organizations are nowadays frequently making information available to citizens. As an example, in this lab we consider the Nord Pool data on electricity consumption. When you visit their website
http://wwwdynamic.nordpoolspot.com/marketinfo/consumption/sweden/consumption.cgi?interval=last8 You get to see a table of the total electricity consumption in Sweden in MWh during the last eight days. The data is very recent; the latest value in the table is less than an hour old.

One drawback of seeing the data in a web browser is that we cannot continue processing it in a program. The table shows the mean values ​​and max and min values ​​per day for the entire period and one can also choose to see the data in graphical form, but there are many other ways to process the data that could be interesting. The technique of retrieving data over the network and then refining it for further processing is called web scraping.

In this lab you will first write a program that reads the electricity consumption data from the network and stores it in an integer array of suitable size, i.e. int[24][8]. When the program was run on Wednesday 20 January 2010 at 10:40, it gave the following output:

19643 19780 18990 17975 17464 17639 17925 17825
19283 19450 18507 17684 17236 17547 17626 17523
19245 19202 18374 17457 16994 17325 17366 17521
19340 19206 18303 17414 16961 17561 17455 17389
19671 19417 18665 17473 17011 17763 17780 17683
20344 20196 19536 17781 17128 18662 18767 18755
22756 22127 21556 18228 17592 20762 20738 20954
24925 24305 23649 18846 18041 22902 22926 23239
25272 24424 24145 19481 18538 23322 23078 23386
24737 24349 23976 20029 19049 23185 22939 23175
24920 24415 23993 20396 19608 23387 23162 0
24768 24330 23828 20512 20134 23437 23304 0
24616 23829 23626 20426 20410 23076 22967 0
24334 23772 23220 20360 20433 23002 22763 0
24153 23636 23081 20186 20517 22831 22715 0
24805 23958 23033 20582 20843 23253 23096 0
25413 24382 23409 21383 21546 23831 23499 0
25523 24530 23593 21763 21714 24132 23732 0
25036 24286 23099 21296 21679 23637 23368 0
24368 24001 22261 20869 21439 22962 22742 0
23677 22896 21247 20093 20697 21972 21688 0
22760 21975 20332 19460 19901 21128 20735 0
21754 20897 19636 18718 19018 20111 19682 0
20303 19618 18813 18050 18143 18723 18545 0

The last column (representing "today"), then contained only the data up to 10 o'clock; the remaining values ​​are given as 0.

To illustrate how this data can be used, we provide a class ECIcon that can create icons, showing last week's electricity consumption. An icon is a graphical representation of the data that can be used in clickable components. An object of our class ECIcon can be created in different sizes, here are the sizes 1, 3 and 6:

The size indicates the number of hours that merge into one pixel, i.e. an icon of size 3 has thus size 8 * 24/3 = 64 pixels in both width and height. The red line indicates level of consumption 15,000 MWh/h.

We also provide a graphical component ECPanel, which contains a text besides an ECIcon of size 4. It has the behavior that when we click the icon, it opens a browser with the Nord Pool data. The picture below shows how it looks when the component is run:

The idea now is that a ECPanel should be a standard component to be placed into the interface for a whole variety of applications. In the exercise you will add this component to your program from lab 3.

Task 1. Getting data from Nord Pool.

Your task is to define a method that reads data from Nord Pool over the network and fills an array with integers that are printed as above. The matrix is given as a parameter. The method has the following signature:

public static void getEightDays(int[][] data) throws Exception

The method expects that data is a 24x8 matrix, which will be filled in with data for the last week's consumption. We do this by requesting data from Nord Pool in exactly the same way as a browser does. We are thus dependendent on retrieving data over the network from the Nord Pool site. This may of course fail, in which case the method throws an exception.

To fetch data over the network from a Java application is very easy with the help of the Java libraries. The problem is that what we get is the HTML data that describes the whole website. The HTML data consists of many hundred of lines; the data we are looking for is there, but it is not so easy to find it. The main difficulty is to find these values ​​and store them in the matrix data.

Start by downloading the file URLReader.java to your directory (suggested directory lab4). Compile and run the application (no command line arguments). If everything works fine then you will see in the terminal a listing of the full HTML of the web page from Nord Pool. It you run the program on your own computer, it must of course be connected to the Internet for this to work. Now read through the program. Apparently it only takes two lines in the begining of main to get access to the object in of the standard class Scanner. It contains the data from Nord Pool:

URL url = new URL(nordpoolURL);
Scanner in = new Scanner(new InputStreamReader(url.openStream()));

nordpoolURL is a string containing the address of Nord Pool, as defined earlier in the same file. For this lab, we do not need to know what exactly the classes URL and InputStreamReader do (you can read more in Eck, section 11.4 , if you want). We need to know, however, how to use the class Scanner, in a more sophisticated way than before. By using its methods, we should search for the data that we need. The given main method uses in in the simplest way - as long as there are lines it fetches a line and prints it on the terminal.

We suggest you to look at the HTML code for the Nord Pool area. Click on the link to Nord Pool and select Page Source in the View menu in Firefox. Search for the first number in the array: in our case it is 19643, but in your case it will be something else corresponding to the electricity consumption in Sweden between midnight and 1:00 eight days ago. You will find it in a part of the HTML file with the following basic appearance:

<tr bgcolor="#e1e1e1">
<td>00-01</td>
<td align="right">19643</td>
<td align="right">19780</td>
<td align="right">18990</td>
<td align="right">17975</td>
<td align="right">17464</td>
<td align="right">17639</td>
<td align="right">17925</td>
<td align="right"><b>17825</b></td>
</tr>

It is not important to understand the HTML code. It describes a table row (HTML tag tr) consisting of a number cells (HTML tag td). We also recognize the same numbers as those found in the first row on the Web page. In addition, the other rows with the same basic appearance follow in a sequence later in the HTML. What needs to be done is to skip the beginning of the file up to this part and then access the actual numbers and weed out the rest. Here comes the class Scanner to help: it provides a method

String findInLine(String pattern)

with which to search for a string that match the pattern argument in the current line from the input. The matching is done by using regular expressions. Regular expressions help to implement searches in many different contexts and it is therefore well worth it to learn something about them. One can search the web, but we also offer a brief introduction.

We observe that the string "00-01" is in the line before the first number in the table. It can be used to identify the right line in the file: this string does not occur elsewhere. Thus, we can call in.findInLine("00-01") to see if the next line in in contains the string "00-01". The method findInLine works so that if the string is found in the next line then the input before and including the matched string is read and the match is returned. Otherwise no input is read and null is returned. So we'll just have to read the new lines as long as findInLine returns null. When the result becomes something other than null, we know in fact that the result is "00-01". It is important that the current position is now immediately after this string in the input, i.e. before </td> on the the second line from the excerpt above. We can read through the rest of the line and then we are ready to read the line containing the first number. The following code does what is described in this paragraph:

while (in.findInLine("00-01") == null) in.nextLine();
in.nextLine() // skip rest of the line containing "00-01"

It now remains to continue to read the file to fill the matrix. We do this 24 times (once for each hour of the day):

Note that the last column ("today") must contain 0 for the hours where no data is yet available while in this case the Nord Pool table indicates -. Note also that the method getEightDays should have only one parameter data. This means that the creation of in that in the given file is made in main must be moved to getEightDays. The method may also not catch any exceptions thrown by the creation of in. Clients may determine what should be done in these cases.

The class should also contain a main method that declares and creates a 24x8 matrix, calls getEightDays to fill it with last week's electricity consumption and finally prints the matrix on the terminal as above.

Task 2. Using a component

We provide the classes ECIcon and ECPanel which you have to download to your directory. You can test ECPanel as follows:

$ java ECPanel

The class is primarily to provide a graphical component, but it also contains a main method for testing purposes. Note that the method update in ECIcon is calling the method URLReader.getEightDays(). It is therefore important that the class and the method must have exactly those names. Try clicking on the icon and see a that a browser with the table is open. (How to open a web browser varies between platforms. The class ECPanel is written to work in our labs; on a different computer, the argument for Runtime.exec may be changed. On a Mac for instance replace the string "firefox" in the argument to exec with "open".)

To see that an ECPanel can easily be added to a graphical application, we'll use the one thing we have: Game of Life from Lab 3. Copy all Java files from Lab 3 to your working directory. Open class LifeView in an editor and add a ECPanel with the appropriate text to the interface. It is handy that LifeView makes use of BorderLayout which hosts five components - and we have so far only added four (world and three buttons). Compile all the files and enjoy the feeling to be able to play the Game of Life while at the same time you are able to click and see Sweden's current electricity consumption ...

Finally, please note that your program has contact with Nord Pool in two ways:

A backup file

We depend on the Nord Pool web page to provide data on the URL given at the beginning of this lab instruction. Unfortunately, it is quite possible that Nord Pool at anytime will rearrange their web page which will make the above URL incorrect. This has happened twice since this lab instruction was written. If the URL above does not work , you can instead try your program with
http://www.cse.chalmers.se/edu/course/TDA547/lab4/nordpool.html

This is a file with the right structure, but of course not with the current data.

Supervision

As always present your lab by showing the program to a tutor and get approval.