EECE 352: PS2

Due: Tuesday, 7 September 1999

Problem 1 (50%): The first part of this problem set will excersize your knowledge of objects and strings (as defined in the C++ standard library).

For this problem you will convert part of the RDF dump of the Open Directory Project into an HTML page suitable for viewing. (I am using the TOP/Computers/Computer_Science branch, since the whole thing now stands at 250MB.) Basically, you will convert this file (also available in the class directory), into this file (use the "View, Source" option of your browser to view the HTML), by running the main program shown below.

You will achieve this by implementing a URL class which implements: operator>>, operator<<, and the appropiate constructors and destructor. Even thought the main program below does not use them, you also need to implement, a copy constructor, a constructor that takes only one string parameter (the url), and a constructor that takes three string parameters (the url, the title, and the description). A URL object stores the actual url, the title, and the description associated with that url. You should use the string class since it will make your life much easier and full of joy.

operator>> will be the hardest to implement since it needs to parse the RDF file. Take a look at the input file. You will notice Topic blocks and ExternalPage blocks. You can completely ignore the Topic blocks and anything that is in them. Each call to operator>> needs to read exactly one ExternalPage block into a URL object---this includes the actual URL, the Title and the description (if any). To put it more succintly, the call to operator>> should return only right after it has read a whole ExternalPage block, not before, and not later (unless, of course, you reach the EOF). Notice that each element in the RDF file is in a different line. You might find the getline(istream &, string &) function useful. Also, note that the string find function can look for arbitrary length strings. That is, you can say string s ="abracadabra"; s.find("cada"), and it will return the starting position of substring "cada" within s.

main.cpp

#include "URL.h"
#include <iostream>
#include <fstream>
#include <string>

using namespace std;

//Note: having operator>> read in one format (RDF)
// and operator<< write in another format (HTML)
// is almost always a bad idea. We are just doing it
// as an excersice. In this case you would want to
// give these functions different names (e.g. writeAsHTML)
int main(int argc, char* argv[]){

	ifstream opendir("//Engr_asu/ECE352/computer_science.txt");
	if (opendir == 0) {
		cout << "Could not open input file" << endl;
	}

	//Change this filename to write to your own folder.
	ofstream fout("//Engr_asu/ECE352/ps2out.html");
	if (fout == 0) {
		cout << "Could not open output file" << endl;
	}

	fout << "<html><head><title>PS2 Output</title></head><body>" << endl 
		<< "<h1>PS2 Output</h1>" << endl << "<ul>" << endl;
	
	URL next;
	while (opendir >> next){
		fout << "<li>" << next << endl;
	}
	fout << "</ul>" << endl << "</body></html>" << endl;
	fout.close();
	return 0;
}

In this example you can see the utility of objects. The URL knows how to read itself from a stream and write itself out. If we needed more I/O methods we would just need to define more functions. We could also define other operators on the URL such as operator== which could do a smart comparison of two urls (i.e. ignoring trailing "index.htm", or "index.html").

Problem 2: (50%)

Using the database as above, and assuming that you are reading a file of size N, answer the following questions using big-O notation and provide explanations if needed:
1. If you were to implement a function that searched for a particular URL, how long would it take?
2. If the URLs were sorted alphabetically, how long would the search take?
3. Say you took the whole open directory (250MB of data) and set up your own website. You use a search function that takes O(n) time. Specifically, it takes .1 seconds to do a search on the whole 250MB. Since you are such a talented webmaster, the number of hits (and searches) on your site doubles each month. Not only that but the open directory people keep adding data so the directory doubles in size every six months. Say you start with one hit per day on September 1, how long will it be before your website crashes? (that is, how long before it spends all 24 hours of the day doing searches).
4. Answer the same question as above but assume that you have a O(log n) search function.
5. Finally, assume you use a nice SQL database which gives your search function O(1) time, answer the same question.
6. What is the assumption you are making in the previous three questions that is not realistic for a real-world website? Explain how you might manipulate your numbers better to approximate a real-world simulation.
Excersise 5.15-a (only part a) from the textbook.

Jose M. Vidal

Last modified: Thu Sep 2 17:06:04 EDT 1999