Searching Google from a Java Program

Background and Download

Each year the Elementary Patterns Working Group meets several times. One annual meeting is at ChiliPLoP where we gather to try to extend our work of making patterns accessible to novice programmers. In April 2006 we met there again and one project was to extend our work of the previous year on providing examples that might be used in elementary courses and that show a bit of patterns.

One of our sub goals is to show how beginning students can access large data sets and manipulate them. Rather than sort 10 items, we would like to have students work with millions (at least) of bits in some interesting and creative way. As an aid in this, we build scaffolding that the students use in elementary projects.

Google, of course, provides a large data set. It also provides an API to access the main features of its search service. Unfortunately, the API, while complete and even accessible, does not have an interface that should really be shown to novices as it is all getters and setters with little thought to an abstract interface. We present here a facade to that api that we hope is more suitable for novices. It doesn't provide all of the features, but the intent is to provide something that the students can extend in their own work.

The people who have contributed directly to this small project are Dan Steinberg, Eugene Wallingford, Rick Mercer, Robert Duvall (no, the other, smarter, one), and myself. Download the Elementary Patterns Working Group Google Search API.

To use this you will need to visit http://www.google.com/apis/ and download the Google kit, as you will need the googleapi.jar from there. You will also need to obtain your own personal Google search key from the same location. This requires registration, but is free. The key entitles you to make 1000 searches per day with this api, and hence via our code as well. Part of the google kit is the Java Docs for their original API and this can be effectively used, but as it has little true encapsulation, you may not want to show this to beginners. Programming with getters and setters rather than a more abstract (conceptual) interface should probably be avoided with novices. Our API is an attempt to address this with a simple wrapper. Our download (above) also includes Java Docs for the more conceptual version.

To use this you will also need to build a project in your favorite IDE that contains the google jar, the JUnit jar, and our source code along with any code you write yourself.

Patterns

The code for this project demonstrates a few simple design patterns. Some of these can be mentioned to beginners as they learn to program, but not all need be. Patterns are about good code, good design, and designing solutions, and not about adding more course material.

The main classes in this API represent a Facade (see Design Patterns, by Gamma, Helm, Vlissides, and Johnson), hiding the details of another API behind one more appropriate.

All of the classes here provide Immutable objects. The objects are instantiated completely in their constructors and have no mutator (setter) methods to change the state of the object. Such objects, and the programs they contain, are easier to reason about than otherwise, as the objects are like structured constants.

The viewer class represents the MVC (Model-View-Controller) pattern as it has complete separation from all of the "model" code of the search itself. Here in fact the separation is complete as the viewer can view any simple web page, not just the results from here. The viewer also implements the Observer pattern in its listener structure, with separate listeners for each control. This, of course, is typical in Java.

The Query classes implement something like a Decorator. If you have a simple Query, you can wrap it with a SiteQuery (decorator) that will restrict the search to a single web site.

The handling of SiteQueries and DefinitionQueries is something like a Visitor. A recursive call is passed down a tree and information is gathered on the unrolling phase of the recursions and therefore passed back up the tree where it is handled by the original caller, that started the process by sending a message to the root of the tree of queries. We have a tree here, not simply a list, as some of the query types have more than one "child". Note that there is no explicit TREE DATA STRUCTURE here, but the natural linking of objects results in a tree.

We use the Null Object pattern extensively. The null value isn't generally returned here. Instead we have NullPages and NullSearchResult (inner) classes, that provide empty behavior when they arise. In particular this absolves the user from null checks on returned results from the messages. You will always get back an object that will behave correctly, even if it does nothing. This means that messages are safer and that the beginner need pay less attention to messy details that, while important, can be delayed so as to focus on behavior elicited via message passing.

Note that the NULL_QUERY object is defined via an anonymous inner class. Also note that it is a Singleton. There is no other object of this class possible. This is a special case of a Typesafe Enumeration where there is only one value.

The API

This API lets you formulate Google queries, save them, construct new queries from old, obtain the search results, display the html, and even image the results of searches in a simple browser.

Part of the motivation of this is that the main author currently (Bergin) usually forgets the details of advanced searches and wanted to explore the possibility of combining searches with logical keywords, such as AND and OR. All of this is possible within the Google search mechanism, but, as each search service use its own syntax, it is easy to forget the details.

More important, is that you sometimes might want to search Google (or other service) from within a program and process the results. For example, you might want to know if Google will provide more hits for "Nuclear War" than for "Everlasting Peace". You can easily do this by hand, of course, but you can also easily write a program with this API to make arbitrary such comparisons.

The main abstractions provided here are

Query -- formulate a Google Query, a simple word or an arbitrarily complex search using the Google syntax. This class has many subclasses to, for example, or two queries together , which will produce pages that contain either of two words. You can also restrict a search to a single web site, etc.

SearchResult -- a collection of pages returned by a Google search along with some additional information about the search. For example you can get an estimate of how many pages fulfill the search criteria.

SearchResultPage -- a single page returned by a Google search. You can obtain the URL of the page, its HTML, and image it in a special minimal browser provided here.

The Java docs are provided in our download zip, but you can also read them here.

Uses in Beginning Courses

The distribution also includes a page of examples exercises that might be given to students with this API. We welcome additions to this list of exercises.

There are four levels of exercise that might be contemplated with this package.

Using the API

At this level, students simply create objects using the constructors here and send messages defined by the API. the instructor can surround this with interesting projects, such as mentioned above, to compare the Google rankings of a set of sites.

Extending the API

Here the student can extend this API, by improving our implementation, addiing additional methods, and writing additional sub-classes to those provided. For example, we have done nothing here with SafeSearch or with searches constrained to dates.

Looking Under the Hood

The questions here are mostly discussions (essay questions) asking about the implications of some of the things we have done and getting students thinking about whether something in an API can be improved.

Over the Horizon

Over the Horizon questions take you beyond the everyday. Build something like this for the Amazon service, for example, or for other search engines.

Notes on the Code

One of the decisions made in this code was to use Java 5 collections and it shows some simple use if template collections. This permits you to avoid casting results retrieved from containers like List and HashMap, for example. The code also hides the fact that the official Google API returns an array of pages in a result. We map this to a List (actually List<SearchResultPage>). We prefer the List and Maps of Java over such things as arrays, as they can be manipulated with message syntax, rather than the more specialized array syntax. The concepts of contiguous and linked structures can be taught with these, while not requiring additional syntax. If the course stresses object-oriented programming, then the students will be familiar with message passing so we just capitalize on this. Since we are filtering the result pages anyway we also filter out any malformed urls that might be returned so that they don't appear in the result list.

The code shows a use of Abstract Classes, but not interfaces. Polymorphism is demonstrated in many ways, but primarily through the buildSearchString method of the Query classes. The Query class provides quite a lot of functionality that is used unchanged in the subclasses. Also note that subclasses don't extend the interface of the Query superclass. They implement exactly the same methods. The implication is that the only type we need for reference variables to Query objects is the Query class itself. All subclass objects implement exactly the same set of methods, though in different ways, perhaps.

We use static methods in a few places. We have minimized this, however, and don't consider it part of the mainstream Java use at this level.

We need to trap exceptions in a few places. In one place we have transformed an Exception (MalformedURLException) into a runtime exception so that it may be trapped, but need not be.

Public features are generally provided with JavaDoc comments.

We provide JUnit test classes for much of what is here, but have left some undone as an opportunity for exercises. There is also a suite class (AllTestSuite) that runs all the tests in a single execution.

Last Updated: April 21, 2006