english & programming & web Franchu on 05 Jun 2007 10:19 am
How to create a mashup and not die trying…
This post is an english translation of my previous post Como hacer un mashup y no morir en el intento… as kindly requested by D. Canos y T. Forza on behalf of the other students of the J2EE programming with passion course by Sang Shin.
Before attending the Google Developer Day in Madrid, I wanted to play around with the Google Maps API in order to see if it was difficult to use and get a feeling of how difficult it would be the Google Maps API workshop I was going to attend. That is what motivated me to create a little mashup that would display over a map of Spain the status of the dams and see where we have water and where we are not doing that well.
The first thing that you need in order to do a mashup up is at least one data source. It is getting more common to get the data in a syndicated fromat (RSS, Atom) or to be able to access them through APIs that expose the data in XML or JSON. Nevertheless, it is also possible that you have to face the case in which the information that you want to use in your mashup is available in a website but there is no easy way to extract it. That was my case and the solution is to create an RSS feed through web scrapping.
How to create a feed from a website
You can always hack, in a language you are familiar with (I’d go for python), a web scrapper that would extract the information you are interested in and returns a feed, but that might not be a trivial thing to do for everybody. Fortunately, there is an easier way, that uses a GUI and that is freely hosted by a third party provider that does exactly what we are trying to do. As this mashup was just a proof-of-concept exercise, I didn’t feel like hacking my own web scrapper and it also provided me with an opportunity to discover how Openkapow worked in order to create my feed. In a couple of minutes I had my feed up and running, publicly accesible and created from the data available in a page that gathers all the information about spanish dams extracted from the website of the Spanish Ministry of the Environment. I could have web scrapped directly the original source, but the structure of the website made it more complex for the web scrapper.
In order to create the web scrapper you need to download the tool that is available for downloading directly from the openkapow website. The only downside of this tool is that is available for Windows and Linux, leaving us the poor Mac users a little bit out of the game, but we can always find ways to get around this. You can find the robot that I have created by downloading it from the openkapow website.
Once you publish it, the feed is available to be used. I have put an update rate of 1 day for the robot, in order not to generate useless traffic as the contents are updated once a week. Moreover, in order for the feed not to load directly from the openkapow website (I’ve had some service downtime problems), I use FeedBurner as a feed cache.
Feed manipulation and relevant data extraction
As the feed contains a lot of informatino that I am not interested in displaying in the map in this first experiment (eg. global data of the status of the spanish dams, or regional data), I need to manipulate the data before using them in the map. In order to do this I use an online tool called Yahoo! Pipes that allows the manipulation of different data sources and output a RSS feed, a JSON object, a XML document, …
You can see the pipe that I have created. The nice thing of this tool is that it allows you to see how you have created the pipe, and learn really fast how to do useful things.
Basically what my pipe does is to filter out the items I am not interested in, and extract the values I will use as variables in my mashup (water level, coordinates of the dam, name, …) and output it in the feed.
Showing it in the map
After loading the JSON object directly from the pipe output, displaying them in the map is almost trivial. As a function of the water level of the dam I decided to use a different marker icon, and as there are lots of markers to be displayed it is better to use a marker manager. You can take a look at the source code that I use to generate the map. Instead of the php include, I could have used the Google AJAX feed API, but I didn’t have time to start playing with another API
The final result can be seen in my other blog entry: “Status of the Spanish Dams”
If you have any questions, do not hesitate to ask them in the comments and I will do my best to answer them ![]()
Related posts:
- Como hacer un mashup y no morir en el intento…
- iCal JSP taglib
- Nerding wishlist (16/12/07)
- Study of email clients HTML standards’ compliance
on 15 Jun 2007 at 6:53 1.discoverall said …
Hello Franchu ,
I asked the question “What is the best way to get information from Internet automatically?” on J2EE free course…
Thank you for your witting.
Now I have once more question about this.
I have used HttpClient from http://jakarta.apache.org/commons/httpclient/
and I can download webpage. But I would like to get some part in that page.
Example: I have a link to a news from other web. But I just want to get only the news. not get the banner, advertise…
How can I do that.
Thank you very much.
on 15 Jun 2007 at 12:23 2.Franchu said …
Hi!
Well, I am afraid there is no easy answer to your question.
In an ideal world, the page would be written in XHTML or following some accessibility standards and it would be easy to extract semantic content from the page.
In the former case you would need an XML parser and using XPath you can extact the node of the document that contains the information you are interested in.
In the later, you would have to parse the string, looking for the delimiters of the content you want to extract.
The only thing I can do, is suggest you to search for more information on web scraping (that’s the technical name of what you are trying to do). Keep in mind that the HttpClient the only thing that does is retrieve the web page and give it to you as a Stream or as a String. After this point, it is up to you to parse the content with web scrapping techniques.
Hope this reply was useful to put you on track towards the answer
on 11 Dec 2007 at 8:15 3.Chris_C said …
Nice work Franchu. We are well aware of the Mac client issue and will be addressing it. Timing is uncertain at this point.
Additionally, you can create a REST robot on openkapow that can return data to you in multiple formats without rewriting the robot. Formats include:
- REST (XML)
- JSON
- CSV
- HTML
- XHTML
on 01 Feb 2008 at 3:12 4.Tim said …
“What is the best way to get information from Internet automatically?”
I have had good success with the free iMacros for Firefox extension
https://addons.mozilla.org/en-US/firefox/addon/3863
http://wiki.imacros.net/Data_Extraction
Tim