goldb.org home

AS OF MAY 2008, THIS BLOG IS NO LONGER BEING UPDATED.
Visit the new blog at: http://coreygoldberg.blogspot.com



 Thursday, April 03, 2008

Python - Script - Which Webserver Does That Site Run?

You can use this little Python function to see what type of web server a site is running.  All it does is send an HTTP request to the host and reads the 'server' header in the response.


import httplib

def get_server_type(host):
    conn = httplib.HTTPConnection(host)
    conn.request('GET', '/')
    resp = conn.getresponse()
    return resp.getheader('server')


print get_server_type('www.pylot.org')
print get_server_type('www.techcrunch.com')

Output:

lighttpd/1.4.19
Apache/2.0.52


Note: This doesn't work for all sites

#    Comments [7] |
 Tuesday, February 12, 2008

Python - 15 Line HTTP Server - Web Interface For Your Tools

I write a lot of command line tools and scripts in Python. Sometimes I need to kick them off remotely. A simple way to do this is to launch a tiny web server that listens for a specific request to start the script.

I add a "WebRequestHandler" class to my script and call it from my main method. There is a "do_something()" method in the class. You call your code from this method.

All you have to do is launch your script and it will sit there and wait for requests. If the request is bad, it spits back a 404 error. If the request path matches what we are looking for (in this case "/foo"), the code is launched.

Now you have an easy way to call your script remotely. Just open a browser and type in the URL: http://your_server/foo, or call it with a tool like 'wget' or 'curl'.


import BaseHTTPServer

class WebRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/foo':
            self.send_response(200)
            self.do_something()
        else: 
            self.send_error(404)
            
    def do_something(self):
        print 'hello world'
        
server = BaseHTTPServer.HTTPServer(('',80), WebRequestHandler)
server.serve_forever()

(this was adapted from a code sample in "Python In A Nutshell" by Alex Martelli)

#    Comments [1] |
 Wednesday, February 06, 2008

C# .NET 2.0 HTTP GET Class

Sending HTTP Requests from a C# program seems unnecessarily hard.  I wrote a small helper class to deal with sending and timing GET requests:
http://www.goldb.org/httpgetcsharp.html

You use it like this:


public class Program
{
    static void Main(string[] args)
    {
        HTTPGet req = new HTTPGet();
        req.Request("http://www.google.com");
        Console.WriteLine(req.StatusLine);
        Console.WriteLine(req.ResponseTime);
    }
}
#    Comments [2] |
 Monday, October 22, 2007

OpenSTA 1.4.4 Release (Open Source HTTP Performance Test Tool)

The OpenSTA team has announced the release of version 1.4.4

OpenSTA is a distributed software testing architecture designed around CORBA.  The applications that make up the current OpenSTA toolset were designed to be used by performance testing practitioners for web load testing.

Info:
http://portal.opensta.org/index.php?name=News&file=article&sid=51

Download:
http://opensta.org/download.html

Congrats and thanks to Bernie Velivis, Daniel Sutcliffe, Jerome Delemarche for making this release possible.




#    Comments [1] |
 Sunday, October 14, 2007

Python - Simple Multithreaded HTTP Load Generator/Timer

This is a module for generating concurrent requests to an HTTP server.  Each thread makes HTTP GET requests to a single URL at the specified interval.  Threads are added over a given rampup time if you want to generate increasing load.  Response times are printed to STDOUT.  Can be used for cursory performance benchmarking or load testing a web resource.

load_generator.py module

sample usage:


#!/usr/bin/env python

from load_generator import LoadManager

lm = LoadManager()
lm.msg = ('www.example.com', '/')
lm.start(threads=5, interval=2, rampup=2)
#    Comments [3] |
 Tuesday, September 11, 2007

Python httplib2 - Handling Cookies in HTTP Form Posts

I often need to automate tasks in web based applications.  I like to do this at the protocol level by simulating a real user's interactions via HTTP.  Python comes with two built-in modules for this: urllib (higher level Web interface) and httplib (lower level HTTP interface).

However, I usually don't use either of these.  I prefer to use Joe Gregario's excellent httplib2 module (btw, I really wish this could make its way into Python's Standard Library).  It is a much richer library and has a lot of nice features for dealing with HTTP.  

When automating something, you often need to "login" to maintain some sort of session/state with the server.  This is usually achieved with form-based authentication. You post a form to the server, and it responds with a cookie in the incoming HTTP header.  You need to pass this cookie back to the server in subsequent requests to maintain state or to keep a session alive.

Here is an example of how to deal with cookies when doing your HTTP Post.


First, lets import the modules we will use:


import urllib
import httplib2


Now, lets define the data we will need: In this case, we are doing a form post with 2 fields representing a username and a password.


url = 'http://www.example.com/login'   
body = {'USERNAME': 'foo', 'PASSWORD': 'bar'}
headers = {'Content-type': 'application/x-www-form-urlencoded'}


Now we can send the HTTP request:


http = httplib2.Http()
response, content = http.request(url, 'POST', headers=headers, body=urllib.urlencode(body))


At this point, our "response" variable contains a dictionary of HTTP header fields that were returned by the server. If a cookie was returned, you would see a "set-cookie" field containing the cookie value. We want to take this value and put it into the outgoing HTTP header for our subsequent requests:


headers['Cookie'] = response['set-cookie']

Now we can send a request using this header and it will contain the cookie, so the server can recognize us.



So... here is the whole thing in a script. We login to a site and then make another request using the cookie we received:


#!/usr/bin/env python

import urllib
import httplib2

http = httplib2.Http()

url = 'http://www.example.com/login'   
body = {'USERNAME': 'foo', 'PASSWORD': 'bar'}
headers = {'Content-type': 'application/x-www-form-urlencoded'}
response, content = http.request(url, 'POST', headers=headers, body=urllib.urlencode(body))

headers = {'Cookie': response['set-cookie']}

url = 'http://www.example.com/home'   
response, content = http.request(url, 'GET', headers=headers)
#    Comments [2] |
 Friday, August 31, 2007

JavaScript - Anti-Spam Email Link

Posting a link to your email address on your website is inviting spiders to grab it and spam you.  I get around this by using a JavaScript snippet so my email address link renders on the client but not in the actual HTML source.

While I'm sure there are much better ways to do this, this has been successful for me so far.

This script creates a "mailto" link for my email address:


<script type="text/javascript">
    <!--
    var name = "corey"
    var emailHost = "goldb.org"
    document.write("<a href=" + "mail" + "to:" + name + "@" + emailHost + ">"
        + name + "@" + emailHost + "</a>")
    //-->
</script>
#    Comments [1] |
 Thursday, August 23, 2007

Scalability - TechCrunch Runs On 1 Web Server And 1 Database Server. Huh?

highscalability.com pointed out an article from Pingdom titled "What the Web’s most popular sites are running on", which shows the results from an infrastructure survey of 7 popular web "super sites" (TechCrunch, FeedBurner, iStockPhoto, YouSendIt, Meebo, Vimeo, Alexaholic).

There is some intriguing information in the survey, but one thing stood out.  Apparently, TechCrunch runs on 1 web server and 1 database server, with no server clustering:


I almost find this hard to believe.  They serve > 1 million unique visitors per month off of this?

#    Comments [3] |
 Friday, June 29, 2007

PyLT - Dev Update #3 - Web Performance/Load Test Tool

(Update: PyLT has been renamed to Pylot)

(PyLT is the open source web performance/load test tool that I am developing)

A quick update on PyLT development...

The load generating engine is looking pretty solid and seems to work really well so far. It uses threading for concurrency and seems to scale well (though I haven't put it through its paces enough yet).

The GUI is evolving more and starting to look like a real performance/load testing tool:

This is my first project using wxWidgets and wxPython.  I am finding it to be very powerful and relatively straight forward to design nice user interfaces.  However, this is a big jump for me.  The past few years I have mostly done web programming and work with distributed systems.  It took a bit to get my head back into traditional GUI application development and event-driven programming

More to come...

Related:

#    Comments [0] |
 Monday, June 11, 2007

PyLT - Dev Update #1 - Web Performance/Load Test Tool

(Update: PyLT has been renamed to Pylot)

A quick update on PyLT development...

I have a working version of the guts of my tool (the multi-threaded load generator).  I have now started working on the user interface.  My initial idea was to use Tk for the GUI Toolkit.  I started developing a minimal GUI and quickly realized I need a Toolkit more powerful than Tk.

My original justification for using Tkinter (from blog comments):

"I will probably eventually move to a richer toolkit (like wxPython) if I take this thing far. For right now, Tk works. It comes distributed with core python, it's super fast and light, it's easy to use, and I know it pretty well. Though it looks like crap and is limited in many ways."

As of today I am rewriting the GUI with wxPython, which uses the wxWidgets Toolkit.  This should give me the ability to create a rich cross-platform UI for my tool.

[For posterity] Here is what the original prototype of the Tk UI looked like:


R.I.P. Tk... Hello wxWidgets


Related:
PyLT - Scratching My Itch - New Web Performance/Load Test Tool (Open Source) 

#    Comments [2] |
 Friday, June 01, 2007

PyLT - Scratching My Itch - New Web Performance/Load Test Tool (Open Source)

(Update: PyLT has been renamed to Pylot)

I have started development on a new web performance/load testing tool.  It is targeted at testing Web Services.


Here is some Q&A with myself:


You know you are reinventing the wheel, right?

Yes, I know.  There are already open source web load testing tools available (OpenSTA, JMeter, Grinder, WebLOAD, etc).  I have used all of these as well as proprietary tools for years.  I am a performance engineer and I feel like I need a tool set that I am intimately familiar with.  I need the ability to easily alter and tweak the tool at will.  I don't have the time, budget, or patience enough to wait on vendors when I need something.  I also want a tool that is fun to hack and adapt.  For this, I need to understand the code base deeply.

What language are you using?

Python.  The initial GUI uses Tk, but this may be changed down the road. I use Python's threading module for concurrency. If this doesn't scale well enough, I will be exploring other models of concurrency (perhaps generator based coroutines).

Why do you think you can write a tool like this?

I have worked in performance testing for nearly 10 years.  I have written many tools that work with various protocols to do distributed load generation and testing.  Creating a simple HTTP load generator is sort of my Hello World 2.0 for each language I try (I have written these from scratch in Python, Perl, Java, and C#).  This tool takes that basic concept and organizes it into a robust application.

Will it be Free and Open Source?

Of course!  Licensed under GNU GPL.



For an early look, check out the source repository at:  http://pylt.googlecode.com/svn/trunk

More details to come.

-Corey

#    Comments [6] |
 Wednesday, May 30, 2007

goldb.org - OS Breakdown - Last 30 Days

Operating Systems of visitors to my website and blog over the past 30 days.
(Results with less than 1% have been removed)

#    Comments [2] |
 Tuesday, May 29, 2007

WebInject and op5 Monitor for Advanced Web Site Monitoring

One cool thing about developing Open Source software is seeing where people end up using your software. I have seen my WebInject test tool show up in various places for various uses.

The most recent example of this come from op5 AB:

"op5 is a leading product developer of systems and network monitoring and management software. Our aim is to give our customers an increased and measurable availability to the IT system – both in terms of quality and quantity. Our products are op5 Monitor, op5 Statistics and op5 LogServer."

op5 Monitor (Linux Open Source Awards: Best Open Source Application 2006) is their network and application monitoring solution:

"op5 Monitor is a system that monitors the whole network. op5 Monitor is unique in its flexibility and it can monitor all net connected components from servers, routers and printers to individual processors, for example mail services, web servers and virus programmes. All these functions are handled by a web – browser. The system can handle a net with a several thousand units."

Carl Ekman wrote a nice paper explaining how to use WebInject as an intelligent agent/plugin for use in web application monitoring:
WebInject and op5 Monitor - Setting up advanced website monitoring with WebInject (PDF)

The paper serves a nice example and tutorial for using WebInject as a monitoring plugin (It can be used in a similar fashion with Nagios also):

"Most op5 Monitor users have detailed monitoring of all inhouse applications and servers, but sometimes not even that is enough. What if something unforeseen happens that makes dynamically generated content spout jibberish to thousands of visitors without you even knowing about it?

With WebInject you can monitor the actual content of the web pages, and you can perform simulated user actions such as logging in and checking an account balance. If a search string is not present, an error message occurs or a link is broken, you can get an alert with a customized, descriptive message."
#    Comments [0] |
 Thursday, May 17, 2007

RESTful Web Services - 10 Years of 'Programmable Web' Books

I just got the RESTful Web Services book (Leonard Richardson & Sam Ruby, O'Reilly, 2007) in the mail today.  I've only read the beginning, but so far it is great.  In fact, it brings me back to when I first started working with the "programmable web".  I got into the programmable web back when the web was only a few years old.  I spent years doing performance/scalability testing and tuning for large Web 1.0 applications and bizarre custom Web API's (think huge financial services rushing to get online).  Building tools to run realistic workloads through a system involves writing custom clients to simulate real user/browser interaction.  This is pretty ugly stuff when you are dealing with an application that was designed with only humans in mind (AKA all).  It involves lots of HTTP protocol level work.. screen scraping.. protocol sniffing and analyzing.. requests.. header mangling.. cookie handling.. redirects.. authentication.. session information parsing.. etc, etc.

Application simulation is pretty messy work.  There is no simple API to hide behind; you had to figure out what the API was for yourself.  See.. *every* web application has an API.  Though it might have been designed by accident.  This allowed me to see first hand how developers and frameworks butchered the use of the "Web" as a platform.  Staring at naked HTTP let me see every little bit of the hairball underneath.  Alas, any standardization around web services (or the concept to be officially named) was far off.

A friend (bearded Perl hacker) let me borrow a book to show me how Perl can do this cool web stuff:  Web Client Programming with Perl (Clinton Wong, O'Reilly, 1997).  This book helped me build my first web clients to do application simulation and testing.  There wasn't a ton of documentation at the time to do this sort of thing, so i relied heavily on this book.

So now.. 10 years later..  the Web has changed..  it has morphed into *the* distributed platform..  it is becoming organized.

As I flip through Restful Web Services, it all just looks right..  REST looks right..   It is simple..  it is HTTP..  it is all the guts I already know.  It almost feels like a sequel to my old favorite:

I have traded Perl for Python as my preferred scripting language the past few years, but I am still building simulators, web clients, and virtual users. I am excited to work on some new stuff in this area.

#    Comments [0] |
 Wednesday, May 16, 2007

WebInject - Open Source Web Service Testing Tool Gets High Marks

InfoWorld article:

Three open source Web service testing tools get high marks

Rick Grehan of InfoWorld reviewed 3 popular open source tools for testing web services.  Rick is a contributing editor of the InfoWorld Test Center.  One of the tools he reviewed was WebInject (which I wrote).

"In this roundup, I examined three tools that purport to verify that your Web services do what they are supposed to do, that they resist graceless failure, and (in some cases) that they conduct themselves with efficiency. The tools are soapUI, TestMaker, and WebInject. All are open source, and are available for free download and incorporation into your next Web services project."
My tool (WebInject) scored pretty well in the comparison.

From the article:

WebInject

WebInject is a super-lightweight testing tool that can automate the testing of both Web services and Web applications. In fact, WebInject's ability to test XML/SOAP Web services appears to be a recent addition to the tool, as earlier versions could not readily handle the SOAP protocol.

Written in Perl, WebInject is primarily a command-line tool, though its author provides a thin Perl/Tk user interface that at least simplifies the execution of tests for those unwilling to spend too much time at the command prompt. If you're not familiar with Perl, don't panic. WebInject is built so that you can construct your tests without having to touch so much as a byte of Perl code.

WebInject is really an execution and reporting engine. Unlike the other tools, it has no IDE-style user interface, so tests must be written in an editor outside of the WebInject UI. This gives WebInject a less professional feel, but doesn't hamper the tool. I envision users of WebInject having directories filled with text files of various test “templates.” To add a new test case, the user just pops open his or her favorite editor, does some cutting, some pasting, and a bit of tweaking to alter the template to fit the specific circumstance, and ba-ding!, you've got a new test case.

...

In essence, a WebInject “project” is nothing more than an XML file filled with a set of elements strung one after the other. WebInject's simple structure lets you build tests with amazing rapidity. You must, however, have a moderately good understanding of the mechanics of SOAP protocols as well as a tool that lets you generate and capture HTTP/SOAP requests and responses. You'll need the requests to build the POST body and the responses so that you can create proper “verifypositive” and “verifynegative” regular expressions to check for success or failure. I used the Web Service Toolkit add-on for Eclipse to grab requests and responses for WebInject; once I had gotten the hang of it, I fell easily into the groove of building test cases.


Criteria Score Weight
Documentation 8 20%
Features 8 20%
Scalability 8 20%
Ease-of-use 8 15%
Portability 9 15%
Value 9 10%

Review Score:
Very Good 8.3

Cost:
Free download - open source

Platforms
Any platform that runs Perl or has a Perl interpreter installed

Bottom Line:
Much less feature-rich than the other tools, the lightweight WebInject nonetheless bolts out of the starting gate. If you need testing that will be off the ground and flying in minutes, reach for WebInject. On the other hand, it has far fewer capabilities than the other two products in this test, and unless you want to hack the Perl code, WebInject's feature set is pretty much what you install.


visit www.webInject.org
for more of my tools, visit: www.goldb.org

#    Comments [0] |
 Friday, May 11, 2007

Mike Shaver on New RIA Tools vs. Web Standards

Via The high cost of some free tools (Mike Shaver):

"If you choose a platform that needs tools, if you give up the viral soft collaboration of View Source and copy-and-paste mashups and being able to jam jQuery in the hole that used to have Prototype in it, you lose what gave the web its distributed evolution and incrementalism. You lose what made the web great, and what made the web win. If someone tells you that their platform is the web, only better, there is a very easy test that you can use:

When the tool spits out some bundle of shining Deployment-Ready Code Artifact, do you get something that can be mashed up, styled, scripted, indexed by search engines, read aloud by screen readers, read by humans, customized with greasemonkey, reformatted for mobile devices, machine-translated, excerpted, transcluded, edited live with tools like Firebug? Or do you get a chunk of dead code with some scripted frills about the edges, frozen in time and space, until you need to update it later and have to figure out how to get the same tool setup you had before, and hope that the platform is still getting security and feature updates? (I’m talking to you, pre-VB.NET Visual Basic developers.)"

All hail "View Source".

#    Comments [0] |
 Wednesday, May 09, 2007

PerfLog - Performance Analysis Tool for Web Server Logs (Python)

I wrote a small tool that I have found useful.  It is a Python script that parses and analyzes web log files (in W3C Extended Log File Format).  It creates and HTML report with data and PNG images showing graphs of things like: request throughput, error rates, HTTP method distribution, content type distribution, time-series, etc.

Many log parsing/analysis tools exist, but I was looking for something more specific to Performance than something a webmaster would want to look at.

The script is pretty basic. It was very useful for my own needs, but others might want to modify it.  If anyone has good suggestions to add to it, I am willing to enhance it at some point (or just grab my code and hack it yourself if you know Python).


Project Home

Features

  • Produces metrics and graphs from web logs (W3C Extended Log File Format)
  • Useful during performance testing and analysis
  • Output is created in XHTML/CSS with embedded PNG images
  • PerfLog is written in Python and uses Matplotlib for graphs and plotting

License

Project Info

Requirements

  • Python 2.4+
  • Matplotlib (requires Numeric or Numpy)

Platforms

  • Cross-Platform.  PerfLog will run on any system that supports Python and Matplotlib.
#    Comments [1] |
 Thursday, May 03, 2007

Mark Pilgrim on Vendor-Specific Hype

Mark Pilgrim speaks the truth about this hype going on with the new announcements of proprietary/vendor specific web stacks and runtimes (Microsoft Silverlight, Adobe Apollo, etc).  Don't get fooled again!:

"Y’all have fun. Play with your vendor-specific runtimes. Don’t call me when you wake up one morning with a pink line in the round window and your BFF vendor won’t return your calls. If you need me (but of course you won’t), I’ll be holed up in my drab unpainted toolshed around the corner, quietly building applications on the web that works."

Love it.

#    Comments [0] |
 Monday, April 09, 2007

Geo Location Mashup - Python, Yahoo Maps AJAX API

Mapping User Metro Concentration by IP Address

I just posted this: http://www.goldb.org/geo_maps

It is a tutorial/example showing how to create a geolocation mashup by generting HTML/JavaScript code from a Python script.  The resulting code is an HTML page with embedded JavaScript that you can open with your browser.  It works with the Yahoo Maps AJAX API to plot markers at specified locations.  I also explain how this technique can be used to create a [near] real-time map of user concentration based on IP addresses.

... feedback welcome.


It generates cool AJAXy eye-candy like this:

and this:

Since I use the AJAX control, the rendered map has a zooming, panning, dynamic, tiled interface.  Pretty Slick.

#    Comments [1] |
 Monday, April 02, 2007

I Need Better Web Hosting

My website and blog were down most of today, after getting pounded with traffic from Reddit Programming.

The day started great... I already had 600 visitors today when I woke up for work at 7AM.  Then one of my posts started floating near the top of Reddit.  My server couldn't handle the traffic and soon fell over.  It didn't come back online until just now.

Granted, I am using ultra cheap shared hosting, so this shouldn't come as a huge surprise.  However, I am now looking for some hosting that is slightly more reliable.  Aside from the the heavy traffic today, my site goes up and down intermittently all the time anyways.

Can anybody recommend some good cheap web hosting?  Basically I am looking for about 1 gig of storage and at least 3 gigs of transfer per month.  I understand that reliability and availability are something one must pay for (and usually mutually exclusive with shared hosting).  So.. I would sacrifice availability for price, as long as availability and reliability were decent.

I need both Windows (with ASP.NET 2.0) and Linux (with Python/Perl) hosting. These can be from a single provider, or with 2 different providers.  I have used lots shared hosting services over the years and all of them generally suck.

.. any good hosting recommendations?

#    Comments [1] |
 Saturday, March 31, 2007

Digital Ethnography

(I can't even tell you how many times I've watched this video since it came out a few months ago)

For posterity...

Professor Michael Wesch:

teaching the machine.
the machine is us.

we'll need to rethink a few things...
copyright
authorship
identity
ethics
aesthetics
rhetorics
governance
privacy
commerce
love
family
ourselves

- The Machine is Us/ing Us

#    Comments [0] |
 Wednesday, March 28, 2007

Microsoft IIS - Welcome to Last Decade (Performant CGI)

Wow..
CGI will run well on IIS
Rails will run well on IIS.

Rob Conery on running Ruby on Rails (or other CGI based platforms) on IIS:

"Rails works using CGI - basically an executable that gets run each time a request comes into a web site. Most of the frameworks out there do NOT support multi-threading, so each time a request comes in that requires anything dynamic, CGI is "instanced" and executed. If you have a lot of requests at once, this isn't really a good thing. Now some servers are built to mitigate this (Apache, Lighttpd, etc); IIS is not.

... I would imagine that in the next 6 months we'll see a great addition to IIS 6 and 7 for all the CGI-enabled platforms out there."


hmm.. good to hear. (seriously)
but damn... weren't we doing this 10 years ago with Perl/Apache? :)

#    Comments [2] |
 Sunday, March 25, 2007

Real World Web Scalability

(via reddit programming)

Very lengthy overview of performance and scalability issues for web systems by Ask Bjorne Hansen.  This presentation covers a vast range of information:

Real World Web Scalability  (warning large PDF)

The takeaway?
Create horizontally scalable distributed systems..  always.

#    Comments [0] |
 Saturday, February 24, 2007

The Web as a Data Integration and Machine-Oriented Publishing Layer

This is what I was talking about when I wrote:

"one of the most attractive things about the Web is the ability to use HTTP as a simple transport protocol abstraction.  [...]  with this additional transport abstraction in place, you can build another application layer protocol on top of this and use that as your API for distributed operations.  That is where the rubber meets the road in modern large scale systems, and that is where the action is taking place in the current debate about SOA, REST, Web Services, and distributed architectures.  Furthermore, the foundation for this style is built directly into HTTP 1.1."

Bill de hÓra states it better in his "Confederacy" post:

"the Web is not just the presentation tier anymore; it's becoming a data integration and machine-oriented publishing layer.  The presentation layer is being pushed down to the client machine in the form of AJAX, XUL and Flex."

A new web/middleware layer has been forming, and this is the engine that is driving Web 2.0 and creating a new level of integration and interoperability.

#    Comments [0] |
 Friday, February 23, 2007

“Humanity Lobotomy” - Net Neutrality Open Source Documentary

Spread Awareness.

Awesome new video (via Lessig): 

“Humanity Lobotomy” - Net Neutrality Open Source Documentary


What a great month for videos!  I feel inspired to be working in technology again.  We control the future.

Check out more here

#    Comments [0] |

Google Apps Premier - The Office Battle Is On

Last year, I made a prediction/bet that Google was gonna make a huge push into office applications and we were gonna see the MS Office monopoly start to erode.  Well, its on!  We actually have quite a cool phenomenon brewing, with Open Office striking from one side and online office apps striking from the other.

It is a classic disruptive play.. still far from a tipping point, but serious shots were fired over Microsoft's bow.   Google is going at it pretty aggressively too.  I just got an invite for a seminar in Boston that explains the new enterprise office tools (Google@Work Seminar).

It will be interesting to watch this unfold.

#    Comments [3] |
 Saturday, February 17, 2007

Clarifying Architectural Styles for the Web

In his latest finely crafted post, REST and WS, Joe Gregorio gives the quick definitive overview of web services and modern distributed architecture, while clarifying much confusion.


First of all, what REST really is:

"REST is not a specific piece of technology but an Architectural Style that was abstracted from HTTP during the transition from HTTP 1.0 to HTTP 1.1."


OK..  I get it.  From a network perspective, going up the OSI Model/TCP Stack... starting from Layer 4, TCP is the transport layer protocol.   HTTP is the [Layer 7] application layer protocol that rides on top of it.  However, one of the most attractive things about the Web is the ability to use HTTP as a simple transport protocol abstraction, rather than interfacing wih TCP directly.  So with this additional transport abstraction in place, you can build another application layer protocol on top of this and use that as your API for distributed operations. That is where the rubber meets the road in modern large scale systems, and that is where the action is taking place in the current debate about SOA, REST, Web Services, and distributed architectures.  Furthermore, the foundation for this style is built directly into HTTP 1.1.

The problem with whole debate going on is that we are talking apples and oranges. Different architectural styles offer certain advantages, and these become apparent as your system grows in scale:


"REST and WS-* are two different tools whose strengths shine at different scales. The easiest way to think about this is an example from nature: at the scale of the atom the forces responsible for most of the action are different from the forces at the scale of a cell. Quantum effects and the strong nuclear force determine the structure and operation of an atom, while the operation of a cell is dominated by molecular reactions and Van der Waals' forces.

Another example closer to home; when programming and making calls into other functions and libraries, you pass along classes and types in the function call parameters. You expect those classes and types to be perfectly understood on the other side of that function call. Those are the rules at that scale; that type information can be counted on to survive and be useful over the function call boundary. As your scale grows, as you move outside the single executable, the same machine, or the same platform, that assumption begins to weaken, to the point that when you get to Internet scale services that assumption is actually harmful.

When working at the smaller scale the assumption that types can move across a boundary is powerful and allows many optimizations. Working in a homogeneous environment such as Java, WS-* has real advantages; you can very quickly create interfaces in your target programming language and expose those interfaces via WSDL and have them consumed just as easily on the calling side using the same WSDL.

As you move to larger systems, either many more clients connecting, or a non-homogeneous pool of clients, this paradigm starts to break down. If there are many clients then the demands for caching semantics will be begin to dominate. In that case you need to abandon HTTP as just a simple transport and start using the application level semantics of HTTP to start leveraging the caching architecture already built into the Internet."


Well.. that pretty much cements the whole idea in my head.  When you move towards larger distributed systems and/or less-homogeneous environments, scalability and interoperability become a concern.  There have been some clever approaches to solving these issues. Systems continue to become larger, more loosely coupled, and more interoperable... this is good... but as you approach this space, there are some tradeoffs you must make.

The real question is: should you think in those larger and better organized terms right from the start, or do you want to quickly exploit some of the advantages and optimizations available in another approach?  And of course the answer is context...  "It depends on the system".


#    Comments [0] |
 Friday, February 16, 2007

Google - All Your Search Traffic Are Belong To Us

(Yes, the title is intentionally ungrammatical

I run a personal website: www.goldb.org This is where I host my blog as well as content pages mostly dealing with computer programming. I was just looking over my traffic/visitors stats for the past month and noticed something interesting.

Basically, all of my search traffic comes from Google (I am indexed in every major search engine). I keep reading about search volume comparisons and how Google is slightly leading, and how more parity in the search market now exists.

Obviously my website visitors are skewed towards technical types, and the search terms they use to find my site are all technical/programming/software terms. The takeaway from this is that nearly all technical users are searching from Google instead of the other popular search engines.


Here is a breakdown of some stats from the last 30 days:

Where did my traffic come from?

  • 14.8% came directly
  • 70.4% from searches
  • 14.8% from other sites


Search Engine - # Visitors

  • Google - 1729
  • Yahoo - 20
  • Microsoft Live - 17
  • Technorati - 4
  • Del.icio.us - 2
  • AOL Search - 1



97.52% of visitors that reached my site in the past 30 days via search, came from Google.

#    Comments [0] |
 Tuesday, February 13, 2007

Screen Scraping in Python

Mads Kristensen just posted an article: Screen scraping in C#, where he shows several ways to make HTTP requests in C# that can be used for screen scraping.

from Mads:

"Some say that screen scraping is a lost art because it is no longer an advanced discipline. That may be right, but there are different ways of doing it. Here are some different ways that all are perfectly acceptable, but can be used for various different purposes."


Not to be outdone... here are 2 examples of how to do the same thing in Python:

using httplib:

conn = httplib.HTTPConnection("www.python.org")
conn.request("GET", '/')
print conn.getresponse().read()


using urllib:

f = urllib.urlopen('http://www.python.org/')
print f.read()


#    Comments [2] |
 Thursday, November 30, 2006

Feed Reading in Firefox 2

From Ben Goodger (Googler and Lead Firefox Dev) on RSS in Firefox:

"... Firefox philosophy of having enough features, not too many or too few. In general, we felt that RSS reader was a very personal choice to be made by the user, and that we did not want to compete with existing RSS readers, which are very competent in a variety of ways. Rather, we wanted to allow users to easily subscribe to feeds using their favorite reader."

I love this feature.. it lets me quickly subscribe to feeds through Bloglines.

I think Firefox nailed this one.  A feed reader belongs in a plugin if you choose to use your browser for such tasks.  Keep the core simple and extensible.

#    Comments [0] |
 Wednesday, November 01, 2006

Google Code Search - Indexing Source Code Inside Zip Files

I was playing with Google Code Search (search engine for public source code) and I noticed it had indexed some code I released a while back.

I knew the google bot was indexing public CVS and SVN repositories...
But the interesting thing is that I never checked this code into any public repository.  All I did was place a zip file on my webserver and link to it from my homepage.

I searched around a little and found this explaining what it does:

"The two ways that source code lives on the Internet is in archives, things like Zip files, gzip, etc. And then in software-control repositories like SourceForge.net, Google's code hosting, and other places," Google product manager Tom Stocky told internetnews.com.

"We'll be crawling all of that."

Google isn't just going to index the Zip archive files. They're actually going to open up the files and index all the individual files within in.

This is pretty cool.  By doing a Google Code Search you can see the full contents of the zipped source files, as indexed by Google.

#    Comments [0] |